This document discusses issues concerning the comparison of Uniform Resource Identifiers (URIs) and documents common practices in this area.
Software is commonly required to compare two URIs to determine whether they identify the same resource. Applications which do this vary in their behavioral and performance requirements, and not all use the same URI-equivalence decision criteria.
This document describes a variety of methods which may be used to compare URIs, the trade-offs between them, and the types of applications which might reasonably be expected to use one method or another.
For brevity, this document will use the term "equivalent" to describe the condition where software determines that two URIs identify the same resource
URIs exist to identify resources. A resource, in the Web Architecture, is an abstraction; a URI may in some cases be dereferenced to yield a representation of the resource. It is generally the case that any two different URIs may identify the same resource, in the view of the user or publisher of that resource. Thus, while comparisons of two URIs can establish with confidence that they are equivalent and identify the same resource, it is always the case that such comparisons can yield "false negatives". Put another way, it is often possible to determine that two URIs are the same, but it is in general never possible to be sure that they are different.
The syntax of URIs is defined by RFC 2396; the present document cannot really be understood without reference to that RFC. RFC2396 defines a URI as a sequence of characters, with the definition of "character" not tied to any particular form of storage; the characters may be stored on disk one byte per character, in a Java string two bytes per character, painted on the side of a bus, or spoken in conversation.
The repertoire of characters in URIs is extremely limited, comprising a subset of those available in US-ASCII. Certain of these characters have special roles, for example ':' and '/', and may not normally be used in URIs outside of those roles without escaping (see below).
Since the world contains many characters useful in identifying resources beyond those in US-ASCII, and since the special characters such as ':' and '/' are also often useful, RFC2396 provides a mechanism for "%-escaping" such characters; they are represented as a sequence of 2-digit hexadecimal codes, each representing the value of one byte and preceded by the percent sign '%'.
RFC2395 also specifies that every URI has a "prefix", a leading sequence
of characters separated from the rest of the uri by a colon character ':'.
Two examples of URIs are http://example.com/uri
and ftp://ftp.example.com/uri
; their prefixes respectively are
"http" and "ftp".
It is common practice to name classes of URIs by their scheme, for example
"HTTP URIs" and "FTP URIs".
Each URI scheme which is appropriately registered with the Internet Assigned Names Authority has its own governing RFC; for example, HTTP URIs are described by RFC 2616. The syntax and semantics of URIs vary significantly as a function of their scheme. For example, URIs whose scheme is "urn" (commonly referred to as URNs) are not allowed to contain slash characters, and certain parts of HTTP URIs (but not others) are meant to be processed case-insensitively.
RFC2396 defines a construct called a "URI reference". URI references differ syntactically from URIs in two ways:
http://example.com/intro#chap1
.intro#chap1
.
Two principles apply to the comparison of URI references.
http://example.com/intro#chap1
and http://example.com/intro
cannot in general be considered
equivalent.A variety of methods are used in practice to test URI equivalence. These methods fall into a range, distinguished by the amount of processing required, and the degree to which the probability of false negatives is reduced. As noted above, false negatives cannot in principle be eliminated. In practice, their probability can be reduced, but this reduction carries some processing cost and is not cost-effective for all applications.
If this range is considered as a ladder, the following discussion will climb the ladder, starting with comparison methods that are cheap but have a relatively higher chance of producing false negatives, and proceeding to those that have higher computational cost and lower risk of false negatives.
If two URIs, considered as character strings, are found identical, then it is safe to conclude that they are equivalent. This type of equivalence test has very low computational cost and is in wide use in a variety of applications. The Namespaces in XML recommendation mandates this class of comparison when testing two Namespace Names for equivalence.
Testing strings for equivalence requires some basic precautions. This procedure is often referred to as "bit-for-bit" or "byte-for-byte" comparison, which is potentially misleading. Testing of strings for equality is normally based on pairwise comparison of the characters that make up the strings, starting from the first and proceeding until both strings are exhausted and all characters found to be equal, or a pair of characters compares unequal or one of the strings is exhausted before the other.
These character comparisons require that each pair of characters be put
in comparable form. Should, for example, one URI be stored in a byte
array in EBCDIC encoding, and the second be in a Java String
object, bit-for-bit comparisons applied naively can produce both
false-positive and false-negative errors.
Thus in principle it is better to speak of equality on a
character-for-character rather than byte-for-byte or bit-for-bit
basis.
In Unicode terminology, this would be properly referred to as
codepoint-for-codepoint comparison.
Software may use logic based on the definitions provided by RFC2396 to reduce the probability of false negatives. Such processing is (moderately) higher in cost than character-for-character string comparison. It is not appropriate to enumerate all the consequences of RFC2396's rules here. However, an application using this approach would reasonably consider the following two URIs equivalent:
example://a/b/c/d/%7A
eXAMPLE://a/b/../x/b/c/%7a
Web user agents such as browsers typically apply this kind of processing when determining whether a URI's representation has been cached.
Software may choose to implement none, some, or all of the comparison-related processing implied by RFC2396's rules.
It would seem almost wilfully perverse to consider the characters
represented respectively by %7A
and %7a
in the
example above as different.
In fact, since the Namespaces in XML recommendation specifies
"character-for-character" comparison, it might be argued that
since %7A
and %7a
must per RFC2396 represent
the same character, XML namespaces which differ only in this respect
might reasonably be considered equal.
However, most real-world XML Namespaces implementations use simple
mechanisms such as C's strcmp()
or
Java's String.equals
, which would fail
the %7A
/%7a
reasonableness test, but claim with
some reason to be compliant implementations.
%-escaping raises other issues. Consider two URIs:
http://a/b/
http://a%2fb
Software applying RFC2396's rules would not consider these
equivalent, since the %2f
is being used explicitly to escape
the special semantics in URIs of the '/' character.
Another example:
http://a/b/
http://%61/b/
Such software might consider these equivalent,
since %61
encodes the character 'a' in both ASCII and UTF-8,
but context becomes significant.
RFC2396 does not constrain the character encoding scheme of URIs; if the
original document were encoded in EBCDIC, or the URIs were sourced from
two different documents whose original encoding was not known, there is a
(slim) chance of a false-positive in finding these equivalent.
It is in the interests of everyone to avoid false-negatives in comparing URIs, and simultaneously to do the minimum amount of software processing in such comparisons.
Those who generate URIs and transmit them, or include them in resource representations, can make a major contribution to this common good by understanding the rules in RFC2396 and generating URIs in an at-least-partly canonicalized form. Specifically:
a
-through-f
characters when %-escaping././
, and the
vacuous /../
in the example above.The syntax and semantics of URIs, as noted, vary from scheme to scheme as described by the relevant RFCs. Software may have built-in knowledge of these scheme-specific rules and use them, at further processing cost, to reduce the probability of false negatives. In particular, the Web Robots that drive most large search engines would consider the following two URIs to be equivalent:
http://example.com/
http://Example.COM:80/
This is reasonable behavior based on the rules provided by RFC 2616, which defines HTTP.
As a more extreme example, if the domain name "example.com" were
known, at the time of some particular processing sequence, to
map to the IP address 10.20.30.40, then some software might consider the
URIs above equivalent to http://10.20.30.40/
.
As noted above with respect to RFC2396-sensitive processing, those who generate URIs can reduce the cost of processing and the risk of false negatives by generating them in a form which is reasonably canonical with respect to the governing scheme.
In fact, Web Robots, which are at pains to reduce the incidence of
false negatives, are observed to implement even more aggressive
techniques in URI comparison.
For example, if they observe that http://example.com/data
redirects to http://example.com/data/
(note trailing slash)
they will likely regard the two as equivalent, at least until the next
visit.
They may even observe that http://example.com/
returns a
representation with a media-type of HTML, and conclude
that http://example.com/#
is equivalent.
Obviously this kind of technique is only appropriate in special situations.