How to Compare Uniform Resource Identifiers

Author: Tim Bray

Abstract

This document discusses issues concerning the comparison of Uniform Resource Identifiers (URIs) and documents common practice.

Introduction

Software is commonly required to compare two URIs. Such comparison is always in respect of some particular purpose, and different software modules might reasonably come to different conclusions about the same pair of URIs. This document uses the terms "different" and "equivalent" to describe the possible outcomes of such comparisons, but, as the discussion of examples and procedures makes clear, there are many possible application-dependent versions of equivalence.

Since URIs exist to identify resources, presumably they should be considered equivalent when they identify the same resource. This definition of equivalence is not of much practical use for reasons which include:

Resources may have many different identifiers.
Web architecture defines how resources are named and how their representations are interchanged, but doesn't define resource equivalence.

For these reasons, determination of equivalence or difference must be based on string comparison, perhaps augmented by reference to additional rules provided in one or more RFCs.

Software modules performing such comparisons differ in their requirements and therefore their URI equivalence criteria. This document describes a variety of methods which may be used to compare URIs, the trade-offs between them, and the types of applications which might use them.

The expressiveness of URIs is limited by their small character repertoire. The IRI specification currently under development is aimed at addressing this. The material in this note applies equally to URIs and IRIs.

Status of This Document

This the second draft of this document, and reflects editorial input from members of the TAG and the broader community, but may not represent the consensus of the TAG.

Background

Inevitability of False Negatives

URIs exist to identify resources. A resource, in the Web Architecture, is an abstraction; a URI may in some cases be dereferenced to yield a representation of the resource. Any two different URIs may identify the same resource, in the view of the user or publisher of that resource. Thus, while comparison of two URIs can establish with confidence that they are equivalent and identify the same resource, such comparisons can always yield "false negatives". Put another way, it is often possible to determine that two URIs are equivalent, but it is never possible to be sure that they identify different resources.

Rules Governing URIs

The syntax of URIs is defined by RFC2396; the present document cannot really be understood without reference to that RFC. RFC2396 defines a URI as a sequence of characters, with the definition of "character" not tied to any particular form of storage; the characters may be stored on disk one byte per character, in a Java string two bytes per character, painted on the side of a bus, or spoken in conversation.

The repertoire of characters in URIs is limited, comprising a subset of US-ASCII. Certain of these characters have special roles, for example : and /, and may not be otherwise used in URIs.

The world contains many characters useful in identifying resources beyond those in US-ASCII, and furthermore the special characters such as : and / are also often useful. RFC2396's "%-escaping" mechanism is helpful in these situations. %-escaping is a two-step process; the logical characters in the URI are encoded in some fashion (such as ASCII, UTF-8, or Shift-JIS) as a series of octets; each octet is then represented as a 2-digit hexadecimal code preceded by the percent sign %.

URI Schemes

RFC2396 specifies that every URI has a "scheme", a leading sequence of characters delimited by a colon character :. Two examples are http://example.com/uri and ftp://ftp.example.com/uri; their schemes respectively are "http" and "ftp". It is common practice to name classes of URIs by their scheme, for example "HTTP URIs" and "FTP URIs".

Each URI scheme which is appropriately registered with the Internet Assigned Names Authority has a governing RFC; for example, HTTP URIs are described by RFC 2616. The syntax and semantics of URIs vary significantly as a function of their scheme. For example, URIs whose scheme is urn (commonly referred to as URNs) are not allowed to contain / characters, and certain parts of HTTP URIs (but not others) are meant to be processed case-insensitively.

Comparison of Relative URI References

RFC2396 defines a construct called a "URI reference" which differs syntactically from URIs in two ways:

The URI may be followed by the character # (not otherwise allowed in URIs) and a string called the "fragment identifier". The semantics of the fragment identifier are specific to the media type of the resource representation. An example is http://example.com/intro#chap1.
Initial portions of the URI may be omitted, producing "relative URI references". The reference may be made absolute by prepending a "base URI" - there are a variety of mechanisms to establish the base URI. An example of a relative reference is intro#chap1.

Two principles apply to the comparison of URI references:

In testing for equivalence, it is generally not useful to compare relative URI references; they must be converted to their absolute form before comparison.
RFC2396 states that the trailing # and fragment identifier are not part of a URI, and that fragment processing happens in the context of a retrieved presentation. Applications may choose to perform comparison operations on either the base URIs or the references including fragment identifiers. It is important that software and specifications be clear about which of these is being done. For example, when asked to navigate to http://example.com/intro#chap1, a user-agent might compare the base URI http://example.com/intro against the contents of its cache to determine of a network access is required.

The URI Comparison Ladder

A variety of methods are used in practice to test URI equivalence. These methods fall into a range, distinguished by the amount of processing required and the degree to which the probability of false negatives is reduced. As noted above, false negatives cannot in principle be eliminated. In practice, their probability can be reduced, but this reduction requires more processing and is not cost-effective for all applications.

If this range of comparison practices is considered as a ladder, the following discussion will climb the ladder, starting with those that are cheap but have a relatively higher chance of producing false negatives, and proceeding to those that have higher computational cost and lower risk of false negatives.

Simple String Comparison

If two URIs, considered as character strings, are identical, then it is safe to conclude that they are equivalent. This type of equivalence test has very low computational cost and is in wide use in a variety of applications. The Namespaces in XML recommendation mandates this class of comparison when testing two Namespace Names for equivalence.

Testing strings for equivalence requires some basic precautions. This procedure is often referred to as "bit-for-bit" or "byte-for-byte" comparison, which is potentially misleading. Testing of strings for equality is normally based on pairwise comparison of the characters that make up the strings, starting from the first and proceeding until both strings are exhausted and all characters found to be equal, or a pair of characters compares unequal or one of the strings is exhausted before the other.

These character comparisons require that each pair of characters be put in comparable form. For example, should one URI be stored in a byte array in EBCDIC encoding, and the second be in a Java String object, bit-for-bit comparisons applied naively will produce both false-positive and false-negative errors. Thus in principle it is better to speak of equality on a character-for-character rather than byte-for-byte or bit-for-bit basis.

Unicode defines a character as being identified by number ("codepoint") with an associated bundle of visual and other semantics. At the software level, it is not practical to compare semantic bundles, so in practical terms, character-by-character comparisons are done codepoint-by-codepoint.

RFC2396-Sensitive Comparison

Software may use logic based on the definitions provided by RFC2396 to reduce the probability of false negatives. Such processing is (moderately) higher in cost than character-for-character string comparison. It is not appropriate to enumerate all the consequences of RFC2396's rules here. However, an application using this approach could reasonably consider the following two URIs equivalent:

example://a/b/c/%7A
eXAMPLE://a/./b/../b/c/%7a

Note that RFC2396 does not authorize the removal of the /./ and b/../ fragments except in the case of relative URI references, but that this is arguably an inconsistency and that software often does so anyhow.

Web user agents such as browsers typically apply this kind of processing when determining whether a resource's representation has been cached.

Software may choose to implement none, some, or all of the comparison-related processing implied by RFC2396's rules.

%-Escaping Issues

It would seem almost willfully perverse to consider the data represented respectively by %7A and %7a in the example above as different, since per RFC2396 they must represent the same octet. In this light, it might be reasonable to find the URIs http://example.com/%7a and http://example.com/%7A equivalent in the framework of of the Namespaces in XML recommendation, which specifies "character-for-character" comparison.

However, most real-world XML Namespaces implementations use simple mechanisms such as C's strcmp() or Java's String.equals(), which would fail the %7A/%7a reasonableness test, but claim with some reason to be compliant implementations.

%-escaping raises other issues. Consider two URIs:

http://a/b
http://a%2fb

Software applying RFC2396's rules would not find these equivalent, since the %2f is being used explicitly to escape the special semantics in URIs of the / character.

Another example:

http://dir/a
http://dir/%61

Such software might consider these equivalent, since %61 encodes the character a in both ASCII and UTF-8, but context becomes significant. RFC2396 does not constrain the character-to-octet mapping scheme used in URIs. If the second URI had been generated on a machine in which the EBCDIC character-to-octet mapping was in use, the %61 would encode the character / (quite naturally, since / must be encoded but a need never be).

Thus, assuming two different URIs whose original character-to-octet mapping is not known, there is a (slim) chance of a false-positive in finding these equivalent.

Good Practice When Generating URIs

It is in the interests of everyone to avoid false-negatives in comparing URIs, and simultaneously to do the minimum amount of software processing in such comparisons.

Those who generate URIs and transmit them, or include them in resource representations, can make a major contribution to this common good by understanding the rules in RFC2396 and generating URIs in an at-least-partly canonicalized form. Specifically:

Always provide the URI scheme in lower-case characters.
Only perform %-escaping where required by RFC2396.
Always use the UTF-8 character-to-octet mapping.
Always use upper-case A-through-F characters when %-escaping. The choice of upper-case is motivated by an observation that a high proportion of deployed software uses this practice.
Avoid idioms such as the /./ and /../ in the example above.

Scheme-Sensitive Processing

The syntax and semantics of URIs, as noted, vary from scheme to scheme as described by the relevant RFCs. Software may use scheme-specific rules, at further processing cost, to reduce the probability of false negatives. In particular, the Web Robots that drive most large search engines would consider the following two URIs to be equivalent:

http://example.com/
http://Example.COM:80/

This is reasonable behavior based on the rules provided by RFC 2616, which defines the syntax and semantics of HTTP URIs.

As a more extreme example, if the domain name "example.com" were known, at the time of some particular processing run, to map to the IP address 10.20.30.40, then software might consider the URIs above equivalent to http://10.20.30.40/.

As noted above with respect to RFC2396-sensitive processing, those who generate URIs can reduce the cost of processing and the risk of false negatives by generating them in a form which is reasonably canonical with respect to their scheme.

In fact, Web Robots, for which substantial effort to reduce the incidence of false negatives is often cost-effective, are observed to implement even more aggressive techniques in URI comparison. For example, if they observe that http://example.com/data redirects to http://example.com/data/ (note trailing slash) they will likely regard the two as equivalent, at least until the next visit. They may even observe that http://example.com/ returns a representation with a media-type of HTML, and conclude that http://example.com/# is equivalent.

Obviously this kind of technique is only appropriate in special situations.