"Deep Linking" in the World Wide Web

Abstract

The community of Web users has been engaged in discussion and litigation concerning the practice of "deep linking". This document is designed to provide technical input to this discussion concerning the relevant aspects of the Web architecture. The goal is that discussion at the level of policy be well-informed about the underlying technology.

Status of this Document

This document is an unreviewed draft prepared by Tim Bray as input to the W3C TAG deliberations of issue DeepLinking-25. It has no official standing of any kind.

Table of Contents

Deep Linking Background

People engaged in delivering information or services via the World Wide Web typically speak in terms of "Web sites" which have "home pages" or "portal pages". "Deep linking" is the practice of publishing a hyperlink from a page on one site to a page "inside" another site, i.e. which is not the linked-to site's home page or portal page.

Certain Web publishers wish to prevent or control deep linking into their site, and wish to establish a right to exercise such control as a matter of public policy, i.e. through litigation based on existing law or by instituting new legislation.

Whether or not the exercise of such control is a matter of good business practice is outside the scope of this document. However, it is arguable that public policy in this area that is informed by and coherent with the underlying technical architecture will work more smoothly and be easier to specify, monitor, and enforce.

Difficulties in Reconciling Deep Linking Policy and Technology

The underlying technology which drives the World Wide Web has no built-in notion of "web site" or "home page". This adds to the difficulty of specifying policies in this area in such a way that they are correctly implemented by technologists. Once policies are in place, this also adds to the difficulty of creating technology for automatic monitoring and enforcement. In a system as large and complex as the Web, automated monitoring and enforcement is a necessity for the success of any policy.

The Uniform Resource Identifier and Web Architecture

This issue centers around the use of hyperlinks. The central feature of a hyperlink, and indeed a central feature of Web architecture, is the notion of a "Uniform Resource Identifier" (URI), often called a "Uniform Resource Locator" or URL. Every object on the Web must have a URI, which is simply a string of characters that may be typed into a web browser, read over the phone, or painted on the side of a vehicle. The Web is unique in the history of information systems in being built around the ability to address anything with a short character string. Basic to the architecture of the Web is that URIs may be freely interchanged, and that once one knows a URI, one may pass it onto others, publish it, and attempt to access whatever resource it identifies.

The vast majority of all the software on the Web is built around these assumptions, and attempts at a policy level to control the interchange and use of URIs without some sort of automatic monitoring and enforcement are likely to prove expensive and problematic.

The formal definition of the URI, on which all of the software that successfully drives the Web is built, is in [RFC2396].

Access Control and Accountability on the Web

While the Web offers little or no control over the ability to refer to any resource, it offers a rich suite of access-control facilities. The procedures by which resources may be accessed over the web are those of the Hypertext Transfer Protocol (HTTP), which is formally laid out in [RFC2616]. When any piece of software attempts to access a resource via its URI, it sends a request which typically contains a variety of information including:

When such a request is received, it may succeed, or it may fail. It may fail because there is no resource identified by the URI (the well-known "404 Not Found") or because the server refuses, based on the information available, to grant access ("401 Permission Denied").

A server can be programmed to deny access to any resource for a variety of reasons, including:

Deep Linking by Analogy

The following analogy attempts to find a parallel in the real world to the issues of policy and technology discussed here.

Consider a plot of land in the countryside at the intersection of Mill Road and Forest Way. Suppose that this plot of land has no fences or gates of any kind. In the middle of the field is a pile of watermelons; beside them is a sign saying "Watermelons". No humans are in attendance.

A local journalist driving by might observe this and note, in the weekly town paper, that there is a pile of watermelons in the field at Mill Road and Forest Way.

If the owner of the watermelons wished to control access to them, the appropriate approach would be to use some combination of fences, gates, sales personnel, and cash registers. Once such a control structure is in place, the watermelons' owner would be entirely justified in taking legal action against anyone who climbed the fence, broke down the gate, held up the salesperson, or paid with counterfeit coin.

It is difficult to imagine a circumstance in which it would be either legitimate or useful to to take legal action against the journalist for noting the existence of the watermelons.

The analogy with "deep linking" on the Web is compelling. A provider of Web resources who does not make use of the built-in facilities of the Web to control access to resources is unlikely to achieve either justice or a good business outcome by attempting to suppress information about the resources' existence.

Architecture and Policy Working Together

The Web's structure includes facilities to implement nearly any imaginable set of business policies as regards access control; for example, access policies based on the "Referer" field could restrict access to links from a "home page".

Unethical parties could, of course, attempt to circumvent such policies, for example by programming software to transmit false values in various request fields, or by stealing passwords, or any number of other nefarious practices. Such a situation has clearly passed from the domain of technology to that of policy. Public policy may need to be instituted which determines the seriousness of such attempts to subvert the system, the nature of proof required to establish a transgression, the appropriate penalties for transgressors, and many other related issues. Working out the correct policies in these areas will be difficult, even given that there is good support in the technical infrastructure.

To summarize, attempts to limit the usage, transmission and publication of URIs at the policy level are at high risk of failure because of poor support in the underlying technology. However, public policy concerning access to the resources identified by URIs are not only appropriate, they are somewhat urgently needed, and have a good probability of success given the technical infrastructure in place.