Wednesday, August 29, 2007

Returning HTTP 303s for Semantic Web URIs

The World Wide Web Consortium's (W3C) Technical Architecture Group (TAG) attempted to settle a long standing debate about the use of URL resolution called http-range-14 a couple of years ago by ruling that "If an 'http' resource responds to a GET request with a 303 (See Other) response, then the resource identified by that URI could be any resource." Roy Fielding's original suggestion for the TAG finding is here.

That is a very subtle point. The idea was to cleanly separate those resources that are referred to by an HTTP URL and those that cannot be referred to directly, but might have an HTTP URL assigned to them anyway. The latter include physical items in the real world. Importantly, many of the objects assigned URIs in Semantic Web descriptions are given HTTP URLs but cannot be directly referred to on the Web.

Programmatic resolution of an HTTP URL may refer to an object in the real world (that is, an arbitrary resource) or an information resource (in some virtual form, such as an HTML page or an image or a movie). In that case, the HTTP response code would be 303 instead of 200. A 303 is an indication that the thing referred to may not be an information resource, it may be either an information resource or a "real" object. The body of the 303 (and the Location header) can provide information about the resource without encouraging you to think that what was returned really was a representation of the resource (as you would with a 200 response).

Most HTTP resources are expected to respond to a GET request with a 200 (OK) response and the body of a 200 is an "information resource". The definition of an information resource is that the entire content of the referred object may be "conveyed in a message".

But what about resources that cannot be conveyed in a message? My dog is a resource, as is my car, or myself. These things cannot be conveyed in a message, they can only be referred to. That is where HTTP 303 response codes come in.

RFC 2616 defines HTTP version 1.1 and its response codes. Section 10 defines a 303 thusly:

The response to the request can be found under a
different URI and SHOULD be retrieved using a GET
method on that resource. This method exists primarily
to allow the output of a POST-activated script to
redirect the user agent to a selected resource. The
new URI is not a substitute reference for the originally
requested resource. The 303 response MUST NOT be
cached, but the response to the second (redirected)
request might be cacheable.

The different URI SHOULD be given by the Location
field in the response. Unless the request method was
HEAD, the entity of the response SHOULD contain a
short hypertext note with a hyperlink to the new URI(s).
NB: An information resource should only be returned by an HTTP GET, not a POST, hence the discussion of POST. I unfortunately don't know of many people who bother to comply with Web Architecture to that extent.

Thus, the body of a 303 response is under-specified. There are several open questions. Some of them are:
  • What should it contain?
  • How should it be formatted?
  • Does a "short hypertext note" constrain implementations to a text/html MIME type?
  • How short is "short"?
  • Can there be more than one Location header?
  • If not, what do all the other URIs in the hypertext tell a user? Arbitrarily anything? If so, there is no limitation.
Uche, Eric and I have been trying to answer those questions. I should rightfully include Brian, too. Uche has proposed that the body of a 303 response include RDF (but not necessarily in RDF/XML format!). Eric has suggested encoding RDF content using RDFa, which would allow for the "short hypertext note" to include machine-readable RDF without breaking any existing implementations. The obvious downside is that RDFa is not a standard (merely an Editors Draft) and may not become one.

I rather like the idea of using RDFa, but a broader solution may be to always include a text/html body part, but use a multipart/alternative structure to allow for the body to hold RDF data (in whatever form) if it is present. That way, an implementation would not have to run the body through an RDFa parser just to determine whether any RDF content was present. The use of an additional header to indicate the presence of RDF content would also do it, regardless of the body type.

Of course, others have derided the use of HTTP URLs (emphasis on the Locator aspect) to reference arbitrary resources in the universe (for which the URI, I for identifier, was designed) and for some good reasons. However, I see tremendous value in being able to marry information space with meat space via the resolution of HTTP URLs. In fact, the automated manipulation of Semantic Web content depends upon it. That is why the TAG's finding makes a lot of sense. The use of 303 response codes to both separate resolution from description while maintaining universal addressing ties the abstraction of the Internet to the real world.

The new Persistent URL (PURL) service, now in construction, will allow PURLs to be created that can return 303 responses. The use of 303s to represent arbitrary real-world resources will enable PURLs to coalesce the fragmented persistent identifier space. Do we really need LSIDs, DOIs, INFO URIs and the rest? My answer is, "Not if the Web Architecture supports all their requirements". The representation of arbitrary objects via HTTP URLs and the ability to return multiple See Also URLs from a response would seem to do just that.

TimBL has raised a concern that the information resource contained in a body of a 303 response should not contain anything very interesting, because it is not addressable (in context of the linked data discussion). I disagree. The address of the 303 body is directly addressable because that is what is returned when the URL that got you there is resolved. Further, the 303 status informs a user that the information resource in the body is not the requested resource itself, merely information about it. That is, the URL addresses both the (real world) resource and the information resource in the 303 body and the 303 status allows one to cleanly separate which one you may wish to refer to at any given time, either programmatically or by a human.

As Paul points out, the use of a 303 response code does not put a requirement on the semantics of the identifier. A 303 response may be used to present information to a human user, while at the same time providing an indication to a computer that the resource you addressed is not the one that was returned.

I have posted about http-range-14 issues before, but not in this amount of detail. I think that proper answers to the questions above will be critical to the success of the Semantic Web because we must have a mechanism to programmatically determine whether an HTTP URL refers to an arbitrary thing in the wide universe or that relatively small subset of things that we call information resources.

A big question in any new use of HTTP is how existing browsers have implemented handling of the return codes. I was quite surprised that Firefox and Apple's Safari redirect to the Location header in a 303! For example, go to the home page for Tom Heath (http://kmi.open.ac.uk/people/tom/) and you will be redirected to http://kmi.open.ac.uk/people/tom/html. Neither browser will show you that an intermediate 303 return code was issued or that the browser followed it. You can see the 303 by using something like HTTPTracer or wget. It is also interesting to note that Tom is using a text/html body describing in human terms what the Location header says (a "short hypertext note", captured using HTTPTracer):


HTTP/1.1 303 See Other
Date: Wed, 29 Aug 2007 16:54:19 GMT
Server: Apache/2.0.52 (Red Hat)
Location: http://kmi.open.ac.uk:8888/people/tom/html
Content-Length: 332
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>303 See Other</title>
</head><body>
<h1>See Other</h1>
<p>The answer to your request is located
<a href="http://kmi.open.ac.uk:8888/people/tom/html">here</a>.
</p>
<hr>
<address>Apache/2.0.52 (Red Hat) Server at kmi.open.ac.uk Port 8888</address>
</body></html>


Tom is using a 303 in the manner described by RFC 2616, and it is important to note that it does not conflict with the TAG finding regarding http-range-14. His response does, in fact, refer to "any resource".

The fact that browsers are automatically (and incorrectly, in my opinion) redirecting to the URLs specified in 303 Location headers suggests to me that Semantic Web applications will need to be careful. We can rely upon our own programmatic handling of 303s, but not the browsers'. We should also, again in my opinion, ensure that we do not break existing browser implementations if we can help it. The use of creative 303 body messages and perhaps new headers is one way out.