Wednesday, August 29, 2007

Returning HTTP 303s for Semantic Web URIs

The World Wide Web Consortium's (W3C) Technical Architecture Group (TAG) attempted to settle a long standing debate about the use of URL resolution called http-range-14 a couple of years ago by ruling that "If an 'http' resource responds to a GET request with a 303 (See Other) response, then the resource identified by that URI could be any resource." Roy Fielding's original suggestion for the TAG finding is here.

That is a very subtle point. The idea was to cleanly separate those resources that are referred to by an HTTP URL and those that cannot be referred to directly, but might have an HTTP URL assigned to them anyway. The latter include physical items in the real world. Importantly, many of the objects assigned URIs in Semantic Web descriptions are given HTTP URLs but cannot be directly referred to on the Web.

Programmatic resolution of an HTTP URL may refer to an object in the real world (that is, an arbitrary resource) or an information resource (in some virtual form, such as an HTML page or an image or a movie). In that case, the HTTP response code would be 303 instead of 200. A 303 is an indication that the thing referred to may not be an information resource, it may be either an information resource or a "real" object. The body of the 303 (and the Location header) can provide information about the resource without encouraging you to think that what was returned really was a representation of the resource (as you would with a 200 response).

Most HTTP resources are expected to respond to a GET request with a 200 (OK) response and the body of a 200 is an "information resource". The definition of an information resource is that the entire content of the referred object may be "conveyed in a message".

But what about resources that cannot be conveyed in a message? My dog is a resource, as is my car, or myself. These things cannot be conveyed in a message, they can only be referred to. That is where HTTP 303 response codes come in.

RFC 2616 defines HTTP version 1.1 and its response codes. Section 10 defines a 303 thusly:

The response to the request can be found under a
different URI and SHOULD be retrieved using a GET
method on that resource. This method exists primarily
to allow the output of a POST-activated script to
redirect the user agent to a selected resource. The
new URI is not a substitute reference for the originally
requested resource. The 303 response MUST NOT be
cached, but the response to the second (redirected)
request might be cacheable.

The different URI SHOULD be given by the Location
field in the response. Unless the request method was
HEAD, the entity of the response SHOULD contain a
short hypertext note with a hyperlink to the new URI(s).
NB: An information resource should only be returned by an HTTP GET, not a POST, hence the discussion of POST. I unfortunately don't know of many people who bother to comply with Web Architecture to that extent.

Thus, the body of a 303 response is under-specified. There are several open questions. Some of them are:
  • What should it contain?
  • How should it be formatted?
  • Does a "short hypertext note" constrain implementations to a text/html MIME type?
  • How short is "short"?
  • Can there be more than one Location header?
  • If not, what do all the other URIs in the hypertext tell a user? Arbitrarily anything? If so, there is no limitation.
Uche, Eric and I have been trying to answer those questions. I should rightfully include Brian, too. Uche has proposed that the body of a 303 response include RDF (but not necessarily in RDF/XML format!). Eric has suggested encoding RDF content using RDFa, which would allow for the "short hypertext note" to include machine-readable RDF without breaking any existing implementations. The obvious downside is that RDFa is not a standard (merely an Editors Draft) and may not become one.

I rather like the idea of using RDFa, but a broader solution may be to always include a text/html body part, but use a multipart/alternative structure to allow for the body to hold RDF data (in whatever form) if it is present. That way, an implementation would not have to run the body through an RDFa parser just to determine whether any RDF content was present. The use of an additional header to indicate the presence of RDF content would also do it, regardless of the body type.

Of course, others have derided the use of HTTP URLs (emphasis on the Locator aspect) to reference arbitrary resources in the universe (for which the URI, I for identifier, was designed) and for some good reasons. However, I see tremendous value in being able to marry information space with meat space via the resolution of HTTP URLs. In fact, the automated manipulation of Semantic Web content depends upon it. That is why the TAG's finding makes a lot of sense. The use of 303 response codes to both separate resolution from description while maintaining universal addressing ties the abstraction of the Internet to the real world.

The new Persistent URL (PURL) service, now in construction, will allow PURLs to be created that can return 303 responses. The use of 303s to represent arbitrary real-world resources will enable PURLs to coalesce the fragmented persistent identifier space. Do we really need LSIDs, DOIs, INFO URIs and the rest? My answer is, "Not if the Web Architecture supports all their requirements". The representation of arbitrary objects via HTTP URLs and the ability to return multiple See Also URLs from a response would seem to do just that.

TimBL has raised a concern that the information resource contained in a body of a 303 response should not contain anything very interesting, because it is not addressable (in context of the linked data discussion). I disagree. The address of the 303 body is directly addressable because that is what is returned when the URL that got you there is resolved. Further, the 303 status informs a user that the information resource in the body is not the requested resource itself, merely information about it. That is, the URL addresses both the (real world) resource and the information resource in the 303 body and the 303 status allows one to cleanly separate which one you may wish to refer to at any given time, either programmatically or by a human.

As Paul points out, the use of a 303 response code does not put a requirement on the semantics of the identifier. A 303 response may be used to present information to a human user, while at the same time providing an indication to a computer that the resource you addressed is not the one that was returned.

I have posted about http-range-14 issues before, but not in this amount of detail. I think that proper answers to the questions above will be critical to the success of the Semantic Web because we must have a mechanism to programmatically determine whether an HTTP URL refers to an arbitrary thing in the wide universe or that relatively small subset of things that we call information resources.

A big question in any new use of HTTP is how existing browsers have implemented handling of the return codes. I was quite surprised that Firefox and Apple's Safari redirect to the Location header in a 303! For example, go to the home page for Tom Heath (http://kmi.open.ac.uk/people/tom/) and you will be redirected to http://kmi.open.ac.uk/people/tom/html. Neither browser will show you that an intermediate 303 return code was issued or that the browser followed it. You can see the 303 by using something like HTTPTracer or wget. It is also interesting to note that Tom is using a text/html body describing in human terms what the Location header says (a "short hypertext note", captured using HTTPTracer):


HTTP/1.1 303 See Other
Date: Wed, 29 Aug 2007 16:54:19 GMT
Server: Apache/2.0.52 (Red Hat)
Location: http://kmi.open.ac.uk:8888/people/tom/html
Content-Length: 332
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>303 See Other</title>
</head><body>
<h1>See Other</h1>
<p>The answer to your request is located
<a href="http://kmi.open.ac.uk:8888/people/tom/html">here</a>.
</p>
<hr>
<address>Apache/2.0.52 (Red Hat) Server at kmi.open.ac.uk Port 8888</address>
</body></html>


Tom is using a 303 in the manner described by RFC 2616, and it is important to note that it does not conflict with the TAG finding regarding http-range-14. His response does, in fact, refer to "any resource".

The fact that browsers are automatically (and incorrectly, in my opinion) redirecting to the URLs specified in 303 Location headers suggests to me that Semantic Web applications will need to be careful. We can rely upon our own programmatic handling of 303s, but not the browsers'. We should also, again in my opinion, ensure that we do not break existing browser implementations if we can help it. The use of creative 303 body messages and perhaps new headers is one way out.

7 comments:

  1. David,

    What leads you to the statement ' The obvious downside is that RDFa is not a standard (merely an Editors Draft) and may not become one.'?

    1. We are on rec track [1]. Ok - we are a bit behind schedule, but it was a very aggressive one ;)
    2. Currently, most parts of the syntax are stable, test cases are being approved all the time, and we have plenty of implementations available.

    Please, do not rant about these issues - or at least give concrete evidence.

    Cheers,
    Michael

    [1] http://www.w3.org/2006/07/SWD/wiki/RDFa

    ReplyDelete
  2. Hi Michael,

    I made the statement about RDFa for the simple reason that it is so. It was not a rant, but merely a statement of fact.

    Please note that I have a long history of supporting RDFa as co-chair of the Semantic Web Best Practice and Deployment Working Group. In fact, I defended RDFa publicly and privately during times when others thought of killing it. However, it is not yet a W3C Recommendation.

    Let's say that I made the same statement about GRDDL (also a candidate for embedding RDF into hypertext 303 bodies, by the way). GRDDL is already a Proposed Rec and the Working Group looks set to move it forward shortly. It is simply farther along. That is not your fault, but it is so.

    However, I do not think that lack of Rec status should be the only consideration. I lead an effort to create a series of RDF databases (Tucana/Kowari/Mulgara) starting five full years before RDF became a Recommendation. It is wise, though, to look at both the likelihood of Rec status happening and the likelihood of wide industry adoption before suggesting a change to a fundamental part of Web Architecture.

    The bottom line is this: My team is on the verge of implementing the new PURL service and need to choose useful mechanisms for 303 bodies. We had better get it pretty close to right because we and others plan to create a huge number of PURLs in the coming months, many (most?) of which will be 303s.

    ReplyDelete
  3. David,

    Thanks for teaching me history, though IMHO I am perfectly aware of your role :)

    I see your points and think I do understand your reasons, BUT still do not understand why the 'statement' about 'RDFa may not become a standard' should be true. I might have missed something, but can you give some hints, pointers, explanations, whatever-you-got that actually supports this?

    Cheers,
    Michael

    PS: Keep on doing the good stuff, regardless if with or without RDFa ;)

    ReplyDelete
  4. Because I am an optimist by nature and a pessimist by experience :)

    ReplyDelete
  5. I think it's a great idea to return application/rdf+xml or an HTML document containing RDFa markup in the entity of a 303 response if the returned content-type is consistent with the request Accept: list.

    If the request says that application/rdf+xml (or */*) is acceptable to the client then it seems perfectly reasonable to me to consider an RDF entity to be the specified "hypertext note" in a 303 response.

    I also believe that the behaviour you note of deployed browsers automatically following 303 Location headers is but one more reason why leveraging http: URIs for meat space resources is good for users. The Location returned can be a function of the Accept value in the request, allowing the client to be referred to a specific "see other" resource that is most acceptable to it. I do agree that it would be a help to users if browsers were to indicate when the are following a 'see other', perhaps in the status bar at least. (I'd like to see browsers have a general metadata sidebar where lots of stuff, including security information, citations, and annotations could be indicated.)

    As I was reading (linearly) your post, I stumbled on the phrase "referred to" in your second paragraph. To my ears, all URIs (http or otherwise) are a "reference to" something. The information resource vs. physical resource distinction is one of retrievability. An 'information resource' is one that can be transported across the Internet. Someday we might be able to so transport meat space resources but not yet. So, in paragraph 2 I'd have used the phrase "retrieved [by]" rather than "referred to [by]".

    I agree with TimBL that the entity in a 303 response should not be considered to be directly addressable. You argue that this is the entity that is returned with a GET on a particular URI but I retort that there are even fewer promises about the repeatability of getting this entity on subsequent requests than there are on "normal" 200 responses. I take the statement in the HTTP spec that the 303 response MUST NOT be cached as supporting this claim. Imagine, if you will, what URI you'd use to refer to that 303 entity as the subject of some RDF statements. It certainly wouldn't be appropriate to use the original request URI -- that names a (perhaps meat world) resource, not the response entity. On the other hand, I think it should be perfectly fine to return "interesting" information in the 303 response. It's too valuable a place to not use for something.

    ReplyDelete
  6. I'm with Tim here. I think posting anything of relevance in the message body of a 303 is a bad idea. The 303 is a redirect status code, which means it points someplace else, and is not intended to deliver content back to the client.

    Almost all browsers and HTTP clients automatically redirect to the Location URI. That's how redirects have always been handled, and many web sites rely on it (303-after-POST is a standard REST idiom). I don't understand why that surprises you, isn't that how HTTP redirects are supposed to work?

    This means it is impossible to get at the message body of the 303 response with a standard web browser or the standard HTTP stacks in most programming languages. Thus, putting anything of interest in there is a bad idea.

    The HTTP spec actually tells you quite clearly what should go into the body: A short hypertext note with a link to the Location URI. As with all redirect status codes, the purpose of this is historical: There were very old browsers that didn't support automatical redirects, and they would display the message body instead. Including that hypertext note prevented them from getting stranded. In other words, it's obsolete today, and whatever you put in there today, no user will ever see it.

    So I think Tom is doing the right thing by not worrying at all about this, and just letting Apache insert its little auto-generated message.

    Simply use the Location header to point to a URI where information about the resource can be found. Whatever you planned to say inside the 303 message body, just say it in the redirection target. That's the approach implemented by almost all of the datasets in the Linking Open Data project. It's simple, it works, without any need for messing around with 303 message bodies or new headers.

    To give you my answers to your questions:

    What should it contain? A short hypertext note with a hyperlink to the new URI(s).

    How should it be formatted? Don't care, in all likeliness no one will ever see it.

    Does a "short hypertext note" constrain implementations to a text/html MIME type? No, though it doesn't matter.

    How short is "short"? One sentence will do.

    Can there be more than one Location header? Existing HTTP implementations redirect to the first one and ignore any subsequent headers. No point in putting anything in there.

    If not, what do all the other URIs in the hypertext tell a user? Arbitrarily anything? If so, there is no limitation. Most likely, no user will ever see it.

    That being said, I'm totally with you regarding HTTP URIs vs. DOI, INFO, LSID and friends. And I'm very much looking forward to an updated PURL that supports 303s.

    ReplyDelete
  7. I have given this a lot of thought. The temptation to do something more interesting with the bodies of 303 responses seemed compelling on the surface. The more I thought about it, however, the more I realized that simpler is better - and more Web-like.

    The reason is that (as far as I can see) there is nothing (nothing!) that one can do with an RDF body that one cannot do by redirecting to an RDF document (via the Location header). Sure, following the Location header requires another network connection and the attendant time, but it also means that the resource is properly addressable by URL.

    So, I think the solution to 303 bodies is to keep them simple: A single URL in a single Location header and an auto-generated XHTML body duplicating the information in the Location header in more human friendly terms ("More information about this resource is available at URL_GOES_HERE").

    However, there is still the need to properly identify non-information resources on the Semantic Web. That's where a best practice approach might come in. I suggest that any URL that returns a 303 with a Location header that points to an RDF document be considered a non-information resource (that is, either a physical, "real world" resource (like my car) or a conceptual resource (like the concept of a car). The RDF eventually returned from following your nose is then known to provide information about the resource in question without being confused with being the resource itself.

    Is a 303/RDF combination adequate to describe non-information resources in general? I think so. Is a 303/RDF combination also useful for describing information resources? Yes, but I don't see any reason to impose a 303 to describe an information resource: one could jump right to some RDF.

    What do others think about that?

    ReplyDelete