Tuesday, November 09, 2010

A(nother) Guide to Publishing Linked Data Without Redirects

It seems to me that we in the Linked Data community have a need to:

  1. Assign URIs to resources, be they physical, conceptual or virtual (information resources) in nature.

  2. Apply the same mechanisms for metadata description to any resource, regardless of type.

  3. Be able to traverse in obvious ways from a resource to its metadata description and from a metadata description to its resource.

Unfortunately, we can't do all that yet, at least not easily and in all circumstances. We are close, but not close enough.

Linked Data deployment is hampered by the requirement for so-called "slash" URLs to be resolved via a 303 (See Other) redirection. Unfortunately, many people wishing to publish Linked Data don't understand the subtleties of 303 redirection, nor do many of them have adequate control over their Web server configurations to implement 303 redirections. Ian Davis has been looking for a solution to this problem. Unfortunately, I don't think he has found it yet.

Ian published A Guide to Publishing Linked Data Without Redirects specifically to find a way around the confusing (and sometimes difficult) usage of 303 redirects for Linked Data. Ian's original question was: "What breaks on the web if we use status code 200 instead of 303 for our Linked Data?"

Unfortunately, the use of the Content-Location header with Linked Data begs the same questions as 303s:

  1. It requires a change of thinking regarding the meaning of 200 (OK), specifically to the http-range-14 finding.

  2. It suffers from the same problem as 303s in relation to deployment with current hosting companies/IT departments. If you don't have control over your Apache, you can't publish your Linked Data.

  3. There is an "implicit redirect", in that one may wish or need to check the URL in the Content-Location header.

The first one admittedly bothers me most. If one resolves a URL and receives a 200 (OK) response, we are currently guaranteed that both (a) our request succeeded in the way we expected and (b) that the thing we received is an information resource. We expect that the thing we received is an information resource that is a representation of the resource we requested (and identified by its URL address).

In short, I think Ian's proposal mostly but not completely solves the problems that Ian was meaning to address. Unfortunately, there is practically little difference from the status quo. Tom Heath has some of the same concerns.

If we are going to fix fundamental problems with serving Linked Data, I'd prefer to explicitly address the fundamental questions related to URI naming of physical, conceptual and information resources (the overloading of the HTTP name space), so I proposed an alternative solution on the public-lod@w3.org mailing list last week. This post expands on those thoughts with some more detail.

The use of 303 redirections by the Semantic Web and Linked Data community is a bit of a hack on top of the existing 303 functionality laid down in the early Web. The http-range-14 debate tried to end the arguments, but only slowed them down. We can't really hack at the 303 any more than we have. I explored that in 2007 and came up pretty empty.

My Proposal


I propose deprecating the 303 for use in Linked Data (only) in favor of a new HTTP status code. The new status code would state "The URI you just dereferenced identifies a resource that may be informational, physical or conceptual. The information you are being returned in this response contains a metadata description of the resource you dereferenced." This new status code would be used to disambiguate between generic information resources and the special class of information resources that describe (via metadata) an addressed URI.

The "metadata description" would generally be in some form of RDF serialization, but could also be in HTML (for human consumption) or in some future metadata representation format. Existing HTTP content negotiation approaches and Content-Type headers would be sufficient to inform both requester and Web server what they received.

I propose that the new status code be called 210 (Description Found).

Existing HTTP status codes may be found in RFC 2616 Section 10.

Example Requests and Responses


Let's start with the basics. If we resolve a URI to an information resource, we get a 200 (OK) response upon success:

# Get an information resource:
$ curl -I http://example.com/toucan.info
HTTP/1.1 200 OK
Date: Wed, 10 Nov 2010 21:37:44 GMT
Server: Apache/2.2.3 (Red Hat)
Content-Type: text/html;charset=UTF-8
Content-Length: 1739



An information resource that supports some (any!) form of embedded RDF can easily point to its metadata description at another URL (e.g. via a link element or a POWDER description). The metadata description can easily point back to the described resource.

Physical and conceptual resources are where we have historically ran into trouble on the Web of Data. A "slash" URI assigned to name a physical or conceptual resource has required a 303 redirection to another document and the semantics are unclear at best. Instead, this proposal suggests that physical and conceptual resources explicitly return a 210 (Description Found) status code, thus removing any ambiguity from the response.

The resolution of a URI to a physical resource might return:

# Get an information resource:
$ curl -I http://example.com/toucan.physical
HTTP/1.1 210 Description Found
Date: Wed, 10 Nov 2010 21:38:52 GMT
Server: Apache/2.2.3 (Red Hat)
Content-Type: text/turtle
Content-Length: 1739



The body of the response would naturally be (in this case) an RDF document describing the physical resource. The fact that the resource is physical would be encoded in an RDF statement in the description.

Conceptual resources could be handled in an identical manner. The only difference would be in the requested URI and differing content returned:

# Get an information resource:
$ curl -I http://example.com/toucan.concept
HTTP/1.1 210 Description Found
Date: Wed, 10 Nov 2010 21:40:12 GMT
Server: Apache/2.2.3 (Red Hat)
Content-Type: text/turtle
Content-Length: 1214



Again, the fact that the resource is conceptual would be encoded in an RDF statement in the description.

Savvy readers might note that the existing status code 300 (Multiple Choices) could be used when multiple metadata descriptions of a resource are available:

The requested resource corresponds to any one of a set of
representations, each with its own specific location, and
agent- driven negotiation information (section 12) is being
provided so that the user (or user agent) can select a
preferred representation and redirect its request to that
location.


Note that Ian's statement that when using a 303 "only one description can be linked from [a resource's URI]" is not correct; standards-compliant Web servers could use a 300 status code should they so wish (and can figure out a way to configure their Web server to do that).

Ramifications


How does my proposal stack up to Ian's? Ian proposed nine problems with the 303, the most important of which (in my opinion) were:

  • it requires an extra round-trip to the server for every request (at least, that's important to those implementing browsers, spiders and Linked Data clients and to those with limited bandwidth)

  • the user enters one URI into their browser and ends up at a different one, causing confusion when they want to reuse the URI (PURLs also suffer from this due to odd UI decisions by browser makers)

  • having to explain the reasoning behind using 303 redirects to mainstream web developers simply reinforces the perception that the semantic web is baroque and irrelevant to their needs.


Additionally, three of his concerns related to the difficulties of Web server configuration:

  • its non-trivial to configure a web server to issue the correct redirect and only to do so for the things that are not information resources.

  • the server operator has to decide which resources are information resources and which are not without any precise guidance on how to distinguish the two

  • it cannot be implemented using a static web server setup, i.e. one that serves static RDF documents



The 210 status code proposal would effectively deal with Ian's major issues. Metadata describing a resource could be returned in a single GET if the resource were physical or conceptual (that is, not an information resource). It would be reachable for information resources, although requiring two hops if the URL to the metadata is not known. The URI displayed by a browser would not change. Importantly, the 210 is conceptually much easier to explain.

Support For Existing Web Servers


Web servers, even existing ones at hosting centers, can be easily configured to serve 210 content immediately. At least, via a simple hack. The one we use for 3roundstones.com (Arvixe) allows limited site configuration using cpanel. Cpanel allows Apache handlers to be associated with file extensions in URLs. One of the Apache handlers installed by default with Apache is mod_asis.

mod_asis is used to send a file "as is". A file sent that way can contain HTTP headers separated by a blank line. Using that trick, we might associate a URI (say, http://example.com/toucan.physical) with a metadata description of a physical object. The resource file served when that URL is resolved looks like this (inclusive of the 210 status code!):

Status: 210 Description Found
Date: Mon, 10 Nov 2010 15:07:14 GMT
Content-Type: text/turtle
 
<http://example.com/toucan.physical>
a <http://dbpedia.org/resource/Toucan> ;
...



The combination of mod_asis and a file (with a mapped extension) containing custom HTTP headers (including a Status pseudo header) will result in the remainder of the file being served with the designated headers. In this case, that means that we can return 210 status codes from any URL we wish using a stock Web hosting service.

Some might consider the use of file extensions restrictive (or just a PITA), but the Principle of URI Opacity protects us from people like that :)

Other Considerations


It may interest some to note that common Web clients (including human-oriented browsers and command line clients such as curl and wget) do not seem to mind a non-standard 200-series status code. They return the document and the new status code without complaint.

There are some disadvantages to the 210 proposal. Most importantly, this proposal is a change to the very fabric of HTTP and thus the Web. The W3C and IETF would need to standardize the 210 status code, probably in a new IETF RFC. That will take time and effort. Web server operators would have to configure their Web servers to return the correct status code (as described above), at least until Web servers ship with 210 support by default.

Please comment. If we want to build the Semantic Web and the Linked Data community on a designed fabric instead of a series of hacks, the time to start is now. Even now is late, but it is not (yet) impossible.

No comments:

Post a Comment

Post a Comment