Monday, March 23, 2009

With Apologies to Emerson

Rich are the Web-gods: who gives gifts but they?
They grope the Web for PURLs, but more than PURLs:
They pluck Force thence and give it to the wise.

Thursday, March 05, 2009

O'Reilly Media Joins the Semantic Web

O'Reilly Media (http://oreilly.com/), the current name for the geek publishing giant founded by Tim O'Reilly, has finally joined the Semantic Web.  O'Reilly's coining of the term "Web 2.0" and early misunderstandings of the Semantic Web stack lead some to think that he didn't see much value in machine readable information.  That seems to have changed, at least in within O'Reilly Labs.

O'Reilly Labs launched a Beta product last month called the O'Reilly Product Metadata Interface (OPMI), which is available at http://labs.oreilly.com/opmi.html.  The OPMI is a technical platform for the exchange of metadata between publishing trading partners.  Now that it is in RDF and publicly accessible, the rest of us can play with it, too.

It is easy to retrieve RDF/XML describing any book that O'Reilly publishes. You simply perform an HTTP GET on a URL constructed with the book's International Standard Book Number (ISBN). Every edition of a published book has an ISBN and they come in two flavors, the older 10-digit variety and the newer 13-digit version. All ISBNs issued after 1 January 2007 have been 13 digits. Some books are assigned both forms by their publishers for convenience during the transition.

For example, let's get the metadata description of an O'Reilly book I wrote, Programming Internet Email. The 13-digit ISBN for the second edition of the paperback is 9781565924796, and the 10-digit equivalent is 1-56592-479-7. The OPMI nicely works with either one, but the returned RDF uses the modern 13-digit one as canonical, as it should.

The URL for any O'Reilly book is http://opmi.labs.oreilly.com/product/ followed by its ISBN, in this case 9781565924796. The full URL is thus http://opmi.labs.oreilly.com/product/9781565924796.

An HTTP GET may be done with any Web browser, of course, or on a command line by use of the curl utility:

$ curl http://opmi.labs.oreilly.com/product/9781565924796


The returned RDF includes a wealth of information about the book. The OPMI uses four vocabulary descriptions in its RDF: Dublin Core for describing books (title, subject, language, etc), Friend-of-a-Friend (FOAF) for describing people associated with those books, the library community's MARC (MAchine Readable Cataloging) relator codes for relating people and books and the Metadata Object Description Schema (MODS) for specifying the edition of a book. MARC and MODS come from the Library of Congress and are traditionally used in library cataloging systems.

Since this metadata is on the Web, we can use standard Semantic Web query tools to query it. Using SPARQLer, a SPARQL query language processor available freely on the Web, we can query the RDF to extract bits we want. A bit of playing around makes it easy to get the author's name and the unique URI assigned to the author by O'Reilly:

prefix dc:
prefix foaf:
prefix rdf:
SELECT ?work ?authorURI ?author
FROM
WHERE {
?work dc:creator ?authorType .
?authorType rdf:_1 ?authorURI .
?authorURI foaf:name ?author
}


The results look like this:
work authorURI author
<urn:x-domain:oreilly.com: product:9781565924796.IP> <urn:x-domain:oreilly.com: agent:pdb:2495> "David Wood" @en
<urn:x-domain:oreilly.com: product:9781565924796.BOOK> <urn:x-domain:oreilly.com: agent:pdb:2495> "David Wood" @en


There are two results because the first (.IP) is the overall URI for the work in all of its possible formats. The second (.BOOK) is the book edition of the work. If this book had been published on Safari, O'Reilly's electronic publishing forum, it would also have a URL ending in ".SAF". E-books get an ".EBOOK" and Apple iPhone applications get a ".APP".

O'Reilly claims published metadata for over 1100 books, which is a pretty reasonable addition to the Semantic Web, even in Beta. Naturally, I now want O'Reilly to publish machine-readable metadata on their human-readable Web pages using RDFa. There has been no sign of that yet, though.

This content was cross-posted to Semantic Universe.

Monday, March 02, 2009

PURL Legacy Loader Now Open Source

A legacy loader is available to take old OCLC version 1 Persistent URL (PURL) database dumps and upload PURLs into the new project’s RESTful API. This is not production code, but is provided in the hope that it may be useful to operators of old PURL servers wishing to migrate to a more modern PURL server. The legacy loader has been released under an Apache 2.0 license.

To get the legacy loader, use Subversion to check it out like this:

svn co http://purlz.zepheira.com/svn/purlz/purlsbulkloader

Check out the code and follow the directions in the file README.txt.

This information is also available at the PURL Project's Download Area.

Persistent URL (PURL) Server version 1.4 Released

The PURLZ Persistent URL Server version 1.4 is now available. See the PURLZ Downloads area to get your copy now. This release improves handling of URLs with query strings and special characters. It is recommended for immediate use by all PURL server operators.

PURLs are Web addresses or Uniform Resource Locators (URLs) that act as permanent identifiers in the face of a dynamic and changing Web infrastructure. This capability provides continuity of references to network resources that may migrate from machine to machine for business, social or technical reasons. Details are available on the PURLZ community site.

Please see also the README and Release Notes for version 1.4.