Thursday, 9 November 2006
I attended the track this morning on Applications of SW Technologies with Lessons Learned, including these papers:
Crawling and Indexing Semantic Web Data (Andreas Harth, Juergen Umbrich, Stefan Decker),
Using Ontologies for Extracting Product Features from Web Pages (Wolfgang Holzinger, Bernhard Kruepl, Marcus Herzog) and Characterizing the Semantic Web on the Web (Li Ding, Tim Finin).
I asked Andraes Harth and Li Dong (Swoogle) about indexing RDFa content and confirmed my opinion regarding its difficulty. Neither project currently indexes RDFa documents for the simple reason that they have no way to identify RDFa content without parsing every XHTML document they come across. The cost of doing that is too high.
I spoke with DanC and Ivan Herman about this at some length, but nobody seems to know what to do about it. Do you add an in-document identifier for RDFa content? If so, you lose a critical RDFa feature: the ability to cut-and-paste sections of content without losing machine readability. Do you just point to RDFa compatible documents from other documents in such a way that search engines get the hint they need? Swoogle would be fine with that, but it doesn't address how RDFa documents are consumed in a browser by the general public. Perhaps the answer is, as Steve Harris would have it, that your browser should just parse a document locally to see if it contains any triples of interest to you. It doesn't address global searching, but many people seem willing to cede that to those willing to parse the documents, like Google.
The W3C's RDF-in-XHTML Task Force, which has recently moved to the Semantic Web Deployment Working Group, has discussed this at length and not come up with an answer. I don't have one myself.
Great to see you mentioning RDFa...thanks for that. :)
Just one thing to say though, that the issue of how to identify when an XHTML document contains RDFa is really not a show-stopper, and it certainly hasn't been discussed at length! The reason for that is simply that it won't be very difficult to do, should we decide that it is needed.
As it stands at the moment, it's not completely clear that we do need to do this. The reason I say that is that in many ways RDFa is simply a question of interpretation; we've worked very hard to ensure that RDFa harmonises with normal HTML metadata practices, so you could say that HTML documents already contain RDFa. (See for example, RDFa: The Gentle Road to RDF.)
But I wouldn't rule out having some mechanism that indicates that 'there's useful stuff in here', such as using @profile, for example. (Although it might save on server processing if you could indicate the presence of the metadata outside of the document, it would limit RDFa's usefulness; one of the major use cases of RDFa is the ability to publish a blog via something like Blogger and just have the RDFa 'work'.)
Thanks again for the interest.
"But I wouldn't rule out having some mechanism that indicates that 'there's useful stuff in here'" ..ReplyDelete
how about /html/head/@profile = 'http://www.w3.org/2003/g/data-view' and link[@rel'='transformation']/@href = 'http://www.w3.org/2001/sw/grddl-wg/td/RDFa2RDFXML.xsl'. Or an RDFa profile which identifies a transformation to use to extract RDFa..
I.e., leave a GRDDL trail for extracting RDF/XML from XHTML+RDFa