Tuesday, June 14, 2011
Friday, June 03, 2011
There has been a long-standing argument between microformats and the Semantic Web. Many developers, and to some degree, search engines have preferred microformats because they are easy to use and to understand. Microformats are widely deployed because of this. However, there is simply no way to combine microformats on a single page. This is the Achilles heel of microformats; sooner or later someone wishes to use more than one (or a few if they play together particularly nicely) at a time and can't do it.
RDF is harder to understand (although an experiment in Germany showed that fifth graders could easily be taught RDF. It is adults who have already learned other ways to think who have trouble.) RDF is a completely general solution to the problems that microformats solve. RDF's raison d'être is to allow for the combination of data from multiple people (e.g. developers and search engines, or multiple relational databases or as an interchange format between proprietary system). RDF can represent any type of data, and combines easily with other RDF.
The argument between microformats and RDF can thus be thought of as an argument between short-term pragmatism and long-term planning. Those who want to solve a specific problem now use microformats. Those who want to solve more general problems in the future use RDF.
The presumption in the Semantic Web community is that the best (perhaps the only) way to combine microformats is either RDF or something very much like it. Further, people have been expressing needs to combine the use of multiple microformats on Web pages for about five years.
Microsoft is aware of the Semantic Web and, in fact, was an early supporter of the RDF standards at the World Wide Web Consortium (W3C). They even paid for some marketing; proof may be found here.
Unfortunately for those interested in open standards, Microsoft decided that Netscape's (remember Netscape??) use of RDF in their portal was threatening, so they decided to reinvent RDF internally as a proprietary technology. Microsoft's internal version of RDF has appeared in their file system, Sharepoint and other products. That's the way this story simply must play out: Use RDF or reinvent it.
Yahoo was the first search engine to support RDFa (RDF in Web pages), followed by Google. Both supported particular vocabularies of RDFa, which is the same as saying 'microformats encoded in RDF' and therefore along the lines of my earlier comments.
The new schema.org announcement is a partnership between "Google, Bing and Yahoo" or "Google, Microsoft and Yahoo" depending where you look. Since Microsoft bought Bing and Yahoo has licensed Bing for its search services, schema.org is really between Google and Microsoft.
So, I read schema.org as an attempt (actually, a further attempt) by Microsoft to reduce the impact of RDF and Semantic Web techniques on the search business specifically and their larger business in general. Time will tell whether that will work. History suggests that it will partially work by changing the places RDF is seen as threatening to big business. Another similar area to watch will be RDF and Linked Data's threat to the Data Warehousing market (a $10 billion market in 2010). That fight will be primarily between standards and Oracle.
Michael Hausenblas at DERI released Schema.org in RDF while I was writing this. Well done to Michael and his colleagues. As Michael said, "We're sorry for the delay". Awesome.
Tuesday, February 08, 2011
Thursday, January 06, 2011
Ian challenged me to come up with a compelling reason why HTTP should encode the difference between a resource representation and a resource description and, after some effort, I simply could not. Ian summarized his thoughts in a new post: Back to Basics with Linked Data and HTTP.
The problem in my mind has always related to the use of HTTP URIs to identify things in the real world. We can get around that easily enough by returning RDF whenever someone resolves those URIs. You get a description of a real-world thing that is as richly described as the publisher wanted it to be. Cool.
Tuesday, November 09, 2010
- Assign URIs to resources, be they physical, conceptual or virtual (information resources) in nature.
- Apply the same mechanisms for metadata description to any resource, regardless of type.
- Be able to traverse in obvious ways from a resource to its metadata description and from a metadata description to its resource.
Linked Data deployment is hampered by the requirement for so-called "slash" URLs to be resolved via a 303 (See Other) redirection. Unfortunately, many people wishing to publish Linked Data don't understand the subtleties of 303 redirection, nor do many of them have adequate control over their Web server configurations to implement 303 redirections. Ian Davis has been looking for a solution to this problem. Unfortunately, I don't think he has found it yet.
Ian published A Guide to Publishing Linked Data Without Redirects specifically to find a way around the confusing (and sometimes difficult) usage of 303 redirects for Linked Data. Ian's original question was: "What breaks on the web if we use status code 200 instead of 303 for our Linked Data?"
Unfortunately, the use of the Content-Location header with Linked Data begs the same questions as 303s:
- It requires a change of thinking regarding the meaning of 200 (OK), specifically to the http-range-14 finding.
- It suffers from the same problem as 303s in relation to deployment with current hosting companies/IT departments. If you don't have control over your Apache, you can't publish your Linked Data.
- There is an "implicit redirect", in that one may wish or need to check the URL in the Content-Location header.
In short, I think Ian's proposal mostly but not completely solves the problems that Ian was meaning to address. Unfortunately, there is practically little difference from the status quo. Tom Heath has some of the same concerns.
If we are going to fix fundamental problems with serving Linked Data, I'd prefer to explicitly address the fundamental questions related to URI naming of physical, conceptual and information resources (the overloading of the HTTP name space), so I proposed an alternative solution on the firstname.lastname@example.org mailing list last week. This post expands on those thoughts with some more detail.
The use of 303 redirections by the Semantic Web and Linked Data community is a bit of a hack on top of the existing 303 functionality laid down in the early Web. The http-range-14 debate tried to end the arguments, but only slowed them down. We can't really hack at the 303 any more than we have. I explored that in 2007 and came up pretty empty.
I propose deprecating the 303 for use in Linked Data (only) in favor of a new HTTP status code. The new status code would state "The URI you just dereferenced identifies a resource that may be informational, physical or conceptual. The information you are being returned in this response contains a metadata description of the resource you dereferenced." This new status code would be used to disambiguate between generic information resources and the special class of information resources that describe (via metadata) an addressed URI.
The "metadata description" would generally be in some form of RDF serialization, but could also be in HTML (for human consumption) or in some future metadata representation format. Existing HTTP content negotiation approaches and Content-Type headers would be sufficient to inform both requester and Web server what they received.
I propose that the new status code be called 210 (Description Found).
Existing HTTP status codes may be found in RFC 2616 Section 10.
Example Requests and Responses
Let's start with the basics. If we resolve a URI to an information resource, we get a 200 (OK) response upon success:
# Get an information resource:
$ curl -I http://example.com/toucan.info
HTTP/1.1 200 OK
Date: Wed, 10 Nov 2010 21:37:44 GMT
Server: Apache/2.2.3 (Red Hat)
An information resource that supports some (any!) form of embedded RDF can easily point to its metadata description at another URL (e.g. via a link element or a POWDER description). The metadata description can easily point back to the described resource.
Physical and conceptual resources are where we have historically ran into trouble on the Web of Data. A "slash" URI assigned to name a physical or conceptual resource has required a 303 redirection to another document and the semantics are unclear at best. Instead, this proposal suggests that physical and conceptual resources explicitly return a 210 (Description Found) status code, thus removing any ambiguity from the response.
The resolution of a URI to a physical resource might return:
# Get an information resource:
$ curl -I http://example.com/toucan.physical
HTTP/1.1 210 Description Found
Date: Wed, 10 Nov 2010 21:38:52 GMT
Server: Apache/2.2.3 (Red Hat)
The body of the response would naturally be (in this case) an RDF document describing the physical resource. The fact that the resource is physical would be encoded in an RDF statement in the description.
Conceptual resources could be handled in an identical manner. The only difference would be in the requested URI and differing content returned:
# Get an information resource:
$ curl -I http://example.com/toucan.concept
HTTP/1.1 210 Description Found
Date: Wed, 10 Nov 2010 21:40:12 GMT
Server: Apache/2.2.3 (Red Hat)
Again, the fact that the resource is conceptual would be encoded in an RDF statement in the description.
Savvy readers might note that the existing status code 300 (Multiple Choices) could be used when multiple metadata descriptions of a resource are available:
The requested resource corresponds to any one of a set of
representations, each with its own specific location, and
agent- driven negotiation information (section 12) is being
provided so that the user (or user agent) can select a
preferred representation and redirect its request to that
Note that Ian's statement that when using a 303 "only one description can be linked from [a resource's URI]" is not correct; standards-compliant Web servers could use a 300 status code should they so wish (and can figure out a way to configure their Web server to do that).
How does my proposal stack up to Ian's? Ian proposed nine problems with the 303, the most important of which (in my opinion) were:
- it requires an extra round-trip to the server for every request (at least, that's important to those implementing browsers, spiders and Linked Data clients and to those with limited bandwidth)
- the user enters one URI into their browser and ends up at a different one, causing confusion when they want to reuse the URI (PURLs also suffer from this due to odd UI decisions by browser makers)
- having to explain the reasoning behind using 303 redirects to mainstream web developers simply reinforces the perception that the semantic web is baroque and irrelevant to their needs.
Additionally, three of his concerns related to the difficulties of Web server configuration:
- its non-trivial to configure a web server to issue the correct redirect and only to do so for the things that are not information resources.
- the server operator has to decide which resources are information resources and which are not without any precise guidance on how to distinguish the two
- it cannot be implemented using a static web server setup, i.e. one that serves static RDF documents
The 210 status code proposal would effectively deal with Ian's major issues. Metadata describing a resource could be returned in a single GET if the resource were physical or conceptual (that is, not an information resource). It would be reachable for information resources, although requiring two hops if the URL to the metadata is not known. The URI displayed by a browser would not change. Importantly, the 210 is conceptually much easier to explain.
Support For Existing Web Servers
Web servers, even existing ones at hosting centers, can be easily configured to serve 210 content immediately. At least, via a simple hack. The one we use for 3roundstones.com (Arvixe) allows limited site configuration using cpanel. Cpanel allows Apache handlers to be associated with file extensions in URLs. One of the Apache handlers installed by default with Apache is mod_asis.
mod_asis is used to send a file "as is". A file sent that way can contain HTTP headers separated by a blank line. Using that trick, we might associate a URI (say, http://example.com/toucan.physical) with a metadata description of a physical object. The resource file served when that URL is resolved looks like this (inclusive of the 210 status code!):
Status: 210 Description Found
Date: Mon, 10 Nov 2010 15:07:14 GMT
a <http://dbpedia.org/resource/Toucan> ;
The combination of mod_asis and a file (with a mapped extension) containing custom HTTP headers (including a Status pseudo header) will result in the remainder of the file being served with the designated headers. In this case, that means that we can return 210 status codes from any URL we wish using a stock Web hosting service.
Some might consider the use of file extensions restrictive (or just a PITA), but the Principle of URI Opacity protects us from people like that :)
It may interest some to note that common Web clients (including human-oriented browsers and command line clients such as curl and wget) do not seem to mind a non-standard 200-series status code. They return the document and the new status code without complaint.
There are some disadvantages to the 210 proposal. Most importantly, this proposal is a change to the very fabric of HTTP and thus the Web. The W3C and IETF would need to standardize the 210 status code, probably in a new IETF RFC. That will take time and effort. Web server operators would have to configure their Web servers to return the correct status code (as described above), at least until Web servers ship with 210 support by default.
Please comment. If we want to build the Semantic Web and the Linked Data community on a designed fabric instead of a series of hacks, the time to start is now. Even now is late, but it is not (yet) impossible.
Wednesday, October 20, 2010
Thursday, October 07, 2010
A primary goal of this book is to highlight both costs and benefits to broader society of the publication of raw data to the Web by government agencies. How might the use of government Linked Data by the Fourth Estate of the public press change societies? How can agencies fulfill their missions with less cost? How must intra-agency culture change to allow public presentation of Linked Data?
Monday, August 02, 2010
I edited this book for Springer and the publisher has created a Web site for it as it enters production.
Springer seems to think the book won't be out until 2011, but I'm hoping on November because I'll be speaking at a conference then and would like to see it out.
I have been given the rights to put the entire book's content on the Web and plan to do so as Linked Data sometime shortly.
I wish Zepheira well and believe I am leaving at a time when the company is strong and their future looks bright.
The future for me is a bit less certain at the moment, but I'm speaking with a number of good people. More when a decision has been made, probably in late August around my birthday. In the meantime, I've updated my resume and Linked In profile as I make the rounds.
Feel free to contact me or leave a comment if you know of exciting opportunities.
Thursday, July 01, 2010
Callimachus version 0.1.1 is now available. This release includes
updated documentation and the first sample applications.
Please see the directions in the file SAMPLE-APPS.txt to understand
the sample applications. More are coming soon!
You can acquire this release either by downloading the ZIP archive
from the downloads area or by checking out the v0.1.1 tag:
svn checkout http://callimachus.googlecode.com/svn/tags/0.1.1/
Either way, follow the directions in README.txt to get started.
Have fun and please report your experiences with Callimachus to the discussion list!
Thursday, October 22, 2009
The tiny fēn is about 3 mm. The cùn is traditionally the width of a person's thumb at the knuckle. The chǐ (or Chinese 'foot') is derived from the length of a human forearm, like a cubit. Or so says Wikipedia.
Those were hard-working people, to have thumbs as wide as a cùn.
The ruler is wooden, with brass inlays marking the units.
Friday, September 04, 2009
This article at Reuters reported on damage control attempts at Amazon after it (in a delicious piece of irony) deleted copies of George Orwell's 1984 from its Kindles in July. The provider of the ebook version of 1984 apparently did not own the appropriate publication rights. Readers were naturally upset at the sudden disappearance of content from their readers, although of course they forget to read the fine print, didn't they? You can't buy an ebook, you can only rent. Amazon was technically within their rights to delete the content.
That's hardly the full story, though. Amazon was sued by a high school student for having also removed his "copious notes" regarding the deleted novel. The Reuters story linked above showed Amazon's hand when they reported:
Amazon's email on Thursday said that the company would replace
the deleted books along with any annotations made by customers.
That's right, Kindle fans. Amazon has admitted publicly that they, like Orwell's Big Brother, keep copies of any annotations that Kindle users make on the devices. For at least months. Holy cow!
The full text of Amazon's email to affected customers is available at the WSJ.
Perhaps more amazing is that Kindle readers don't particularly seem to care (cf. comments to the WSJ blog post). Kindle notes are synced to an Amazon server and thus available to readers over the Web. That may seem like a feature to some, but not to me. I'll back up my own notes, thanks.
Friday, August 21, 2009
This is a new style of "collective wisdom" books from O'Reilly. An earlier one was aimed at software architects.
I was pleased to see that O'Reilly used one of my quotes at the top of their home page for the book ("Clever Code Is Hard to Maintain...and Maintenance Is Everything").
The tips I wrote for this book were:
- Clever Code Is Hard To Maintain
- The 60/60 Rule
- The Fallacy Of Perfect Execution
- The Fallacy Of Perfect Knowledge
- The Fallacy Of The Big Round Ball
- The Web Points The Way, For Now
Monday, August 17, 2009
I don't know if I still have the math to slog through it, but it looks to be worth the effort.
Called the Invariant Set Postulate, the proposed law offers a geometry of space-time that resolves long-standing difficulties in quantum mechanics, including complementarity, quantum coherence, superposition and wave-particle duality. Quantum description of gravity may even be possible. Wow. That is an amazingly out-of-the-box contribution.
For the faint of heart, here is a key quote: "The Invariant Set Postulate appears to reconcile Einstein’s view that quantum mechanics is incomplete, with the Copenhagen interpretation that the observer plays a vital role in defining the very concept of reality."
Monday, June 15, 2009
Friday, June 12, 2009
Monday, June 01, 2009
Friday, May 29, 2009
Zepheira partners Eric Miller, Uche Ogbuji and myself will brief representatives of the press at 12:00 US Pacific Time in the Fairmont Hotel in San Jose. Zepheira will demonstrate Freemix in a booth on the SemTech exhibit floor.
SemTech conference attendees may also attend a briefing on Freemix on Wednesday, 17 June 2009 from 5:00-6:00 PM US PST.
If you are a spreadsheet user and want to share your data more widely, Freemix is for you. Wouldn't it be nice if your data had friends, too?
Thursday, May 21, 2009
Dan McCreary and I will be giving a three-hour tutorial on entity extraction on the Monday. I'll be presenting a talk on Active PURLs: Stored Procedures for the Semantic Web on the Tuesday. Additionally, it seems likely that I will replace Uche on a panel dubiously entitled Web3-4-Web2, also on the Tuesday.
Speakers have been authorized to share coupons for up to $200 off registration fees. If you would like to get the coupon code, please contact me or leave a comment here by May 29, 2009.
Zepheira is a gold sponsor again this year and we will have a very cool announcement. We are going to officially launch Freemix at the conference. The site is still under authentication, but will be released to the public just before the conference. It should be exciting. If you care are putting real, live, useful, everyday data on the Semantic Web, come see it.
Saturday, May 16, 2009
FIrstly (using the rare American adverb here - don't be confused), you can't expect Wolfram Alpha to act like Google. It is a new kind of search engine, as one should expect from Stephen Wolfram. Wolfram is famously the inventor of Mathematica and author of A New Kind of Science.
Wolfram Alpha seems to consist of a linguistic interpretation engine coupled to Mathematica and a growing number of databases. Google, on the other hand, is a free-text indexer of Web content. That suggests that while one might be able to type just about any word or phrase into Google that is somewhere on the Web, one must limit Wolfram Alpha queries to concepts that are in its databases or may be treated as mathematical relationships. Indeed, this seems to be the case.
Wolfram's overview video is well worth watching. It, and the example search results available from the home page, give a flavor for the powerful searches one can do with the site.
Following a lead from the video, I tried typing the female name "Bernadette" into the search box. Wolfram Alpha, as advertised, did indeed respond with a presumption that I wanted information about the name and results that included a time distribution plot of popularity. Searching for "Bernadette David" gave me a distribution plot of both names which showed the highest combined popularity did in fact occur around our birth years. Well done, Wolfram Alpha.
Changing the previous search to "Bernadette Peters" resulted in some minor information about the actress and a link to her Wikipedia entry. Wikipedia links are provided where possible, as a transparent but useful attempt to provide flesh to limited source content.
However, more general searches, such as the word "Zepheira", produced no results. Wolfram Alpha responds to null result sets with a message saying "Wolfram|Alpha isn't sure what to do with your input.". That alone makes it clear that Wolfram Alpha and Google are at best complimentary.
Too many users on the site result in a cute message saying "I'm sorry Dave, I'm afraid I can't do that..." - which is only mildly freaky if your name happens to be Dave. The reference naturally comes from the mutiny of the HAL 9000 computer in the film "2001: A Space Odyssey".
Math, science, engineering and finance queries work well, as expected. A Web interface to Mathematica is useful in itself. I suspect that the site will be most effectively used by college students and some working professionals. My mom and dad are unlikely to find it compelling (although my dad is a weather geek and weather data is well represented, so I might be wrong). Still, the lack of detailed weather results such as live RADAR images would more likely lead him to weather.com.
One can do funky and useless math with aplomb. Wolfram Alpha rapidly provided me with the correct interpretation, unit dimensions and unit conversions for the search "100 furlongs per microfortnight", a speed well above that of sound but under that of light.
Minor misspellings were handled effectively (e.g. "area of icosehedron" was correctly interpreted as "area of icosahedron"). Similarly, "volume of icosahedron" resulted in a correct interpretation. I expected the search "distance to a star" to fail miserably, but the answer was surprisingly useful. Try it yourself to see what I mean.
The problem with this kind of interface is that interpretations of intent are notoriously hard, if not impossible, in the general case. How can Wolfram Alpha expect to know that when I typed "birth year of gandhi" that I meant Mahatma Gandhi? What if I meant Indira Gandhi? Guessing is fine as far as it goes, but most search engines chose to give up that approach a decade ago in favor of appendation of search results.
The interface style is also naturally limited by its underlying data. Searching for "the size of the World Wide Web" resulted in a suggested to try "the size of the world wide" - which it could answer as the diameter of Earth.
I wonder how many people recall that Yahoo used to allow mathematical equations in their search engine? They seem to have removed the functionality. One can only presume that they got in the way of becoming a more general Internet search engine. I suspect there is a lesson there for Wolfram Research. Will Wolfram Alpha stay aimed at specialists or will they grow into a more general tool? Time will tell. Their promise to integrate more databases does not promise to address the inherent limitations of guessing linguistic intent.
In summary, Wolfram Alpha is an expert-friendly search system for specialists and is best used as an orthogonal complement to Google and other general search engines. Its approach is pure Wolfram - unashamedly different and unapologetically ignoring of lessons learned by others.
Sunday, April 26, 2009
Fortunately, others are doing active research on agricultural origins even if I am not. Dr. Dorian Fuller of the Institute of Archaeology at University College London has cracked a very special nut, indeed. He and his team have located substantial evidence of the location and timing of rice domestication in the Lower Yangtze region of Zhejiang, China.
Dr. Fuller and his colleagues discovered a location where the local diet shifted dramatically from a hunter-gatherer lifestyle to an agricultural one over a mere three hundred years. That alone is fascinating and an important discovery. Equally interesting was the dating of the shift, from 6900 to 6600 years ago. That places rice domestication in a timeframe fully two thousand years later than thought and lends serious support to diffusion theories (versus parallel development).
The process used by Fuller collected mixtures of midden material from the site, and painstakingly separated wild rice remains from domesticated rice remains. Specifically, they looked at spikelets, the place where rice seeds attach to stalks. Like other domesticated plants, rice underwent a genetic shift to retain the seeds for harvest by humans by a process of artificial selection. The shape of the spikelets is sufficiently different as to be distinguishable.
There is a nice scanning electron microscope image of a wild rice spikelet base at the Agricultural Biodiversity Weblog.
The last I heard, Londo's investigation1 was still suggesting multiple independent origins of rice in Southeast Asia and lower China. Hopefully Fuller's paper2 will put that to rest. Londo at least admitted that his team wasn't certain.
Wikipedia's entry on rice says, "Rice has been cultivated in Asia likely over 10,000 years." It is clearly time to correct that entry and, more broadly, correct the education of literally billions of people who are taught it. I really need to get back to work on my Origins of Agriculture summary and update it with these findings.
 Londo, J.P., Chiang, Y-C, Hung, K-H, Chiang, T-U and Schaal, B.A. (2006). "Phylogeography of Asian wild rice, Oryza rufipogon, reveals multiple independent domestications of cultivated rice, Oryza sativa". PNAS, http://www.pnas.org/content/103/25/9578.long
 Fuller, D.Q., Qin, L., Zheng, Y., Zhao, Z., Chen, X., Hosoya, L.A. and Sun, G-P (2009, March 20). The Domestication Process and Domestication Rate in Rice: Spikelet Bases from the Lower Yangtze, Science 20 March 2009, 323/5921, pp. 1607-1610, http://www.sciencemag.org/cgi/content/abstract/323/5921/1607
Saturday, April 25, 2009
Even my eight-year-old can figure this one out, all by herself and with no hints.