Sunday, November 26, 2006

Requiem for a Lost Soul

My brother, you had it all.
You were so smart, so strong, so handsome.
You took our father's name and made me jealous.

My brother, your skills stunned us.
You were a wonderful musician,
a skilled linguist, a clear thinker, a nice man.

My brother, why did you not see your own worth?
Why did you need others to validate you?
Why did you need others to force you to lead?

My brother, you loved and were loved
and yet it was not enough to save you.
Your poison of choice was too strong.

My brother, now you are dead
and the world is a happier place for it.
Not one of us anticipated that.

Monday, November 13, 2006

Sun Makes a Really Great Mess

In a shock move, Sun Microsystems, every geek's favorite non-profit corporation, released Java ME and SE today under - get this - the GNU General Public License version 2.

Do they have any idea what they did to the industry? I don't think so. Sun seems to be claiming that Java programs which run on a GPL'd Java Virtual Machine are not "derivative works" of the Java language.

A derivative work in the GPL is defined as it is under copyright law, namely, "a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language."

The question is whether a Java program will be considered a derivative work of Java (such as when you extend Java.lang.Object or use reflection) by any court, anywhere, under any nation's copyright law. That it will happen somewhere seems likely to me and incredibly dangerous to the Java industry.

Why is this such a problem? Because the GPL says, "But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it." This is the "viral nature" of the GPL and the heart of the debate that will dominate the blogosphere and the industry news media for the immediate future.

I think Sun has just let a very powerful Pandora out of the box.

Thursday, November 09, 2006

International Semantic Web Conference (ISWC) 2006 DAY 3

Thursday, 9 November 2006

I attended the track this morning on Applications of SW Technologies with Lessons Learned, including these papers:

Crawling and Indexing Semantic Web Data (Andreas Harth, Juergen Umbrich, Stefan Decker),
Using Ontologies for Extracting Product Features from Web Pages (Wolfgang Holzinger, Bernhard Kruepl, Marcus Herzog) and Characterizing the Semantic Web on the Web (Li Ding, Tim Finin).

I asked Andraes Harth and Li Dong (Swoogle) about indexing RDFa content and confirmed my opinion regarding its difficulty. Neither project currently indexes RDFa documents for the simple reason that they have no way to identify RDFa content without parsing every XHTML document they come across. The cost of doing that is too high.

I spoke with DanC and Ivan Herman about this at some length, but nobody seems to know what to do about it. Do you add an in-document identifier for RDFa content? If so, you lose a critical RDFa feature: the ability to cut-and-paste sections of content without losing machine readability. Do you just point to RDFa compatible documents from other documents in such a way that search engines get the hint they need? Swoogle would be fine with that, but it doesn't address how RDFa documents are consumed in a browser by the general public. Perhaps the answer is, as Steve Harris would have it, that your browser should just parse a document locally to see if it contains any triples of interest to you. It doesn't address global searching, but many people seem willing to cede that to those willing to parse the documents, like Google.

The W3C's RDF-in-XHTML Task Force, which has recently moved to the Semantic Web Deployment Working Group, has discussed this at length and not come up with an answer. I don't have one myself.

Wednesday, November 08, 2006

International Semantic Web Conference (ISWC) 2006 DAY 2

Wednesday, 8 November 2006

Susie Stephens of Oracle presented her Industry Track paper Integrating Enterprise Data with Semantic Technologies. Oracle 10g Release 2 embeds a SPARQL-like graph pattern into SQL (that is industry speak for "We don't support SPARQL, but please don't penalize us for it"). There is apparently no funded project within Oracle to support SPARQL. A forward chaining rules engine, equivalent to Mulgara's Krule, support RDFS and user-defined rules.

She said that Oracle has put up to 1 billion RDF statements into Oracle. That tells me that we had better get onto funding Mulgara's XA2 next-generation data store in order to stay relevant in terms of scaling. However, she only showed numbers for query speeds up to 80 million triples.

Oracle's advantage is building on their mature database, which allows them access to existing features, such as encryption or scaling or clustering. A nice example of combining previous and new features was shown which combined a multimedia search with a term equivalence being given in RDF. Thus, a search for X-rays of "jaw" could pick up those tagged with the term "mandible".

Oracle 11g will support some level of OWL, reportedly something "similar" to OWL Lite but modified to reduce the ability to perform computationally intensive searches.

Oracle seems to have made it easy to get started with Semantic Web technologies. That is a good and positive thing for everyone in the industry. Her comments regarding modifications to the existing URI-based identifiers to allow use of existing unique identification schemes concerned me, though. Semantic Web technologies without URIs would be a huge step backward, even taking into account the obvious short term gains. Better to facilitate the mapping of existing identifiers to URIs.

Susie also mentioned that webMethods has announced RDF and OWL support in the new version of Fabric. Indeed, this press release from webMethods says that Fabric uses RDF and OWL. Specifically, "the library automatically learns dependencies and relationships between IT assets."

Explaining Conclusions from Diverse Knowledge Sources (J William Murdock, Deborah McGuinness, Paulo Pinheiro da Silva, Chris Welty, David Ferrucci). Their Open Source framework, Unstructured Information Management Architecture in Java, seems worth a look. The goal is to produce good search results when dealing with a mixture of structured and unstructured content. The example in the talk involved some textual data extraction coupled with some theorem proving.

There was an active Semantic Web Services track at the conference. This is hardly surprising, since UDDI is so badly broken and Web Services are left without a reasonable way to perform composition and discovery. Yet Semantic Web Services still seem to be mired in academia. Perhaps the industry will start to see the light if Oracle and webMethods successfully deploy useful semantic tools to the community.

A Software Engineering Approach to Design and Development of Semantic Web Service Applications (Marco Brambilla, Irene Celino, Stefano Ceri, Dario Cerizza, Emanuele Della Valle, Federico Michele Facca) described a top-down approach toward annotating Semantic Web Services using a Spiral development model. This was the first time I have seen anyone actually use WebML. I am going to have to look at that, especially since there is a tool which implements it. They also used WSML. The link provides an interesting summary of Web rules languages.

RS2D: Fast Adaptive Search for Semantic Web Services in Unstructured P2P Networks (Matthias Klusch, Ulrich Basters) presented a model for Open World searching of semantic services. They introduced concepts like Semantic Gain, Semantic Loss and a Baysian-derived risk factor to judge the likelihood that peers would have something to add to an answer. The idea is to use machine learning to reduce gratuitous network communication when querying Semantic Web Services. The algorithm works well for unstructured, peer-to-peer networks without a single authoritative source for information. This was also a best paper nominee.

Web 2.0 panel: Tom Gruber of realtravel identified "Collective Intelligence" as the critical feature of Web 2.0 and the area where the Semantic Web can provide the most value. Truth and semistructured queries are the critical components. "Don't ask what the Web knows, ask what the World knows", by which he means, "ask what people know."

The problem with SemWeb apps seems to me to be that they are almost all closed world. We need some large-scale, open world applications. Simile or Tabulator are the closest I have seen and incredibly cool, but they are far from mainstream. We need to encourage the development of more SemWeb apps which pull data from the Web and publish data back to the Web. I have been as guilty as anyone else on this, but I promise to try to get better.

Tom Gruber suggests that any app to address this problem should explicitly allow others to mash up on top of it. That is an excellent point.

Patrick Stickler of Nokia and Marja-Riitta Koivunen of Annotea and I discussed the state of SPARQL at dinner. The lack of a simple syntactical means of performing negation is a real problem. Perhaps we can fix that before SPARQL becomes a W3C Recommendation. Perhaps too I should recover my notes on Mulgara/Kowari/TKS's EXCLUDE operator and it's relation to Jena's NOT. We don't need an RDF query language standard that is hard to use, hard to implement and has two different types of null...

International Semantic Web Conference (ISWC) 2006 DAY 1

Tuesday, 7 November 2006

The first day of ISWC was intensely busy for me. I was only able to attend a single session due to the number of people I spoke with.

I had a lengthy and interesting conversation with Harry Halpin, co-chair of the W3C's GRDDL Working Group, regarding RDFA and GRDDL. I found myself representing RDFA, which I did to the best of my ability. Fortunately, I was able to recall the critical use case for RDFA over GRDDL; the requirement to support a cut-and-paste of a block of XHTML without losing information on how that block should be interpreted. We discussed at length RDFA's potential show stopper - the lack of explicit identification of RDFA content. That prohibits searching for documents which support RDFA and leaves open the question of whether one may rely on RDFA extraction from documents where there is no a priori knowledge that a source document complied with RDFA markup. I suggest that will kill RDFA in practice unless it is addressed.

Harry cheered Eric Miller's push for a "persistent URI" service for RDF identifiers, similar to the persistent URL service operated by OCLC at purl.org.

Harry and I were tutored by IBM's Chris Welty on the difference between RDF:Resource and OWL:Thing. I can never remember the difference. RDF:Resource includes its own language-specific features, OWL:Thing does not (in OWL DL), but they are the same in OWL Full. Thanks, Chris!

Chris has made a tremendous amount of progress as co-chair of the W3C's Rules Interchange Format (RIF) Working Group. The group has reportedly agreed that they will, in fact, produce a rules interchange format (hey, that was hard!), that they will define a core feature set based on positive Horne and that they will support an arbitrary number of non-Horne extensions via an extension mechanism. That level of early structure should allow the group to proceed without the factionalization that dogged the WebOnt group (producers of the OWL standards).

Years into Semantic Web development, we still need a coordinated location for good ontologies. Harry spoke to the guy who runs http://semanticweb.org/ about hosting mappings between Web 2.0 vocabularies and RDF and ontologies. The response was positive, but it still needs to happen. Unfortunately, I did not get his name.

Apparently there is still a need for a good geospatial ontology, even for simple agreed concepts like latitude and longitude. This has been a problem for years, especially within government circles and supporting organizations such as MITRE. Harry pointed me toward Harry Chen's blog entries here.

Harry and Norm Walsh have been working on an isomorphic mapping between vcard and RDF. Details are available on Norm's blog, although Harry told me there is a newer version which he promised to send me. This is a useful thing, especially as it makes use of some FOAF to make the mapping clean.

Steve Harris, of OWL Tiny fame, is now at UK identity protection startup Garlik. Garlik is currently operating in the UK only, but they are planning a US market entry next year. They ask their customers for personal information used to identify them to their banks and watch public and subscription databases to determine if others are using their identity. Naturally, this is done via an RDF graph. There certainly is a need for some kind of identity protection service in the US. USA Today ("McPaper") reported today that 8.9 million Americans (4% of the population) lose their identity each year and that is costs them an average of US$6,383.

I finally met Chimezie Ogbuji, now at Cleveland Clinic and formerly of ForeThought. He worked on the ForeSuite CMS, which uses RDF and XML databases to manage content. Interestingly, Chimezie is a fan of Daniel Krech. ForeSuite uses Daniel's rdflib! It is also using a GRDDL-like transform between XML documents to generate the RDF.

David Taowei Wang did a good job presenting A Survey of the Web Ontology Landscape (Taowei Wang, Bijan Parsia, Jim Hendler). Bijan now has a huge unkempt beard and a nineteenth century waxed mustache. He appears to be enjoying teaching at Manchester.

The most interesting paper I have seen in a while was Semantics and Complexity of SPARQL (Jorge A. Perez, Marcelo Arenas, Claudio Gutierrez). It is up for a best paper award and probably deserves it. It is great to see someone, even if not the W3C, providing a model theoretic for SPARQL. Unfortunately, the work does not yet cover entailments or bnodes, which are outstanding issues at the W3C.

I missed seeing OntoWiki - A Tool for Social, Semantic Collaboration (Sören Auer,Thomas Riechert, Sebastian Dietzold) because I was talking to Guus Schreiber, Chris Welty, Ivan Herman and Harry Halpin (again). I'll have to read it in the proceedings, though, because it looks interesting.


The poster session was well attended. Six posters from MINDSWAP were accepted, including the Semantic Web challenge entry. Unfortunately, three of them were down a long corridor in the lunch room and received very few visitors :( I was fortunate to get an excellent location with plenty of traffic and little noise from the band.

My poster, Enhancing Software Maintenance by using Semantic Web Techniques, reported on some research in progress. My purpose in submitting it to ISWC was to get initial responses to the research direction and gather some ideas for next steps. I was pleasantly surprised to receive some very positive feedback. There was a lot of interest from software engineers, especially those from the more pragmatic people, such as those from IBM, Accenture and SRI International. Software maintenance costs money and that makes a market.

There were several comments regarding the depth of my use of OWL-DL. I had created an ontology of software engineering concepts which was focused on Java for the prototype implementation. I had used OWL-DL in order to use SWOOP to ensure logical consistency. A good next step would be to represent the high-level constructs of other languages so that multi-language projects could be managed in the environment. This is necessary because different languages treat even basic concepts differently, such as the separation of abstract classes from interfaces or the existence of unimplemented method signatures. An OWL-DL ontology could readily map the like and disjoint components in the various languages and be used to infer inheritance relationships.

Dr. Kerry Taylor from CSIRO in Canberra stopped by. She knows my advisor Dave Carrington at UQ and was surprised to see him involved in SemWeb work. She confirmed the earlier comments regarding inferencing across language differences.

I must see similar project called FAMIX. FAMIX provides "a language-independent representation of object-oriented source code and is used ... as a basis for exchanging information about object-oriented software systems." Avi Bernstein of the University of Zurich told me about it and recommended that I discuss it with his collaborator, Harald Gall. This is what conferences are for.

Monday, November 06, 2006

2006 Survey on Software Engineering Practices Closed

The 2006 Survey on Software Engineering Practices is now closed. 448 software engineers from 52 countries participated! Thanks very much to all who helped. I will post summary data from the survey as soon as I can.

Friday, November 03, 2006

Getting the World to Listen

Today's news was chock full of articles about a researcher in Canada and his colleagues in the UK and Germany who concluded the world's oceans would be fished out by 2048. What news! A little digging showed that the original article in the journal Science came out in August of 2005 - a year and a quarter ago. Why the wait until the world noticed?

It turns out that Drs. Boris Worm and Ransom Myers of Dalhousie University's Department of Biology got tired of nobody listening to their research on the loss of biodiversity in the world's oceans. They took matters into their own hands and made it easy for reporters to break the story. They prepared press releases and did most of the work for the reporters. The result was that the research was finally picked up by the Associated Press and syndicated widely.

Today, Google news reported 570 articles (!) telling the story.

It seems sad that researchers have to become experts on both science and marketing to get their message out, but there it is.