Wednesday, October 27, 2004

Life Sciences Industry Perspectives on the Semantic Web

Still at the W3C Semantic Web in Life Sciences Workshop (agenda).

The most consistent message from the pharma industry has been that the data management and discovery problems they face are complex. Otto Rittter from Astrazeneca said, "Drug discovery is a complex, costly, risky, information-driven enterprise". Not a bad quote, but it doesn't make you feel the truth like Ted Slator's comments about Pfizer. Pfizer is the industry's largest R&D organization. They have 12,500 employees and plan to spend US$7.9 billion on R&D alone in 2004. They have hundreds of ongoing R&D efforts in 18 theapuetic areas. Now, take that money-to-person ratio and combine it with the fact that the current state of knowledge management is driven by M$ Excel and Powerpoint. That is, data is collected in Excel and shown to other reseachers solely (in most cases) via Powerpoint. Wow.

According to Eric Neumann (Global Head of Knowledge Management, Aventis Pharmaceutical), the primary concerns when developing drugs are safety, efficacy (will it do what it is supposed to do), cost effectiveness and timeliness. It strikes me that the same list could be applied to software engineering projects; they are simply statements of economics. However, software projects that violate the rules are often fielded, anyway.

A fundamental problem of the application of semantic techniques to the life sciences industry is that basic terms are not well defined. Even simple terms like "protein" and "gene" are the subject of much argument. This would definitely hamper the development of ontologies. Still, many people are doing it in the best spirit of just getting on with things.

Two subjects of discussion in the Semantic Web Best Practices Working Group have been highlighted here: Provenance and transitive relationships. Biology is complex, and the statement that a gene encodes a protein may only be true within a certain context (including species of the genome, the gene sequence used, version info, etc). That makes transitive relationships suspect, and infers (there's that word again) that they should be made only when context is very clear. Even simple in silico experiments suffer from a lack of software version capture, as well as operating environments. That situation gets worse when biological experiments fail to encode full provenance.

The industry currently has no information supply chain data exchange standards, nor are they likely to come soon. Trust issues and funding sources ensure that data is simply not shared. This could result in semantic techniques being applied solely within companies in the short and medium terms. I would love to see some of the pharmas get together to define common ontologies, though, even if the instance data is purely internal.

Overall, the industry is drowning in data and an inability to get their hands around it. Ted Slator (Pfizer) says, "Our domain is too big to fit in our heads", and yet data integration is generally being attempted that way. It is no wonder that this workshop attracted such a large attendence.

No comments:

Post a Comment