Hatching the next generation Web in the incubator of e-Science

Or

Lets stop doing stuff in the abstract, we nearly always get it wrong anyhow

Carole Goble

School of Computer Science,

The University of Manchester,

Manchester, M13 9PL UK

carole@cs.man.ac.uk

http://www.cs.man.ac.uk/~carole

 

The Web has many faces and many futures, and many kinds of user. Although I suspect that much of our discussion will be about the Semantic Web, this is only one aspect of the future. There is the “its all infrastructure for applications” case (Dieter, Dave); the “its all AI” argument (Ora), the “its not A.I., its naming and statistics” case (Henry) etc.  Jim is right when he says, enough already! Because it is all of these.  Its horses for courses. It’s the tool for the job. It’s a mixed ecology not a monoculture, and all the more healthy for it.

It is so tempting to look at part of the Next Generation Web (NGW)  - the Semantic Web, the Web or whatever - as a set of horizontal technologies. We build walls around our bit of territory and put a flag in it and then throw stones at each other.

Meanwhile, how the NGW pans out is as much down to how its many users will evolve it. Did we get the vision for the Web right? What makes the Web tick? Content will be provided by the major media content providers” – nope, 60% is by The People. As hoped by the hypertext community of old, consumers have become providers – prosumers as Wired puts it.  No one will buy anything on the Web. Hmmm. The Web will only be for boys – actually its biggest demographics are the middle-aged and women (eBay anyone?).  

Amazing things about the Web.

  1. The link. The links between people and the links between their intelligence. A personal statement contributes to a global zeitgeist. The finding and building of communities. Social networking. Social bookmarking.
  2. The content. The scale of the content. It doesn’t have to be perfect or good, there just has to be enough of it. When your content providers and millions all over the world, when your auction is billions, then you can do stuff on a global scale, like clustering, statistics and machine learning.
  3. Grassroots spontaneity. Blogging was not part of the corporate or government model for the Web. Sadly, neither was spam, phishing or viruses.
  4. The personal becomes global (Global yet Personal).
  5. The infrastructure that makes all this seem straightforward but it is not. How many bloggers know about http? Hopefully none. Do you know how the caching works? Hopefully not. It’s magic and therefore “simple”.

The most amazing thing is Collective Intelligence. The Semantic Web is not A.I. its just I. Or C.I. The Web is already a repository and a mechanism for collective intelligence; the Semantic Web is a means of converting it into something automatically processable in some way, and that some way may be as simple as making a link. Not all the Web will be Semantic and the bits that become so will not always remain so. The most recent Gardner report talked about the “corporate semantic webs” not THE semantic Web.

I think the way to evolve towards the NGW is to think vertically. Henry pointed out that the most success of the current Semantic Web activity is in spefici domain applications. He is right. Moreover, the real gains from the Semantic Web language and reasoning work have been in knowledge applications where there isn’t much web. There is a kind of partitioning – lots of semantics not much web (the Semantic Web) lots of Web, not much semantics (folkonimics). But this doesn’t matter, because in a vertical domain we need both. And privacy, security, identity etc etc. The one big thing I have learnt from working in e-Science is that you don’t have a concrete application you are in trouble.

Pick an incubating, inspiring, supportive vertical domain and build just for it. Forget e-Commerce. Why is information sharing part of the business model? Choose e-Science. It’s happening spontaneously anyhow. Secondly, figure out the how the five points listed above can be replicated in some degree in the Semantic Web.

e-Science is science performed through distributed global collaborations between scientists and their resources enabled by electronic means, in order to solve scientific problems. No one scientific laboratory has the resources or tools, the raw data or derived understanding or the expertise to harness the knowledge available to a scientific community. Real progress depends on pooling know-how and results. It depends on collaboration and making connections between ideas, people, and data. It depends on finding and interpreting results and knowledge generated by scientific colleagues you do not know and who do not know you, to be analysed in ways they did not anticipate, to generate new hypotheses to be pooled in their turn. The importance of e-Science has been highlighted in the UK, for example, by an investment of over £240 million pounds over the past five years to specifically address the research and development issues that have to be tacked to develop a sustainable and effective e-Science e-Infrastructure.

Scientific communities are ideal incubators for the Next Generation Web. They are knowledge driven, fragmented, and have valuable knowledge assets whose contents need to be combined and used by many applications. The content is diverse, being structured (databases, electronic lab books), semi-structured (papers, spreadsheets) and unstructured (presentations, Web blogs, images). The scale necessitates that the processing be done at least partially automatically. There are many suppliers and consumers of knowledge and a loose-coupling between suppliers and consumers – information is used in unanticipated ways by knowledge workers unknown to those who deposited it. People naturally form communities of practice, and there is a culture of sharing and knowledge curation. For a Semantic Web to flourish, the communities it would serve needs to be willing to create and maintain the semantic content. Most scientific communities embrace ontologies. The Life Science world, for example, has the desire for collaboration, a culture of annotation, and service providers that might be persuaded to generate RDF or at least annotated XML. A semantic web is expensive to set up and maintain, and thus is only likely to work for communities where the added value is worthwhile and an “open source data” philosophy prevails.

The inferencing capabilities of OWL have been shown to aid the building of large and sophisticated ontologies such as The Gene Ontology (http://www.geneontology.org) and BioPAX (http://www.biopax.org/). The self-describing nature of RDF and OWL models enables flexible descriptions for data collections, suiting those whose schemas may evolve and change, or whose data types are hard to fix, like knowledge bases of scientific hypotheses, provenance records of in silico experiments or publication collections [3]. These are examples where the semantic technologies have been adopted by scientific application. The Life Science community is tackling the naming issue highlighted by Henry through the Life Science Identifier standard. Genuine “Semantic WEB” examples, with the emphasis on Web, are also starting to appear. SciFOAF builds a FOAF community mined from the analysis of authors and publications over PubMed (http://www.urbigene.com/foaf/). Scientific publishers like the Institute of Physics (http://syndication.iop.org/), publish RSS feeds in RDF using standard RSS, Dublin Core and PRISM RDF vocabularies. The Uniprot protein sequence database has an experimental publication of results in RDF (http://www.isb-sib.ch/~ejain/rdf/). YeastHub [4] converts the outputs of a variety of databases into RDF and combines them in a warehouse built over a native RDF data store. BioDASH (http://www.w3.org/2005/04/swls/BioDash/Demo/) is an experimental Drug Development Dashboard that uses RDF and OWL to associate disease, compounds, drug progression stages, molecular biology, and pathway knowledge for a team of users. Correspondences are not necessarily obvious to detect, requiring specific rules. Semantic technologies are being used to assist in the configuration and operation of e-Science middleware such as the Grid. Building workflows is what e-Scientists do for a living. These examples should be an inspiration to the Semantic Web community.

By concentrating on one community and doing it in detail we get down to solving real problems not invented ones. There are some problems with the expressivity of OWL for Life Science, Chemical and Clinical ontologies. The mechanisms for trust, security, and context are important for intellectual property, provenance tracing, accountability and security, as well as untangling contradictions or weighting support for an assertion; yet these are immature or missing. Performance over medium-large RDF datasets is disappointing – the CombeChem combinatorial chemistry project generated 80 million triples trivially and broke most of the triple stores it tried (http://www.combechem.org). There is poor support for grouping RDF statements, yet this is fundamental. Semantic web purists claim that the Life Science Identifier [6], for example, is unnecessary, although these critics seem not to have actually developed any applications for life scientists. We would be less likely to place a wrong emphasis on what is important and what is not by the technologists, leading to a communication failure between those for whom the Semantic Web is a means to an end and those for whom it is the end [7].

The Web was developed to serve a highly motivated community with an application and a generous spirit–High Energy Physics. The Semantic Web would also benefit from the nursery of e-Science.

References

[1] James Hendler  Science and the Semantic Web Science 299: 520-521, 2003

[2] Eric Neumann A Life Science Semantic Web: Are We There Yet? Sci. STKE, Vol. 2005, Issue 283, 10 May 2005

[3] Jun Zhao, Chris Wroe, Carole Goble, Robert Stevens, Dennis Quan, Mark Greenwood, Using Semantic Web Technologies for Representing e-Science Provenance in Proc 3rd International Semantic Web Conference ISWC2004, Hiroshima, Japan, 9-11 Nov 2004, Springer LNCS 3298

[4] Cheung K.H., Yip K.Y., Smith A., deKnikker R., Masiar A., Gerstein M. YeastHub: a semantic web use case for integrating data in the life sciences domain (2005) Bioinformatics 21 Suppl 1: i85-i96.

[5] Clark T., Martin S., Liefeld T. Globally Distributed Object Identification for Biological Knowledgebases Briefings in Bioinformatics 5.1:59-70, March 1, 2004.

[6] Goble CA, De Roure D, Shadbolt NR and Fernandes AAA Enhancing Services and Applications with Knowledge and Semantics in The Grid: Blueprint for a New Computing Infrastructure Second Edition (eds. I Foster and C Kesselman), Morgan Kaufman 2003

 [7] Phillip Lord, Sean Bechhofer, Mark Wilkinson, Gary Schiltz, Damian Gessler, Carole Goble, Lincoln Stein, Duncan Hull. Applying semantic web services to bioinformatics: Experiences gained, lessons learnt. in Proc 3rd International Semantic Web Conference ISWC2004, Hiroshima, Japan, 9-11 Nov 2004 , Springer LNCS 3298