Using Semantic Concepts in the myGrid project

Carole Goble, Chris Wroe, Phil Lord, Jun Zhao and Robert Stevens
University of Manchester, UK
carole@cs.man.ac.uk

Abstract

myGrid is an e-Science pilot research project developing open source high-level middleware to support personalised in silico experiments in biology. In silico experiments use databases and computational analysis rather than laboratory investigations to test hypothesis. The project relies heavily on metadata that associates ontological concepts with its various middleware services in order to operate, interoperate and reason over them intelligently. Thus myGrid can be thought of as an early Semantic Grid project. In this paper we present myGrid, the semantic services and the various ways concepts drawn from ontologies are used or proposed to be used.

1. myGrid Motivation

myGrid aims to develop open source high-level service-based middleware to support in silico experiments in biology. In silico experiments are procedures using computer based information repositories and computational analysis adopted for testing hypothesis or to demonstrate known facts. In our case the emphasis is on data intensive experiments that combine use of applications and database queries. The user is helped to create workflows (a.k.a. experiments), sharing and discovering others' workflows and interacting with the workflows as they run.

Rather than thinking in terms of data grids or computational grids we think in terms of Service Grids, where the primary services support routine in silico experiments. The intention is that the project's middleware services are a toolkit to be adopted and used in a `pick and mix' way by bioinformaticians, tool builders and service providers who in turn produce the end applications for biologists. The target environment is open, by which we mean that services and their users are decoupled. Services are not just used solely by their publishers but by users unknown to the service provider, who may use them in unexpected ways.

myGrid focuses on speculative explorations by a scientistto form discovery experiments. These evolve with the scientist's thinking, and are composed incrementally as the scientist designs and prototypes the experiment. Intermediate versions and intermediate data are kept, notes and thoughts are recorded, and parts of the experiment and other experiments are linked together to form a network of evidence, as we see in bench laboratory books. We aim to collect, share and reuse:

  1. Experimental design components: workflow specifications; query specifications; notes describing objectives; applications; databases; relevant papers; the web pages of important workers, and so on.
  2. Experimental instances that are records of enacted experiments: data results; a history of services invoked by a workflow engine; instances of services invoked; parameters set for an application; notes commenting on the results and so on.
  3. Experimental glue that groups and links design and instance components: a query and its results; a workflow linked with its outcome; links between a workflow and its previous and subsequent versions; a group of all these things linked to a document discussing the conclusions of the biologist and so on.

Discovery experiments by their nature presume that the e-biologist is actively interacting with and steering the experimentation process, as well as interacting with colleagues (in the simplest case by email). [13] gives a detailed motivation for the project.

myGrid has developed, together with its biologist stakeholders, a detailed set of scenarios for the examination of the genetics of Graves' disease, an immune disorder causing hyperthyroidism [9]. This case study is our test bed application, though we are also applying our services to investigations into Williams Syndrome and African sleeping sickness in cattle.

We have built an electronic laboratory workbench demonstrator application in NetBeans as a vehicle to experiment with our services: their functionality, their deployment and their interactions, and to crystallise our architecture [12]; [26] shows the workbench in action. In addition, Talisman is a third party application that is prototyping the use of our workflow components [10].

The myGrid middlewarefirstly prototyped with Web Services [4] but with an anticipated migration path to the Open Grid Services Architecture (OGSA) [15]. Figure 1 shows the layered middleware stack of services. The primary services to support routine in silico experiments fall into four categories:

  1. services that are the tools that will constitute the experiments, that is external third party services such databases, computational analyses, simulations etc, wrapped as web services.
  2. services for forming and executing experiments, that is: workflow management services [3], information management services, distributed database query processing [5]. myGrid regards in silico experiments as distributed queries and workflows. Data and parameters are taken as input to an analysis or database service; then output from these is taken, perhaps after interaction with the user, as input to further tools or database queries.
  3. services for supporting the e-Science scientific method and best practice found at the bench but often neglected at the workstation, specifically: provenance management [6] and change notification [8].
  4. semantic services for discovering services and workflows, and managing metadata, such as: third party service registries and federated personalised views over those registries [18,7], and ontologies and ontology management [11].

The final layer (e) constitutes the applications and application services that use some or all of the services described above.

fig1
Figure 1: Figure 1: the myGrid services and middleware stack

Section 2 discusses how concepts are served into the myGrid architecture: the services ontology; the OWL ontology server, the RDF repositories, the reasoning engines and the instance store. Section 3 we tell how semantics are used to describe and discover workflows, services and data; control and validate workflow compositions and service substitutions; annotate workflow execution provenance logs and information repository entries, and link these using reasoning over the annotations. We show that concepts can be used as `glue' to link experiments and experimental components together. We briefly summarise in section 4.

2. Concept Services

myGrid uses a suite of ontologies to represent metadata. Ontologies provide a consensual vocabulary of terms or concepts used descriptions associated with objects such as services, workflows or data; objects are said to be `annotated' with terms. Using free text to describe service, for example, is expressive and flexible, but difficult to search, to reason over, or to automatically organize into classifications. Controlled vocabularies consensually agreed to by a community, give a consistent way to bridge the communication gap between suppliers of items and (potential) consumers.

Ontologies organize the concepts or terms into a classification structure. Other relationships and axioms capture and constrain the properties of the concept. Our ontologies are expressed in DAML+OIL [23], and are subject to a migration to its successor suite of languages OWL [24], the W3C candidate recommendation for a web ontology language.

DAML+OIL (and OWL DL) provides: (a) a means of building concept classifications; (b) a vocabulary for expressing concept descriptions and (c) a reasoning process to both manage the coherency of the classifications and the descriptions when they are created, and the querying and matching when they are deployed. Both DAML+OIL and OWL build upon existing Web standards, such as XML and RDF, and are underpinned by an expressive Description Logic (DL). It is these formal semantics that enable machine interpretation and reasoning support, see [11] for more details. Our key concept services are:

Ontology services An Ontology Server provides a single point of reference for DAML+OIL/OWL concepts. Description logic reasoning of concept expressions is by means of the FaCT Reasoner. A Matchmakermatches query concept descriptions against those in the ontology to find those concepts that are subsumed by (are more specialised) or subsume (are more general) than the query concept. In DLs the query language and the description language are unified, such that you describe the object you want to find (be it concept or instance) and classify it; we then use this classification lattice to explore potential answers to this and more general or specific questions. An Instance Store uses this active concept classification as a sophisticated yet efficient index to a large set of instances held in a database or registry elsewhere. The instance store supports inexact queries i.e. instances are recovered that are not exactly the same as the concept query but are inferred to be included as members of that concept.

Annotation components myGrid uses semantic web annotation tools such as COHSE to capture annotations. RDF is used as a graph-based model to link resources with OWL concepts, with other URI or LSID identified resources, and with XML Schema data types such as literals. Life Science Identifiers [19] are persistent, location-independent, resource identifiers for uniquely naming biologically significant resources including but not limited to individual genes or proteins, or data objects that encode information about them. We have adopted LSIDs as a unique naming scheme for data objects in external databases as well as objects in our own Information Repository.

The use of semantic web technology such as ontology services and semantic annotation tools makes myGrid an early example of a `Semantic Grid' [16].

Using Concepts in myGrid

Concepts drawn from the myGrid ontology are used in several ways.

3.1 Semantic discovery of services and workflows

Much of e-Science depends on pooling, discovering and reusing experimental design components, or even experimental instances. We make no distinction, from the scientist's point of view, between Web or Grid services such as databases, simulations or analysis tools, and the workflows that orchestrate them: they both take inputs and outputs and they both have a function to perform as well as other metadata [20]. Our FreeFluo workflow enactment engine [3] supports two XML workflow languages, one based on IBM's Web Service Flow Language and our own, XScufl, developed as part of the Taverna project, in collaboration with the Human Genome Mapping Project [14]. Workflow templates represent the type or class of service that should be invoked at different stages without specifying a specific instance of the service. To use a workflow template, the abstracted service representations need to be instantiated by available services. Generating enactable workflows from abstracted workflow templates, through the use of service discovery and semantic reasoning, is described in detail in [20].

Services and workflows are published in a federated registry [ref]. We support the UDDI API; however, our registry has been underpinned with a flexible RDF storage component which enables it to support additional metadata. Semantic descriptions in RDF and DAML+OIL attached to registry entries to allow more precise searching by both people and machines. The RDF descriptions cover metadata regarding the operational characteristics of the services or workflows (location, cost, quality of service, availability and so on) that are queried using RDQL. Queries such as `what services offering x currently give the best quality of service?' or `which service would the local bioinformatics expert suggest we use?' involve searching on properties provided by third parties (users, organizational administrators, domain experts, independent validating institutions, etc.) either as opinion, observable behaviour or previous usage. Such metadata might vary from user to user. Thus our registries support third party metadata and multiple semantic descriptions.

DAML+OIL descriptions for the inputs, outputs and function of the registry entries, are based on the DAML-S service profile ontology [25], and queried using the ontology services. Again, a service may attract multiple descriptions reflecting different interpretations of an entry, in particular services that are polymorphic depending on their parameters. For example, seqret service reads and writes (returns) nucleotide sequences; however, depending on the inputs, outputs and configuration it can be a service for extracting sequences from databases; displaying sequences; reformatting a sequence; producing the reverse complement of a sequence; simple extraction of a region of a sequence or a simple sequence beautification utility. Parameterised polymorphic services require multiple semantic descriptions.

Figure 2 shows an extract of such a service classification and an example of a DAML+OIL/OWL class description used to calculate the hierarchy (in abstract OWL syntax).

Class (BLASTpService complete WebService
restriction (input someValuesFrom (Protein))
restriction (usesSomeResource someValuesFrom
(protein sequence database))
restriction (isFunctionOf someValuesFrom (BLAST))

Figure 2: Using a classification to discover classes of services to fulfil the task, and a service description concept.

The scientist interacts with, personalises and chooses services, workflows and data through a workbench. This workbench acts as a client to two components used for the discovery of workflows, and services that can instantiate parts of a workflow. Discovery components take advantage of the richer metadata within the registry view to enable more sophisticated semantic discovery.

Figure 3 shows a workflow discovery wizard making use of the semantic find component, to find relevant services and workflows in the myGrid registry that operate on data of a specific semantic type (here an Affymetrix probe set identifier). A registry browser is also available in the workbench to allow the user to browser more freely for a workflow or service using a hierarchical categorisation based on each individual semantic description.

fig3
Figure 3: Workbench Workflow Wizard

3.2 Semantic workflow construction: guidance and validation

The construction of workflows by the bioinformatician is guided by constraining the choice to those services, which have semantically compatible inputs and outputs. The semantic type of the output data from a workflow or service constrains the input data of the succeeding step. For example, if the first step produces a set of protein sequence records, the next stage must accept inouts that are classified as a set of protein sequence records, such as a SWISS-PROT record entry or a PIR database entry. The semantic find components are being incorporated into the Taverna workflow development environment. In fact workflow resolution and harmonisation is more complex than this as described in [20].

3.3 Semantic discovery and gluing together of repository elements

The myGrid Information Repository (mIR) acts as a personalised store of all information relevant to a scientist performing an in silico experiment. It implements an information model tailored to e-Science. Experimental data is stored together with provenance records of its origin. The mIR has also been designed to store information about people and projects both directly linked to the investigation and from the wider scientific community to aid collaboration.

Metadata storage is a central feature of the mIR, with annotation possible for all internally stored objects in addition to objects stored in disparate remote repositories. Annotations are currently stored in an RDF triple like manner. We are considering the use of off the shelf RDF triple stores such as the Jena Semantic Web toolkit [22], which we already use for the registry. Several types of annotation are used. Free-text notes of the object's significance with respect to the investigation, the hypothesis of the experiment, thoughts and opinions by the scientist and quality of results are stored as XML in the mIR or as regular web documents. All mIR entries may have DAML+OIL ontology annotations of what the object represents, and must have provenance attributes regarding who created it, when, in what context, and so on. These annotations answer questions such as `what recent workflows were run by Dr. Pearce using BLAST', `what workflows have been recently run by members of my project?' and `What other workflows can operate over this kind of data?' (as in Figure 3).

myGrid makes liberal use of ontologies to annotate, discover and manage its various components. Together with the LSIDs that identify mIR entries, the DAML+OIL concepts associated with mIR entries and the RDF graphs linking concepts and LSIDs, form the experimental glue we talked about in section 1. We are trying to build a web of related pages relevant to an experimental investigation, marked up with, and linked together using annotations drawn from shared ontologies (see Figure 4). This web includes not only the provenance record of a workflow run but also links to other provenance records of other related or unrelated workflow runs, diagrams of the workflow specifications, web pages about people who ran the workflow or have related study in provenance, literatures relevant to provenance study, notes of the experiment and so on. This is the idea behind a `web of science' as proposed in [21].

fig4
Figure 4: A Web of Experimental Holdings connected through shared concepts forming semantic glue

An organisation would typically have a single mIR, which would be shared by many users, each using it to store their own provenance, data and metadata. Different users can be provided with different views of the information it contains. These types of views can be built by exploiting the rich metadata associated with each object. Not only can views be constructed based on user but myGrid also aims to provide views based on criteria such as experiment, project and subject topic.

3.4 Semantically annotating and linking workflow provenance logs

When a workflow is executed, FreeFluo generates provenance logs in the form of XML files, recording the start time, end time and service instances operated in this workflow. Data, metadata about the workflow and the provenance logs are stored in the mIR. By annotating provenance logs with concepts drawn from the myGrid ontology, we can dynamically generate a hypertext of provenance documents, data, services and workflows based on their associated concepts and reasoning over the ontology by using the COHSE (Conceptual Open Hypermedia Services Environment) system. See [17] for details of how we annotate these logs.

The concepts lymphocyte and neutrophil are both subsumed by the concept white blood cell in the Domain Ontology. Figure 5 shows a provenance document (left) that includes an input to the service AffyMetrixMapper that is a ProbeSetId that has been annotated by the concept lymphocyte. When the scientist clicks on the annotation icon (a `C' icon) next to the link anchor, the links that are generated are to other documents that also annotated as lymphocyte. The `More General Links' refer to other documents labelled with subsuming concepts, here white blood cell. On (right), a link anchor is generated for the subsuming concept (white blood cell). Links to documents annotated with more specific concepts (lymphocyte and neutrophil) are displayed as `More specific Links' in the popup window.

fig5
Figure 5: Generated links between provenance documents

Figure 6 shows that other kinds of documents can be annotated and linked into the provenance logs with the help of the Generic Ontology. The web page of the Institute of Human Genetic in the University of Newcastle is linked to the provenance logs based on the common annotated concept Human Genetics. Also links to some other human genetics related literatures are provided for the Human Genetics link anchor.

fig6
Figure 6: Generated links between provenance documents and other kinds of documents

These two figures also demonstrate different views of linking between documents due to different ontologies applied for conceptual linking. As introduced above, we used two ontologies in this project. By choosing one ontology for conceptual linking each time, different link anchors are recognized by the Linking Service in COHSE and different target links are provided for different concepts.

4. Summary

Semantic web technologies such as annotation, discovery and ontology services are still at an early stage of development. Current myGrid prototypes have been useful in crystallising requirements for semantics within e-Science and specifically how those semantics are integrated into myGrid components. The next phase of the project will aim to deliver more robust semantic components tailored to this environment, and to evaluate the utility of the approaches.

Acknowledgements

This work is supported by the UK e-Science programme EPSRC GR/R67743, & DARPA DAML subcontract PY-1149, Stanford University.

References

[1]
Proceedings UK OST e-Science 2nd All Hands Meeting 2003, Nottingham, UK 2-4 Sept, 2003
[2]
M Senger, P. Rice T Oinn, Soaplab - a unified Sesame door to analysis tools in [1]
[3]
M Addis, T Oinn, M Greenwood, J Ferris, D Marvin, P Li, A Wipat Experiences with eScience workflow specification and enactment in bioinformatics in [1]
[4]
S Parastatidis, P Watson The NEReSC Core Grid Middleware in [1]
[5]
MN Alpdemir, A Mukherjee, NW Paton, P Watson, AAA. Fernandes, A Gounaris, J Smith Service-Based Distributed Query Processing on the Grid in [1]
[6]
R Stevens, M Greenwood, CA Goble, Provenance of e-Science #Experiments - experience from Bioinformatics in [1]
[7]
P Lord, C Wroe, R Stevens, CA Goble, S Miles, L Moreau, K Decker, T Payne, J Papay, Semantic and Personalised Service Discovery in Proceedings IEEE/WIC International Conference on Web Intelligence / Intelligent Agent Technology Workshop on "Knowledge Grid and Grid Intelligence" October 13, 2003, Halifax, Canada
[8]
L Moreau, X Liu, S Miles, A Krishna, V Tan, R Lawley myGrid Notification Service in [1]
[9]
R Stevens, K Glover, C Greenhalgh, C Jennings, P Li, M Radenkovic, A Wipat. Performing in silico Experiments on the Grid: A Users Perspective in [1]
[10]
T Oinn Talisman - Rapid Application Development for the Grid in Proceedings of Intelligent Systems in Molecular Biology, Brisbane Australia, July 2003
[11]
C Wroe, R Stevens, CA Goble, A Roberts, and M Greenwood. A suite of DAML+OIL ontologies to describe bioinformatics web services and data. In International Journal of Cooperative Information Systems, 12(2):197-224, March 2003.
[12]
R Stevens, A Robinson, and CA Goble myGrid: Personalised Bioinformatics on the Information Grid, in Proceedings of Intelligent Systems in Molecular Biology, Brisbane Australia, July 2003
[13]
CA Goble, S. Pettifer, R. Stevens and C. Greenhalgh Knowledge Integration: In silico Experiments in Bioinformatics in The Grid: Blueprint for a New Computing Infrastructure Second Edition (eds. I Foster and C Kesselman), 2003, Morgan Kaufman, in press
[14]
Taverna workflow environment for bioinformatics: http://sourceforge.net/projects/taverna
[15]
D Talia, "The Open Grid Services Architecture - Where the Grid Meets the Web", IEEE Internet Computing, vol. 6, no. 6, pp. 67-71, 2002.
[16]
CA Goble and D De Roure Semantic Grid: An Application of the Semantic Web in ACM SIGMOD Record, 31(4) December 2002
[17]
J Zhao, CA Goble, M Greenwood, C Wroe, R Stevens Annotating, linking and browsing provenance logs for e-Science in 2nd Intl Semantic Web Conference (ISWC2003) Workshop on Retrieval of Scientific Data, Florida, USA, October 2003
[18]
S Miles, J Papay, V Dialani, M Luck, K Decker, T Payne and Luc Moreau. Personalised Grid Service Discovery. Performance Engineering. 19th Annual UK Performance Engineering Workshop. pp131-140, 2003
[19]
The Life Science Identifier http://www.i3c.org
[20]
C Wroe, CA Goble, M Greenwood, P Lord, S Miles, L Moreau, J Papay, T Payne, Experiment automation using semantic data on a bioinformatics Grid submitted to IEEE Intelligent Systems
[21]
J Hendler Science and The Semantic Web, Science, Jan 24, 2003.
[22]
Jena Semantic Web Toolkit http://www.hpl.hp.com/semweb/jena.htm
[23]
Horrocks DAML+OIL: a reason-able web ontology language. In Proc. of EDBT 2002, March 2002.
[24]
OWL Web Ontology Language Overview. http://www.w3c.org/TR/owl-features/
[25]
The DAML Services Coalition (alphabetically Anupriya Ankolenkar, Mark Burstein, Jerry R. Hobbs, Ora Lassila, David L. Martin, Drew McDermott, Sheila A. McIlraith, Srini Narayanan, Massimo Paolucci, Terry R. Payne and Katia Sycara), "DAML-S: Web Service Description for the Semantic Web", The First International Semantic Web Conference (ISWC), Sardinia (Italy), June, 2002.
[26]
Goble CA, Wroe C, Stevens R, and the myGrid consortium The myGrid project: services, architecture and demonstrator in [1]