Research Agenda for the Semantic Grid De Roure, Jennings and Shadbolt December 2001


1. Introduction

1.1 Motivation

Scientific research and development has always involved large numbers of people, with different types and levels of expertise, working in a variety of roles, both separately and together, making use of and extending the body of knowledge. In recent years, however, there have been a number of important changes in the nature and the process of research. In particular, there is an increased emphasis on collaboration between large teams, an increased use of advanced information processing techniques, and an increased need to share results and observations between participants who are widely dispersed. When taken together, these trends mean that researchers are increasingly relying on computer and communication technologies as an intrinsic part of their everyday research activity. At present, the key communication technologies are predominantly email and the Web. Together these have shown a glimpse of what is possible; however to more fully support the e-Scientist the next generation of technology will need to be much richer, more flexible and much easier to use. Against this background, this report focuses on the requirements, the design and implementation issues, and the research challenges associated with developing a computing infrastructure to support e-Science.

 

The computing infrastructure for e-Science is commonly referred to as the Grid [Foster98] and this is, therefore, the term we will use here. This terminology is chosen to connote the idea of a ‘power grid’: namely that e-Scientists can plug into the e-Science computing infrastructure like plugging into a power grid. An important point to note however is that the term ‘grid’ is sometimes used synonymously with a networked, high performance computing infrastructure. While this aspect is certainly an important and exciting enabling technology for future e-Science, it is only a part of a much larger picture that also includes information handling and support for knowledge within the e-scientific process. It is this broader view of the e-Science infrastructure that we adopt in this document and we refer to this as the Semantic Grid. Our view is that as the Grid is to the Web, so the Semantic Grid is to the Semantic Web. Thus the Semantic Grid is characterised by an open system, with a high degree of automation, that supports flexible collaboration and computation on a global scale.

 

The grid metaphor intuitively gives rise to the view of the e-Science infrastructure as a set of services that are provided by particular individuals or institutions for consumption by others. Given this, and coupled with the fact that many research and standards activities are embracing a similar view [WebServices01], we adopt a service-oriented view of the Grid throughout this document (see section 2 for a more detailed justification of this choice). This view is based upon the notion of various entities providing services to one another under various forms of contract (or service level agreement).

 

Given the above view of the scope of e-Science, which includes information and knowledge, it has become popular to conceptualise the computing infrastructure as consisting of three conceptual layers[1]:

 

This layer deals with the way that computational resources are allocated, scheduled and executed and the way in which data is shipped between the various processing resources. It is characterised as being able to deal with large volumes of data, providing fast networks and presenting diverse resources as a single metacomputer (i.e. a single virtual computer). In terms of technology, this layer shares much with the body of research that has been undertaken into distributed computing systems. In the context of this document, we will discuss a variety of frameworks that are, or could be, deployed in grid computing at this level. The data/computation layer builds on the physical ‘grid fabric’, i.e. the underlying network and computer infrastructure, which may also interconnect scientific equipment. Here data is understood as uninterpreted bits and bytes.

 

This layer deals with the way that information is represented, stored, accessed, shared and maintained. Given its key role in many scientific endeavours, the World Wide Web (WWW) is the obvious point of departure for this level. Thus in the context of this document, we will consider the extent to which the Web meets the e-Scientists’ information requirements. In particular, we will pay attention to the current developments in Web research and standards and identify gaps where the needs of e-Scientists are not being met. Here information is understood as data equipped with meaning.

 

This layer is concerned with the way that knowledge is acquired, used, retrieved, published and maintained to assist e-Scientists to achieve their particular goals and objectives. In the context of this document, we review the state of the art in knowledge technologies for the Grid, and identify the major research issues that still need to be addressed. Here knowledge is understood as information applied to achieve a goal, solve a problem or enact a decision.

 

 

 

 

 

 

 

 

 

 

 


Figure 1.1: Three layered architecture viewed as services

 

We believe this structuring is compelling and we have, therefore, adopted it to structure this report.  We also intend that this approach will provide a reader who is familiar with one layer with an introduction to the others, since the layers also tend to reflect distinct research communities and there is a significant need to bridge communities. However, there are a number of observations and remarks that need to be made. Firstly, all grids that have or will be built have some element of all three layers in them. The degree to which the various layers are important and utilised in a given application will be domain dependent – thus in some cases, the processing of huge volumes of data will be the dominant concern, while in others the knowledge services that are available will be the overriding issue. Secondly, this layering is a conceptual view on the system that is useful in the analysis and design phases of development. However, the strict layering may not be carried forward to the implementation for reasons of efficiency. Thirdly, the service-oriented view applies at all the layers. Thus there are services, producers, consumers and contracts at the computational layer, at the information layer and at the knowledge layer (figure 1.1).

 

Fourthly, a (power) grid is useless without appliances to plug in. Confining the infrastructure discussion to remote services runs the risk of neglecting the interface, i.e. the computers, devices and apparatus with which the e-Scientist interacts. While virtual supercomputers certainly offer potential for scientific breakthrough, trends in computer supported cooperative work (CSCW), such as embedded devices and collaborative virtual environments, also have tremendous potential in facilitating the scientific process. Whereas the grid computing literature has picked up on visualisation [Foster99] and virtual reality (VR) [Leigh99], comparatively little attention has been paid to pervasive computing and augmented reality – what might be called the ‘smart laboratory’. This document addresses this omission.

 

With this context established, the next sub-section introduces a grid application scenario that we will use to motivate and explain the various tools and techniques that are available at the different grid levels.

 

1.2 Motivating Scenario

At this time, the precise set of requirements on the e-Science infrastructure are not clear since comparatively few applications have been envisaged in detail (let alone actually developed). While this is certainly going to change over the coming years, it means the best way of grounding the subsequent discussion on models, tools and techniques is in terms of a scenario. To this end, we will use the following scenario to motivate the discussion in this document.

 

This scenario is derived from talking with e-Scientists across several domains including physical sciences. It is not intended to be domain-specific (since this would be too narrow) and at the same time it cannot be completely generic (since this would not be detailed enough to serve as a basis for grounding our discussion). Thus it falls somewhere in between. Nor is the scenario science fiction – these practices exist today, but on a restricted scale and with a limited degree of automation. The scenario itself (figure 1.2) fits with the description of grid applications as “coordinated resource sharing and problem solving among dynamic collections of individuals” [Foster01].


Scenario

 

The sample arrives for analysis with an ID number. The technician logs it into the database and the information about the sample appears (it had been entered remotely when the sample was taken). The appropriate settings are confirmed and the sample is placed with the others going to the analyser (a piece of laboratory equipment). The analyser runs automatically and the output of the analysis is stored together with a record of the parameters and laboratory conditions at the time of analysis.

 

The analysis is automatically brought to the attention of the company scientist who routinely inspects analysis results such as these. The scientist reviews the results from their remote office and decides the sample needs further investigation. They request a booking to use the High Resolution Analyser and the system presents configurations for previous runs on similar samples; the scientist selects the appropriate parameters. Prior to the booking, the sample is taken to the analyser and the equipment recognizes the sample identification. The sample is placed in the equipment which configures appropriately, the door is locked and the experiment is monitored by the technician by live video then left to run overnight; the video is also recorded, along with live data from the equipment. The scientist is sent a URL to the results.

 

Later the scientist looks at the results and, intrigued, decides to replay the analyser run, navigating the video and associated data. They then press the “query” button and the system summarises previous related analyses reported internally and externally, and recommends other scientists who have published work in this area. The scientist finds that their results appear to be unique.

 

The scientist requests an agenda item at the next research videoconference and publishes the experimental data for access by their colleagues (only) in preparation for the meeting. The meeting decides to make the analysis available for the wider community to look at, so the scientist then logs the analysis and associated metadata into an international database and provides some covering information. Its provenance is recorded. The availability of the new data prompts other automatic processing and a number of databases are updated; some processing of this new data occurs.

 

Various scientists who had expressed interest in samples or analyses fitting this description are notified automatically. One of them decides to run a simulation to see if they can model the sample, using remote resources and visualizing the result locally. The simulation involves the use of a problem solving environment (PSE) within which to assemble a range of components to explore the issues and questions that arise for the scientist. The parameters and results of the simulations are made available via the public database.  Another scientist adds annotation to the published data.


Figure 1.2: Workflow in the scenario

 

This scenario draws out a number of underlying assumptions and raises a number of requirements that we believe are broadly applicable to a range of e-Science applications:

 

 

We believe that these requirements will be ubiquitous in e-Science applications conducted or undertaken in a grid context. The rest of this document explores the issues that arise in trying to satisfy these requirements.

1.3 Report Structure

The remainder of this document is organised in the following manner:

 

Section 2. Service-Oriented Architectures

Justification and presentation of the service-oriented view at the various levels of the infrastructure. This section feeds into sections 4 and 5 in particular.

 

Section 3. The Data/Computation Layer

Discussion of the distributed computing frameworks that are appropriate for grid computing, informed by existing grid computing activities.

 

Section 4. The Information Layer

Discussion of the Web, distributed information management research and collaborative information systems that can be exploited in e-Science applications.

 

Section 5. The Knowledge Layer

Description of the tools and techniques related to knowledge capture, modelling and publication that are pertinent to e-Science applications.

 

Section 6. Research agenda

Summary of the steps that need to be undertaken in order to realise the vision of the Semantic Grid as outlined in this report.



[1] The three layer grid vision is attributed to Keith G. Jeffery of CLRC, who introduced it in a paper for the UK Research Councils Strategic Review in 1999.