| Research Agenda for the Semantic Grid | De Roure, Jennings and Shadbolt | December 2001 |
In this section we focus firstly on the Web. The Web’s information handling capabilities are clearly an important component of the e-Science infrastructure, and the web infrastructure is itself of interest as an example of a distributed system that has achieved global deployment. The second aspect addressed in this section is support for collaboration, something which is key to e-Science. We show that the web infrastructure itself lacks support for synchronous collaboration between users, and we discuss technologies that do provide such support.
It is interesting to consider the rapid uptake of the Web and how this might inform the design of the e-Science infrastructure, which has similar aspirations in terms of scale and deployment. One principle is clearly simplicity – there was little new in HTTP and HTML, and this facilitated massive deployment. We should however be aware of a dramatic contrast between Web and Grid: despite the large scale of the Internet, the number of hosts involved in a typical web transaction is still small, significantly lower than that envisaged for many grid applications.
The information layer aspects build on the idea of a ‘collaboratory’, defined in a 1993 US NSF study [Cerf93] as a “centre without walls, in which the nation’s researchers can perform their research without regard to geographical location - interacting with colleagues, accessing instrumentation, sharing data and computational resource, and accessing information in digital libraries.” This view accommodates ‘information appliances’ in the laboratory setting, which might, for example, include electronic logbooks and other portable devices.
The next section discusses technologies for the information layer and this is followed in 4.3 by consideration of support for collaboration. Section 4.4 considers the information layer aspects of the scenario.
4.1.1 The Web for Information Distribution
The early web architecture involved HTTP servers and web browsers, the browser simply being a client that supports multiple Internet protocols (including HTTP) and the then-new document markup language HTML. In responding to an HTTP request, the server can deliver a file or, using the Common Gateway Interface (CGI), invoke a local program (typically a script) that obtains or generates the content to be returned. The document returned by the server to the client is MIME-encoded so that the client can recognise its type. Different content types are handled at the browser by native support, invocation of a helper application or, more recently, browser ‘plugins’. With forms, the client effectively sends a document rather than a simple request (in fact HTTP has always supported document uploads, but this is only now becoming used as authoring tools become better integrated).
Scalability is achieved through caching, whereby a copy of a document is stored at an intermediary between client and server so that future requests can be serviced without retrieving the entire document from the server. Familiar examples are the client-side caches maintained by web browsers, and ‘caching proxies’ shared by a group of users. Hierarchical caching schemes have also been explored. These involve a tree of parent and child caches in which each parent acts as a cache for the child caches below it. A cache consults all its neighbours if it does not have a document, and if none of them has it then it is obtained via a parent. Here there is a trade-off between performance and freshness of the content – there are fewer gains from accessing highly dynamic content, and some content is not cached at all. A server may specify an expiry time after which a document is out-of-date, after which the proxy server will check for a newer document. Documents that have changed recently are considered more likely to change again. Sometimes it is necessary to access a server directly, not via a cache, for reasons of access control (e.g. subscription to content).
The simplicity of the web architecture is constraining, and several techniques have emerged to overcome this. For example, forms and cookies are used to maintain state across multiple HTTP requests in order to have a concept of a session. This is achieved by the server returning state information to the client that is then sent back to the server in later requests. However, there are many practical advantages to the use of the strict client-server architecture and bi-directional synchronous communication over a single TCP/IP session. For example, once the client has connected to the server then the return path is immediately available; to establish the connection in the reverse direction may be prohibited by a firewall. There are performance penalties for establishing a new TCP/IP connection for every request, because it takes time for TCP/IP to establish a connection and for it to self-adjust to optimise performance to prevailing network conditions (‘slow start’). This has led to the introduction of ‘persistent connections’ in HTTP version 1.1 so that a connection can be reused over a close succession of requests. This is more efficient that using short connections in parallel.
For security, HTTP can be transported through the ‘secure socket layer’ (SSL) which uses encryption – this is ‘HTTPS’, and the related X509 standard deals with public key encryption. The evolution of SSL within IETF is called TLS (Transport Layer Security). Although HTTP is thought of as a reliable protocol, things can still go wrong and IBM have proposed reliable HTTP (HTTPR), a protocol for reliable messaging over HTTP.
In section 3.1.2 we discussed the Web as an infrastructure for distributed applications, where information is exchanged between programs rather than being presented for a human reader. Such information exchange is facilitated by the XML family or recommendations from W3C.
XML is designed to mark up documents and has no fixed tag vocabulary; the tags are defined for each application using a Document Type Definition (DTD) or an XML Schema. A well-formed XML document is a labelled tree. Note that the DTD or Schema addresses syntactic conventions and does not address semantics. XML Schema are themselves valid XML expressions. Many new ‘formats’ are expressed in XML, such as SMIL (the synchronised multimedia integration language).
RDF (Resource Description Framework) is a standard way of expressing metadata, specifically resources on the Web, though in fact it can be used to represent structured data in general. It is based on ‘triples’ where each triple expresses the fact that an object O has attribute A with value V, written A(O,V). An object can also be a value, enabling triples to be ‘chained’, and in fact any RDF statement can itself be an object or attribute – this is called reification and permits nesting. RDF Schema are to RDF what XML Schema are to XML: they permit definition of a vocabulary. Essentially RDF schema provide a basic type system for RDF such as Class, subClassOf and subPropertyOf. RDF Schema are themselves valid RDF expressions.
XML and RDF (with XML and RDF schema) enable the standard expression of content and metacontent. Additionally a set of tools has emerged to work with these formats, for example parsers, and there is increasing support by other tools. Together this provides the infrastructure for the information layer. Other representational formats include the Ontology Interchange Language (OIL) and the DARPA Agent Markup Language (DAML), which have been brought together to form DAML+OIL. These are discussed in Section 5. W3C has created a Web Ontology Working Group to focus on the development of a language to extend the semantic reach of current XML and RDF metadata efforts.
W3C ran a “Metadata Activity”, which addressed technologies including RDF, and this has been succeeded by the Semantic Web Activity. The activity statement [Semweb] describes the Semantic Web as follows:
“The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation. It is the idea of having data on the Web defined and linked in a way that it can be used for more effective discovery, automation, integration, and reuse across various applications. The Web can reach its full potential if it becomes a place where data can be shared and processed by automated tools as well as by people.”
This vision is familiar – it shares much with the
e-Science vision. The Scientific American paper [BernersLee01] provides
motivation, with a scenario that uses agents. In a nutshell, the
Semantic Web is intended to do for knowledge representation what the Web did
for hypertext. The Semantic Web activity now includes The
Web Ontology Working Group, which will build upon the RDF core work. Section 5 focuses on Semantic Web and
associated technologies.
Having focused on process-to-process information exchange, in this section we revisit the interface to the human. A key goal of the Semantic Grid is to provide the e-Scientist with the right information at the right time, i.e. personalisation and a degree of context-awareness. This requirement is amplified by the huge scale of information that will be generated by e-Science.
Content can be generated automatically or pre-existing content can be transformed dynamically according to circumstances. To this end, the sets of links (hyperstructure) support navigation around a body of documents and adapting these links is a powerful mechanism for personalisation which might not require dynamic content generation (rather it provides a different view on a set of pre-existing information resources, which can be a useful separation of concerns). The lifetime of dynamically generated content must be considered – for example, it may need to persist so that it can be annotated.
Although this is a well-established approach within hypermedia systems, the web linking technology is particularly problematic. The basic problem is that once a document is published, anybody on the planet can embed a link to it within their own documents. That link is then hardwired, and it is also vulnerable to decay as the original document might be renamed or moved. An alternative architecture (known as ‘open’ hypermedia [Davis99]) maintains the link information separately from the document. This facilitates dynamic generation of the hyperstructures and maintenance of link integrity. Web standards have evolved to accommodate this mode of working, supporting ‘out of line’ links.
In contrast, current generation search engines are a good example of a service that is not adaptive – typically two different users with the same query will get the same result. Ideally the search query will be qualified by the user’s context, and the results ranked accordingly on return. The ultimate search engine, however, would be one that accepted an empty query and gave the right results, entirely based on information about the user and what they are doing. Metasearch takes advantage of several search engines and typically involves some form of query expansion and query routing, a process that can also be adapted dynamically.
To illustrate the creation of user models, consider the following two techniques. Firstly, the user might specify their interests, perhaps on a form or by providing a representative set of documents (perhaps from bookmarks). Secondly, the model might be obtained by incidental capture of information based on browsing habits. The latter can occur close to the server (e.g. from server logs), near the client (where detailed information, such as dwell time and other actions on the document can be recorded) or at an intermediate stage. The set of documents visited by a user is their trail, and based on trails it is possible to help users find other users who have travelled (or are travelling) the same route.
A popular mechanism to realise adaptive systems in the web context is the use of proxies, using them to perform other functions as the requests pass through and documents pass back [Barrett98]. There are many examples:
· Caching proxies, including hierarchical caching in which caches are arranged as parents and siblings.
· Proxies producing logs. This enables logging local to part of a network, rather than using server logs. Logs can be used for management purposes but also to capture user browsing behaviour as part of an adaptive information system.
· Proxies modifying document content. Simple modifications include rewriting URLs, introducing links, prepending or appending content. With a full parser, the proxy can do more sophisticated transformations (but may take more time to do it).
· Proxies modifying requests. The proxy can rewrite a request, or respond to a request by redirecting the client.
Another technique to extend functionality is to download code to the client, e.g. as Java or JavaScript. Such applications are constrained by the execution environment and typically only provide additional interactive functionality. Note that a downloaded Java application (applet) is able to initiate a connection to the server, enabling new bi-directional communication channels to be opened up. For example, this enables a server to send live data to the client without the client polling for it.
What we have described in this section is current web practise which is relevant to the information grid. Building on top of this, however, we envisage an “adaptive grid” being developed, and this will be further supported by the emerging technologies described in section 5.2.
The role of the browser is then to provide the human interface. Web browsers have become the standard user interface to many services and applications, where they provide a widely available and highly configurable interface. However in the fullness of time it is clear that conventional browsers will not be the only form of interface to web-based information systems; for example, with augmented reality, queries will be made via many forms of device, and responses will appear via others.
The Web was originally created for distribution of information in an e-Science context at CERN. So an obvious question to ask is does this information distribution architecture described in 4.1.1-3 meet grid requirements? A number of concerns arise:
· Version control. The popular publishing paradigm of the Web involves continually updating pages without version control. In itself the web infrastructure does not explicitly support versioning.
·
Quality of service. Links are embedded, hardwired
global references and they are fragile, rendered useless by changing the
server, location, name or content of the destination document. Expectations of
link consistency are low and e-Science may demand a higher quality of service.
· Provenance. There is no standard mechanism to provide legally significant evidence that a document has been published on the Web at a particular time [Probity][Haber91].
· Digital Rights Management. e-Science demands particular functionality with respect to management of the digital content, including for example copy protection and intellectual property management.
· Curation. Much of the web infrastructure focuses on the machinery for delivery of information rather than the creation and management of content. Grid infrastructure designers need to address metadata support from the outset (this issue is more fully justified in section 5).
To address some of these issues we can look to work in other communities. For example, the multimedia industry also demands support for digital rights management. MPEG-21 aims to define ‘a multimedia framework to enable transparent and augmented use of multimedia resources across a wide range of networks and devices used by different communities’ [MPEG21], addressing the multimedia content delivery chain. It is interesting to consider its seven elements in the context of grid computing:
1. Digital Item Declaration - schema for declaring digital items
2. Digital Item Identification and Description - framework for identification and description of any entity
3. Content Handling and Usage - interfaces and protocols for creation, manipulation, search, access, storage, delivery and content use and reuse
4. Intellectual Property Management and Protection
5. Terminals and Networks - transparent access to content across networks and terminals
6. Content Representation
7. Event Reporting – for users to understand the performance of all reportable events within the framework.
Authoring is another major concern, especially collaborative authoring. The Web-based Distributed Authoring and Versioning (WebDAV) activity [WebDAV] is chartered “to define the HTTP extensions necessary to enable distributed web authoring tools to be broadly interoperable, while supporting user needs”.
In summary, although the Web provides an effective layer for information transport, it does not provide a comprehensive information infrastructure for e-Science.
We can view the e-Science infrastructure as a number of interacting components, and the information that is conveyed in these interactions falls into a number of categories. One of those is the domain specific content that is being processed. Additional types include:
· Information about components and their functionalities within the domain
· Information about communication with the components
· Information about the overall workflow and individual flows within it
These must be tied down in a standard way to promote interoperability between components, with agreed common vocabularies. By way of example, Agent Communication Languages (ACLs) address exactly these issues. In particular the Foundation for Intelligent Physical Agents (FIPA) activity [FIPA], and more recently DAML-S, provide approaches to establishing a semantics for this information in an interoperable manner. FIPA produces software standards for heterogeneous and interacting agents and agent-based systems, including extensive specifications. In the FIPA abstract architecture:
· Agents communicate by exchanging messages which represent speech acts, and which are encoded in an agent-communication-language.
· Services provide support agents, including directory-services and message-transport-services.
· Services may be implemented either as agents or as software that is accessed via method invocation, using programming interfaces (e.g. in Java, C++ or IDL).
Again we can identify agent-to-agent information exchange and directory entries as information formats which are required by the infrastructure ‘machinery’.
The two recent developments in the web community that have attracted most interest are Web Services and the Semantic Web. Web Services, outlined in this section, provide a potential realisation of certain aspects of the service-oriented architecture proposed in section 2. The Semantic Web, on the other hand, utilises the RDF and RDF Schema technologies and is discussed in section 5.
We have seen that the web architecture for document delivery is different to the technologies used for distributed applications, and that the web infrastructure is increasingly supportive of process-to-process information exchange. Web Services represent the convergence of these technologies. The Web Services proposals are industry led, and are in various stages of the W3C process:
· XML Protocol. Originally this was the Simple Object Access Protocol (SOAP). Essentially, XML Protocol [XMLP] enables applications to access remote applications; it is often characterised as an XML-based remote procedure call mechanism. XML Protocol can be used for exchanging any XML information. SOAP is a working draft from W3C, together with XML Protocol Abstract Model (the set of concepts that XML Protocol must adhere to).
· Web Services Description Language (WSDL). Describes the service (in a limited way) and how to use it, in some ways similar to an IDL. WSDL is available as a W3C note [WSDL].
· Universal Description Discovery and Integration (UDDI). This is a directory of registered services, similar to a ‘yellow pages’. UDDI support ‘publish, find and bind’: a service provider publishes the service details to the directory; service requestors make requests to the registry to find the providers of a service; the services ‘bind’ using the technical details provided by UDDI. Effectively, UDDI integrates yellow and white page services because it has both business services; it additionally provides the ‘binding templates’ [UDDI].
The next service attracting interest is at the process level. For example, Web Services Flow Language (WSFL) [WSFL] is an IBM proposal that defines workflows as combinations of Web Services and enables workflows themselves to appear as services; XLANG [XLANG] from Microsoft supports complex transactions that may involve multiple Web Services. It may also be useful to be able to decompose services, replacing elements to meet individual needs
As an information dissemination mechanism the Web might have involved many users as ‘sinks’ of information published from major servers. However in practice, part of the web phenomenon has been widespread publishing by the users. This has had a powerful effect in creating online communities. However, the paradigm of interaction is essentially ‘publishing things at each other’, and is reinforced by email and newsgroups which also support asynchronous collaboration.
Despite this, however, the underlying internet infrastructure is entirely capable of supporting live (real-time) information services and synchronous collaboration. For example:
· Live data from experimental equipment
· Live video feeds (‘webcams’) via unicast or multicast (e.g. MBONE).
· Videoconferencing (e.g. H.323, coupled with T.120 to applications, SIP)
· Internet Relay Chat.
· MUDs
· Chat rooms
· Collaborative Virtual Environments
All of these have a role in supporting e-Science, directly supporting people, behind the scenes between processes in the infrastructure, or both. In particular they support the extension of e-Science to new communities that transcend current organisational and geographical boundaries.
Although the histories of these technologies predate the Web, they can interoperate with the Web and build on the web infrastructure technologies through adoption of appropriate standards. For example, messages can be expressed in XML and URLs are routinely exchanged. In particular the web’s metadata infrastructure has a role: data from experimental equipment can be expressed according to an ontology, enabling it to be processed by programs in the same way as static data such as library catalogues.
The application of computer systems to augment the capability of humans working in groups has a long history, with origins in the work of Doug Englebart [Englebart62]. In this context, however, the emphasis is on facilitating distributed collaboration, and we wish to embrace the increasingly ‘smart’ workplaces of the e-Scientist including meeting rooms and laboratories. Amongst the considerable volume of work in the ‘smart space’ area we note in particular the Smart Rooms work by Pentland [Pentland96] and Coen’s work on the Intelligent Room [Coen98]. This research area falls under the “Advanced Collaborative Environments” Working group of the Global Grid Forum (ACE Grid), which addresses both collaboration environments and ubiquitous computing.
The Access Grid is a multicast videoconferencing infrastructure to support the collaboration of e-Scientists. In the UK there will be access grid nodes at the national and eight regional e-Science centres [AccessGrid]. Multicast videoconferencing is a familiar infrastructure in the UK in the form of the JANET multicast backbone (MBONE) which has been in service since 1991, first using tunnels and as a native service since 2000; international native multicast peerings have also been established.
The ISDN-based videoconferencing world (based on H.320) has evolved alongside this, and the shift now is to products supporting LAN-based videoconferencing (H.323). In this world, the T.120 protocol is used for multicast data transfer, such as remote camera control and application sharing. Meanwhile the IETF has developed Session Initiation Protocol (SIP), which is a signalling protocol for establishing real-time calls and conferences over internet networks. This resembles HTTP and uses Session Description Protocol (SDP) for media description.
During a meeting there is live exchange of information, and this brings the information layer aspects to the fore. For example, events in one space can be communicated to other spaces to facilitate the meeting. At the simplest level this might be slide transitions or remote camera control. These provide metadata which is generated automatically by software and devices, and can be used to enrich the conference and stored for later use. New forms of information may need to be exchanged to handle the large scale of meetings, such as distributed polling and voting.
Another source of live information is the notes taken by members of the meeting, or the annotations that they make on existing documents. Again these can be shared and stored to enrich the meeting. A feature of current collaboration technologies is that sub-discussions can be created easily and without intruding – these also provide enriched content.
In videoconferences, the live video and audio feeds provide presence for remote participants – especially in the typical access grid installation with three displays each with multiple views. It is also possible for remote participants to establish other forms of presence, such as the use of avatars in a collaborative virtual environment. For example, participants can share a 3D visualisation of the meeting spaces. This convergence of the digital and physical – where people are immersed in a virtual meeting space and/or remote participants are ‘ghosts’ in the physical meeting space – is the area of the Equator project, one of the Interdisciplinary Research Collaborations funded by the EPSRC in 2000 [EQUATOR].
The combination of Semantic Web technologies with live information flows is highly relevant to grid computing and is an emerging area of activity [Page01]. Metadata streams may be generated by people, by equipment or by programs – e.g. annotation, device settings, data processed in real-time. Live metadata in combination with multimedia streams (such as multicast video) raises quality of service (QoS) demands on the network and raises questions about whether the metadata should be embedded (in which respect, the multimedia metadata standards are relevant).
To realise the scenario, and the information services in table 2.2, the information layer of the Semantic needs to deal with a variety of information types. These are identified in the table 4.1, with comments on the content representation and metadata.
|
Information |
Representation |
Description (metadata) |
|
Sample ID |
RDF |
|
|
Analysis results |
Raw or XML |
ID, timestamp, parameters |
|
Analyser configurations used in previous runs |
XML |
parameters |
|
Video |
MPEG etc |
ID, timestamp (RDF), events |
|
A URL |
|
|
|
Agenda |
XML |
Author etc |
|
Videoconference |
RTP etc |
Participants, slide transitions |
|
Published analysis results |
XML |
RDF catalogue data |
|
Notifications of new results |
XML or RDF |
|
|
Community publications |
raw |
RDF bibliographic data |
|
Service descriptions |
WSDL |
|
Table 4.1: Information in the scenario
Each of these pieces of information requires a common understanding by those parties using it at the time and many require an understanding for retrospective use. There is an additional form of information, key to provenance and to automation: the description of the workflow. This could be described in a language such as WSFL. The discovery of services may involve a registry (such as UDDI) which does not appear explicitly in the scenario. There will also be security and provenance information (e.g. certificates, digests), as well as cost information for charging and other “housekeeping” information. Exception handling will also result in information flow and is an important area that may be too readily overlooked.
Although many of the technologies discussed in this section are available today (even if only in a limited form), a number of the topics still require further research. These include:
1. Issues relating to e-Science content types. Caching when new content is being produced. How will the web infrastructure respond to the different access patterns resulting from automated access to information sources? Issues in curation of e-Science content.
2. Digital rights management in the e-Science context (as compared with multimedia and e-commerce, for example).
3. Provenance. Is provenance stored to facilitate reuse of information, repeat of experiments, or to provide evidence that certain information existed at a certain time?
4. Creation and management of metadata, and provision of tools for metadata support.
5. Service descriptions, and tools for working with them. How best does one describe a service-based architecture?
6. Workflow description and enaction, and tools for working with descriptions.
7. Adaptation and personalisation. With the system ‘metadata-enabled’ throughout, how much knowledge can be acquired and how can it be used?
8. Collaboration infrastructure for the larger community, including interaction between scientists, with e-Science content and visualisations, and linking smart laboratories and other spaces.
9. Use of metadata in collaborative events, especially live metadata; establishing metadata schema to support collaboration in meetings and in laboratories.
10. Capture and presentation of information using new forms of device; e.g. for scientists working in the field.
11. Interplay between ‘always on’ devices in the e-Scientist’s environment and portable devices with local storage.
12. Repesentation of information about the underlying grid fabric, as required by applications; e.g. for resource scheduling and monitoring.