| Research Agenda for the Semantic Grid | De Roure, Jennings and Shadbolt | December 2001 |
The aim of the knowledge layer is to act as
an infrastructure to support the management and application of scientific
knowledge to achieve particular types of goal and objective. In order to
achieve this, it builds upon the services offered by the data-computation and
information layers.
The first thing to reiterate at this layer is the problem
of the sheer scale of content we are dealing with. We recognise that the amount
of data that the data grid is managing will be huge. By the time that data is
equipped with meaning and turned into information we can expect order of
magnitude reductions in the amount. However the amount of information remaining
will certainly be enough to present us with a problem – a problem recognised as
infosmog – the condition of having too much information to be able to
take effective action or apply it in an appropriate fashion to a specific
problem. Once information is delivered that is destined for a particular
purpose, we are in the realm of the knowledge grid that is fundamentally
concerned with abstracted and annotated content, with the management of
scientific knowledge.
We can see this process of scientific knowledge management
in terms of a life cycle (figure 5.1) of knowledge-oriented activity that ranges
over knowledge acquisition and modelling, knowledge retrieval and reuse,
knowledge publishing and knowledge maintenance (section 5.1). In the rest of
this section we first review the knowledge life cycle in more detail. Next we
discuss the fundamental role that ontologies will play in providing semantics
for the knowledge layer (section 5.2). Section 5.3 then reviews the current
state of knowledge technologies – that is tools and methods for managing the
sort of knowledge content that will be supported in the knowledge grid. Section 5.4 then considers how the knowledge
layers of the Grid would support our extended scenario. Finally, we review the
research issues that arise out of our requirements for a knowledge grid
(section 5.5).
Although we often suffer from a deluge of data and too much information, all too often what we have is still insufficient or too poorly specified to address our problems, goals and objectives. In short, we have insufficient knowledge. Knowledge acquisition sets the challenge of getting hold of the information that is around, and turning it into knowledge by making it usable. This might involve, for instance, making tacit knowledge explicit, identifying gaps in the knowledge already held, acquiring and integrating knowledge from multiple sources (e.g. different experts, or distributed sources on the web), or acquiring knowledge from unstructured media (e.g. natural language or diagrams).

Figure 5.1: The knowledge life cycle
Knowledge modelling bridges the gap between the acquisition of knowledge and its use. Knowledge models must be able both to act as straightforward placeholders for the acquired knowledge coming in, and to represent the knowledge so that it can be used for problem-solving.
Once knowledge has been acquired and modelled, one hopes it will be stored or hosted somewhere meaning that we need to be able to retrieve it efficiently. There are two related problems to do with knowledge retrieval. First, there is the issue of finding knowledge again once it has been stored. And second, there is the problem of retrieving the subset of content that is relevant to a particular problem. This second problem may well set problems for a knowledge retrieval system where that knowledge alters regularly and quickly during problem-solving.
One of the most serious impediments to the cost-effective use of knowledge is that too often knowledge components have to be constructed afresh. There is little knowledge reuse. This arises partly because knowledge tends to require different representations depending on the problem-solving that it is intended to do. We need to understand how to find patterns in knowledge, to allow for its storage so that it can be reused when circumstances permit. This would save a good deal of effort in reacquiring and restructuring the knowledge that had already been used in a different context.
Having acquired knowledge,
modelled and stored it, the issue then arises as to how to get that knowledge
to the people who subsequently need it. The challenge of knowledge
publishing or disseminating can be described as getting the right
knowledge, in the right form, to the right person or system, at the right time
and is analogous to many of the issues raised in section 4.1.2 at the information
layer. Different users and systems will require knowledge to be presented and
visualised in different ways. The quality of such presentation is not merely a
matter of preference. It may radically affect the utility of the knowledge.
Getting presentation right will involve understanding the different
perspectives of people with different agendas and systems with different
requirements. An understanding of knowledge content will help to ensure that
important related pieces of knowledge get published at the appropriate time.
Finally, having got the knowledge acquired, modelled and having managed to retrieve and disseminate it appropriately, the last challenge is to keep the knowledge content current – knowledge maintenance. This may involve the regular updating of content as content changes. But it may also involve a deeper analysis of the knowledge content. Some content has considerable longevity, while other knowledge dates very quickly. If knowledge is to remain active over a period of time, it is essential to know which parts of the knowledge base must be discarded and when. Other problems involved in maintenance include verifying and validating the content, and certifying its safety.
If the
knowledge intensive activities described above are to be delivered in a grid
context we will need to build upon and extend the main elements of the proposed
Semantic Web. It is to this we now turn.
While
the basic concepts and languages of the Semantic Web (as introduced in section
4) are generally appropriate for specifying and delivering services at the
information layer, it is generally agreed that they lack the expressive
power to be used as the basis for modelling and reasoning with knowledge[1].
To this end, the concept of an ontology is necessary. Generally
speaking, an ontology determines the extension of terms and the relationships
between them. However, in the context of knowledge and web engineering, an
ontology is simply a published, more or less agreed, conceptualization of an
area of content. The ontology may describe objects, processes, resources,
capabilities or whatever.
Recently a number of
languages have appeared that attempt to take concepts from the knowledge
representation languages of AI and extend the expressive capability of RDF and
RDF Schema. Examples include SHOE [Luke00], DAML [Hendler00],
and OIL [vanHarmelen00]. Most recently there has been an attempt to integrate the
best features of these languages in a hybrid called DAML+OIL. As well
as incorporating constructs to help model ontologies DAML+OIL is being equipped
with a logical language to express rule-based generalizations. The W3C Web Ontology Working Group, part of
the Semantic Web Activity, is focusing on the development of a language to extend
the semantic reach of current XML and RDF metadata efforts.
However the development of the Semantic Web is not simply about producing machine-readable languages to facilitate the interchange and integration of heterogeneous information. It is also about the elaboration, enrichment and annotation of that content. To this end, the list below is indicative of how rich annotation can become. Moreover it is important to recognize that enrichment or meta-tagging can be applied at any conceptual level in the three tier grid. This yields the idea of meta-data, meta-information and meta-knowledge.
The benefits of an ontology include improving
communication between systems whether machines, users or organizations. They
aim to establish an agreed and perhaps normative model. They endeavour to be
consistent and unambiguous, and to integrate a range of perspectives. Another
benefit that arises from adopting an ontology is inter-operability and this is
why they figure large in the vision for the Semantic Web. An ontology can act
as an interlingua, it can promote reuse of content, ensure a clear
specification of what content or a service is about, and increase the chance
that content and services can be successfully integrated.
A number of ontologies are emerging
as a consequence of commercial imperatives where vertical marketplaces need to
share common descriptions. Examples include the Common Business Library (CBL),
Commerce XML (cXML), ecl@ss, the Open Applications Group Integration Specification
(OAGIS), Open Catalog Format (OCF), the Open Financial Exchange (OFX), Real
Estate Transaction Markup Language (RETML), RosettaNet, UN/SPSC (see
www.diffuse.org), and UCEC. Moreover, there are a number of large-scale
ontology initiatives underway in specific scientific communities. One such is
in the area of genetics where a great deal of effort was invested in producing
common terminology and definitions to allow scientists to manage their
knowledge (www.geneontology.org/).
This effort provides a glimpse of how ontologies will play a critical role in
sustaining the e-Scientist.
This work can also be exploited to
facilitate the sharing, reuse, composition, mapping, and succinct
characterizations of (web) services. In this vein, [McIlraith01] exploit a web
service markup that provides an agent-independent declarative API that
is aimed at capturing the data and metadata associated with a service together
with specifications of its properties and capabilities, the interface for its
execution, and the prerequisites and consequences of its use. A key ingredient
of this work is that the markup of web content exploits ontologies. They have
used DAML for semantic markup of Web Services. This provides a means for
agents to populate their local knowledge bases so that they can reason about
Web Services to perform automatic web service discovery, execution, composition
and interoperation.
More generally speaking, we can observe a variety of e-Business infrastructure companies who are beginning to
announce platforms to support some level of Web Service automation. Examples of
such products include Hewlett-Packard’s e-speak, a description, registration,
and dynamic discovery platform for e-services. Microsoft’s .NET and BizTalk
tools; Oracle’s Dynamic Services Framework; IBM’s Application Framework for
e-Business; and Sun’s Open Network Environment. The company VerticalNet
Solutions is building ontologies and tools to organize and customize web service
discovery. Its OSM Platform promises an infrastructure that is able to
coordinate Web Services for public and private trading exchanges. These
developments are very germane not only for e-Business but also for e-Science.
It can be
seen that ontologies clearly provide a basis for the communication, integration
and sharing of content. But they can also offer other benefits. An ontology can
be used for improving search accuracy by removing ambiguities and spotting
related terms, or by associating the information retrieved from a page with
other information. They can act as the backbone for accessing information from
a community web portal [Staab00]. Internet reasoning systems are beginning to
emerge that exploit ontologies to extract and generate annotations from the
existing web [Decker99].
Given the developments reviewed in this section, a
general process that might drive the emergence of the knowledge grid would
comprise:
Given the processes outlined above, this section deals
with the state of those technologies that might contribute to the construction
and exploitation of annotated knowledge content, and to the general life cycle
of knowledge content.
We have reviewed some
of the language development work that is being undertaken to provide
capabilities for expressive ontology modeling and content enrichment. The W3C
has recently convened a Semantic Web activity (www.w3.org/2001/sw) that is looking into
the options available and a community portal (www.semanticWeb.org) is in existence to
give access to a range of resources and discussion forums. It is fair to say
that at the moment most of the effort is in building XML (www.w3.org/XML, www.xml.com
) and RDF (www.w3.org/RDF ) resources that,
whilst limited, do give us ways and means to build ontologies and annotate
content.
Tools to build ontologies are thin on the ground but
recently one of the best has become open source and is attracting a lot of
external development work. Protégé 2000 is a graphical-based software tool
developed at Stanford. Protégé is conceptually clear and supports both the
import and export of ontologies in RDF, RDFS, and XML. The currently available
versions of Protégé2000 do not provide annotation tools to map information from
an ontology to related web content. Currently the process involves inserting
the XML or RDF annotations into the content manually. There is clearly an
urgent need for the semi-automatic annotation of content.
Besides ontology construction and annotation there are
many other services and technologies that we need in our knowledge grid:
The need for these services is leading to the
emergence of controlled vocabularies for describing or advertising capabilities
– one could almost say ontologies of service types and competencies. Some of
these have already been reviewed in section 4 and include UDDI specification (www.uddi.org); ebXML (www.ebXML.org); and
eSpeak (www.e-speak.hp.com).
A number of these services are going to require inference
engines capable of running on content distributed on the knowledge grid.
Inference has been an important component in the visions of the Semantic Web
that have been presented to date [BernersLee99,01]. According to these views,
agents and inference services will gather knowledge expressed in RDF from many
sources and process this to fulfil some task. However most Semantic Web-enabled inference
engines are little more than proofs of concept. SiLRI
[Decker98] for example is an F-Logic inference engine that has the advantage of
greater maturity over the majority of other solutions. In the US, work on the
Simple HTML Ontology Extensions (SHOE) has used for example Parka, a
high-performance frame system [Stoffel97]; and XSB, a
deductive database [Sagonas94]; to reason over annotations crawled out of
Semantic Web content. There are other inference engines that have been
engineered to operate on the content associated with or extracted from web
pages. Description Logics are particularly well suited to inferences associated
with ontological structures – for example inferences associated with
inheritance, establishing which classes particular instances belong to and
providing the means to reorganize ontologies to capture generalities whilst
maintaining maximum parsimony [Horrocks99], see also www.cs.man.ac.uk/~horrocks/FaCT
.
However, with all of these inference engines there are
likely to be problems of scale and consistency depending on the number and
quality of annotations crawled out of enriched content. How can we ensure that
any of our automated inference methods deliver results in which we can trust?
Trust spans a range of considerations but includes at least the following
important considerations:
A
different perspective on undertaking reasoning on the Grid or Web is to move
away from generic methods applicable to any annotated content and concentrate
instead on task specific reasoning. The emergence of problem-solving
environments (PSEs) takes this position. These systems allow the user to
exploit resources to solve particular problems without having to worry about
the complexities of grid fabric management. As Gannon and Grimshaw [Gannon99]
note “these systems …allow users to approach a problem in terms of the
application area semantics for which the PSE was designed”. Examples include
the composition of suites of algorithms to solve for example design
optimization tasks. Consider the design optimisation of a typical aero-engine
or wing (see figure 5.2). It is necessary (1) to specify the wing geometry in a
parametric form which specifies the permitted operations and constraints for
the optimisation process, (2) to generate a mesh for the problem (though this
may be provided by the analysis code), (3) decide which code to use for the
analysis, (4) decide the optimisation schedule, and finally (5) execute the
optimisation run coupled to the analysis code.

Figure 5.2: A web service knowledge enabled PSE
[Geodise]
In the type of architecture outlined above,
each of the components is viewed as a web service. It is therefore necessary to
wrap each component using, for example, open W3C standards by providing an XML
schema and using XML Protocol to interact with it – in other words an ontology.
However, often the knowledge in a human designer’s mind as to how to
combine and set up a suite of tasks to suit a particular domain problem remains
implicit. One of the research issues confronting architectures such as that
outlined above is to start to make the designer’s procedural knowledge explicit
and encode it within PSEs.
A similar approach to specialist problem solving environments on the Web originates out of the knowledge engineering community. In the IBROW project (http://www.swi.psy.uva.nl/projects/ibrow/home.html) the aim is the construction of Internet reasoning services. These aim to provide a semi-automated facility that assists users in selecting problem-solving components from online libraries and configuring them into a running system adapted to their domain and task. The approach requires the formal description of the capabilities and requirements of problem-solving components. Initial experiments have demonstrated the approach for classification problem solving.
Providing complete
reasoning services will remain a difficult challenge to meet. However, there
are a variety of technologies available and under research (www.aktors.org) to support the more general
process of knowledge acquisition, reuse, retrieval, publishing and maintenance.
Capturing knowledge and modelling it in computer systems has been the goal of knowledge-based systems (KBS) research for some 25 years [Hoffman95]. For example, commercial tools are available to facilitate and support the elicitation of knowledge from human experts [Milton99]. One such technique, the repertory grid, helps the human expert make tacit knowledge explicit.
KBS research has also produced methodologies to guide the developer through the process of specifying the knowledge models that need to be built. One such methodology, CommonKADS [Schrieber00], guides the process of building and documenting knowledge models. Since, the development of knowledge intensive systems is a costly and lengthy process it is important that we are able to re-use knowledge content. CommonKADS is based around the idea of building libraries of problem solving elements and domain descriptions that can be reused.
Knowledge publishing
and dissemination is supported through a range of document management systems
that provide more or less comprehensive publication services. Many are now
exploiting content mark up languages to facilitate the indexing, retrieval and
presentation of content. As yet few of them are powerfully exploiting the
customization or personalization of content. However, within the UK EPSRC
funded Advanced Knowledge Technologies (www.aktors.org)
or AKT project, there is a demonstration of how content might be personalized
and delivered in a way determined by a user’s interests expressed via an
ontology (http://eldora.open.ac.uk/my-planet/).
One of the
interesting developments in knowledge publishing is the emergence of effective
peer-to-peer publication archiving. The ePrints initiative (www.eprints.org) provides a range of
services to allow publications to be archived and advertised with extended
meta-data that will potentially allow a variety of knowledge services to be
developed. For example, content-based retrieval methods. Workflow tools also
exist that help support organization procedures and routines. For example,
tools to support the collection, annotation and cataloguing of genomic data.
However, they often use proprietary standards.
One of the key capabilities is support for
communication and collaborative work and this was discussed in section 4.2.1. A
range of web conferencing tools and virtual shared workspace applications is
increasingly facilitating dialogue and supporting communication between
individuals and groups. Netmeeting using MCU technology to support multicasting
over the JANET is being trailed by UKERNA. CVW developed by Mitre corporation
presents a rich environment for virtual collaborative meeting spaces. Digital
whiteboards and other applications are also able to support brainstorming
activities between sites. More ambitious visions of collaboration can be found
in teleimmersive collaborative design proposals [Gannon99] exploiting CAVE
environments. This aspect of large-scale immersive technology and the mix of
real and virtual environments is at the heart of the research agenda for
EQUATOR IRC (http://www.equator.ac.uk).
As we noted in section 4.1.2, one of the core components to be inserted into the UK e-Science Regional Centres are Access Grids. Access Grids support large-scale distributed meetings, collaborative teamwork sessions, seminars, lectures, tutorials, and training. An Access Grid node consists of large-format multimedia display, presentation, and interaction software environments. It has interfaces to grid middleware; and interfaces to remote visualization environments.
Work in the area of
the sociology of knowledge sharing and management indicate that useful
information is exchanged through social activities that involve physical
collocation such as the coffee bar or water cooler. Digital equivalents of
these are being used with some success.
The use of extranets
and intranets has certainly been one of the main success stories in applied
knowledge management. Corporate intranets allow best practice to be
disseminated, and enable rapid dissemination of content. Extranets enable the
coordination and collaboration of virtual teams of individuals.
A wide variety of
components exist to support knowledge working on the knowledge grid. There is
still much to do in order to exploit potential synergies between these
technologies. Moreover, there is a great deal of research needed to further
develop tools for each of the major phases of the knowledge life cycle. There
are many challenges to be overcome if we are to support the problem solving
activities required to enact a knowledge grid for e-Scientists (see section 5.5
for more details).
Let us now consider our scenario in terms of the opportunities it offers for knowledge technology services (see table 5.1). We will describe the knowledge layer aspects in terms of the agent-based service oriented analysis developed in section 2.3. Important components of section 2.3 were the software proxies for human agents such as the scientist agent (SA) and a technician agent (TA). These software agents will interact with their human counterparts to elicit preferences, priorities and objectives. The software proxies will then realise these elicited items on the Grid. This calls for knowledge acquisition services. A range of methods could be used. Structured interview methods invoke templates of expected and anticipated information. Scaling and sorting methods enable humans to rank their preferences according to relevant attributes that can either be explicitly elicited or pre-enumerated. The laddering method enables users to construct or select from ontologies. Knowledge capture methods need not be explicit – a range of pattern detection and induction methods exist that can construct, for example, preferences from past usage.
One of the most pervasive knowledge services in our
scenario is the partial or fully automated annotation of scientific data.
Before it can be used as knowledge, we need to equip the data with meaning.
Thus agents require capabilities that can take data streaming from instruments
and annotate it with meaning and context. Example annotations include the
experimental context of the data (where, when, what, why, which, how).
Annotation may include links to other previously gathered information or its
contribution and relevance to upcoming and planned work. Such knowledge
services will certainly be one of the main functions required by our Analyser
Agent and Analyser Database Agent (ADA). In the case of the High Resolution
Analyser Agent (HRAA) we have the additional requirement to enrich a range of
media types with annotations. In the original scenario this included video of
the actual experimental runs.
These acquisition and annotation services along with
many others will be underpinned by ontology services that maintain agreed
vocabularies and conceptualizations of the scientific domain. These are the
names and relations that hold between the objects and processes of interest to
us. Ontology services will also manage the mapping between ontologies that will
be required by agents with differing interests and perspectives.
|
Agent Requirements |
Knowledge Technology
Services |
|
Scientist Agent (SA) |
Knowledge Acquisition of Scientist Profile Ontology Service |
|
Technician Agent (TA) |
Knowledge Acquisition of Technician Profile Ontology Service Knowledge Based Scheduling Service to book
analyser |
|
Analyser Agent (AA) |
Annotation and enrichment of instrument streams Ontology Service |
|
Analyser Database Agent (ADA) |
Annotation and enrichment of databases Ontology Service |
|
High Resolution Analyser Agent (HRAA) |
Annotation and enrichment of media Ontology Service Language Generation Services Internet Reasoning Services |
|
Interest Notification
Agent (INA) |
Knowledge Publication Services Language Generation Services Knowledge Personalisation Services Ontology Service |
|
Experimental Results Agent
(ERA) |
Language Generation Services Result Clustering and Taxonomy Formation Knowledge and Data Mining Service Ontology Service |
|
Research Meeting Convener
Agent (RMCA) |
Constraint Based Scheduling Service Knowledge Personalisation Service Ontology Service |
|
International Sample
Database Agent (ISDA) |
Result Clustering and Taxonomy Formation Knowledge and Data Mining Services Ontology Service |
|
Paper Repository Agent
(PRA) |
Annotation and enrichment of papers Ontology Service Dynamic Link Service Discussion and Argumentation Service |
|
Problem Solving
Environment Agent (PSEA) |
Knowledge Based Configuration of PSE Components Knowledge Based Parameter Setting and Input
Selection Ontology Service |
Table 5.1: Example
knowledge technology services required by agents in the scenario
Personalisation services will also be invoked by a
number of our agents in the scenario. These might interact with the annotation
and ontology services already described so as to customize the generic
annotations with personal markup – the fact that certain types of data are of
special interest to a particular individual. Personal annotations might reflect
genuine differences of terminology or perspective – particular signal types
often have local vocabulary to describe them. Ensuring that certain types of
content are noted as being of particular interest to particular individuals
brings us on to services that notify and push content in the direction of
interested parties. The Interest Notification Agent (INA) and the Research
Meeting Convener Agent (RMCA) could both be involved in the publication of
content either customized to individual or group interests. Portal technology
can support the construction of dynamic content to assist the presentation of
experimental results.
Agents such as the
High Resolution Analyser (HRAA) and Experimental Results Analyser (ERA) have
interests in classifying or grouping certain information and annotation types
together. Examples might include all signals collected in a particular context,
sets of signals collected and sampled across contexts. This in turn provides a
basis for knowledge discovery and the mining of patterns in the content. Should
such patterns arise these might be further classified against existing pattern
types held in international databases – in our scenario this is managed in
market places by agents such as the International Sample Database Agent (ISDA).
At this point agents
are invoked whose job it is to locate other systems or agents that might have
an interest in the results. Negotiating the conditions under which the results
can be released, determining the quality of results, might all be undertaken by
agents that are engaged to provide result brokering and result update services.
Raw results are
unlikely to be especially interesting so that the generation of natural
language summaries of results will be important for many of the agents in our
scenario. Results that are published this way will also want to be linked and
threaded to existing papers in the field and made available in ways that
discussion groups can usefully comment on. Link services are one sort of
knowledge technology that will be ubiquitous here – this is the dynamic linking
of content in documents in such a way that multiple markups and hyperlink annotations
can be simultaneously maintained. Issue tracking and design rationale methods
allow multiple discussion threads to be constructed and followed through
documents. In our scenario the Paper Respository Agent (PRA) will not only
retrieve relevant papers but mark them up and thread them in ways that reflect
the personal interests and conceptualizations (ontologies) of individuals or
research groups.
The use of Problem
Solving Environment Agents (PSEAs) in our simulation of experimentally derived
results presents us with classic opportunities for knowledge intensive
configuration and processing. Once again these results may be released to
communities of varying size with their own interests and viewpoints.
Ultimately it will be
up to application designers to determine if the knowledge services described in
this scenario are invoked separately or else as part of the inherent
competences of the agents described in section 2.3. Whatever the design
decisions, it is clear that knowledge services will play a fundamental role in
realizing the potential of the Semantic Grid for the e-Scientist.
The following is by no means an exhaustive list of the
research issues that remain for exploiting knowledge services in the e-Science
Grid. They are, however, likely to be the key ones. There are small-scale
exemplars for most of these services. Consequently many of the issues relate to
the problems of scale and distribution
[1] When building or modeling ontologies there are certain representational and semantic distinctions that it is important to be able to model. Examples would be the necessary and sufficient conditions for membership of a class and whether two classes are equivalent or disjoint. However RDF and RDF Schema lack these capabilities. Neither do they have the ability to model general constraints.