INFORMATION TECHNOLOGY DEPARTMENT
Knowledge, Information and Data
Keith G Jeffery
Director IT, CLRC
November 1998: Tony Blair: Lord Mayor’s Banquet, Guildhall, London. “Advances in Knowledge are the driving force behind the industries of the future.”
21 July 1999: Stephen Byers: London Business School. “The successful economies of the future will excel at generating and disseminating knowledge and exploiting it commercially. Our objective must be a dynamic knowledge based economy founded on individual empowerment and opportunity.”
Tom Stewart: “Now more than 50% of the cost of extracting petroleum from the earth is information processing and information gathering…… knowledge is now the principal raw material”
The last CSR (and the emerging SR2000 papers) identify as the major targets for the next decade genomics and informatics. A component of the latter is addressed here. For the whole SET sector, across all Research Councils (and across all human activity), there is a need for the value-adding chain:
Data ==> Information ==> Knowledge which ends with insight.
The insight provides innovation and from this comes wealth creation and improvements in the quality of life (and an improvement in the stock of human understanding).
This value-adding chain is of immense importance in many ways of which the following provide a handful of examples:
(a) in handling observational data e.g. earth observation from space, social science surveys;
(b) in handling experimental data derived from instrumentation (sensors) e.g. particle physics experiments, national facilities such as ISIS and the new synchrotron;
(c) in providing access to heterogeneous distributed information sources worldwide (e.g. over the WWW) e.g. to provide advice and assistance in managing natural disasters;
(d) more specifically, in providing access to CRIS (Current Research Information Systems) with details of projects, experts, facilities, equipment, patents, prototypes, publications (both refereed as held in bibliographic systems and ‘grey literature’);
(e) in commercial (trading) activities (with or without exchange of money) whether the goods are artefacts or knowledge products;
(f) in design activity whether the product is a car, an electronic component, a piece of furniture or a theatre production;
(g) in decision-making of any kind from control room operation to government strategy;
(h) in education (through interaction with the value-adding chain obtaining information and testing /experimenting to obtain knowledge) of our next generation researchers, developers and innovators.
This multilingual, multimedia information infrastructure is required (particularly in a European context) for all areas of SET in order to make the next leap forward. It is the modern equivalent of Newton’s famous ‘other men’s’ shoulders’ remark. The provision of this infrastructure requires a progressive implementation from what is available through to products from R&D as yet not even proposed.
The requirement in IT terms, predicated by the Introduction, consists of the provision of an architected facility based on several layered components each benefiting from the substructure below:
(a) a computational / data grid to provide raw computing power and associated data stores network-connected with abilities in both floating point computation and data-handling with logic;
(b) an information grid superimposed on (a) connecting together the major information sources with interfaces to allow homogeneous access to heterogeneous distributed information sources. The information grid also requires sophisticated statistical analysis / reduction techniques for floating point numbers, textual information and multimedia information and with special facilities for images, all with associated visualisation and VR (virtual reality) facilities. Metadata to describe the information is an essential component both for integration and for guiding of analysis and utilisation. Metadata has 3 major components: schema metadata constrains the information to have integrity (including thesauri and terminology), navigational metadata locates the information and associational metadata describes succinctly instances or collections of information (e.g. a catalogue) and provides restrictions on access such as security or copyright;
(c) a knowledge grid superimposed on (b) utilising KDD (knowledge discovery in database) technology of which a well-known component is ‘data mining’. The knowledge grid will also support intelligent assists to decision makers (from control room to strategic thinkers) and provide interpretational semantics on the information.
Each grid will have suitable security controls (both for information availability and prevention of unauthorised access) appropriate to the source and the accessor. Similarly rights access (e.g. copyright, IPR) will be controlled.
At present the UK has the basics of the computation / data grid. There are supercomputer facilities at Manchester (CSAR), Edinburgh and CLRC-RAL each with associated large data stores. (CSAR has > 100Tb nearline storage(of which only a few Tb are used), CLRC-RAL > 30Tb with large upgrade planned). Every Research Institute and University has considerable local computation and data storage capacity, many provided under recent JIF bids. Some have nationally-relevant data centres such as the ESRC data archive at University of Essex, MIMAS at Manchester, EDINA and the Data Library at Edinburgh, the space science data centre (supporting several RCs) at CLRC-RAL, the British Oceanographic Data Centre at NERC-Bidston, the genomic databases at MRC-Hinxton. The UK has a reasonable quality academic network connecting these resources.
The UK has a much less well-developed information grid. Interoperation with and use of datasets from a location other than the research team base is only done when the remote datasets are at a recognised national facility. These national facilities tend to be discipline-based and usually known only to a minority of researchers in that discipline, therefore commonplace within-discipline use and associated cooperation and discussion is not widespread. Cross-disciplinary use is much less since rarely is the existence of the resource known to researchers of another discipline. Furthermore, without adequate metadata (e.g. for integrity control such as precision, accuracy and for explanation of collection method) cross-disciplinary resource usage is hazardous. Certain national facilities provide metadata for their data holdings – but not in a consistent form such that datasets from different centres (or even from the same centre in the same discipline) can be used together comfortably. Athena at Edinburgh provides uniform access control to a limited set of national resources. The funding regime of the ever-changing various national facilities does not encourage persistent excellence.
The UK does not have a knowledge grid, except that recorded in scholarly publications and available online. Grey literature storage and cataloguing is extremely variable. There are isolated nodes on the network where knowledge is elicited from humans and stored, or where knowledge is derived from information by KDD techniques.
The immediate requirement is for these valuable resources of computation and data:
(a) to be catalogued so that they are known to exist;
(b) to be made easily available to authorised users subject to controls ranging from security to payment for access;
(c) on a network-connected grid with central coordination at a long-lived, internationally recognised data centre (providing the catalogue of resources as metadata) so that UK can benefit maximally from the investment by many agencies of government (and some associated commercial) funding;
Such an action would indicate immediately at a national level areas where further investment might be required and – more importantly - areas where resource might with benefit be shared. Most importantly, it would facilitate interoperation of computation and data with associated cooperation. Furthermore, it would provide the basis for an information grid, thus making the resources more generally usable.
The provision of an information grid requires the provision of metadata to agreed standards across all nationally-relevant resources such that, by the metadata description, authorised users can decide if the resource is relevant and useful and furthermore can utilise the resource. The provision of this metadata may be done by automated techniques in some cases and this is an area of active research. The provision of access to heterogeneous information resources utilising metadata (and intelligent agents) is also the subject of active research where UK has an excellent reputation. Again, coordination is the key to effective provision
The provision of a knowledge grid requires two major elements: firstly an agreed knowledge representation and then provision by elicitation from humans and/or discovery (inference), from databases, of knowledge in this representation and secondly the provision of homogeneous access over heterogeneous sources of scholarly publications and grey literature. This is likely to include facilities such as thesauri and/or domain ontologies to assist in understanding and multlingual facilities. Once again, coordination is the key to effective provision. There are advantages in the coordination of all 3 grids being at the same node.
Since the computation power available per unit price, in both modern multiprocessor servers and workstations, increases rapidly it is unlikely that large additional expenditure will be required for computation over and above that planned. This is especially true if future supercomputing provision uses clustered standard processors in ‘Beowulf-like’ configurations such as those being pioneered in CLRC and elsewhere. Client devices will continue to evolve through notebook computers to intelligent digital phones to wearable computers – all at the same price but with increased functionality - while there will continue to be an need for increasingly powerful high-end visualization client devices including VR. This is one area where investment will be needed.
However, the rate of production and digitisation of data – especially from scientific detector systems such as on satellites and the LHC at CERN and from videocameras digital cameras and audio recordings - is increasing faster than the increase in storage facilities per unit price and so additional investment will be required. The cost of massive disk storage is reducing rapidly so that – unless there are great technological breakthroughs – the cost advantage of nearline storage has to be balanced against the additional access time costs. However, there are arguments for archived copies of datasets for availability security. In any case, it may be expected that the demand for managed data storage will increase by a factor of 5-10 per year.
There are implications concerning network capacity. ‘Abilene’ in the USA has a 2Gb backbone and has similar speed access trunks into many nodes. The current SuperJanet backbone capacity is 155 Mbps – more than an order of magnitude less. There are plans to upgrade by a factor of ~4. Similarly, onsite networks tend to be or order 100 Mbps and access trunks 2 orders of magnitude less than those in ‘Abilene’.
The major suppliers of software for data, information and knowledge handling are all working on these problems, some closely with UK R&D teams (in industry as well as the RC sector). Additionally, there are advantages to be gained for UK in being ahead of the game in defining standards and protocols (e.g. the work at CLRC-RAL as the office for W3C) which ensures the IT suppliers provide that which is required to conform to the standards (usually with proprietary extensions for marketing reasons!). There is great advantage to the UK SET community in using common middleware software for interconnection of information sources.
The major implications concern effort to generate the metadata required for the information grid and the associated systems for access and utilisation of the information. Advances have been made and can be made in ensuring (e.g. by conditions of funding) that datasets of more than parochial value have associated metadata before deposition in a data library or archive. R&D work on homogeneous access over heterogeneous data sources (Manchester, Cardiff, CLRC-RAL) is well advanced and could be made into production services on a timescale similar to that for widespread provision of metadata.
The effort of experts from whom knowledge can be elicited is expensive, as is that of the knowledge engineers who elicit and encode the knowledge. Existing knowledge-based systems generally are rudimentary and commonly domain-specific.
The work on intelligent agents for information support (Manchester, Cardiff, Open University, CLRC-RAL, Southampton, Edinburgh) provides a basis for development of suitable knowledge grid software. Knowledge Discovery in Database systems exist commercially (e.g. IBM IDM, the UK system Clementine, the Dutch system Data Distilleries) and are used commonly in certain commercial applications. However, the knowledge so generated is rarely used in conjunction with other knowledge because of representation incompatibilities and domain-specific characteristics. This is an area for R&D to develop the knowledge grid.
The provision of metadata to characterise scholarly publications is now commonly a requirement from the publishers, increasingly this is also true for grey literature. CLRC-RAL, jointly with colleagues in Norway, has defined a formal metadata set (which can be represented as information or knowledge) for grey literature based on the widely-known, but informal, ‘Dublin Core’. However, the provision of thesauri and domain ontologies to assist readers or users of this published knowledge is lacking. Effort to provide such facilities is expensive, although – partly because of multilinguality – countries with a language other than English (e.g. Norway, Italy, Greece) have invested in this area.