IMeshTk

ImeshTk: Subject Gateway Review Literature Review

Authors:
libby miller <libby.miller@bristol.ac.uk>
dan brickley <daniel.brickley@bristol.ac.uk>
based on extensive research by
martin hamilton <martin@net.lut.ac.uk>

Last updated: 2000-09-14

Overview

As stated in the Subject Gateway Review Plan [1]

"the objective of the Subject Gateway Review is to ensure that the IMeshTk architectural and technical strategies are well-grounded in the documented needs and practical requirements of the internet cataloguing community as they stand now, and with a view to the next 2-3 years".

The Subject Gateway Review work is responsible for producing scope and prioritisation guidelines for the IMesh toolkit, particularly for the technical and architectural activities.

The Subject Gateway review will encompasses three main elements: "a qualitative study of the current and anticipated technical needs of the IMesh community, though discussion on the IMesh mailing list a case-study based review of the software-related needs of the subject gateway community, gathered through informal interviews a literature review detailing potentially useful emerging and current technologies and standards not necessarily already used by the IMesh community, including a summary of related projects and initiatives (this would for example cover IMesh list archives, Renardus project deliverables etc.) " This document is the literature review. Its objectives are:

This document will feed into the other parts of the Subject Gateway Review:

by creating definitions and scoping these activities.

1.0 Scope

There are two main aspects to defining the scope of this document and the IMesh toolkit itself. The first is that it is unclear exactly what a subject gateway is; although several precise definitions are available [see, for example the Renardus definition: [2]], the IMesh toolkit project plan acknowledges that many of the elements of a subject gateway toolkit defined in this way would also be useful to other users:

The toolkit Project Plan argues that a well-defined architecture is necessary for a modular approach to the building of subject gateways, and that the IMesh toolkit project will

"...define, implement and test such an architecture through the provision of an integrated toolkit for subject gateways and other metadata creators." [3]

As a concrete example, the SOSIG service includes not only the hand-selected and classified collection of the main catalogue, but also the Social Science Search Engine, a robot-harvested collecton of metadata about webpages. In addition it contains databases of conferences, and CV information in my grapevine, which is a different sort of data altogether. The first catalogue satisfies the criteria set out in the Renardus/Koch document of a quality controlled subject gateway:

It is a subject-based resource discovery guide which provides links to information resources (documents, collections, sites or services), predominantly accessible via the Internet, and applies a documented set of quality measures to support systematic resource discovery. It is also managed, collected by humans according to documented selection criteria, with maintenance criteria, with a fixed metadata set and controlled subject classification. [paraphrased from [2]]

But the other parts do not. On the other hand, we could stretch the definition of a subject gateways little, so that all the different elements of SOSIG fit in, since the CV and conference services are subject-based.

"Subject Gateways - are subject-based resource discovery guides that provide links to information resources (documents, collections, sites or services), predominantly accessible via the Internet." [2]

Now consider that a service such as SlashDot [4] on this definition is a subject-based gateway to information resources about computing. In fact SlashDot fits this description rather well, being a filtered, regularly updated guide to resources on the internet for computing. It also fits the IMesh tk definition of the subject gateway model, in which

"databases of resource descriptions are built up through manual selection, `cataloguing' and classification." [3]

But you might well argue that IMesh tk is not funded to provide an architecture for SlashDot. It is funded under the JISC/NSF DLI2 program

"Digital Libraries Initiative Phase Two is a multiagency initiative which seeks to provide leadership in research fundamental to the development of the next generation of digital libraries, to advance the use and usability of globally distributed, networked information resources, and to encourage existing and new communities to focus on innovative applications areas." [5]

and so concerns libraries, or perhaps, library-like projects. A difference between SlashDot and those services we name 'quality controlled subject gateways' is that information is selected primarily by the uses of the site and not by information collection professionals. However, to some degree user input is accepted by services such as SOSIG [6] and Biz/ed [7] and then filtered, just as SlashDot uses subject experts to filter its information content.

Even if we were to include only those sites where information was both described and sought be subject experts, there are many sites available where subject experts on business news or certain aspects of computing provide just such a service, for example theRegister [8], xmlhack [9].

There is a further scoping consideration related to this, namely, that from a technical point of view the needs of quality controlled subject gateways may not be so far from those database-backed site management tools used by SlashDot and other sites. Zope [10], for example, proves a simple, free system to manage and publish data on the web, requires little technical experience to set up and run; at the other end of the scale, if technical help is available, a relational database system such as mySQL plus PHP, or Perl, provides a scaleable approach to publishing data on the web, which can be written from scratch. Biome has in fact taken this approach [11].

The scoping issue here is that the problems of running a quality controlled subject gateway are similar to those involved with running a large website. There are numerous open source and commercial tools available to do this; and if subject gateways want to offer services such as personalization, tools such as SlashCode [12] might fit the bill nicely. The second aspect of scoping is the time element. Many quality controlled information gateways (to take the clearest case) use the ROADS [13] software to create and maintain and present subject-based metadata. Improving these gateways' speed of searching might be achieved by providing an architecture which allows the swapping in of faster back-end databases such as mySQL. This would improve the lives of these gateways now; however while this may well be a suitable use of IMesh time, the project need to focus on what will be useful to subject gateways at the end of the project, when we can offer them the APIs and reference implementations. It is no use offering APIs to subject gateways which do not take into account changes in needs and technologies in the course of the project.

However, naturally it will be very difficult to assess what an subject gateway will need in 2-3 years, or even what it look like. Hence the requirement for continuing contact with the IMesh community over the course of the project. But at this stage it is necessary to examine possibly useful technologies and standards, to try to play the game of anticipating the needs of the subject gateways. This document will examine some of these potential standards and technologies, and in addition, look at technologies and standards used by similar projects in this area. Additional emphasis will be given to the subject gateways' predicted needs when subject gateways are questioned.

Part of the aim of this document and the Subject Gateways Review as a whole is to pin down what we might mean by the term 'subject gateway', now and in the future. This will allow us to assess what technological and architectural recommendations might be appropriate for the IMesh toolkit. Separately, we need to to scope the rest of the subject gateway review, including who will be interviewed and consulted for their opinions.

1.1 Defining a Subject Gateway

We have already seen some of the definitions of a subject gateway; we have noted above (as an example) that only part of SOSIG the service fits the notion of a quality controlled subject gateway, and that all of SlashDot does. So the definition we have is both too narrow and too wide. The definition is too narrow because of the common lack of distinction made between the SOSIG internet catalogues and the SOSIG service, which incorporates other services and databases. The definition is too broad because we have an implicit understanding of who the IMesh toolkit is aimed at, encapsulated, albeit vaguely, in the notion of the IMesh community.

It is useful to start to examine who the IMesh community are at this point.

1.2 IMesh Community

Who are the IMesh community?

A starting point is the list of delegates at IMesh Framework Workshop [14]. This workshop was organised by the Resource Discovery Network Centre (RDNC) on behalf of the IMesh initiative with JISC support; out of it grew the IMesh Toolkit project proposal.

From this, the IMesh list [15] and the IMesh workshop email list [16], we have begun to construct a list of members of the IMesh mailing list with past a current affiliations [17] and used this as a starting point.

Loosely, the IMesh community consists of a number of organisations and projects with an interest in information retrieval from a library perspective, and with respect to quality internet metadata. Often these will be libraries or projects and services attatched to libraries. Often they will be interested in cross-searching. Many of the individuals and organisations are involved in numerous projects simultaneously. A wide variety of countries are represented on the list and at the workshop.

Some characteristics of the workshop group include:

The IMesh list in total is even more diverse.

1.3 IMesh Toolkit Definitions

The IMesh toolkit project plan is a little vague about the definition of a subject gateway, although the document does provide certain hints about the targeted services [3]

"Recent years have seen the emergence of the subject gateway approach to Internet resource discovery. In this model, databases of resource descriptions are built up through manual selection, `cataloguing' and classification."

"Various design and technical choices have been made in subject gateway services, however there are some common features: a metadata format (Dublin Core or IAFA templates, for example), a search and retrieve protocol (Z39.50, LDAP and Whois++, for example), and a mechanism for routing queries between gateways (the Common Indexing Protocol). "

One might also add that necessarily for web-based services, subject gateways require a web-enabled user interface to these services.

We can summarise this view from the project plan:

1.4 Scope

It seems clear that this project is extremely difficult to scope. Add this to the very variable interests of the IMesh community (as represented by the subset who attended the first workshop) and it seems impossible to scope the project at all.

For this reason, in the subject gateways review interviews, we will restrict ourselves, somewhat arbitrarily, to the needs of the Renardus definition of quality controlled subject gateways, so:

It is a subject-based resource discovery guide which provides links to information resources (documents, collections, sites or services), predominantly accessible via the Internet, and applies a documented set of quality measures to support systematic resource discovery. It is also managed, collected by humans according to documented selection criteria, with maintenance criteria, with a fixed metadata set and controlled subject classification. [paraphrased from [2]]

This will enable us to narrow down a list of subject gateways to interview in depth about what they need.

However, a crucial part of this research will be to ask them how they anticipate changing in the next 2-3 years. With the advent of hybrid libraries, portalization, this may mean that our scoping policy is even more arbitrary than it seems at first, since we may be able to learn from the more complex broker and cross-search services. It may include harvesting, portalisation, preferences, cross searching with different kinds of databases.

This means that we need to take account of the experiences of excluded projects both before and after interviews. We also need to anticipate what sorts of standards and technologies are on the ascendant. This requires that the rest of this document be quite wide-ranging, since we need to be open-minded about how these services will develop and what sorts of technologies they might use or need in the future.

2.0 Existing Research About Subject Gateways

Substantial research already exists about quality controlled subject gateways. This section surveys available research where appropriate to the subject gateways review

2.1 MIA and the IMesh Toolkit Technology Review

Draft by Martin Hamilton of I3 (Technology Review) dated Jan 2000 [18]

Martin Hamilton uses the MIA architecture to analyse three existing resource discovery architectures in this draft of this IMesh tk internal deliverable. Briefly analysing his research is useful because he looks at existing software packages commonly in use by quality controlled subject gateways, within an analytical framework, giving us information about the current requirements of some QSBGs.

Hamilton examines three resources discovery mechanisms: Chic-Pilot, ROADS, and ISSAC according to the 5 layers of the MIA, i.e.

Layer Implements
Presenter The user interface, responsible for interaction with users - both human (e.g. via Web browsers) and software agents.
Coordinator Offers services suitable for the user community, e.g. user profiles (e.g. per user customisations) and session maintenance (e.g. holding a cache of search result sets)
Mediator Provides an opaque interface to lower level services, e.g. discovery and retrieval.
Communicator Encapsulates the technical details (e.g. communications protocols and metadata formats) required for access to a service.
Provider Provides the external services which the Communicator interacts with, e.g. catalogue of resource descriptions, database of user profiles, and so on.

In each case the software can be easily or in a more forced manner fitted into the architecture; however in each case there are interdependencies between layers in the form of reliance on special protocols, or on more detailed API elements which mean that the layers are connected.

This is not a particular problem for the QSBGs in itself, since the protocols such as LDAP and Whois++ were found to be sufficient and useful within these packages. There may not even (theoretical) be difficulties with swapping pieces of these over: both Chic-Pilot and Roads use the Whois++ protocol, for example. The difficulties instead comes from certain theoretical use-case scenarios, for example:

As a secondary point, it is important to note that the MIA analysis of the architecture suggests commonalities between the internal architecture of a QSBG and the external interfaces of a broker architecture which uses QSBGs. Consider a QSBG which intends to form part of a cross-search brokers service such as Renardus [19]. The broker need only talk to to the Mediator level of the architecture in this case, if such a level exists. At present, cross-searching, say, Whois++ and Z39.50 databases is much more like accessing the Communicator level, which it is necessary to know the protocol used by the database and other technical data. Similarly, one way of ensuring plugability of different databases from different software packages would be to create this layer. The layer provides a solution both to modularity of software elements and to compatibility for interaction between different installations of QSBG software.

2.2. Renardus Research

Extensive research on issues similar to to those that the IMesh tk Subject Gateway Review is interested in has already been undertaken by the Renardus project [19]. Renardus will eventually be a broker system for simultaneous access to quality-controlled subject gateways and other Internet-based, distributed services. It has 12 partners in Europe, and 11 of these responded to a questionnaire about their preferences for such a broker system. [url of questionnaire and results in the references section of D1.2 - User Requirements for the Broker System [20] ]

In addition we have

D1.1 review of existing broker services. With IMesh tk [21]

and D1.6 Evaluation Report of Partner subject Gateways (section 1.8 Index Type/ Technical Notes) [22]

Here I have summarised the results of D1.1 in a table, and provided a short summary of the conclusions of D1.2. There is more information about D1.2 and D1.6 in section 3.

2.2.1 Review of Existing Broker Services [21]

[partially funded by IMesh tk]

This is a review of existing broker architectures and technical information. It covers 19 systems, fitted to the MIA architecture where possible.

In summarising these results, I have concentrated on the layers of MIA between presentation and communicator. Communicator almost invariably communicates directly with the database. Often, as illustrated by the table below, the presentor communicates directly with the communicator, leaving out the coordinator and mediator layers.

Name of Broker Service Description Presenter/Communicator/Mediator API Underlying Protocol(s)
Agora hybrid library cross-search unclear from description Z39.50, ILL
Aquarelle Museum archives and annotations 'folders' of archives AQL query language Z39.50, folders API
ASF Freeware interoperability framework for government information arbitrary cgi scripts Z39.50, whois++
Chic-PILOT software for indexing whois++ databases and routing queries to them cgi scripts whois++
CORC metadata management system html -> Z39.50 query? results->HTML/XMLRDF Z39.50
DEF: Denmark Electronic research Library distributed search environment/database hybrid HTML <-> Z39.50. uses ZAP gateway tool Z39.50
NRW: Die Digitale Biblithek Nordrhein-Westfalen hybrid library cross-search: full text, library catalogues, internet services HTML to Z39.50 gateway - WebPAC Z39.50, ILL
ETB: European Schools Treasury Broker Thesaursus, metadata registry, remote searching of distributed databases html <-> z39.50; z39.50 -> xml Z39.50, XML
EULER: European Libraries and Electronic Resources in Mathematical sciences cross-searching of bibliographic databases, library catalogues, electronic journals, preprint archives, internet resources HTML <-> Z39.50 gateway: Euler engine Z39.50
Finnish Virtual Library: FVL cross-search of ROADS/hois++ internet resource databases. uses centroids HTML <-> whois++ gateway cgi script whois++
GAIA: Generic Architecture for Information Availability discovery of information, location of suppliers, negotiation of quantity, p-rice, delivery digitally, authentication, pricing, payment. Uses Corba HTML/Z39.50/VRML/SMPT/SMS gateways to Z39.50, ftp, VAVIC, RTP, ILL, LDAP, SET
Harvest indexer, gatherer HTML ,_. Harvest broker protocol (cgi-script) Harvest broker protocol
ht://Dig full-text indexing of web pages HTML <-> htsearch program (cgi-script) none
Issac Network metedata management tool; cross -searching; uses Tios HTML <-> LDAP cgi-script LDAP
Jointly Administered Knowledge Environment (jake) search and retrieval of electronic resources such a s online journals. High performance HTML <-> SQL database using PHP3 none
NCSTRAL/Dienst Network Computer Scinece Technical Research Library stored digital documents, indexing, cross-searching. Dienst by the Open Archives initiative. Renardus considerig using it HTML (via structured url) <-> Dienst protocol. also Dienst -> XML Dienst
RND Resource Finder Cross-search of whois++ ROADS gateways HTML <-> Whois++ gateway (cgi script) Whois++
ROADS Searching/browsing metadata. Also CIPS, Harvest, Z39.50 HTML <->Whois++ gateway (cgi script) whois++
UNIverse distributed virtual union library service HTML <-> Z39.50 gateway Z39.50 ILL

Recall that the communicator talks a protocol like Z39.50 to one or more underlying databases. MIA suggests a mediator layer, to shield the presentation layer from the specific protocol.

Essentially the path from presenter to communicator requires some sort of query language to take the query inputed by the user and turn it into a Z39.50 or LDAP or Whois++ query, and some kind of layer which converts the results set into html.

Normally this will be http/html -> cgi script/arbitrary or convenient API -> Z39.50 query

and the other way around

z39.50 results set -> cgi-script/arbitrary API -> http/html

As we have seen, in many of these broker services the communicator and mediator layers are compressed together, or the communicator, mediator and presenter layers are. In general the APIs at this point are not clearly defined in most cases, because this is the point at which presentation gets mixed with content, because this is the interface with a human searcher.

An exception is the Dienst protocol, which provides a mediator in which the input from a html form, expressed as parameters on a url, is converted to a query of some sort. Dienst essentially provides an http based query language, which is currently used to do queries of a Dienst database; however, there is no reason why this or a similar system need not work for Z39.50 or whois++ databases. Dienst then provides XML or HTML results.

A system based on a protocol such as Dienst would constitute a simple, lightweight API which would enable cross-searching of databases, and also would enable increased pluggability of component of subject gateway software. In addition, it would not constitute a large change from current practice, merely representing a formalisation of what already happens - an arbitrary cgi script interfaces with the database protocol.

2.2.2 User Requirements for the Broker System [20]

Service providers were asked to complete a questionnaire concerning their requirement for the broker functionality [20] [list of participant gateways [23] ]. Some of the results of this report are highly relevant to IMesh Toolkit. Some comments from the report are included in section 3 below. In summary, however, respondants wanted a broker system which used Whois++, Z39.50 and DC/RDF/XML as protocols or export formats. They wanted a distributed model, and wanted to have the possibility of building a personalized user interface, perhaps adding bookmarks and recommendations to the system. They also required that data should be importable into their own systems, but that data should also be clearly contextualized: the source of the data should be transparent to the user. From the conclusions:

"From the service providers' perspective Renardus should be an addition to their existing information gateway requiring only reasonably small adjustments to the current way of doing things."

These conclusions are interesting because of the varied preference of protocols of the respondents; and therefore from the point of view of IMesh tk, the complexity of producing a way of interfacing with various back-end databases of many different types. The providers don't want to change the way they work or the software they use.

3.0 Current and Possible Future Technologies and Standards

see also: [24] Renardus: Deliverable 2.1 (internal): Technical standards

Current:

From Renardus Evaluation Report of Partner Subject Gateways [22] (section 1.8 Index Type/ Technical Notes). [25] is a summary table.

"At the basis of the technical system lies the operating system, usually some UNIX variant. The freely available LINUX is mostly used, the other gateways use proprietary systems like SOLARIS, Digital UNIX, or HP-UX, only SSG-FI uses Windows as well."

"The search engines built on this basis again employ mostly freeware, like the different versions of the Harvest system or the Zebra Z39.50 server. Other systems used by several partners are the ROADS software and the Z'mbol server. Special systems are the Basis plus database system employed by DAINet and the Allegro/Avanti system used in the SSG-FI guides."

"For communication, the Z39.50 and the Whois++ protocol are used, as well as some version of the LDAP. DAINet offers a special interactive module FQM."

"The records are all freely available and describe single resources. The HTML display is produced with a combination of static pages, usually for (the entry points of) the browsing structure, and pages generated on the fly, used in particular to display search results."

4.0 Summary

Currently

An examination of previous literature reviews and surveys shows that quality controlled subject gateway software is closely tied to specific search and retrieval protocols, as MIA analysis shows. The preliminary Renardus analysis suggests that a layer on top of this would be useful; however it also suggests that people are attached to and experienced with the technologies they use, and that experience shows that there is a great lag between new technologies appearing and people taking them up.

This analysis applies to software which has been created for or used within the quality controlled subject gateway community. Our analysis indicates that quality controlled subject gateways have many technical considerations in common with running a large website.

Future:

The current buzzword is XML; XML provides a very flexible format for stuctured data of any kind. However its flexiblity also means that the structure of documents must be determined elsewhere, in a schema or DTD. XML tools are currently new and untested. XML lacks a query language, although XSLT provides something with similar functonality for documents but not databases. XML and XSLT means that device-independent content can be created and used. SOAP might be a useful way of formalizing the process of transforming resultset type objects via an arbitrary protocol or API to human-readable objects on the web.

References

[1] http://www.ilrt.bris.ac.uk/discovery/2000/07/itk-sgr/

[2]http://www.renardus.org/gateway/gateway.html

[3]http://www.Ukoln.ac.uk/metadata/IMesh-toolkit/plan.htm

[4]http://www.slashdot.org

[5]http://www.dli2.nsf.gov/

[6]http://www.sosig.ac.uk/

[7]http://www.bized.ac.uk/

[8]http://www.theregister.co.uk/

[9]http://www.xmlhack.com/

[10]http://www.zope.org/

[11]http://biome.ac.uk

[12]http://slashcode.com/

[13]http://www.roads.lut.ac.uk/

[14] http://www.Ukoln.ac.uk/events/imesh-workshop-jun99/delegates.html

[15]http://www.mailbase.ac.uk/lists/imesh/

[16] http://www.ukoln.ac.uk/services/mailing-lists/ukoln-external-open/imesh-workshop/

[17] http://www.ilrt.bris.ac.uk/discovery/2000/09/imesh/imesh_list.html

[18] http://clark.cs.wisc.edu/cgi-bin/cvsweb.cgi/docs/IMeshTk--I3--Technology_Review.txt

[19]http://www.renardus.org/

[20]http://www.renardus.org/deliverables/d1_2/doc0007.htm

[21]http://www.renardus.org/deliverables/d1_1/D1_1_final.pdf

[22]http://www.renardus.org/deliverables/d6_1/doc0006.htm

[23]http://www.renardus.org/gateway/participants.html

[24]http://nwi.dtv.dk/RENARDUS/D2.1/

[25] http://db1-www.sub.uni-goettingen.de/servlets/renaList?Table=technics&Head=Index/Technical+Notes

[26]http://www.fdgroup.co.uk/easel/about.html

[27]http://www.bib.uab.es/decomate2

[28]http://www.bib.uab.es/project/eng/d31.pdf

[29]http://rudolf.opensource.ac.uk/about/java-info.htm

[30]http://rudolf.opensource.ac.uk/about/specs/sitemap.html

[31]http://www.w3.org/2000/xp/Group/

[32] http://www.w3.org/Style/XSL/

[33] http://www.w3.org/XML/Query

Valid XHTML 1.0!