IMeshTk

IMesh Toolkit Technical Results

Libby Miller
July 2002

see also: ILRT documents funded by IMesh toolkit

Summary

This document provides a summary of some of the technical results of the IMesh Toolkit project [1], a JISC funded project which ran from July 1999 to July 2002.

A major aim of the IMesh toolkit was to create an architecture for subject gateway software and

"To develop or identify existing APIs (programming interfaces) between the principal components of such an architecture to allow them to communicate"
Imesh project plan, Introduction

The project undertook a consultation exercise with the IMesh community (as represented by the IMesh mailing list [2]). The summary of the questionnaire results received is here [3]. The results of this exercise have informed the creation of the software.

In the project we have taken several approaches with one design principle:

separate out the architectural layers

This is based on the following observations:

It is much easier to create a common model of a query than of a protocol, and particularly of a query compared with the combination of a query and a protocol.

our aim was to build a system based on separating out the protocol from the query language

This was mooted by Dan Brickley's post to the project list 2000-10-17 [4] and is discussed futher in Monica Bonett's architecture notes [5].

With respect to the querying part of the system, the architecture is similar to the SQL-based query language implementations. There are language specific APIs to SQL stores, for example JDBC, Perl DBI, which nevertheless use a common query language. This makes for applications which can easily be written in different languages and which can nevertheless interoperate well.

Results

Martin Hamilton has written a prototype query schema which uses SQL as the query language and Perl's DBI as the API [6]. It has a specific schema, incorporating a generic data store plus specific add-ons for annotations, security and recommendations.

query layer
searcher --> SQL database
(layered DBI API+ SQL)

Libby Miller used an even more generic but also more experimental architecture, in an RDF query software implementation [7] co-funded by IMesh toolkit and the Harmony project [8].

SQL mixes the way the information is stored with the data itself. To access it, you need to know how it is stored. Libby has developed a query language and implementation for RDF which separates out these layers. Metadata creation is independent of storage and separate from query.

creation/harvesting layer
metadata creator -->  RDF/XML doc 
RDF/XML doc --> database
(network API (uses http, RDF parser))
 
query layer
searcher --> RDF/XML doc
searcher --> RDF/XML database
(layered JDBC-like API+ RDFQL)

In addition, work has been done in IMesh to investigate specific aspects of the component-based architecture of particular interest to subject gateways in the future. These are described in detail in a separate document [9]. Monica has done work investigating personalization - a key interest of the respondents to the IMesh subject gateway review questionnaire. Monica has experimented with creating a SOAP-based interface to authentication information in an LDAP store. Richard has been looking into annotation.

We think that we have provided significant advances in the analysis of the architecture of subject gateways, and sample implementations and code which illustrate our approach.

Background

From the IMesh toolkit project plan introduction:

Subject gateway technology is now at a point where an architectural approach is necessary. Individually developed software tools cannot be expected to work together automatically; they can be 'glued' together using custom-developed code but this approach is not scalable or maintainable. What is required is an overall architecture which specifies individual components and how they communicate. Software with a well-defined architecture is known to be more maintainable, extensible and reusable.

The strategic aims of the project were

The specific objectives of Imesh toolkit included:

The notion of an API was left somewhat unspecified in the IMesh project plan. Below is a fairly standard definition from the Free On-line Dictionary of Computing FOLDOC [10] Editor Denis Howe:

Application Program Interface
(API, or "application programming interface") The interface (calling conventions) by which an application program accesses operating system and other services. An API is defined at source code level and provides a level of abstraction between the application and the kernel (or other privileged utilities) to ensure the portability of the code.
An API can also provide an interface between a high level language and lower level utilities and services which were written without consideration for the calling conventions supported by compiled languages. In this case, the API's main task may be the translation of parameter lists from one format to another and the interpretation of call-by-value and call-by-reference arguments in one or both directions.

APIs are usually language specific.

Methodology

The approach we have taken comes from the result that from our research we have found almost no existing documented programmatic APIs specific to the connections between the components as illustrated in the component-based architecture for subject gateways [5]. This conclusion was a very definite finding of the Subject gateway review questionnaire [3a]. Instead of documented APIs there are a variety of code fragments and protocols which connect the components in existing software.

Protocols and query languages.

Much of existing software uses the Z39.50, Whois++ and LDAP protocols, which combine the protocol and the query language, which makes interoperability difficult.

For example, there was a strawman proposal in the IMesh project to create an 'IMesh protocol' (which could have been some existing protocol such as LDAP). This would be used as a common denominator between other protocols so that access to any given repository of subject gateway records could be by this protocol. However, since a protocol is a stateful handshake between two pieces of software, it is difficult to map between two protocols. However, as Mike Wright of DLESE gateway [11] pointed out in the responses to the IMesh subject gateway review questionnaire,

"Each of these embed the query language (like boolean) with the protocol of dealing with a connection. Being able to understand the query structure in each would be a wishlist. Having access to that without the connection baggage could be interesting."

The approach we have taken is based on this general idea: to separate out the layers of processing. We have been influenced by the REST [12] architecture in this approach.

REST - REpresentational State Transfer - is a set of architectural design principles which Roy Fielding hypothesised restrospectively as being the design behind the web. The web is an example of very extensive interoperability - millions of computers communicate with each other through a very simple set of methods - http GET POST PUT DELETE. This is a layer on top of other layers - TCP/IP for example.

Simple APis decrease coordination costs and allow implementation details to be hidden and since these are unspecified, they can be optimised for the needs of the application.

So in the subject gateway context

SQL is a good example of this: SQL query language can express very complex queries but the client talking to an SQL database only needs to know how to connect to the database and receive the results - not implement the query language itself which it sees as a black box string-like object. @@??@@

Architecture

Monica Bonett's architecture notes describe a component-based architecture for subject gateways, with functionality extended beyond search and browse [5]. This architecture does not as it stands specify the type of API between the various components, but specifies what the components should be and which components should be connected.

In two of your outputs we have focussed primarily on two main functional areas of the components diagram:

These are the core components although not by any means the only ones as the diagram shows. The general approach is to separate out the functional layers.

1: Creation, storage and harvesting

Creation and storage are very important parts of systems. Creation is complex, skilled and often system- and schema- specific. Often creation is an integral part of integrated systems, for exwample as in ROADS. This makes for inflexible systems which are hard to maintain and extend. Similarly with databases: storage is schema specific, and so hard to change.

Consider the following model:

object      property    value

and some sample objects with properties and values:

resource12345   title       "Sosig: Welcome"
resource12345   description "Sosig home page"

resource98765   annotates   resource12345
resource98765   annotatedBy resource76543

resource76543   name        Libby Miller

If information is modeled in this way then there are database-schema independent ways of storing that information. The simplest is to have a three-column database table for the object, property and value.

Creation of data like this can be integrated into the system as a set of web forms for example. It can also be generated by generic tools which produce object property value outputs. RDF data creation tools can do this - several systems are available, notably RDFAuthor [14].

The architectural objective here is to separate the document format and the storage from the details of the implementation. As subject gateways change and add new functionality such as annotation, recommendations, personalization, and new properties of existing objects, it is expensive to rewrite the database schema and metadata generator. One system described below solves the database schema extensibility problem by storing all the information in a more efficient version of the three-column table described above. This has some potential associated scalability and efficiency problems. The alternative approach taken is to put some information into a three column table and also have additional tables for certain types objects and their properties and values you know or think you will need, such as people and their usernames, passwords and email addresses.

We can also separate out the metadata generation from the storage system. One implementation does this by using a standards based XML interchange format for metadata, RDF. RDF documents use the object-property-value model used above, and can be generated by tools and consumed by other tools. Generation can be done with a generic or a specific generator, for example a web form or an RDF authoring tool, or a text editor. Consumption requires an RDF parser, of which there are many available at this time (for example ARP (Jena), SiRPAC (Stanford RDF API), RDFStore, Raptor (Redland). See the RDF Tools section below for locations).

2: Query

The IMesh subject gateway review questionnaire found that Z39.50 and Whois++ were commonly used in subject gateways software [3b]. Both of these are protocols and query languages combined, which means that software which is written to interoperate with them has to be able to support the back and forth 'handshake' of communication as well as the query itself.

So to create an interoperability layer at this crucial place on the architecture, we could build a tool which maps between various methods of accessing data. But it is difficult to map between a protocol and a query language at once, whereas a mapping between query languages alone is be much easier. Combining protocol and query language mixes up the means of talking to the database and the sorts of questions the database understands.

Our aim was to build a system based on separating out the protocol from the query language. This architecture is similar to the SQL-based query language implementations. There are language specific APIs to SQL stores, which use a common query language. This makes for applications which can easily be written in different languages and which can nevertheless interoperate well.

With an SQL database you can access the data from a given database as long as you know how the data is stored in the sense of the tables and variable names used to represent the data, regardless of the programming language you are using as means to get at the data. In the subject gateways context this makes it easier for experts in different languages to access information from the same store.

In addition SQL is very flexible. It can accommodate change through a change in the tables-based structure of the storage and in a change to the query language.

One implementation below goes a step further in interoperability terms. A problem with relational databases is that as data changes the storage has to be redesigned, which is time-consuming to implement. In contrast, implementations of the object-property-value RDF model can be queried regardless of the way the data is stored. RDF query languages mirror the generic data structure rather than the particular storage strategy. This can have efficiency and scalability implications. However it means that you do not have to redesign the structure of your database when you need to store more information. It also means that you need less information to query the database -instead of the particular format the data is stored in, all you need to know is the sorts of objects in the database and/or the properties that they have.

Implementations

Martin Hamilton's Perl implementation [6]

Martin Hamilton has written a prototype query schema which uses SQL as the query language and Perl's DBI as the API. It has a specific schema, incorporating a generic data store plus specific add-ons for annotations, security and recommendations.

query layer
searcher --> SQL database
(layered DBI API+ SQL)
@@...@@

Inkling

Libby Miller used an even more generic architecture. SQL query language mixes the way the information is stored with the data itself. To access it, you need to know how it is stored. Libby implemented a query language (SquishQL) for RDF which separates out the way the information is stored from the information itself. Inkling [7] is an open source project written in Java, with several contributors, part funded by IMesh toolkit and the Harmony project [8].

Why use RDF?

RDF is a model of objects ('resources') and their properties and values. [15] RDF is the W3C's standard for metadata [16], and is the primary format used for Dublin Core [17], a very important technology for respondents to the Imesh questionnaire [3c]. It can be written in XML syntax, another important technology for those respondents [3c]. RDF query is able to query any RDF source, i.e. documents and databases, because it does not care how the information is stored, only about how it is modeled.

Disadvantages include the relative immaturity of RDF technologies compared with, for example, relational databases. For this reason we chose to use a relational database as the backend database for RDF and map between RDF query and SQL.

Metadata creation is independent of storage and separate from query. Inkling does not have built-in metadata creation support, although it can harvest RDF files into the databases created by other tools.

creation/harvesting layer
metadata creator -->  RDF/XML doc 
RDF/XML doc --> database
(network API (uses http, RDF parser))
 
query layer
searcher --> RDF/XML doc
searcher --> RDF/XML database
(layered JDBC-like API+ RDFQL)

Here is a sample SquishQL query that can be answered by an installation of Inkling:

select ?name, ?title, ?identifier 
where 
(dc::title ?paper ?title)
(dc::creator ?paper ?creator)
(dc::identifier ?paper ?uri)
(foaf::name ?creator ?name) 
(foaf::mbox ?creator mailto:libby.miller@bristol.ac.uk) 
using dc for http://purl.org/dc/elements/1.1/
foaf for http://xmlns.com/foaf/0.1/

And here is a possible result of that query in pictural form:

Usecase: how to create a 'portal' using Inkling and RDF query

RSS 1.0 is a simple, extensible XML/RDF format, originally for syndication of news stories. In its simplest form, it consists of a list of links in a container (a 'channel') with an associated logo, name, description, and url.

RSS 1.0 [18], unlike earlier versions, is extensible - you can Add 'modules' to describe different types of object: webpages (Dublin Core); events (events module). A portal-like interface can be built be displaying the RSS feeds from different sources, as the Sosig-grapevine system does.[19]

To build a portal using Inkling, you first need to download and install the version which includes the Jakarta-Tomcat Java webserver. This allows you to create JSPs or servlets to present the results of your query to the user. Installing simply consists of starting up the webserver by typing ./bin/startup.sh (linux) or ./bin/shartup.bat (windows). The Inkling code is in common/lib directory as a .jar file.

In this example, data is downloaded directly from the web and stored in-memory. It is very little change to use a persistent (relational) database to store the data instead. To do this you use a "scutter" file to preload the database with RDF data you are interested in, and use the API slightly differently. The queries are the same. For more detail, see the Inkling website. [20]

The first thing you need to do is build a list of the RSS feeds you are interested in, like this one, for example.

This is a very simple list of RSS feeds classified according to their subject. Then you can get a list of feeds you want to display by asking a SquishQL query like this one:

Get me the RSS urls and titles where the subject matches the string 'economics'

select ?feedUrl, ?title  
where 
(dc::subject ?feedUrl ?subject)
(rss::title ?feedUrl ?title)
and ?subject ~ 'economics'
using rss for http://purl.org/rss/1.0/
dc for http://purl.org/dc/elements/1.1

This technique is very useful if you would like to personalize the feed display according to subject. You could also personalize it according to the user id of a person, using a similar technique.

Then you need to display the content of the RSS feeds themselves. The code for a JSP which does this is here.

This is the query it asks:

For each channel found, print the title and the url of each item

select ?item, ?ti, ?li 
where 
(rss::items http://chewbacca.ilrt.bris.ac.uk/events/events.xml ?seq)
(?contains ?seq ?item)
(rss::title ?item ?ti) 
(rss::link ?item ?li) 
using rss for http://purl.org/rss/1.0/

Here's a demo that shows this in action.

RSS modules and RDF query mean easy extensibility

Example: events data from LTSN economics. RSS 1.0 can be extended by adding modules, such as the events module [21]. This means that simple RSS readers (like a Perl regex) can still display a RSS feed, because they ignore the parts they don't understand. More complex displayers can do interesting things with the new information. Some examples showing display in a calendar format are here [23].

Inkling can do this by adding to the query. The query on the left is the original query to get the RSS feed item titles, and links. The query on the right asks for their startdate, enddate and location as well.

select ?item, ?title, ?link 
where 
(rss::items 
  http://chewbacca.ilrt.bris.ac.uk/events/events.xml ?seq)
(?contains ?seq ?item) 
(rss::title ?item ?title) 
(rss::link ?item ?link)
using
rss for http://purl.org/rss/1.0/
select ?title, ?link, ?start, ?end, ?location 
where 
(rss::items 
  http://chewbacca.ilrt.bris.ac.uk/events/events.xml ?seq)
(?contains ?seq ?item) 
(rss::title ?item ?title) 
(rss::link ?item ?link)
(ev::startdate ?item ?start) 
(ev::enddate ?item ?end) 
(ev::location ?item ?location) 
using rss for http://purl.org/rss/1.0/ 
ev for http://purl.org/rss/1.0/modules/event/

Here are some more examples of queries you could use to extend these ideas:

Combine, store and optimise feeds

"find me all the events starting in April 2002 from all feeds which can be
picked out using the search term 'economics'" 

Use different modules: bookmarks, learning objects

"find all the learning objects which can be picked out using the search
term economics" 

Personalize

"select events from feeds picked by people who match my SOSIG Grapevine
profile"

There are a list of other usecases and instructions on how to create them yourself on the Inkling website [23].

Technical Issues

The Inkling software is demonstration software, and it not scalable to the size of a gateway like Sosig, for example. It has been tested with 30,000 object-property-value pairs and performs reasonably well, but that represents approximately 1,500 records, so it is not suitable for most subject gateways. A major issue is sting searching, which it performs poorly at compared with exact string or uri matching; this is a product of the underlying database used (PostgresQL [24]).

However there are many other RDF databases available which are much more scalable - see the RDF Tools section below for some links.

Martin's work is more scalable and has been tested with approximately 30,000 records. The main difference seems to be the underlying database used - in Martin's case MySQL [25] which is known to have better text-matching capabilities.

A significant problem with both sets of tools is the degree of technical expertise required to use them. The Imesh questionnaire indicated [3d] that the presence of technical skills in subject gateways could not be assumed, and both tools require a level of technical understanding and time far above what would be expected of technical support. In particular, tools for the creation of metadata would have to be implemented by anyone using either implementation. This can be done in a technically simple way (for example by writing cgi or JSP pages) but the complexities of the user interface needs mean that it is non-trivial.

RDF Tools

Documents funded by IMesh toolkit from ILRT

http://ilrt.org/discovery/2001/02/imeshdb/
A writeup of the work involved in finding out what sorts of people and projects were on the IMesh mailing list, and so what the 'IMesh community' could be said to consist of.
Libby Miller

http://sw1.ilrt.org/discovery/2001/01/imesh/
A database of the people and projects represented on the Imesh mailing list, implemented in RDF using SquishQL/Inkling.
Libby Miller

http://sw1.ilrt.org/discovery/2001/04/imeshqs/
The IMesh questionnaire raw results.

http://ilrt.org/discovery/2001/05/imeshqs/
The IMesh questionnaire results analysis.
Libby Miller, Monica Bonett

http://ilrt.org/discovery/2000/09/imesh/
The IMesh subject gateways literature review.
Libby Miller, Dan Brickley, Martin Hamilton

http://ilrt.org/discovery/2000/07/itk-sgr/
The IMesh subject gateways review plan.
Dan Brickley, Libby Miller

Papers funded in part by IMesh

http://ilrt.org/discovery/2000/09/metamesh/
Search Mesh Topology and Visualisation
Dan Brickley

http://ilrt.org/discovery/2001/07/KCAP-ann/
Using RDF to annotate the (semantic) web (Accepted at KCAP Annotations workshop, 2001)
Phil Cross, Libby Miller, Sean Palmer

http://ilrt.org/discovery/2002/04/query/
RDF Query by example (Invited talk, Netlab and Friends 2002)
Libby Miller

References

[1] http://www.imesh.org/toolkit

[2] http://www.jiscmail.ac.uk/lists/IMESH.html

[3] http://www.ilrt.bris.ac.uk/discovery/2001/05/imeshqs/

[3a] http://sw1.ilrt.org/discovery/2001/04/imeshqs/bynumber.jsp?num=http%3A%2F%2Filrt.org%2Fdiscovery%2F2001%2F03%2Fimesh%2Fquestionnaire.rdf%234i

[3b] http://sw1.ilrt.org/discovery/2001/04/imeshqs/bynumber.jsp?num=http%3A%2F%2Filrt.org%2Fdiscovery%2F2001%2F03%2Fimesh%2Fquestionnaire.rdf%234f2

[3c] http://sw1.ilrt.org/discovery/2001/04/imeshqs/bynumber.jsp?num=http%3A%2F%2Filrt.org%2Fdiscovery%2F2001%2F03%2Fimesh%2Fquestionnaire.rdf%2312

[3d] http://sw1.ilrt.org/discovery/2001/04/imeshqs/bynumber.jsp?num=http%3A%2F%2Filrt.org%2Fdiscovery%2F2001%2F03%2Fimesh%2Fquestionnaire.rdf%238

[4] dan2000-10-17.txt

[5] http://www.imesh.org/toolkit/work/architecture/

[6] http://martinh.net/hsemi3/

[7] http://swordfish.rdfweb.org/rdfquery/

[8] http://metadata.net/harmony/

[9] Monica and Richard's work on Imesh

[10] http://www.foldoc.org/

[11] http://www.dlese.org/

[12] http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm

[13] http://www.openarchives.org/

[14] http://rdfweb.org/people/Damian/RDFAuthor

[15] http://www.w3.org/TR/rdf-mt/

[16] http://www.w3.org/TR/REC-rdf-syntax/

[17] http://www.dublincore.org/

[18] http://purl.org/rss/1.0/

[19] http://www.sosig.ac.uk/gv/

[20] http://swordfish.rdfweb.org/rdfquery/howto.html

[21] http://groups.yahoo.com/group/rss-dev/files/Modules/Proposed/mod_event.html

[22] http://sw1.ilrt.org/discovery/2002/05/rsscal/
http://sw1.ilrt.org/discovery/2002/05/rsscal/calmonth.jsp?url=http%3A%2F%2Fchewbacca.ilrt.bris.ac.uk%2Fevents%2Fevents.xml&date=&mbox=&rdfweburl=

[23] http://swordfish.rdfweb.org/rdfquery/downloads.html

[24] http://www.postgresql.org

[25] http://www.mysql.com/