An RDFWeb Aggregation Service for the ILRT

Libby Miller <libby.miller@bristol.ac.uk>
ILRT
Dan Brickley <danbri@w3.org>
W3C

Abstract

RDFWeb is a project that combines open source/free software tools, data and documentation to help people create RDF databases or 'aggregation nodes' of people and the things that they create or participate in or belong to - organisations, workplaces, projects, documents, depictions. In doing so we can find various different ways of connecting people, and visualizing these connections, accessing the data through HTML, SOAP, XML/RDF and N-triples.

This document describes how we created an RDFWeb aggregation node with data from sources for one particular organisation, the Institute for Learning and Research Technology (ILRT), a department of the University of Bristol in the UK. The ILRT is a complex, decentralised organisation with multiple connections between people, projects and technologies within it. An RDFWeb node enabled some of these connections to be made more evident, and enabled information about the ILRT to be presented more effectively both externally and within the organisation.

This document is about describing some of the technical and social obstacles to the creation of a tool for an organisation to use internally and externally to examine the data that it holds about itself.

Keywords: RDF aggregation, Semantic Web, Daml

Word count: 3809

1. Introduction

RDFWeb [1] is a project that combines open source/free software tools, data and documentation to help people create RDF databases or 'aggregation nodes' of people and the things that they create or participate in or belong to - organisations, workplaces, projects, documents, depictions. In doing so we can find various different ways of connecting people, and visualizing these connections, accessing the data through HTML, SOAP, XML/RDF and N-triples.

It is hoped that RDFWeb will provide a forum for ongoing sanity-checking for the development of the Semantic Web, a place to tackle some of the difficult technical issues involved in aggregating and presenting RDF data while providing a practical and useful service and without waiting for all the issues to be resolved.

RDFWeb focuses on people because people do things: they create documents and multimedia objects; they join and leave organisations and projects, they know other people and have meetings with them. Earlier RDFWeb [2] nodes have focussed on finding social connections between people, such as SixDegrees.com [3] and connections between them from writing papers (for example Erdos numbers [4]). Within an organisation there may also be social connections and connections between people who have co-written papers, but there may also be connections according to the projects or groupings people work under, and also according to the technical or subject-based interests they pursue.

This document describes how we created an RDFWeb aggregation node with data from sources for one particular organisation, the Institute for Learning and Research Technology (ILRT)[5], a department of the University of Bristol[6] in the UK. The ILRT is a complex, decentralised organisation with multiple connections between people, projects and technologies within it. An RDFWeb node enabled some of these connections to be made more evident, and enabled information about the ILRT to be presented more effectively both externally and within the organisation. RDF was chosen to make this connections because several individuals at the ILRT are involved in Semantic Web technologies, and the ILRT itself provided an intersting subject for experimentation with the technologies we had been working on.

This document is not primarily about the particular data management systems used. There are many powerful RDF databases, query languages and integrated RDF data management systems available [7], offering more or less functionality in harvesting and querying RDF data, any of which could have been used to create an RDFWeb aggregation node for the ILRT.

Instead, this is about describing some of the technical and social obstacles to the creation of a tool for an organisation to use internally and externally to examine the data that it holds about itself. The obstacles might include:

Finding a suitable RDF data management system is one of the easier items in the list. This is not to say that the ILRT was resistant to the initiative, quite the reverse. However all these steps are important for a complete picture of an organisation, and do not just happen.

2. ILRT

The ILRT consists of many different projects, some working together. Many people work on more than one project. All the projects are funded externally, and most involve external partners; many cover similar technical issues. Projects usually have a duration of between 1 and 5 years and employ between 2 and 10 people at the ILRT. ILRT currently employs about 80 people in 40 projects.

The complexity of the organisation means that communication can sometimes be difficult. When finding partners for a funding proposal, it is easy to miss work going on already in the ILRT. Many people do not know about all the current projects, especially when these are new. The size is sufficiently large that not everyone knows everyone else.

A major problem is the promotion of the activities of the ILRT internally to the University of Bristol and externally to the wider world. As with many organisations, different types of data about the ILRT are held in different sources, in this case both in different places at the ILRT and centrally at the University of Bristol. Presenting this data to people who want to find out about the ILRT is difficult and tends to involve duplication.

3. Data about the ILRT

Data about documents written by members of the department are held in a central university database, where they have been entered by administrative staff and the authors of the documents. They are held centrally so that the University and departments can structure the data required for the UK Higher Education Research Assessment Exercise, which helps determine the amounts of funding going to departments and the university. Bristol also has plans for a university 'data hub'[8], where data about departments will be managed centrally and through which data around the university will be kept consistent.

Data about the ILRT publications was not complete at the time of the project, apparently because people do not see any immediate incentive for creating the data, which is not made available to individuals outside the University of Bristol on the web. In addition the type of publication produced by members of the ILRT is typically a web document, software or a demonstrator not a more traditional formal conference or journal publication.

Other data about the ILRT is held in local databases. Examples are information about who belongs to which project, who used to work on which project and current, past and future calendar data. Much of this data is held on Windows machines in Access databases to enable administrative, financial and management staff to add data easily.

Other information is not explicitly written down in machine-processable format. Some information about people is described in HTML pages on the website, the editing of which is devolved to individual members of staff. Much data is also kept in Word files on the intranet. Many informal discussions and meetings are never documented, and information about them exists only in people's heads.

The data at the ILRT is often stored in the ways easiest for those who need to edit it to do so. But this also means that data is widely spread and may become inconsistent. In addition, the large number of links between people, projects and other entities, means that it is difficult to get an overview of the work of the ILRT. Several awaydays and meetings made it clear that information was not being shared optimally among people and projects at the ILRT.

4. Modeling the data

Because data was being collected from so many different sources it was important to decide on a modeling style that could make the data consistent. We did not want to make the mistake of conflating the identifier for an entity with the entity itself, whether that is an internal identifier for, say, a person, within a particular database, or a more generally-known identifier such as name or an email address.

However, there are advantages to having well-known identifiers for objects, as these provide coordination points for other data-creators, and involve minimal effort in the discovery of identifiers. We chose to use email addresses and webpages where available not as identifiers for entities, but as descriptive expressions which could serve to identify objects without well-known identifiers themselves [9].

For example:
"the person whose mailbox is libby.miller@bristol.ac.uk"
"the project whose homepage is http://www.sosig.ac.uk/"
"the group whose email address is ilrt-semanticweb@bristol.ac.uk"

We used the 'friend of a friend' (foaf) [10] vocabulary as a schema for the modeling, adding properties and classes as we needed them. Foaf is a schema used for experimentation in RDFWeb, and contains some basic types in a very flat structure, and a number of useful properties such as name, mbox, knows. In some places it overlaps with the Vcard [11]schema and other similar contact schemas such as that developed by the W3C [12]. When new classes are added we subclass Wordnet[13],[14]. The descriptive expressions used for identifying objects without identifiers are annotated with daml:uniqueProperty[15], to express the fact that properties such as foaf:mbox uniquely pick out individuals.

5. Creating RDF

Because ILRT data is stored in disparate sources, creation of RDF data had to take place in several different ways.

RDF modeling using string and paper

An advantage of the modeling technique described above was that it was easy for individuals not used to modeling data to understand. The RDF model is very straightforward, and fits well with how many people seem to understand the world, as a series of objects with links between them of different types. Anecdotally, people seem to find it obvious and straightforward that objects can have types and the links between them can be of different sorts. However, even people with substantial experience with RDF people find RDF in XML syntax [16] form very offputting and difficult to write and understand. N3 syntax [17] is easier to write accurately but still needs to be checked using the validator [18].

Because some of the data we had was not in machine-readable format we needed a tool that could help us efficiently create RDF instance data, and preferably which individuals who did not know or care about RDF could also use and understand.

As a preliminary, as part of staff development week members of the ILRT created links between people, projects, funders and technologies using wool connecting paper representation of projects, funders and subjects. The wool had different colours, representing different types of links between projects, people, funders and interests and subjects. This gave a dramatic visual representation of the complexity of the organisation, and created discussion about the connectedness of ILRT.

A photo of the mass of connections between people, projects, funders and subjects at the ILRT. A photo of the mass of connections between people, projects, funders and subjects at the ILRT.
Photos of the mass of connections between people, projects, funders and subjects at the ILRT.

RDFAuthor-created data

This data was then transfered to machine-readable format using RDFAuthor[19], a tool for the visual creation of RDF instance data. Because RDFAuthor is simple to use, several members of the ILRT were able to input the data using the principles described above. RDFAuthor uses predefined schema elements and a drag and drop user interface to allow fast, accurate creation of RDF instance data.

An image of the Harmony project in RDF
An image of the Harmony project in RDF from RDFAuthor

Centrally-held data as a dump from a university database

At the University of Bristol, all publications are held in the central IRIS[20] database. People input them using a web form, and individual departments can download dumps of the data. The university uses several different internal identifiers for people and internal identifiers for publications.

Queries of the University database using SQL and Perl enabled us to produce RDF data dumps, using the same principles as described above to make sure people were linked with email addresses and documents were linked with urls where possible. Where documents did not have urls there was a danger of duplication of the record, for example if two individuals had collaborated on the same document, the document might appear in two files or in two places in the same file. If the document does not have an identifier it can end up appearing twice. In practice the controlled input for documents removed this difficulty. There is also another way of avoiding this problem described below in the section about 'smushing'[9].

Local data from ILRT databases.

A particularly useful database to have included was the ILRT's calendar data, held in an Access database. Unfortunately people and projects are not modeled within the table structure of the database making export for these purposes impractical. However, there are other databases held at the ILRT concerning past and present links between projects and people. These were taken as XML dumps and transformed using XSLT to RDF.

Hand-created XML

Where parts of the data were missing or incomplete we researched and wrote small pieces of XML/RDF data by hand, and validated the RDF against the RDF validator [20].

6. RDF data storage

Both authors of this paper have worked on the Inkling RDF query implementation [21], which also includes an SQL database backend, so for prototyping purposes we used this system because we were familiar with it. In practice for the amount of data we created (approximately 5000 triples) this system is adequate. For larger organisations or more data a more efficient system would be required. There is a strong advantage to using an SQL database rather than Berkeley DB, which (for example) Jena [22] and Redland [23] use as persistent storage at the time of writing; this is simply that the number of people who can set up and manage a free SQL database like PostgreSQL [24] or MySQL [25] is far greater than the number of people who can manage to install and run Berkeley DB [26].

In addition, the presence of a simple RDF query language in the system meant that setting up a user interface to the system was simple and fast compared with querying the data at an API level.

Many systems are available which support this type of functionality; the important data management aspect was to create RDF/XML files so that we can try out several RDF databases, and if the experience of managing one system is lost, the data is not. The data was exported to files which after some discussion were made public, as without the calendar information, all the data was already in the public domain.

The data was fed into a very simple RDF database which uses PostgreSQL as a persistent backend. The database contains only a single table, describing the triples in full text form as subject, predicate and object, with a boolean to identify whether the object is a resource or literal, and an identifier for the triple.

7. Getting the data out again using an RDF query language

We used a simple RDF query language SquishQL based on rdfDB's [27] query language [28] to export the data to HTML using Jakarta-tomcat Java server engine and JSPs.

Here are some example queries:

Find me the names and homepages of things (projects) which are the projects of things (people) with mbox libby.miller@bristol.ac.uk

SELECT ?homepage, ?name 
WHERE 
(foaf::project ?person ?project) 
(foaf::mbox ?person mailto:libby.miller@bristol.ac.uk) 
(foaf::homepage ?proj ?homepage)
(foaf::name ?proj ?name)
USING
foaf FOR http://xmlns.com/foaf/0.1/

Results:

http://metadata.net/harmony Harmony
http://www.imesh.org/toolkit/ Imesh Toolkit

Find me the homepages of things (projects) which are the past projects of things (people) with mbox libby.miller@bristol.ac.uk

SELECT ?homepage, ?name
WHERE
(foaf::pastProject ?person ?project) 
(foaf::mbox ?person mailto:libby.miller@bristol.ac.uk)
(foaf::homepage ?project ?homepage)
(foaf::name ?proj ?name)
USING
foaf FOR http://xmlns.com/foaf/0.1/

Results:

http://www.desire.org/ Desire
http://www.bized.ac.uk/ Biz/ed

The database is available via HTML as a display of some predefined queries such as those described above. There is also a SquishQL query language interface, where you can ask your own queries of the data. Finally, we have made a SOAP interface available using Axis and SquishQL [29].

Feedback has generally been positive, with individuals quickly noticing errors and suggesting enhancements.

8. Issues

Various technical and social issues have arisen during the creation of this and other RDFWeb nodes. Here we present some of the more important ones.

'Smushing' [9]

The model we have used described a particular person as "the thing with the foaf:mailbox libby.miller@bristol.ac.uk", and a particular project as "the thing with foaf:homepage http://www.sosig.ac.uk/". Now consider the following query:

SELECT ?title, ?url, ?mbox
WHERE
(foaf::project ?person ?project) 
(foaf::homepage ?project http://www.sosig.ac.uk/)
(foaf::mbox ?person ?mbox)
(dc::creator ?doc ?person)
(dc::title ?doc ?title)
(dc::identifier ?doc ?url)
USING
dc FOR http://purl.org/dc/elements/1.1/
foaf FOR http://xmlns.com/foaf/0.1/

i.e. get me the title and url of documents created by people who work on Sosig.

Because neither people nor projects have direct identifiers, the clauses

(foaf::mbox ?person ?mbox)

and

(dc::creator ?doc ?person)

may bring back different internal identifiers for ?person. This could be because these pieces of information are from different databases, which use different identifers for people. Or it could be that they are simply represented by different internal identifiers created by the RDF parser for blank nodes. Either way, the fairly simple query above will not be resolvable unless the database can be told or discover that the two representations of ?person in fact refer to the same piece of information.

This can be done in various ways. The one we chose was to manage the data as it was entered into the database, that is, as information about specified predicates is added to the database, the database is first queried to see if any other representations of the node exist already; if so, then this original identifier is used. There are various issues with this approach. For example, it might be the case that some of the identifiers are 'better' than others - some may be uris and others internal b-node identifiers. The system should distinguish between these.

Provenance

In the context of RDFWeb, the ILRT aggregation node is a rather unusual case, since previous aggregation nodes have been open to new additions by anyone, whereas the ILRT node was created by a team of trusted individuals. This allowed us to ignore an important issue for the storage, querying and presentation of information in RDFWeb, namely tracking the source of the data.

To take a simple example, consider the image below.

RDFWeb for danbri@w3.org
An RDFWeb page on an 'open' node for Dan Brickley [30]

This is an RDFWeb node to which anyone can add data. It happens that in one RDF file entered into the database, the foaf:name of the person with foaf:mbox is given as "Danny Boy", while in another file the same thing is given as "Dan Brickley". The system comes up with both names. In this case the inclusion of a different name might decrease the seriousness of the individual concerned, but other inclusions might enable libelous or offensive material to be associated with that person. The presence of libelous or offensive material on the web is hardly a new or unusual thing, but the presence of the information on one page might be misleading to users, and limit the usefulness of the RDFWeb service.

Consider an analogy with google. Search on google for "Dan Brickley" and you get a fair proportion of the work that Dan has done in the past - it makes a reasonable CV for him. In a system like RDFWeb people might prefer it if the most prominent information about them was information that they themselves had written about her work rather than things others had written about it. This is analogous to search engines which allow companies to dictate the order in which links come up about them for promotional purposes.

For the trained observer it is fairly easy to discover who has said what about whom on the web; but it would be better still if this information were made explicit, so that you could say about any piece of information: "where did this come from?" "what other information has come form this source?". This is a simple matter of tracking where information came from, which can be done even if metadata about that information is not present, for example, who actually wrote it; because RDF/XML is stored in an http get-able format you can look at that document and find out certain types of information from it.

This requirement for provenance-tracking of information is reflected in many systems for storing RDF, such that information about the url of the source of RDF data is stored with the data it contained. However this information is not reflected in many APIs, which focus on the RDF model rather than information about the storage of the model.

This information is not reflected in the query language used above, and so provenance tracking to the user is difficult within this system. it is possible to create a separate database containing provenance information linked off hashed values of the triples. This aspect of the storage of RDF needs further examination.

An alternative (or perhaps complimentary) model is to have RDF aggregation nodes with different policies for trustability of the information. So you could query the ILRT node knowing that the the data is trustworthy through machine processable assertions perhaps involving web of trust mechanisms. But you would query a generic RDFWeb node with the same caution you would any web search engine.

Consistency

This database does not control consistency between databases, because it is a read-only amalgamation of data from different sources, and does not change the data in the databases. It may enable consistency checking to be made easier. For example, if calendar data from private calendars maintained by individuals is combined with information from an ILRT calendar, querying the combined data will flag up inconsistencies and clashes, even if neither database can sync the data together or itself do consistency checking.

Flexibility

A primary reason for using RDF to examine data from different sources is its flexibility. This is most evident in the database storage and in the querying.

Suppose that we wanted to include calendar data in the system. The advantage of using RDF is that we would not have to change the structure of the database used to encompass this data. Instead we would get an export of the calendar data as RDF and insert it into the RDF database as is, and then use the query language to retrieve the data. We can do this for any data we consider has the appropriate level of privacy for the database concerned, and as long as we use the simple modeling rules described above and known descriptive expressions for objects, then this data is linked to other data about the people, projects, funders and subjects in the database.

9. Conclusions

The experiment of creating an RDFWeb aggregation service for the ILRT has been interesting because of the social aspects of collecting data from a diverse and decentralised organisation. For example there is a need in this particular organisation to obtain information directly from the individuals who make up the organisation, because much information is not stored in machine-processible format. This means either that they have to be 'interviewed' to obtain this information, or that we need to provide a simple mechanism for continual updating and creation of this information, such as a tool like RDFAuthor. If the creation of this information gives an immediate payback (such as appearing on the organisation or project website) then this provides an incentive to create the information.

A second conclusion is that it is easy to imagine that one day all the data in an organisation will be well-organised and available in one centrally manged, well-designed database. In practice there are often good reasons why data is stored where it is, for example because it is easy for the people who need to access the data to do so. Most importantly, the data an organisation needs is likely to change over time, so that there is never a time-independent optimal information storage system for an organisation, but only a series of databases each of which is optimal at a particular time. The overhead of migrating data from one best database to the next may be very high, involving redesign of the database. At some points in time this is the best thing for the organisation; in the meantime one use of RDF is to quickly create visualisations of the data in the organisation without the high overhead of creating databases specific to the shape of the data.

Acknowledgements

Thanks to Ian Sealy, Dave Beckett, Grainne Conole, Debra Hiom, Damian Steer.

References

[1] RDFWeb http://rdfweb.org

[2] Sample RDFWeb node: http://swordfish.rdfweb.org/rweb/who

[3] Sixdegrees http://sixdegrees.com (now defunct)

[4] Erdos numbers: Caspar Goffman. And what is your Erdos number? American Mathematical Monthly, v. 76 (1969), p. 791; also Erdos number project http://www.oakland.edu/~grossman/erdoshp.html

[5] ILRT http://www.ilrt.bris.ac.uk/

[6] University of Bristol http://www.bris.ac.uk/

[7] for example posts to www-rdf-rules: http://lists.w3.org/Archives/Public/www-rdf-rules/2001Nov/0013.html http://lists.w3.org/Archives/Public/www-rdf-rules/2001Nov/0012.html http://lists.w3.org/Archives/Public/www-rdf-rules/2001Nov/0011.html http://lists.w3.org/Archives/Public/www-rdf-rules/2001Nov/0010.html http://lists.w3.org/Archives/Public/www-rdf-rules/2001Nov/0008.html http://lists.w3.org/Archives/Public/www-rdf-rules/2001Nov/0007.html http://lists.w3.org/Archives/Public/www-rdf-rules/2001Nov/0006.html http://lists.w3.org/Archives/Public/www-rdf-rules/2001Nov/0005.html http://lists.w3.org/Archives/Public/www-rdf-rules/2001Nov/0004.html http://lists.w3.org/Archives/Public/www-rdf-rules/2001Nov/0003.html http://lists.w3.org/Archives/Public/www-rdf-rules/2001Nov/0002.html

[8]News from Information Strategy http://www.ltss.bris.ac.uk/in20p24.htm

[9] RDFWeb notebook: aggregation strategies Dan brickley. http://rdfweb.org/2001/01/design/smush.html

[10] Foaf schema http://xmlns.com/foaf/0.1/

[11] Representing vCard Objects in RDF/XML by Renato Iannella http://www.w3.org/TR/vcard-rdf

[12] Contact schema http://www.w3.org/2000/10/swap/pim/contact

[13] Wordnet http://www.cogsci.princeton.edu/~wn/

[14] Wordnet schema in RDF by Dan Brickley http://xmlns.com/wordnet/1.6/

[15] Reference description of the DAML+OIL (March 2001) ontology markup language Frank van Harmelen, Peter F. Patel-Schneider and Ian Horrocks, editors. http://www.daml.org/2001/03/reference.html

[16]Resource Description Framework (RDF) Model and Syntax http://www.w3.org/TR/REC-rdf-syntax Editors: Ora Lassila Ralph R. Swick

[17] N3 RDF syntax A Primer Tim Berners-Lee http://www.w3.org/2000/10/swap/Primer

[18] RDF Validation Service http://www.w3.org/rdf/validator

[19] RDFAuthor by Damian Steer http://rdfweb.org/people/damian/2001/10/RDFAuthor/

[20] IRIS database (Univeristy of Bristol users only) http://www.iris.bris.ac.uk/

[21] Inkling by libby Miller http://swordfish.rdfweb.org/rdfquery/

[22] Jena Toolkit by Brian McBride et al. http://www.hpl.hp.com/semweb/

[23] Redland by Dave Beckett http://www.redland.opensource.ac.uk/

[24] PostgreSQL http://www.postgresql.org/

[25] MySQL http://www.mysql.org/

[26] BerkeleyDB http://www.sleepycat.com

[27] rdfDB query http://sourceforge.net/projects/rdfdb/, http://web1.guha.com/rdfdb/ R.V. Guha

[28] rdfDB query language http://web1.guha.com/rdfdb/query.html R.V. Guha

[29] ILRT RDF database http://swordfish.rdfweb.org/discovery/2001/11/ilrt/

[30] An 'open' RDFWeb node: http://swordfish.rdfweb.org/rweb/who