IMeshTk ILRT Home

RDF Query by example

2002-04-04

Libby Miller <libby.miller@bristol.ac.uk>

Latest version: http://ilrt.org/discovery/2002/04/query/

Slides: http://ilrt.org/discovery/2002/04/query/Overview.html

RSS demo: http://sw1.ilrt.org/discovery/2002/04/rss/

Acknowledgements

The SquishQL/Inkling query work was part-funded by the Harmony project

Contents

Introduction

  1. Three principles for creating interoperable data
  2. RDF Query: asking questions of very flexible data
  3. Lightweight, hackable interoperability with RDF query: two examples
  4. Completer-finisher stage: Getting from hack to service

Introduction

This paper is for the 'Netlab and friends' [NETLAB] conference in the section on interoperability. 'NetLab and friends' is a celebration of ten years of Netlab, and of Netlab's contribution to technology on the web and in particular to the development of Digital Libraries. This paper is about RDF query, and specifically how simple RDF query languages can help members of the Digital Libraries community use RDF data right now.

RDF is structured data. Instead of putting information in a Word document or a simple html file, or in 'vanilla' XML, you store your information in XML documents using a particular set of conventions for writing that XML, or in such a way that you can export the data to XML documents using those conventions (for example in a database or a spreadsheet, or even in certain forms of xhtml).

The important thing about RDF is the information model, which is a directed label graph, that is, objects in the world are represented by nodes, and their properties by arcs that link the nodes together. Here is an example of the RDF for this paper, expressed as a graph:

Using the RDF set of conventions for exporting structured data has as its goal data-interoperability. RDF's node-arc-node model provides a minimal structure for interoperability, because it encodes directly information which is often implicit in other formats, such as 'vanilla' XML documents or IAFA templates.

The first hurdle with RDF is the question: why would you want to use RDF as your structured data format for your particular project? After all, there are a number of widely used, well tested and mature protocols and query and data formats in use in the Digital Library community already.

The answer is that if you have control and will always have control over your data you don't need RDF. You can use any data format and protocol you like: XML format of your own choice, a particular profile of Z3950 or Whois++, for example. This applies whether the data is private to your organisation (such as internal company records or databases), or whether you can control the data that is provided for you (for example as with the normalisation of data for the Renardus project). There is no one best way of storing and retrieving structured data.

However, in many cases you will not control the data you are working with. Distributed data is difficult to control, because you have to rely on others to provide it. Data is also difficult to control over time, as requirements for the data change over time. Another example is data from unexpected sources, which might be reuseable, but probably isn't in your preferred format.

If you do not control the data you have to deal with, because you don't own it, or because it may change over time, or because you don't know what you might want to combine it with next, then using RDF and associated modeling strategies and tools ('Semantic Web technologies') is useful now. Part of the reason for this is that the RDF model uses certain principles for modelling data which help with interoperability. Three of these are discussed in the next section, section 1. Section 2 briefly describes a simple RDF query language. In Section 3 we describe two different possibilities for combining data from different sources using RDF and RDF query.

1. Three principles for creating interoperable data

RDF is defined by its model, not its syntax. Using the RDF model for your data from the start can help save time if you ever want to combine the data with any other data, whether you decide to use the RDF syntax and an RDF database and query language or whether you use a relational database or Z39.50 to store and serve up the data. There are three basic modeling principles that are used to create RDF data:

  1. Use the RDF information model: nodes and arcs
  2. Use universal resource identifiers: URLs, mailtos
  3. Define the structure of information: write schemas

Principle 1: Use the RDF information model (nodes and arcs)

The RDF model is a good sanity check for interoperable data. RDF uses nodes and connecting arcs to talk about objects and their properties, so that you start thinking in terms of objects and their properties rather than in terms of documents and their syntax. This improves extensibility and may help modelling style.

For example suppose you have a ROADS SERVICE file format, which has the field 'Record-Created-Email'. An addition to the template requires a (small) change to the parser, so that the format is not very easily extensible. Moreover, the underyling model is implicit in the format. 'Record-Created-Email' is a shortcut from the record to the person that created the record, to the email address of that person. This is just a different modeling style, but it also obscures some of the structure of the record that might help interoperability. 'Persons' often crop up in different sorts of records, but here the fact that a person created the record isn't clear. This is also a common problem with 'vanilla' XML: the emphasis is on the syntax of the document not the underlying model. There is nothing wrong per se with this type of modelling, but it can limit extensibility and interoperability.

The RDF/XML syntax has been much criticised, but it is highly expressive and extensible, and there are now many tools that process it [JENA], [SIRPAC]. There are also some tools which allow you to create arbitrary RDF using a visual editor [RDFAUTHOR], [ISAVIZ]. You do not have to use RDF/XML syntax as the primary store of your data, but as an interchange format, it is very useful.

Principle 2: Use URIs as identifiers where possible

Or rather, use well-known, public URIs where possible: don't make them up if you can avoid it. In RDF nodes are either 'blank nodes' or are identified using URIs. RDF processors automatically assume that objects with the same URI are the same object.

In the case of people and certain documents it is not good modelling style to conflate the URI with the thing. A person is not their email address, and a document might have several URLs. In this case it is useful to use indirection. If the document does not have a single URL, then it can still have a dc:identifier pointing to a URL. A person can have a foaf:mbox pointing to their email address. RDF processors will not merge nodes identified indirectly, although schema annotations can be used to do this [SMUSH]. Sometimes it may seem simpler to forgo modeling accuracy for simplicity, but be careful - things can go wrong because RDF will assume that things with the same uri are the same object. For example, characteristics of a document might get confused with the characteristics of a particular instance of the document retrievable by from a url, if you name the document with the url of a retrievable instance of it.

For interoperability purposes, it is most helpful to use well-known, public identifiers, otherwise you'll have to keep stating that 'urn:ilrtperson2356' and 'urn:rdfwebperson347687' are the same individual, and processing information form different sources will be slow and cumbersome.

Principle 3: Write schemas

RDF Schemas describe the broad structure of types of RDF documents, such as what types of arcs can link what types of nodes. They can also include information about class heirarchies: you can build elaborate links describing subclasses of objects.

So for example you could use Dublin Core to represent information about webpages and other documents; the experimental foaf vocabulary or vCard in RDF for people, addresses, relationships. Reusing vocabularies can be difficult if parts of a schema are similar to what you want, but do not quite represent it exactly. RDF has subClassOf and subPropertyOf relationships to accommodate these similarities interoperably. These are not currently processed in very many tools but are useful for fixing the meaning of classes and properties precisely, and will likely be used in the future.

RDF does not require schemas: you can do a great deal without using them at all. However for interoperability it is helpful to describe what you meant your RDF to represent in machine-readable and human-readable form. This enables people to reinterpret your data more accurately.

Storing RDF

Once you have used these methods in your modeling, you have various options for actually storing your data. One option is to use RDF/XML documents as the primary store of your data and then use an RDF processor and a pure RDF database (one designed to store RDF). However, you do not have to use a 'pure' RDF database for storing the information, and in fact there are significant overheads to doing so, in particular the immaturity of the technology and the specialist knowledge that will probably be required. Another option is to store your data in (say) a relational database and export it to RDF/XML files.

Having said that, RDF databases are extremely flexible, and if your data structure changes rapidly you may find them very useful. For similar reasons, an RDF database can provide an interim solution for experimenting with combining different types of data until you settle on a good optimised structure in a well-understood database. The rest of this paper provides some examples of using mixed data with an RDF database that understands an RDF query language.

2. RDF Query: asking questions of very flexible data

If you do choose an RDF database, you need a way of accessing information that mirrors the flexibility of the RDF information model. RDF query languages such as SquishQL [INKLING] (which I've been working on) all query the RDF information model directly, and do not care about how the information is stored.

For example, the query below says:

"find me the name of the person whose email address is libby.miller@bristol.ac.uk, and also find me the title and identifier of anything that she has created"

select ?name, ?title, ?identifier 
where 
(dc::title ?paper ?title)
(dc::creator ?paper ?creator)
(dc::identifier ?paper ?uri)
(foaf::name ?creator ?name) 
(foaf::mbox ?creator mailto:libby.miller@bristol.ac.uk) 
using dc for http://purl.org/dc/elements/1.1/
foaf for http://xmlns.com/foaf/0.1/

Simple RDF queries such as this one just try and match parts of the RDF model in the database. Because of this, you can describe the query itself as a graph, as below:

The query describes the pattern of the information we want to match, not the way it is stored in the database. If we find some more data about reviews of papers, we can add this to our database and query it without redesigning the database structure, and by rewriting our query slightly:

select ?name, ?title, ?identifier, ?content  
where 
(dc::title ?paper ?title)
(dc::creator ?paper ?creator)
(dc::identifier ?paper ?uri)
(foaf::name ?creator ?name) 
(foaf::mbox ?creator mailto:libby.miller@bristol.ac.uk) 
(foaf::review ?paper ?review)
(foaf::content ?review ?content)
using dc for http://purl.org/dc/elements/1.1/
foaf for http://xmlns.com/foaf/0.1/

This differs from the relational model where you have to know the structure of tables before you can make the query. The relational model ties the data strongly to the way it is stored, which means that changes to the structure of the data require changes to the structure of the database, as well as changes to the query.

In contrast when new information is added to a pure RDF database, the structure of the database does not change in a relational sense. RDF data is semi-structured and does not rely on the presence of a schema for storage or query. This makes pure RDF databases very useful for prototyping and for other fast-moving data environments.

3. Lightweight, hackable interoperability with RDF query: two examples

RDF gets you thinking in terms of combining information, and how to make it (relatively) easy. You might start thinking...

Example 1: Publications, ROADS editors recommendations and bookmarks

So, let's find out more about the editors at SOSIG. You can already get to a brief description of the editor of a section: why not combine this information with papers they have authored, and with their current SOSIG-related bookmarks list? Technically this is fairly simple.

Using Dublin Core and a simple bookmarks schema all these formats can be converted to RDF/XML files. They can then be harvested into an RDF database and queried (see below for various options for storing and querying RDf data). An example query might look something like this:

"find me all the bookmarks from the bookmarks file of the person with email address 'emma.place@bristol.ac.uk' which are dated more recently than 1st April 2002".

select ?bookmark, ?title, ?date   
where 
(foaf::mbox ?person mailto:emma.place@bristol.ac.uk) 
(bm::bookmarkFile ?person ?file) 
(bm::bookmark ?file ?bookmark) 
(dc::title ?bookmark ?title) 
(dc::date ?bookmark ?date) 
and ?date > 2002-04-01
using dc for http://purl.org/dc/elements/1.1/
foaf for http://xmlns.com/foaf/0.1/
bm for http://example.com/bookmarks/

The results of such a query might look something like this:

bookmark title date
http://www.psychology.ltsn.ac.uk/ LTSN Psychology 2002-04-02
http://www.bids.ac.uk/ BIDS 2002-04-04

The social constraints on using these pieces of information together might be more limiting than the technical aspects. Although many people make their bookmarks available publically on the web, some would not dream of doing so. People might object to such a swathe of information being made available about them.

What benefit might something like this give SOSIG?

It gets people thinking about different kinds of information that might improve the service. Let's say that the editors aren't happy with everyone being able to see their bookmarks, but would be happy for other editors to see them. Then they could see what other subject editors were working on, check for duplicates, and get useful ideas for where to go next. Or suppose they're not happy for people to see their bookmarks, but would find a blogging tool helpful in the cataloging process, the output of which they would be happy to share.

The user of the service now has an interesting way of checking the credentials of the subject editors. The service does not just say: "trust us because we are trustworthy, non-profit making and have a long history of providing good resources", but "trust us because of the credentials of the named individuals who find and catalogue resources for us."

Finally, combining this information together might make technical people think about how they might generalise this cataloging model to a more inclusive annotating model, and how this might be managed, for example:

"select annotations by people that Emma knows of professionally"

select ?annotation, ?content 
where 
(ann::annotates ?annotation ?file) 
(ann::content ?annotation ?content) 
(dc::creator ?annotation ?knownPerson) 
(foaf::mbox ?person mailto:emma.place@bristol.ac.uk) 
(foaf::knowsOfProfessionally ?person ?knownPerson) 
using dc for http://purl.org/dc/elements/1.1/
foaf for http://xmlns.com/foaf/0.1/
ann for http://example.com/annotations/
Example 2: A very simple portal

For interoperability between organisations, simple is often better, especially if there are many organisations involved.

RSS [RSS] is a well-known syndication format expressed in RDF/XML, originally designed to syndicate news stories. It's very simple indeed, consisting essentially of a list of links with titles and descriptions, and a container to hold them. As of RSS 1.0, simple RSS files can be extended with modules for a particular purpose, for example with a set of Dublin Core elements to describe webpages in more detail.

At LTSN Economics [LTSN-ECON], Martin Poulter has been experimenting with the RSS 1.0 events module [RSS-EVENTS] to describe conference information [LTSN-EVENTS]. The events module adds a start date and an end date to the standard RSS item, and a location, and an organiser. Other LTSN centres have also been producing ordinary RSS 1.0 feeds [LTSN-FEEDS], as have various organisations in the subject gateway community, spurred on by Ukoln's RSSXpress [RSSXPRESS].

SOSIG's Grapevine service [GRAPE] has a personalization feature that can display feeds described as RSS 1.0. But we can do even more interesting things using query. Let's suppose we have a list of feeds such as that at UKoln's RSSExpress page. Then we produce a piece of RDF describing those feeds, for example by classifying them according to their subject within the SOSIG classification system. Then we can load all the feeds into an RDF database, load in our little RDF description file about the feeds, and with just a couple of queries we have a 'portal'.

First we ask: what feed would you like?

select ?feedUrl, ?title  
where 
(dc::subject ?feedUrl ?subject)
(rss::title ?feedUrl ?title)
and ?subject ~ "economics"
using rss for http://purl.org/rss/1.0/
dc for http://purl.org/dc/elements/1.1/

results

feedUrl title
http://chewbacca.ilrt.bris.ac.uk/events/events.xml LTSN Economics events
http://www.cepr.org/aboutcepr/cepr.rss Centre for Economic Policy Research
http://www.bized.ac.uk/homeinfo/whatsnew.htm Biz/ed What's New

then we create each feed using another query which asks for the items, their urls and titles:

select ?item, ?ti, ?li 
where 
(rss::items http://chewbacca.ilrt.bris.ac.uk/events/events.xml ?seq)
(?contains ?seq ?item)
(rss::title ?item ?ti) 
(rss::link ?item ?li) 
using rss for http://purl.org/rss/1.0/

And with a little html formatting, we have a portal!

If we have a channel picker like that on SOSIG Grapevine, we could store people's selections of feeds as an RDF file, and then pull them out again next time using a query like this one:

select ?feedUrl, ?title  
where 
(sosig::profile ?person ?profile)
(sosig::channel ?profile ?feedUrl)
(rss::title ?feedUrl ?title)
(foaf::mbox ?person mailto:libby.miller@bristol.ac.uk)
using rss for http://purl.org/rss/1.0/
foaf for http://xmlns.com/foaf/0.1/
sosig for http://www.sosig.ac.uk/schemas/profiles/

Something as simple as a Perl regex can be used to parse RSS files with great effect. The advantage of using RDF query to do it is that it makes the query and display of RSS extensions or modules very simple. For example, a query of the basic RSS form of the LTSN event feed looks just like the query for any other feed:

select ?item, ?title, ?link  
where 
(rss::items http://chewbacca.ilrt.bris.ac.uk/events/events.xml ?seq)
(?anyPredicate ?seq ?item)
(rss::title ?item ?title) 
(rss::link ?item ?link) 
using rss for http://purl.org/rss/1.0/

adding the events information makes the query look like this:

select ?title, ?link, ?start, ?end, ?location  
where 
(rss::items http://chewbacca.ilrt.bris.ac.uk/events/events.xml ?seq)
(?anyPredicate ?seq ?item)
(rss::title ?item ?title) 
(rss::link ?item ?link) 
(ev::startdate ?item ?start) 
(ev::enddate ?item ?end) 
(ev::location ?item ?location) 
using rss for http://purl.org/rss/1.0/
ev for http://purl.org/rss/1.0/modules/event/

giving us results that look like this:

link title start end location
http://crm.hct.ac.ae/tend2002/ Bridging the Divide - Strategies for Change 2002-04-07 2002-04-09 Dubai, United Arab Emirates
http://www.scoteconsoc.org/ses2002.html SEA Annual Conference 2002-04-11 2002-04-12 Dundee, UK

We could even start limiting the scope of our searches and combining various feeds, for example:

"find me all the events starting in April 2002 from all feeds which can be picked out using the search term economics"

select ?item, ?title, ?link, ?start, ?end, ?location  
where 
(dc::subject ?feedUrl ?subject)
(rss::items ?feedurl ?seq)
(?anyPredicate ?seq ?item)
(rss::title ?item ?title) 
(rss::link ?item ?link) 
(ev::startdate ?item ?start) 
(ev::enddate ?item ?end) 
(ev::location ?item ?location) 
and ?subject ~ "economics"
and ?start ~ "2002-04" 
using rss for http://purl.org/rss/1.0/
ev for http://purl.org/rss/1.0/modules/event/
dc for http://purl.org/dc/elements/1.1/

Simple RDF queries like the SquishQL examples here are not particularly powerful: they can't do 'OR' queries for example. But they can make queries of flexible data in a fairly easy to understand fashion.

4. Completer-finisher stage: Getting from hack to service

You may want to optimise your database for certain queries, once you have experimented with combining RDF data. One way of doing this is to collect RDF information in one RDF database, and then use RDF query to pick out the parts you want to optimise for.

For example, suppose we had a number of interesting economics RSS events feeds available. A query such as

select ?item, ?title, ?link, ?start, ?end, ?location  
where 
(dc::subject ?feedUrl ?subject)
(rss::items ?feedurl ?seq)
(?anyPredicate ?seq ?item)
(rss::title ?item ?title) 
(rss::link ?item ?link) 
(ev::startdate ?item ?start) 
(ev::enddate ?item ?end) 
(ev::location ?item ?location) 
and ?subject ~ "economics"
and ?start ~ "2002-04" 
using rss for http://purl.org/rss/1.0/
ev for http://purl.org/rss/1.0/modules/event/
dc for http://purl.org/dc/elements/1.1/

(repeated from above) gives us the raw ingredients of a simple flat relational database structure, with one table:

id link title start end location

which we can then query using SQL to create a nice html list of conferences for our economics users (or perhaps our own RSS feed). In this way, RDF databases and query can be an intermediate step, helping us to gather and organise diverse data before optimising.

Summary

My aim has been to show how RDF tools can be useful to the Digital libraries community now. I've suggested that while RDF tools may not be as fast and well-understood as more conventional databases and protocols, nevertheless they can be used to combine information from multiple sources in interesting and practical ways that can extend the functionality of services.

Tools you might like to try include Jena [JENA], SquishQL/Inkling [INKLING], SquishQL/Ruby [RUBY-RDF] tools for storing and querying RDF in SQL databases. RDFStore [RDFSTORE] includes a Perl implementation; RDFdb [RDFDB] has a similar query language, on which SquishQL was based. Redland [REDLAND] is a fast RDF database. Many more RDF query languages and databases systems are available - see Dave Beckett's RDF resource guide [BECKETT].

Acknowledgements

Thanks to Dan Brickley and Damian Steer for helpful comments and discussion.

References

[NETLAB] http://www.lub.lu.se/netlab/conf/

[JENA] http://www.hpl.hp.com/semweb/jena-top.html

[SIRPAC] http://www-db.stanford.edu/~melnik/rdf/api.html

[RDFAUTHOR] http://rdfweb.org/people/damian/RDFAuthor

[ISAVIZ] http://www.w3.org/2001/11/IsaViz/

[SMUSH] http://rdfweb.org/2001/01/design/smush.html

[INKLING] http://swordfish.rdfweb.org/rdfquery/

[ROADS] http://www.ukoln.ac.uk/metadata/roads/templates/

[SOSIG] http://www.sosig.ac.uk/

[EDITORS]

for example http://www.sosig.ac.uk/profiles/econ_management.html

[RSS] http://www.purl.org/rss/1.0/

[LTSN-ECON] http://www.economics.ltsn.ac.uk/

[RSS-EVENTS] http://groups.yahoo.com/group/rss-dev/files/Modules/Proposed/mod_event.html

[LTSN-EVENTS] http://chewbacca.ilrt.bris.ac.uk/events/events.xml

[LTSN-FEEDS] http://www.ltsneng.ac.uk/rssfeeds/rsseg.asp

[RSSEXPRESS] http://rssxpress.ukoln.ac.uk/

[GRAPE] http://www.sosig.ac.uk/gv/

[RUBY-RDF] http://www.w3.org/2001/12/rubyrdf/

[RDFSTORE] http://rdfstore.sourceforge.net/

[RDFDB] http://web1.guha.com/rdfdb/

[REDLAND] http://www.redland.opensource.ac.uk/

[BECKETT http://www.ilrt.bris.ac.uk/discovery/rdf/resources/