RDF hacking ILRT Home

RDF, SQL and the Semantic Web - a case study

Authors: Dan Brickley , Libby Miller

Initial draft: 2000-10-31
Latest version: This version: 2000-11-08
http://ilrt.org/discovery/2000/10/swsql/

Status of this Document

You are looking at a pretty rough draft. Lots of raw materials, much still to be done.

Overview

This is a quick writeup some work in progress, exploring various inter-relationships between relational databases, tools from the logic programming community, and W3C's RDF. We first discuss some broad issues surrounding RDF APIs and query systems, before exploring a number of more detailed case studies. Our first case study is based around our SQL-ish ("Squish") interface to any RDF database that exports either the Jena or Stanford RDF APIs. Our second case study shows how tools from the logic programming community can be used to export RDBMS systems to the Web.

Background: SQL and the Web...

This note discusses the relationship between low-level RDF APIs and SQL-like query interfaces, before reviewing various implementation options and reporting some experimental case studies.

During the period 1997-2000 we have seen various proposed 'graph APIs' for the RDF data model. In Java we have SiRPAC (now using the Stanford RDF API), the proposed Jena interface, and a number of others. In C, the evolving Mozilla RDF APIs and more recently the Redland distribution offer similar facilities. These APIs are typically couched in terms of either the 'graph' metaphor (nodes and arcs, properties and resources) or in terms of 'statements' (triples).

Regardless of terminlogy, we see similar functionality in most of these APIs. Typically some mechanism exists for navigating stored RDF data (e.g. the GetTargets(propertyname) call in Mozilla), as well as some mechanism for requesting matches from the database given some partially completed template statement. For example, many systems allow applications to call something like a triplesMatching(propertyname, null, null) method to ask for all the RDF statements that use some specified predicate.

The storage and query systems that expose such APIs exploit a wide variety of implementation strategies. RDF provides a highly general information model; as a result, a wide variety of data services can be encapsulated behind an RDF-oriented interface. While there is no crisp dividing line between systems that 'really' store RDF and ones which 'merely' manifest an RDF view, we need to make some such distinction to limit the scope of our discussion. For example, Mozilla uses RDF as an abstraction layer to wrap a number of services that are not 'general RDF databases' (e.g.. mail/news). This note focuses on what we will call general RDF databases, i.e.. systems capable of representing any arbitrary RDF model (subject to resource limitations, eg. storage space), rather than those with a more constrained information model but which nevertheless manifest an RDF interface. This distinction could be fleshed out with more subtlety than used here; for our current purposes, this broad brush notion of a 'general' RDF database is adequate.

Considering only these 'general' RDF databases, there have typically been a few common categories of implementation. Firstly we have the simple in-memory database. These systems offer graph navigation and partial statement matching against a large data structure held entirely in memory. Persistent RDF databases have most frequently been built on top of relational databases and simpler systems such as Sleepycat's BerkeleyDB. The remainder of this overview discusses various options for exposing SQL-like interfaces on top of these various kinds of 'generic' RDF database.

It should first be noted that a simple un-augmented RDF API (such as sketched above) is enough to construct SQL-like query systems on top of. We know this because, in the worst case, an API that offers the ability to 'return all statements' (i.e.. dump the entire database to client applications) clearly exposes enough information for external query engines to manipulate. When efficiency is not a major concern, this provides a useful backup strategy: any general RDF database that exposes a basic API can have richer query interface wrapper around it. For developers, this can be a big win: instead of applications having to grovel around the RDF data store using the graph metaphor, or interact with it at a highly granular statement-by-statement level, we can offer a more abstract interface.

If I want to ask our Intranet for the homepages and phone numbers of staff who are working on projects that are funded by the organisation whose home page is http://www.jisc.ac.uk/, I would like to be able to write a single expression represent that query rather than have to write a dozen lines of code to navigate the data graph 'by hand'. The existence (and success) of SQL and relational databases has conditioned developers to expect a certain degree of abstraction from databases. In the RDF world, we are only now beginning to see systems that offer such an interface.

Before discussing implementation details any further, it is worth stepping back and contrasting the RDF and SQL/RDBMS approaches. Our goal is to find something that fills an SQL-like role, but using the RDF information model.

Why is this an interesting thing to do? Some observations on SQL and the Web: SQL-based databases, while hugely successful, work best in predictable environments, since schema evolution can be expensive and disruptive. Unless an extremely general RDBMS schema is used (more on which later), SQL systems require application developers to decide in advance on the kind of things their database will be able to store, and the kinds of inter-relations that hold between the entities represented in the database. While it is possible to evolve and extend RDBMS systems over time, this is rarely convenient. With predictability comes efficiency: a database that knows from the outset what it will be doing has a reasonable chance of indexing and storing data efficiently. However, another effect of this is the tendency for SQL-based systems to entangle implementation details with abstract application-level models. If I write an application that needs to embed SQL queries, that application becomes tightly coupled not only to the abstract 'model of the world' (i.e.. entities and relationships etc) I've used, but also to the nitty-gritty detail of how those entities and relationships are stored in my particular RDBMS.

These are related and well understood problems. While they have become increasingly annoying over time, we have also got better at working around them. With networked computing becoming ubiquitous, databases increasingly have to talk to one another, and have to describe the same entities, relationships, attributes etc. Applications need to become less tightly coupled to any particular storage/representation, i.e.. avoid commitment to any particular storage strategy.

Such observations are commonplace, and the computing industry has been busy exploring alternatives and coping strategies for dealing with the shortcomings of the RDBMS approach (e.g.. encapsulating everything via Enterprise Java Bean wrappers, using Object databases, projecting out LDAP interfaces, or using the magic of XML...). Our concern here is more limited: we hope to connect these issues to some questions facing RDF implementors. How can we augment our basic RDF APIs with SQL-like facilities? How can implementation exploit meta-information about the RDF backend storage system, without being tightly coupled to any particular storage strategy? How can query processors and database indexes be constructed to work in the unpredictable, heterogenous Web data environment?

case study 1: SQL-ish query over RDF APIs

To illustrate some practical issues faced by developers, we provide a brief overview of a simple, experimental implementation an SQL-like query language on top of three different Java RDF APIs.

We experimented with a number of Java RDF APIs, in particular three examples...

All have similar features including a triples-matching method, e.g. in RDFModelCore (our in-house simplest-imaginable testbed API) this is...

RDFModelCore triplesWhere(Resource subject, Resource predicate, Node object)

We had previously built a SQL-like query language implementation on top of RDFModelCore, providing simple conjuctive search functionality similar to R.V. Guha's RDFDb. This was subsequently generalized to use these different (but basically similar) RDF APIs. We have two experimental implementations:

Using JDBC for RDF Query

We decided to experiment with the use of the existing JDBC API for representing RDF query services, largely because it is well-used and familiar. While our strawman query language, Squish, was not SQL, it is reasonably similar. A Squish query can be represented as a textual string, and the results of a Squish query can be thought of either as an RDF model, or as a tabular set of variable-to-data bindings. The former can be represented using existing RDF APIs; the latter (we propose) can be manipulated using the Java JDBC APIs, or something very like them.

The JDBC API is used for accessing SQL databases via a Driver written for the particular database type which enables a Connection object to be made. The Connection object is used to create a Statement object which queries the database and returns a ResultSet object. The API consists of Interfaces to be implemented. In this implementation many methods are not implemented, usually because RDF does not yet use things like transactions, for example.

In this implementation, Statement does most of the hard work. It:

* parses the query -> triple objects
* orders the query triples so that there are no dangling bindings
* makes the queries in order, one by one (unoptimised) to the underlying
storage using the triplesWhere method (or similar)
* forms a java.sql.Resultset object which is returned.

Following the SQL model, query triples are treated as conjunctions, i.e.:

select ?x, ?y, ?m where ({http://xmlns.com/foaf/0.1/livesWith} ?x ?y)
({http://xmlns.com/foaf/0.1/mbox} ?y ?m) ({http://xmlns.com/foaf/0.1/mbox}
?x mailto:daniel.brickley@bristol.ac.uk)

is interpreted as 

select ?x, ?y, ?m where ({http://xmlns.com/foaf/0.1/livesWith} ?x ?y) &
({http://xmlns.com/foaf/0.1/mbox} ?y ?m) & 
({http://xmlns.com/foaf/0.1/mbox} ?x mailto:daniel.brickley@bristol.ac.uk)

and will not return a row unless all the slots are filled in that row.

Statement and Connection are API-independent; the only restriction in the
API implementation (whether in-memory or not) is that it implements a
particular form of the triplesMatching(s, p, o) method, returning an
RDFModelCore object (API specific), forming a wrapper for the API's
own implementation of this method. This is implemented by making the
model conform to a tiny interface. The API-specific classes are the
Driver and the in-memory and SQL front end classes.

Using this JDBC API means that details of the access to the API and
details of the access to the SQL store (if used) are hidden.

The software as written is inefficient because it does not use any
optimization to access SQL databases. But it does show that it is easy
(using this implementation something similar) to turn a basic RDF API
into an SQL-like query language.

Case Study 2: RDBMS meets Logic Programming, using XSB with ODBC

This case study describes how to get up and running with a relational database / inference engine environment, providing a useful testbed for RDF / RDBMS mapping experiments. It may also be applicable to general mapping of tabular data (eg. screen scraped procedurally, with XSL, or from annotated HTML TABLEs).

There are two flavours of relationally-stored data that we can access in this way. The first is a 'classic' RDBMS schema. The second is a more generic RDBMS structure that has been designed to represent arbitrary RDF data (see Storing RDF in a relational database by Sergey Melnik for some examples of this).

In this case study, we demonstrate both techniques. While the examples shown here are in a Win32 environment, similar systems can be constructed in Unix (where ODBC facilities are less consistently organised).

Basic idea

The basic concept is as follows. There are lots of reasons to store data in relational databases. Logic-based systems can work well for querying RDF. Some logic systems can talk to RDBMS data stores. So let's try to glue all these things together...

Ingredients

The XSB inference system (download v2.2 Win32 version), Microsoft Access, ODBC Control panel. Some data.

Instructions

Set up a simple database in Access, expose it as an ODBC datasource using the ODBC 32 control panel. Avoid choosing field names that contain spaces or other punctuation (since the SQL generated by the logic system doesn't account for this). Make a note of the ODBC name for this database (eg. rdftest1).

Run XSB and connect to the database

[odbc_call].
odbc_open(rdftest1, anon, none).

Read the XSB man pages on the ODBC adaptor, or try variations on the following to associate logical relations in XSB with RDBMS fields.

Note: XSB (and the related PrologSQL code for SWI-Prolog) has two mechanisms for doing this, the view level and relation level interfaces. Here we experiment with the latter, a simple and crude mapping of a single table into logical relations. The former allows for more flexible generation of complex SQL statements, which allows more computation to be done on the remote OBDC host.

Example 1: Books

This creates a relation 'mybookrelation' in the inference engine corresponding to fields 'Title','Description' etc in the ODBC datasource we're connected to. The relational data we're connecting to has, in this case, not been managed as RDF data, so we need to take care of that mapping here.

odbc_import('Books'('Title','Description','Subject','Etc'),mybookrelation).

If the connect is successful XSB will return 'yes'

Example 2: RDF Triplestore

In this example, we show how to interface with a generic RDF store implemented using a relational database.

We show how to project RDF-based RDBMS structures into RDF statements. In this case we're connecting to a very generic RDBMS representation of RDF statements, so the mappings are simple.

| ?- odbc_import('sw1'('SID','pred','subj','obj','dt'),rdfstatement).

yes

Now we ask for properties of a given resource...

| ?- rdfstatement(S,P,'http://www.ilrt.bris.ac.uk/discovery/',O,D).

The XSB ODBC adaptor translates this into an SQL query...

SELECT rel1.SID , rel1.pred , rel1.subj , rel1.obj , rel1.dt
FROM sw1 rel1
WHERE rel1.subj = ?
;

And we get a result set acquired from the relational database, mapped into the logic representation.

S = 2
P = http://purl.org/dc/elements/1.1/title
O = "ILRT metadata research"
D = rdfs:Literal;

S = 3
P = http://purl.org/dc/elements/1.1/description
O = "An overview of metadata research at ILRT"
D = rdfs:Literal

no

Another query. What 'titles' are in the database? (remembering that the database itself contains full URIs, so we don't have to tell XSB about these).

| ?- rdfstatement(S,'http://purl.org/dc/elements/1.1/title',R,O,D).

SELECT rel1.SID , rel1.pred , rel1.subj , rel1.obj , rel1.dt
FROM sw1 rel1
WHERE rel1.pred = ?
;


S = 1
R = http://www.desire.org/
O = "The Desire website"
D = rdfs:Literal;

S = 2
R = http://www.ilrt.bris.ac.uk/discovery/
O = "ILRT metadata research"
D = rdfs:Literal;


Note: in the XSB interface press ';' followed by [enter] to cycle through result sets when they're being displayed on the interactive console.

Example 3: another representation of RDF

Here's an example session...

[xsb_configuration loaded]
[sysinitrc loaded]
[packaging loaded]

XSB Version 2.2 (Tsingtao) of April 20, 2000
[x86-pc-windows; mode: optimal; engine: chat; scheduling: batched]
| ?- [odbc_call].
[odbc_call loaded]
| ?- odbc_open(xsbdemo,none,none).

yes
| ?- odbc_import(rdfweb('ID',sid,predicate,subject,objectres,objval),rdfclaim).

yes
| ?- odbc_import(objvaldata('ID',objval,basedatatype,intval,strval,dateval,currv
al,language),objvaldata).

| ?- rdfclaim(ID,StatementID,Pred,Sub,ObjRes,ObjVal).

ID = 1
StatementID = 001
Pred = http://purl.org/dc/elements/1.1/title
Sub = http://rdfwebring.org/rdfweb/
ObjRes = NULL(null2)
ObjVal = 1;

ID = 2
StatementID = 002
Pred = http://purl.org/dc/elements/1.1/description
Sub = http://rdfwebring.org/rdfweb/
ObjRes = NULL(null3)
ObjVal = 2;

ID = 3
StatementID = NULL(null4)
Pred = http://purl.org/abc/event-start
Sub = var:uuid:3242626843265234
ObjRes = NULL(null5)
ObjVal = 3


Exploring Inference

Having reflected the ODBC datasources into an inference environment, we can now explore the use of logical rules expressed over the ODBC-accessible data.

For example, if the following prolog rule is stored in a file (call it 'test1.P') and consulted with consult('test1.P') the rules and data work together.

dctitle(S,O) :- rdfstatement(S,P,'http://www.ilrt.bris.ac.uk/discovery/',O,D).

Now we can ask XSB for 'dctitle' relationships instead of using the more verbose, RDF-statement version that we originally used when mapping the ODBC database.

dctitle(X,Y).

X = 2
Y = "ILRT metadata research"

yes

References

[Webize] Webizing existing systems, Tim Berners-Lee, 1998-2000

[MozInf] Mozilla Enabling Inference project

[QL98Inf] Enabling Inference QL-98 Position Paper, Guha, Lassila, Miller, Brickley

[PrologSQL] PrologSQL: SWI-Prolog to SQL bridge

Acknowledgements

Thanks to Jan Grant for comments on an even earlier draft :)

Further work...

Todo: 1. explore mapping of more generic RDBMS structures into the RDF data model. Datatyping, representation of sub-graphs, provenance etc. (See JanG's code snippet below.)

laying/contexts/etc idea from Jan:

triple(subclass, a, b, db).
triple(subclass, b, c, db).

triple(P, S, O, subclasschaining(X)) :- triple(P, S, O, X).
triple(subclass, S, O, subclasschaining(X)) :- triple(subclass, S, O1, X),
        triple(subclass, O1, O, subclasschaining(X)).

triple(P, S, O, subpropchaining(X)) :- triple(P, S, O, X).
triple(subproperty, S, O, subpropchaining(X)) :- triple(subproperty, S,
O1, X),
        triple(subclass, O1, O, subpropchaining(X)).

# Example query:

:- triple(P, S, O, subpropchaining(subclasschaining(db))).     

To-do list...

An elaboration of this document might explore the view-level rather than relation-level SQL/logic interface. Look at common codebase issues w.r.t. SWI-Prolog (the PrologSQL library) and XSB. Find out about ODBC in Linux/Unix enviroment (can we talk to MySQL, PostgreSQL? or just Oracle?). Look at datatyping (eg. datetime in Event models. XML Schema datatypes. Suggestion from danc: move triplestore stuff to rdf geek appendix

Last ammended $date$ daniel.brickley@bristol.ac.uk