Inkling Architectural Overview

Author

Libby Miller <libby.miller@bristol.ac.uk> 2001-07-11

Abstract

This is an overview of the architecture of Inkling, an implementation of the SquishQL query language for RDF in Java.

Introduction

SquishQL is a simple SQL-like query language for RDF. Inkling is the first pass at an implementation of Squish in Java. The aim behind it was to have a query engine that could be used with almost any RDF database implementation, and that could be used for experimenting with the SquishQL query language. It has succeeded in being useful for this limited purpose; this document represents an overview of the architecture which will feed into implementations with greater focus on efficiency.

SquishQL

SquishQL is a simple, triples-based query language for RDF, which is designed to be human readable ('SQL-ish'). It was strongly influenced by Guha's RDFdb query language [1], and the paper 'Enabling Inferencing' [2]. A number of people and organisations have either developed similar query languages for RDF independently (Geoff Chappell [3]) or used and implemented Squish (Colin Britton, Andy Seaborne). There's also the Algae triples-based query language which predates SquishQL and has similar functionality [4].

SquishQL is:

simple:

It does not implement transitive closure for classes and properties as RQL does [5].

triples-based:

Queries are composed of conjunctions of triples (?p ?s ?o) which may be variables or values to match. SquishQL does not implement boolean combinations of AND and OR. The triples-based format makes SquishQL very close to the RDF model, although there is no RDF/XML version at present.

human-readable:

It follows the SQL convention of SELECT....variables....FROM....datasource(s)....WHERE...constraints). Currently it also has AND clauses for more specific SQL-like constraints (~, +, >, <). It also has a 'USING' clause which enables you to use abrieviated namespaces in the queries.

Inkling

JDBC

Inkling uses the JDBC interfaces to make SquishQL queries. This enables the implementation to be fairly independent of the database to be searched, and also means that java programmers will probably be familiar with the means of accessing the queries.

Using JDBC, in order to make a query you:

To those who don't use JDBC this may seem a little baroque, however its familiarity to most Java programmers means that it is probably more useful to implement it in this way rather than using a new API. Analogously, it is envisaged that a Perl implementation of SquishQL could use the Perl DBM APIs.

Making a query

Driver

The JDBC process of making a query gives us a framework to look at the Inkling implementation of JDBC for SquishQL.

The Driver is a very simple implementation of the java.sql.Driver interface which:

These methods are called by Java's DriverManager.

Connection

The Connection object is a simple implementation of the java.sql.Connection interface, and is created by the Driver using an instance of the database implementation class and the nameof the database implementation class.

public Connection(Object source, String handler)

the only other method I implemented was

public java.sql.Statement createStatement()

which creates a Statement object using the the database implementation class and the name of the database implementation class.

Statement

The Statement class itself is an implementation of java.sql.Statement, which used to be an instance of the query engine QE itself, but now just creates an instance of it when called (which means that a new instance is created each time a statement is created). For this reason there are a bunch of redundant methods in this class which used to be used when it was a QE. Its only used function now (apart from instantiation, which creates the query engine QE) is:

public java.sql.ResultSet executeQuery(String query)

which calls return qe.executeQuery(query); on QE, returning a java.sql.ResultSet object.

Query Engine QE.java

This class does the bulk of the query work. It implements java.sql.Statement, though that's not used any more.

There are instances of ParsedQuery and Resultset associated with the class, plus a triples root, to which the query triples are added, and a source object - an RDF database.

It implements several methods to do with parsing, which ParsedQuery now does.

The basic process is illustrated by

public java.sql.ResultSet executeQuery(String query)

which is like a main method.

First the query is parsed:

pq =ParsedQuery.parse(query);

and various useful variables are set from the ParsedQuery

variables = pq.variables; ns = pq.ns; Vector v = pq.triples;

then we look for a source graph; if there isn't one we try to download the urls from the query:

if (pq.graphs.size() != 0){
sourceHandler = "org.desire.rudolf.query.modelCore.MemModelCore";
source =org.desire.rudolf.query.DownloadUrls.getUrls(pq.graphs);
}

we sort the triples

org.desire.rudolf.rdf.Triple sorted = sort(v);

this:

Getting the top query and ordering the queries is currently redundant because the queries are made in a naive manner by asking the query, even if all variables are null. No attempt is made at the moment to fill in values in other subqueries as they are found.

then we make the query:

RDFModelCore rdftest = makeQuery(sorted, rests, source);

rests is an empty triple which is now redundant; source is the source RDF database, sorted is the Triple containing sorted queries.

There's a certain amount of redundancy here due to the organic way this class has developed.

then we check for the interface QueryDirect:

if(source instanceof org.desire.rudolf.query.QueryDirect){

which means we can ask the query directly and get back a java.sql.ResultSet

try{
finalRows=qd.makeQuery(pq);
}catch(Exception er){
ErrorLog.write("QE: could not make query "+er);
}

otherwise we loop through the subqueries stored in sorted, create a triple query to ask and ask it

sql = (SQLHandlerInterface) source;
results = sql.queryDatabase(tmp, s, p, o);

tmp is a RDFModelCore to put the results of the query in (you don't get this if you use queryDirect; also I'm not sure exactly what you get back here - it's just what comes back from the database.)

We then create a resultset from the results Vector we get back (a list of triples)

java.sql.ResultSet res =makeResultSet(results, query);

then smoosh with previous resultset if appropriate:

finalRows = smooshResultSets((java.sql.ResultSet) finalRows, res, i);

Getting back to executeQuery, we then applyConstraints (things like ~, >, <, =), and removeSurplusVariables (variables we didn't ask for the value of), if we didn't do a direct query.

Finally we return the class variable finalRows Resultset as java.sql.ResultSet.

SQLHandlerInterface

This is an extremely basic interface which enables the query engine QE.java to talk to a number of different Java RDF APIs. It consists of one method:

public Vector queryDatabase(RDFModelCore rdf, Resource r, Property p, RDFnode n);

which all databases have to implement; essentially a very simple triplesWhere(p,s,o) method where any of the three parts of the query can be null. It's misnamed (any RDF database can implement it, not just SQL ones). It's also inefficient and rather clumsy because it uses a deprecated set of objects for nodes, graphs and properties (these were part of Sergey Melnik's RDF API, but he created a whole new set fo equivalents in a different package and no longer uses these. This caused awkwardness when trying to make Inkling work with Sergey's RDF implementation). It's also inefficient because although most applications implement a 'triplesWhere' type query method, this isn't necessarily the best way of getting the data out; in particular where the values of two variables are known, common in-memory implementations can have faster methods (although this may be part of the implementations anyway so possibly a moot point).

The RDFModelCore is passed to the method because for certain implementations it was useful to have an RDF representation of the results. Filling this RDFModelCore is inefficient when converting from other APIs such as Jena or the Stanford API.

There are various other problems with the simplicity of the interface, for example, if there are optimisations the database can perform for certain kinds of queries such as reification, bags and sequences, then this per-triple interface won't be useful. On example of how we got around this is by passing the entire query to the underlying database using the QueryDirect interface below.

Databases

I've provided simple SQLHandlerInterface implementations for RDFModelCore, Jena's ModelMem.java (MemModelJena.java), and Sergey Melnik's ModelImpl.java (MemModelMelnik.java), and also written SQL accessing versions of all three. In all these cases I just extended the original implementations so that they would implement SQLHandlerInterface and also the org.w3c.rdf.RDFConsumer (now deprecated class I think) methods for use with SiRPAC. These are:

public void assert(org.w3c.rdf.DataSource ds, org.w3c.rdf.Property p, org.w3c.rdf.Resource r, org.w3c.rdf.RDFnode v);

public void start(org.w3c.rdf.DataSource ds);

public void end(org.w3c.rdf.DataSource ds);

There are various inefficiencies with these implementations to do with converting from one Node, Property, Graph format to another.

Parser

ParsedQuery is a representation of the query in an abstract form. It is called by calling the static method parse with a String query, returning a String. It can parse SquishQL queries, Algae queries and there's a rough, old implementation of parsing rdfpath queries that's not complete in there too.

public static Vector triples = new Vector();

public static Vector variables = new Vector();

public static Vector constraints = new Vector();

public static Vector graphs = new Vector();

public static Hashtable ns = new Hashtable();

ResultSet

This is my (partial) implementation of java.sql.ResultSet interface. It's an extension of Vector, and contains a Hashtable of rows, keyed by the variable name (without the ? - this is a recent change to make it constistent with the ResultSet you get back from SQL databases, which don't allow ?s in variable names).

I've also recently added a columns Vector and a columnNames Vector. The first is a list of values, i.e. all the values from the hashtable. I'm not sure about this at all... The second is a list of column Names - i.e. the variable name without the ?, which is useful to know.

There's an extra method

public Vector getColumnNames()

which gets the full list of column names (although it's more useful to be able to get the column names you asked for. Need to implement this.)

ResultSetMetaData

This is a useful java.sql interface I've only just come across. The constructor takes a java.sql.ResultSet and I've implemented

public int getColumnCount()

public String getColumnLabel(int column)

public String getColumnName(int column)

This means that that you can loop through the number of columns, and get their corresponding names. I think there's also the possibility that we could do resource/literal typing here:

public String getColumnTypeName(int column)

SQL converter

PQ2SQL.java (Parsed query to SQL) is a class which takes a and returns an SQL query string. I adapted it from a Php script by Matt Biddulph <matt@picdiary.com>. The main method is public static String convert(ParsedQuery p)

which creates a new instance of PQ2SQL, processes some parts of the Parsed query and passes them to

public String triple_sql(boundvars,specvars,clauses,constraints);

To use triple_sql you need a minimum of a (full) list of variables (i.e. not just the ones you want returned) (boundvars), plus the variables you want returned (specvars), plus a vector of arrays of clauses (pred, sub, obj). constraints (as pseudo triples at the moment (e.g. (~ ?y bla)). Constraints are optional. This is the only important method.

QueryDirect

An interface which we can test for for a given RDF database to see if it can process queries itself. Its single method is

public java.sql.ResultSet makeQuery(ParsedQuery p)

PQ2SQL.java implements it.

Scutter

A class for inserting RDF from XML files into an SQL RDF database. It can be called from the command line: org.desire.rudolf.query.Scutter [db url] [scutter file name]

e.g.

org.desire.rudolf.query.Scutter rdf:jdbc:postgresql://localhost:5432/test?auth=password&user=postgres&password=notneeded scutterplan.txt

which are the defaults. The scutterplan.txt file should be in the top level directory, and contain one url per line like this:

./scutter http://example.com/test.rdf

Scutter reads in the file, takes all the urls into a Vector and calls insertData for each one.

This in turn calls getDB, which downloads the rdf data and puts it in a file, and depends on the mime types. It uses different mime-type handlers, e.g. org.desire.rudolf.query.TextHandler, which implement MimeHandler.java, an interface with a single method:

public RDFModelCore getRDF(URL Resource);

SiRPAC's own downloading is bypassed because of an implementation I did which got the data from jpgs, which caused SiRPAC to fail.

Once insertData has the in memory RDF database and an SQLModelCore object (an implementation of modelcore for SQL), it first deletes all data from the source URI from the database, and then calls assert(rdf, rdfuri); on the SQL database, where rdf is the RDFModelCore of data, and rdfuri is the source URI.

This method is not database-independent.

There is also a similar method that can be accessed from other classes (for example jsps).

public static boolean insertFromJsp (String urlpre, String dburl, String newUrl)

newUrl is the RDF url to add to the database or the text; dburl is the database identifier url; urlpre is the identifier url for the RDF, which is used to stop SiRPAC crashing if we are parsing text.

This always uses scutterplan.txt as its list, which it adds the url to. It can also insert chunks of RDF/XML text directly using

insertText(urlpre,newUrl, dburl);

Issues

cross-searching ....

References

[1] http://web1.guha.com/rdfdb/

[2] http://www.w3.org/TandS/QL/QL98/pp/enabling.html

[3] http://www.intellidimension.com/RDFGateway/Docs/RDFQLmanual.asp

[4] http://www.w3.org/2000/Talks/05/18-perl-RDF-lib/Overview.html

[5] http://139.91.183.30:9090/RDF/RQL/