RDF hacking ILRT Home

Subject Portal Architecture Notes

Author: Dan Brickley

Date: Feb 6 2001
Latest version: http://ilrt.org/discovery/2001/02/sparch/

Abstract

This brief discussion document provides an overview of some issues and possible work items connected with ILRT's ongoing work on subject-based portal services. In particular, it places ILRT's subject portal work in the context of wider DNER policy issues, and attempts to find common and practical ground connecting the JISC-funded PDP and IMeshTk efforts.

Status of this Document

Rough draft for discussion. The layout leaves much to be desired; in particular should separate general analysis from proposed work items. This is a heap of ideas/observations regarding service portalisation technologies, and not a coherent position paper or survey.

Introduction

ILRT have long been involved in JISC-funded resource discovery services (ROADS, SOSIG...). Recently JISC's activities in this area have been organised through the Resource Discovery Network (RDN), coordinated by the Distributed National Electronic Resource (DNER) office. ILRT are involved in two JISC funded projects, PDP and IMeshTk, which aim to support the evolution of SOSIG-like services into a Web of collaborating, inter-communicating subject services. One component of this environment will be the subject portals. The role of a subject portal is to provide a (typically) Web-based user interface to a distributed collection of related information resources, typically through combining search (keyword and phrase matching) and hypertext (browse) UI conventions in a consistent manner across the collection. This document describes some possible technical, outreach and dissemination work items that ILRT/SOSIG might consider undertaking in support of these goals.

From Subject Gateways to Subject Portals

Much of the UK HE work on subject gateways was associated in some way with the eLib ROADS project (1995-1999), although alternate approaches, both commercial (eg. ADAM) and home-grown (eg. EEVL) have been explored. ROADS provides both an open source Internet cataloguing system as well as a Web-based search/browse interface for end users. Since active development on ROADS has largely ceased, services that might previously have considered using (and contributing to) ROADS are exploring a variety of other options. In addition, expectations regarding subject-based resource discovery services now outstrip the capabilities of a basic ROADS installation. In particular, many of the information aggregation and customisation features typically ascribed to "portals" are missing from the ROADS system. ROADS offers no personalisation / user profile tools, and only limited distributed/parallel search functionality. This document is not intended as a critique of ROADS, but takes the ROADS package as indicative of the facilities currently available in "off the shelf" free software.

The situation regarding freely available, standards-based "subject portal" tools at the time of writing (early 2001) creates a dillema for organisations involved in managing and architecting such services. Various options are available including...

There are reasons for concern with each option. Commercial software can be hard to customise, put services at the mercy of corporate policy decisions by the provider, and provides little return on resources invested in development, configuration and customisation. Opensource tools can lack stability and support, have an unpredictable development strategy and requirements process, and may lack the sophistication and 'polish' of some commercial offerings. Finally, a home grown solution can risk locking a service into maintaining a home-grown, ad hoc solution, through accidental drift into software development efforts that are not adequately resourced.

This is the situation we find ourselves in. The remainder of this document outlines some modest possible deliverables proposed for the PDP project, produced in the context of experience with ROADS-based systems during 1995-2000 and an analysis of metadata interoperability trends (eg. Open Archives, RSS) that have emerged more recently.

PDP

The PDP project offers two possible avenues for ILRT involvement. Either ILRT could install, configure and evaluate OCLC Z39.50 search tools, or attempt to make progress using opensource software. Both approaches have their attractions. This document focusses more on the latter possibility, but should not be taken as asserting that the opensource option is the most appropriate one.

Commercial Subject Portal tools

@todo: list here what we (ILRT, SOSIG, DNER community generally?) might hope to get out of exploring commercial tools.

including...

Opensource Subject Portal tools

ILRT's PDP effort is committed to providing a prototype of a (Z39.50-based?) search environment that provides a consistent interface to at least three Social Science data sources (SOSIG, REGARD, ....).

Observations on Z39.50 and distributed search:

RSS for Subject-based newsfeed aggregation

This section sketches an RSS-based portal news system for the social sciences.

RSS, the RDF Site Summary format, is a lightweight XML-based data format designed for newsfeed and "what's new" message interchange amongst Web based information services. The RSS 1.0 specification in particular has been designed to work well with other metadata specifications expressible in RDF/XML, including in particular the Dublin Core metadata element set. RSS enjoys increasing use amongst Web services offering portal-like information aggregation facilities, and was initially designed and deployed by Netscape (now AOL/Timewarner) for their My Netscape portal. Such a portal (known as an RSS Aggregator) typically maintains a catalogue of (classified) URLs pointing to RSS "feeds". An RSS feed is an XML file that encodes a description of the latest 10-15 "items" on some site. Depending on context these might be news articles, discussion group postings, job, course or training opportunity announcements, or information about recently updated pages on some Web site. RSS can be provide a mechanism for disseminating 'table of contents' information, and support current awareness portal services organised by discipline, format or locality.

Contrasting RSS and Z39.50

The rapid adoption of RSS both by content-providers and aggregators (portals) can be contrasted with the continuing relative obscurity of standards-based search tools, such as Z39.50.

A number of factors combine here...


Suggested Work Items (DRAFT)

Z39.50-based task list (draft)

An RSS-based task list (draft)

This sketch lists a modest proposal for prototyping RSS-based portal facilities. It combines newsfeed / current awareness with the need to explore customisation/personalisation facilities in subject portals. Note that the former theme is one that might successfully be addressed by a separately packaged software bundle, whereas incorporation of advanced customisability, user profiling etc facilities is likely to assume some pre-existing user profile and authentication system (such as SOSIG's Grapevine database). Work on the latter aspects of RSS-based portal tools should consequently focus on usability issues rather than software shrinkwrapping, and in particular seek metrics for evaluating the value and usability of (RSS-based) personalisation facilities.

The guiding observation here is that with RSS, simple yet useful things are known to be readily achieved using freely available tools (such as the XML::RSS Perl library). A static (ie. non-personalised) "pseudo-portal" page built from a list of a dozen RSS feeds can be created in an afternoons work. The goal would be to explore additional layers of sophistication and functionality on top of this "trivial but useful" core.

Summary / Conclusions...

@todo ;-)

Appendix A: RSS/XHTML HOWTO Case Study

This informal case study explains something of how the W3C "What's New" information is syndicated using RSS, yet managed at W3C solely using a (carefully structured) HTML page. A generalisation of this approach might provide one deployment strategy to support for wide-scale RSS adoption. Other options (such as screenscraping and the use of add-ons for Content Management Systems (CMS)) would also be needed.

The following note is an an-reconstructed post from some Bristol mailing lists; @@todo. While this note is targetted at a technical audience, simpler and more practical guidance would be needed for general use.


From Daniel.Brickley@bristol.ac.uk Tue Feb  6 14:02:39 2001
Date: Fri, 17 Nov 2000 15:02:29 +0000 (GMT)
From: Dan Brickley 
To: bris-www@bristol.ac.uk
Cc: ilrt-tech@bristol.ac.uk
Subject: XHTML, XSLT, and RSS/RDF: W3C home page case study



bris-www,

I was just looking at the Bristol Uni homepage for first time in a
while, noticed the 'news' items at the top of the page, and thought
some of you might be interested in a technique we're using on the W3C (World
Wide Web Consortium) home page to transform human-oriented HTML markup
into machine-oriented RDF Site Summary files. These can then flow
around the network and be understood by RSS aggregators, such as
O'Reilly's Meerkat system or ILRT's SOSIG [1]. Why? More people see the
news items, because other websites know how to use RSS data feeds
and point people at the source website for the 'full story'. The
testbed we've built for the W3C home page seems to be working well, so I
thought I'd describe what we've done in case it makes sense to try
something similar locally.

Context: RSS is a lightweight news syndication format; recently
some of us have been working (see proposal at [3], lively discussions at
[4]) to recast RSS in a more flexible, modular fashion. The brief note
at [5] shows how the new RSS spec allows you to syndicate and query job
descriptions, CVs and other structured descriptions. I don't have space
here to explain what XML is in any detail, so advance apologies if this
is too geeky for some folk.


Anyhow, here's what we do.

First, the W3C home page[2] is managed as XHTML rather than HTML. This
is HTML with some tweaks to make it count as 'well formed' XML markup. Do a
'view source' on [2] to see the details.

Secondly, we adopt a profile or flavour of XHTML that uses some
additional conventions to indicate the structure of the 'latest news'
section of the page. For this we use the 'class' attribute familar from
CSS use, eg. 
for a news item, class="date" for the date, and rel="details" to indicate the hyperlink that points to the full details of the news item. These conventions (which might be re-invented differently on different sites) are documented on the W3C site as part of the Semantic Web development -- see notes and online converter at [6]. Third, there's an XSLT stylesheet that, when applied to the XHTML content of the W3C home page, outputs a different style of XML markup more oriented towards machine consumption, ie. RSS (RDF Site Summary). XSLT is a XML-to-XML transformation system designed to provide an ultra-flexible style sheet / presentation mechanism for XML. It's basically a quirky programming language, and is therefore flexible enough to do the fairly complex transformation of the W3C homepage into a structured XML newsfeed. For the curious, [8] is the XSLT file that achieves this. For the application minded folk who don't care to read XSLT files, [8] can be re-used on other sites with little/no knowledge of XSLT details. If one wanted to adopt a different style of representing news items in XHTML, the XSLT transform would be slightly different. The W3C site happens to be configured so that anyone looking at the RSS homepage URL, ie [7], will see the result of this XSLT transformation. The XSLT transformation is accomplished by a generic Java program that takes XSLT + markup as input, and spits out new XML markup in return. What does this buy us? For relatively low effort, we have a mechanism for automatically disseminating W3C press releases, announcements etc to a Web of consumer applications that re-process this data and expose it to more eyeballs. In practical terms, it gives us a URL, [7], that can be called to get a machine-processable snapshot of the current items of interest on the W3C home page. For example, if you look at [9], the Meerkat RSS aggregators view of the W3C RSS newsfeed. Another example: O'Reilly's "industry portal" for XML developers, XML.com, has a "developer news" section on the front page. This is derrived from an RSS feed supplied by xmlhack.com. Third example, ILRT's Social Science Information Gateway includes as part of the Grapevine user-profiles section the ability to add 'news channels'. These are collected from various sources using RSS/XML. We're starting to see a marketplace emerge for supplying and consuming RSS data -- see the recent O'Reilly press release about the RSS 1.0 proposal ([10]). If this turns out to blossum rather than go the way of 'push technology' and VRML, I reckon it'd be fun to try using it within a University context, eg. to aggregate 'news' listings from the various departmental webservers. The idea is that one can author/manage a site in a way that is only slightly more complex or restricted than before. Then by using new stuff like XHTML, XSLT, RSS/RDF we can mechanise certain things that used previously to be done by hand. This makes for a cheap way of disseminating new/interesting items on ones web site, and for a cheap way of tapping into newsfeeds from other sites... Dunno if this of interest in these parts, but I've been having great fun hacking about with this stuff so thought I'd report back :-) Dan [1] http://www.oreillynet.com/meerkat/ http://www.sosig.ac.uk/ [2] http://www.w3.org/ [3] http://purl.org/rss/1.0/ [4] http://www.egroups.com/group/rss-dev/ [5] http://www.ilrt.bris.ac.uk/discovery/2000/11/rss-query/ [6] http://www.w3.org/2000/01/sw/ -> http://www.w3.org/2000/08/w3c-synd/ [7] http://www.w3.org/2000/08/w3c-synd/home.rss [8] http://web4.w3.org/2000/08/w3c-synd/home2rss.xsl [9] http://www.oreillynet.com/meerkat/index.php?&c=4743&t=ALL [10] http://www.oreillynet.com/pub/a/mediakit/pressrelease/20000828.html