XML Army Knife Logo

XMLArmyKnife

  • Home
  • API Docs
  • Archives

Archives

By Month:

December 2006 (2)
May 2006 (8)
April 2006 (2)
December 2005 (1)
November 2005 (4)
October 2005 (1)

More...

By category:

admin (5)
docs (1)
html (1)
rdf (9)
xslt (1)

Metadata

« April 2006 | Main | December 2006 »

May 25, 2006

Keep On Truckin'

Reading Danny's post earlier reminded me that I'd been meaning to do an online version of John Cowan's TagSoup parser.

I've often found the W3C HTML Tidy service to be quite useful for fixing up dodgy HTML to scrape data from it using XSLT. However sometimes Tidy isn't quite able to handle all the bizarre HTML variants thrown at it, but TagSoup's "Keep On Truckin'" philosophy seem to let it deal with a wider array of problems. (This is anecdotal evidence, I've not done a true comparison of the tools).

Anyway, I've packaged up TagSoup so you can try if for yourself. Here's the documentation. The service supports all the same parameters as the TagSoup command-line.

It even supports Pyx!. Here's Danny in line-mode.

Posted by ldodds at 09:32 PM

User Agent

FYI, I've tweaked the code behind the XML ArmyKnife RDF services such that requests from there will be properly advertised using the User-Agent header.

At the moment the UA string is XMLArmyKnife.org RDF Services; http://xmlarmyknife.org. So if you want to monitor use and abuse you should now be able to. If you have concerns about heavy use let me know and perhaps we can discuss local mirroring of the data of interest. I'm doing this already for some sources.

Posted by ldodds at 02:02 PM

Experimenting with EmbeddedRDF and GRDDL Support

Embedded RDF is a method of embedding (a subset of) RDF within XHTML and HTML documents. A simple XSLT transformation can be used to extract the RDF from within the document.

A related and more generalised technology is GRDDL which defines how to associate transformation algorithms (i.e. XSLT stylesheets) with XHTML profiles or microformats so that there's a clear mapping from embedded metadata into RDF.

I've been experimenting with adding support for both of these technologies in the XMLArmyKnife SPARQL query service. This provides a means to directly query RDF embedded in XHTML documents.

The mechanism works as follows. When the service retrieves some remote data it checks the Content-Type of the response. If it's application/xhtml+xml or text/html it applies the following rules; otherwise its business as usual and the content will be parsed as RDF.

If possible the retrieved content is parsed as XML and then inspected to discover the XHTML profiles associated with the content.

If the profiles URIs includes that of Embedded RDF it applies a suitable stylesheet to retrieve some triples.

It also looks for the GRDDL data view profile. If that profile is found, then the processor tries to find any additional transformations associated with the document by its author. This mechanism is defined in the GRDDL Profile for XHTML. Essentially it just looks for all link elements with rel="transformation" attributes. If its finds any, then each are applied in turn.

The end result is a single chunk of RDF which is then made available to the SPARQL query as normal.

The mechanism allows you to write queries such as this:


PREFIX foaf: <http://xmlns.com/foaf/0.1>
SELECT ?blog 
FROM <http://iandavis.com>
WHERE 
{<http://iandavis.com/#ian> foaf:weblog ?blog.}

...which discovers Ian Davis's blogs by querying his homepage.

A similar technique can be applied to directly query Dan Connolly's homepage to discover the dates he's attending the WWW2006 conference.

Dan's homepage is interesting as he's combined Embedded RDF with hCalendar. It nicely illustrates that you can merge together multiple RDF views of the same source page, as well as demonstrating that SPARQL can be applied to microformat content very easily. All thats needed is a suitable XSLT stylesheet. Danny wrote a nice posting looking at Microformats on the GRDDL which makes useful background reading.

The current code needs to be generalised to support arbitrary profile URIs. The GRDDL specification outlines a more general solution to discovering transformations by dereferencing the profile URI. Although that aside the current implementation can deal with any transformation referenced licensed via the "data view" profile. If you include a suitable link then any microformat is already supported.

I'd like to gather some feedback on the initial implementation first though. Let me know if you cook up any cool demonstrations.

The way I've implemented the support is via plugging in some custom code to the Jena FileManager component, so I may consider releasing it as a separate Jena contribution. Mail me if you're interested in that.

Oh, and of course this mechanism isn't restricted to just querying via SELECT. You can extract the data as RDF using CONSTRUCT and DESCRIBE. Would be interesting to plug is into Slug to provide a way to crawl and aggregate microformat content.

Posted by ldodds at 01:51 PM

May 12, 2006

SPARQL Geo Extensions

I've been experimenting with some ARQ Extensions for manipulating geographical information in SPARQL.

I've implemented three functions which are now running on this server, to use them you'll need to add the following PREFIX to your queries:


PREFIX myfn: <java:com.ldodds.sparql.>

You're free to change the name of the PREFIX, but the URI must be the above for now. WARNING: the URI is likely to change, initially to a proper URI for the collection of geo extensions. Hopefully eventually to a formally specified URI if I can drum up interest amongst SPARQL developers to implement the extensions on other engines. So use with care now and check bac for updates.

With that PREFIX declared you can now use the following three functions:


myfn:Distance(?a, ?b)

The Distance function will return the distance between two points in kilometres. It assumes that the variables its passed conform to the examples given on the GeoRSS in RDF docs. The literal values are automatically split to get the latitude and longitude. Here's an example query.

A better option for data that uses the Geo vocab such as this data mapping a walk through Bristol is to use the following function:


myfn:Distance(?lat1, ?lon1, ?lat2, ?lon2)

Here's an example query.

The final function allows a query to test whether an arbitrary point is within a bounding box:


myfn:PointInBoundingBox(?minlat, ?minlon, ?maxlat, ?maxlon, ?lat, ?long)

The first four parameters define the southwest and northeast corners of the bounding box. The last two define the point to test.

There's an example query.

Let me know if you come up with any interesting hacks, or whether you think the math is wrong!

Posted by ldodds at 08:55 PM

Applying XSLT to SPARQL Results

Using the XSLT service its possible to apply a transformation to any XML document, including the results of SPARQL queries.

The Jena RDF/XML-ABBREV output which can be used to serialize the results of CONSTRUCT and DESCRIBE queries can be processed with XSLT as its more regular than the default RDF/XML format. For example Masahide Kanzaki has as stylesheet that can be used to transform RDF calendar data to an HTML calendar view.

However its the SPARQL results format which is most amenable to post-processing with XSLT. It's a very regular and simple format that can be easily transformed into many different vocabularies.

As a convenience feature I've updated the SPARQL query service to accept an extra xslt-uri parameter. The parameter should indicate an XSLT stylesheet that will automatically be applied to the results of any SELECT query. Only this response format is supported at present, although I'll expand it to cover the other options when I get chance.

Any non-SPARQL protocol parameters present in the query string, i.e. anything not listed in the documentation, will be automatically made available to the XSLT stylesheet engine (Saxon 8.6). This means the stylesheet can be parameterised to take in additional data.

To set the mime type for the response, ensuring that you use the xsl:output element in the stylesheet, and configure an appropriate media-type attribute.


Morten Frederiksen has previously demonstrated how to convert SPARQL results to RSS. I've taken his stylesheet and made a few tweaks (bringing it up to date, renaming a few parameters) and made it available on my server. So using this stylesheet you can transform the results of any SPARQL query to RSS 1.0.

There are a couple of rules to follow:

  • Ensure there are variables called rsstitle and rsslink in your response. These will form the rss:title and rss:link tags of each item. Note: rsslink MUST be bound to a URI in the query.
  • Provide a title parameter in the URL to set the rss:title of the feed
  • The channel and description parameters are similarly mapped to the their respective top-level channel elements

Here's an example of a SPARQL query that provides results in the correct format. It selects the 10 most recently tagged items from 2 users of del.icio.us. Here's the results in HTML. And here's the results with the stylesheet applied.

The nice aspect to this approach is that you can serialize any result set that conforms to the expectations of the stylesheet. "Shape" your query to match expectations and you can do all sorts of interesting transformations.

Posted by ldodds at 08:39 PM

Content-Type Bug Fixes

I've made some bug fixes to all of the services that should resolve problems with generating results that contain non-ASCII characters. I should have caught this earlier, but I wasn't correctly setting the charset parameter in all responses.

I've also double-checked and ensured that I'm returning the correct mimetype for N3, Turtle, etc.

If you notice any other Content-Type related bugs (or anything else for that matter!) please drop me a mail.

Posted by ldodds at 07:44 PM

May 08, 2006

Sample SPARQL Queries

Whilst putting together my local cache of useful data sources I wrote a series of SPARQL queries to exercise the data.

They're all pretty trivial, but demonstrate the kinds of data available in each of the data sources as well as illustrating that some of the sources merge together very nicely as care has been taken to reuse standard identifiers, e.g. for countries.

The complete set of queries is browsable, but here's a few:

    List airports in the UK. View Results
  • List countries and their dialing codes. View results
  • List countries and their main currencies (where available in CIA World Fact Book). View Results
  • List Countries and their FIPS country code. View Results
  • List Countries and their ISO 3166 country code. View Results
  • Show natural hazards in the US. View Results
  • Which countries observe daily savings?. View Results

With a bit of mapping mashup action, and some user contributed metadata thrown in for good measure, I think there's the basis for a good travel guide application in there somewhere.

Posted by ldodds at 02:09 PM

May 05, 2006

Local Mirrors of Useful RDF Data Sources

This morning I released another internal improvement to the SPARQL query service, adding the first of several caching layers to improvement performance and efficiency.

The service now has local caches of a large set of useful RDF data sources including many from the DAML project, a copy of the CIA World Factbook, airport and country codes, etc. There's also a complete mirror of the Historical IDs data set. I'm expecting this mirror to grow to encompass other useful, and reasonably static data sources. Mail me if you have further suggestions.

I'm currently considering adding some of the GovTrack data, as well as some bioinformatics resources.

There's no need to adapt queries to use the locally mirrored data. Using the Jena FileManager API specific URLs are automatically remapped to local copies. A complete list of cached URLs is included below.

The table is generated using a SPARQL query over the location mapping config and rendered as Javascript code by the SPARQL service. No AJAX, just simple code insertion.