XML Army Knife Logo

XMLArmyKnife

  • Home
  • API Docs
  • Archives

Archives

By Month:

December 2006 (2)
May 2006 (8)
April 2006 (2)
December 2005 (1)
November 2005 (4)
October 2005 (1)

More...

By category:

admin (5)
docs (1)
html (1)
rdf (9)
xslt (1)

Metadata

« May 12, 2006 | Main | December 03, 2006 »

May 25, 2006

Keep On Truckin'

Reading Danny's post earlier reminded me that I'd been meaning to do an online version of John Cowan's TagSoup parser.

I've often found the W3C HTML Tidy service to be quite useful for fixing up dodgy HTML to scrape data from it using XSLT. However sometimes Tidy isn't quite able to handle all the bizarre HTML variants thrown at it, but TagSoup's "Keep On Truckin'" philosophy seem to let it deal with a wider array of problems. (This is anecdotal evidence, I've not done a true comparison of the tools).

Anyway, I've packaged up TagSoup so you can try if for yourself. Here's the documentation. The service supports all the same parameters as the TagSoup command-line.

It even supports Pyx!. Here's Danny in line-mode.

Posted by ldodds at 09:32 PM

User Agent

FYI, I've tweaked the code behind the XML ArmyKnife RDF services such that requests from there will be properly advertised using the User-Agent header.

At the moment the UA string is XMLArmyKnife.org RDF Services; http://xmlarmyknife.org. So if you want to monitor use and abuse you should now be able to. If you have concerns about heavy use let me know and perhaps we can discuss local mirroring of the data of interest. I'm doing this already for some sources.

Posted by ldodds at 02:02 PM

Experimenting with EmbeddedRDF and GRDDL Support

Embedded RDF is a method of embedding (a subset of) RDF within XHTML and HTML documents. A simple XSLT transformation can be used to extract the RDF from within the document.

A related and more generalised technology is GRDDL which defines how to associate transformation algorithms (i.e. XSLT stylesheets) with XHTML profiles or microformats so that there's a clear mapping from embedded metadata into RDF.

I've been experimenting with adding support for both of these technologies in the XMLArmyKnife SPARQL query service. This provides a means to directly query RDF embedded in XHTML documents.

The mechanism works as follows. When the service retrieves some remote data it checks the Content-Type of the response. If it's application/xhtml+xml or text/html it applies the following rules; otherwise its business as usual and the content will be parsed as RDF.

If possible the retrieved content is parsed as XML and then inspected to discover the XHTML profiles associated with the content.

If the profiles URIs includes that of Embedded RDF it applies a suitable stylesheet to retrieve some triples.

It also looks for the GRDDL data view profile. If that profile is found, then the processor tries to find any additional transformations associated with the document by its author. This mechanism is defined in the GRDDL Profile for XHTML. Essentially it just looks for all link elements with rel="transformation" attributes. If its finds any, then each are applied in turn.

The end result is a single chunk of RDF which is then made available to the SPARQL query as normal.

The mechanism allows you to write queries such as this:


PREFIX foaf: <http://xmlns.com/foaf/0.1>
SELECT ?blog 
FROM <http://iandavis.com>
WHERE 
{<http://iandavis.com/#ian> foaf:weblog ?blog.}

...which discovers Ian Davis's blogs by querying his homepage.

A similar technique can be applied to directly query Dan Connolly's homepage to discover the dates he's attending the WWW2006 conference.

Dan's homepage is interesting as he's combined Embedded RDF with hCalendar. It nicely illustrates that you can merge together multiple RDF views of the same source page, as well as demonstrating that SPARQL can be applied to microformat content very easily. All thats needed is a suitable XSLT stylesheet. Danny wrote a nice posting looking at Microformats on the GRDDL which makes useful background reading.

The current code needs to be generalised to support arbitrary profile URIs. The GRDDL specification outlines a more general solution to discovering transformations by dereferencing the profile URI. Although that aside the current implementation can deal with any transformation referenced licensed via the "data view" profile. If you include a suitable link then any microformat is already supported.

I'd like to gather some feedback on the initial implementation first though. Let me know if you cook up any cool demonstrations.

The way I've implemented the support is via plugging in some custom code to the Jena FileManager component, so I may consider releasing it as a separate Jena contribution. Mail me if you're interested in that.

Oh, and of course this mechanism isn't restricted to just querying via SELECT. You can extract the data as RDF using CONSTRUCT and DESCRIBE. Would be interesting to plug is into Slug to provide a way to crawl and aggregate microformat content.

Posted by ldodds at 01:51 PM