November 08, 2005
Introduction to XAK
In this posting I want to introduce the basic concepts of the site in order to outline my intentions for creating this service.
URL Pipelining
Over the past few years in building web applications and simple web based tools and utilities I've found it invaluable to have to hand one or more web services to carry out specific tasks such as XSLT transforms, converting HTML to XHTML, validators, etc.
The most useful of these services are those that follow the familiar Unix pipeline model. These services take data already on the web, referenced by URL, perform some useful processing and then return the results. By chaining together such services one can perform quite complex tasks: e.g. extracting metadata from an HTML document by cleaning it with HTML Tidy, and then processing it with XSLT.
Recent experiences in building REST-style web services and API "mashups" has further demonstrated the utility of tools that live and work on the web.
Yet, while a growing number of sites are exposing data as XML enabling more complex applications to be developed, the suite of online XML processing tools hasn't been greatly extended. Some tools, e.g. for XQuery processing are notably absent; others such as XSLT processing offered by the W3C's online tool lag behind the latest specifications. I think there is a growing need to have a wider range of XML processing tools available.
XML processing isn't the only area needing attention. Tools for Semantic Web developers are also lacking. The W3C RDF Validator is invaluable when authoring RDF content, but there are, as yet, few services offering features such as SPARQL querying, inferencing, etc. It's possible that the availability of such tools may help bootstrap additional interest in Semantic Web technologies amongst web hackers.
So, the initial goal for the XML Army Knife (XAK) project is to expand on the tools available for processing XML and RDF data on the web.
XAK will therefore be a multi-function tool, consisting of a range of services covering specific areas. I've dubbed these blades to continue the symbolism, with each blade having multiple tools. Initially I've identified XQuery, XSLT, RDF and HTML as top level "blades". Each of these will consist of at least one RESTful service. For example the XQuery blade will offer a single service (query processing) while the RDF blade will offer SPARQL querying, syntax conversion, validation, smushing and inferencing.
Beyond the initial goal of providing a richer toolset for web hackers there are some areas worthy of additional study and further work.
Data Storage and Gathering
The first is in online data storage for the kind of semi-structured XML and RDF data commonly manipulated by mashups and social content engines. Interesting work is continuing in this area with the launch of Ning and the imminent arrival of Google Base. But there will certainly be room (for some time to come) for other kinds of flexible data storage APIs. And not only data storage, data gathering is another important but so far largely overlooked features.
Many Semantic Web developers have written web crawlers and experimented with aggregating data from across the web. Yet there are no services that offer online data aggregation. The problem of crawling the Semantic Web is similar to that of generic web crawling and as the amount of RDF data grows Semantic Web crawlers ("Scutters") will require a similar level of resources and investment.
But there is room for a more focused data gathering service, e.g. services that aggregate data about a particular person or community. These data aggregations may vary in lifetime, ranging from a simple "gather and query" use cases through to regularly refreshed data sources.
XAK will ultimately provide a number of services in this area. Again, it is hoped that the availability of larger chunks of RDF data, aggregated from multiple sources, will help foster interest and innovation in the creation of RDF applications.
Processing Pipelines
While REST-style online services provide a lot of flexibility in working with online data, they suffer from several problems:
- Lack of Caching -- very often services fetch all data "on demand", leading to inefficient and redundant network accesses
- Lack of Error Reporting -- as the services are chained together it becomes hard to spot problems in any one section of the processing pipeline
More integrated services can deal with these problems by using a shared caching infrastructure and an error "endpoint" for communicating problems encountered during processing. As the W3C has chartered an XML Processing Model Working Group the time seems right for experimenting with XML processing framework to create more flexible and integrated services.
Semantic Web developers could also benefit from a "RDF Pipeline" framework that describes how to assemble and subsequently process an RDF data set. A key component of the XAK data storage environment will be an RDF data set assembly language. (More on this and RDF storage in general in another essay).
Hopefully that outlines the basic goals for this service. If you have suggestions please mail them to suggest@xmlarmyknife.org.
