Tuesday, May 19, 2009

Why XProc Rocks

These are exciting times to be a technologist in the news publishing business. The industry is going through fundamental changes driven by the current economic downturn. Innovation has become an imperative. Examples of recent innovations include news content APIs (newspaper as a Platform), specialized desktop news readers, the ability to publish to an increasing number of mobile devices (Kindle, iPhone, etc.), and migrating news content to the Semantic Web and Linked Open Data (LOD) cloud.

The result is that news publishing workflows are getting more complex. The following is an example of what can be expected from such a workflow:

  1. Retrieve content from a native XML database such as Exist using XQuery and from a MySQL database and combine the result as an XML document.
  2. Expand XIncludes and other content references in the XML.
  3. Apply various processing to the XML depending on the content type.
  4. Make a REST call to data.gov and recovery.gov to retrieve some government published data in XML for a graphical mashup visualization of the data.
  5. Transform the XML into XHTML+RDFa for consumption by Yahoo's SearchMonkey and the recently unveiled Google Rich Snippets.
  6. Transform the XML into RDF/XML and validate the result using the Jena command line ARP RDF parser. The RDF/XML output will provide an RDF dump for RDF crawlers and will be searched via a SPARQL endpoint.
  7. Transform the XML into an Amazon Kindle-friendly format such as Text.
  8. Publish the content as a PDF document with a print layout (e.g. header, footer, multi-column, pagination, etc.).
  9. Transform the XML into a NewsML document and validate the result against the NewsML XML Schema and an ISO Schematron schema.
  10. Generate an Atom feed containing NewsML elements and validate the result using NVDL (Namespace-based Validation Dispatching Language). NVDL is required here because the Atom feed will contain nodes that will be validated against the Atom RelaxNG schema as well as nodes that will be validated against the NewsML XML schema.

As you can see, this publishing workflow is XML document-centric. If you are a Java developer, your first instinct might be to automate all those processing steps with an Ant build file. There is now a better alternative and it is called XProc (an XML Pipeline Language) currently a W3C candidate recommendation. Here are some reasons why XProc is a superior alternative when dealing with an XML document-centric publishing workflow:

  • With XProc, XML documents are first-class citizens. Java objects are first-class citizens in Ant. In fact, using Ant to automate a complex XML document-centric publishing workflow can quickly lead to spaghetti code.
  • XProc allows you to add, delete, rename, replace, filter, split, wrap, unwrap, compare, insert, and conditionally process nodes in XML documents by addressing these nodes using XPath.
  • XProc comes with built-in steps for XML processing such as p:xquery, p:xinclude, p:xslt, p:validate-with-xml-schema, p:validate-with-relax-ng, p:validate-with-schematron, p:xsl-formatter, and p:http-request for making REST calls.
  • XProc is declarative, and programming language and platform neutral. For example, work is underway for an XQuery implementation for the Exist database. Note that step 1. above can be implemented with the SQL extension function provided by Exist for retrieving data from relational databases using JDBC.
  • XProc is extensible, so you can add custom steps. For example the XML Calabash XProc implementation provides extension steps such as cx:nvdl and cx:unzip.
  • XProc provides exception handlers which are essential for any complex workflow.
  • XProc pipelines are amenable to streaming.
  • XProc can simplify the design, documentation, and maintenance of very complex publishing workflows. Since XProc is in XML format, one can envision a visual designer for building XProc pipelines or a tool (such as Graphviz) to visualize XProc documents in the form of processing diagrams.

There are other pipeline technologies that can be useful as well, particularly if you're building mashup applications. Yahoo Pipes is a good choice for combining feeds from various sources, manipulating them as needed, and outputting RSS, JSON, KML, and other formats. And if you are into Semantic Web meshups, DERI Pipes provides support for RDF, OWL, and SPARQL queries. DERI Pipes works as a command line tool and can be hooked into an XProc pipeline with the XProc built-in p:exec step.