Monday, September 21, 2009

Relational, XML, or RDF?

During the last 15 years, I have had the opportunity to work with different data models and approaches to application development. I started with SGML in the aerospace content management space back in 1995, then saw the potential of XML and fully embraced it in 1998. Since that time, I have been continuously following the evolution of XML related specifications and have been able to leverage the bleeding edge including XForms, XQuery, XSLT2, XProc, ISO Schematron, and even XML Schema 1.1.

However, being a curious person, I decided to explore other approaches to data management and application development. I worked on systems using a relational database backend and application development frameworks like Spring, Hibernate, and JSF. I've been involved in SOAP-based web services projects where XML data (constrained and validated by an XML Schema) was unmarshalled into Java objects, and then persisted into a relational table with an Object-Relational Mapping (ORM) solution such as Hibernate.

I also had the opportunity to work with the Java Content Repository (JCR) model in magazine content publishing, and the Entity-Attribute-Value with Classes and Relationships (EAV/CR) model in the context of medical informatics. EAV/CR is suited for domains where entities can have thousands of frequently changing parameters.

Lately, I have been working on Semantic Web technologies including the RDF data model, OWL (the Web Ontology Language), and the SPARQL query language for RDF.

Clients often ask me which of these approaches is the best or which is the most appropriate for their project. Here is what I think:

  • Different approaches should be part of the software architect's toolkit (not just one).
  • To become more productive in an agile environment, every developer should become a "generalizing specialist".
  • The software architect or developer should be open minded (no "not invented here syndrome" or "what's wrong with what we're doing now" attitude).
  • The software architect or developer should be willing to learn new technologies outside of their comfort zone and IT leadership should encourage and reward that learning.
  • Learning new technologies sometimes requires a new way of thinking about the problems at hand and "unlearning" old knowledge.
  • It is important not to have a purist or religious approach to selecting any particular approach, since each has its own merits.
  • Ultimately, the overall context of the project will dictate your choice. This includes but is not limited to: skills set, learning curve, application performance, cost, and time to market.

Based on my personal experience, here is what I have learned:

The XPath 2.0 and XQuery 1.0 Data Model (XDM)


The roots of SGML and XML are in content management applications in domains such as law, aerospace, defense, scientific, technical, medical, and scholarly publishing. The XPath 2.0 and XQuery Data Model (XDM) is particularly well suited for companies selling information products directly as a source of revenues (e.g. non-ad based publishers).

XSLT2 facilitates media-independent publishing (single sourcing) to multiple devices and platforms. XSLT2 is also a very powerful XML transformation language that allows these publishers to perform the series of complex transformations that are often required as the content is extracted from various data sources and assembled into a final information product.

With XQuery, sophisticated contextualized database-like queries can be performed. In addition, XQuery allows the dynamic assembly of content where new compound documents can be constructed on the fly from multiple documents and external data sources. This allows publishers to repurpose content into new information products as needed to satisfy new customer demands and market opportunities.

XInclude enables the chunking of content into reusable pieces. XProc is an XML pipeline language that essentially allows you to automate a complex publishing workflow which typically includes many steps such as content assembly, validation, transformation, and query.

The second category of application for which XML is a strong candidate is what is sometimes referred to as an "XML Workflow" application. The typical design pattern here is XRX (XForms, REST, and XQuery) where user inputs are captured with an XForm front end (itself potentially auto-generated from an XML schema) and data is RESTfully submitted to a native XML database, then queried and manipulated with XQuery. The advantages of this approach are:

  • It is more resilient to schema changes. In fact the data can be stored without a schema.
  • It does not require handling the impedance mismatch between XML documents, Java objects, and relational tables which can introduce design complexity, performance, and maintainability issues even when using code generation.

A typical example of an "XML Workflow" application would be a Human Resources (HR) form-based application that allows employees to fill and submit a form and also provides reporting capabilities.

The third and last category of application are Web Services (RESTful or SOAP-based) that consume XML data from various sources, store the data natively in an XML database directly (bypassing the XML databinding and ORM layers altogether), and perform all processing and queries on the data using a pure XML approach based on XSLT2 and XQuery. An example is a dashboard or mashup application that stores all of the submitted data in a native XML database. In this scenario, the data can be cached for faster response to web services requests. Again the benefits listed for "XML Workflow" applications apply here as well.


The Relational Model

The relational model is well established and well understood. It is usually an option for data-oriented and enterprise-centric applications that are based on a closed world assumption. In such a scenario, there is usually no need for handling the data in XML and a conventional approach based on JSF, Spring, Hibernate and a relational database backend is enough.

Newer Java EE frameworks like JBoss Seam and its seam-gen code generation tools are particularly well-suited for this kind of task. There is no running away from XML however, since these frameworks use XML for their configuration files. Unfortunately, there is currently a movement away from XML configuration files toward Java annotations due to some developers complaining about "XML Hell".

The relational model supports transactions and is scalable although a new movement called NoSQL is starting to challenge that last assumption. An article entitled "Is the Relational Database Doomed?" on readwriteweb.com describes this emerging trend.


The RDF Data Model

Semantic Web technologies like RDF (an incarnation of the EAV/CR model mentioned above), OWL, SKOS, SWRL, and SPARQL and Linked Data publishing principles have received a lot of attention lately. They are well suited for the following applications:

  • Applications that need to infer new implicit facts based on existing explicit facts. Such inferences can be driven by an ontology expressed in OWL or a set of rules expressed in a rule language such SWRL.
  • Applications that need to map concepts across domains such as a trading network where partners use different e-commerce XML vocabularies.
  • Master Data Management (MDM) applications that provide an RDF view and reasoning capabilities in order to facilitate and enhance the process of defining, managing, and querying an organization's core business entities. A paper on IBM's Semantic Master Data Management (SMDM) project is available here.
  • Applications that use a taxonomy, a thesaurus, or a similar concept scheme such as online news archives and medical terminologies. SKOS, recently approved as a W3C recommendation was designed for that purpose.
  • Silo-busting applications that need to link data items to over data items on the web, in order to perform entity correlation or allow users to explore a topic further. The Linked Data design pattern is based on an open world assumption, uses dereferenceable HTTP URIs for identifying and accessing data items, RDF for describing metadada about those items, and semantic links to describe the relationships between those items. An example is an Open Government application that correlates campaign contributions, voting records, census, and location data. Another example is a semantic social application that combines an individual's profiles and social networks from multiple sites in order to support data portability and fully explore the individual's social graph.


Of course, these different approaches are not mutually exclusive. For example, it is possible to provide an RDF view or layer on top of existing XML and relational database applications.

Tuesday, August 11, 2009

Adding Semantics to SOA

What can Semantic Web technologies such as RDF, OWL, SKOS, SWRL, and SPARQL bring to Web Services. One of the most difficult challenges of SOA is data model transformation. This problem occurs when services don't share a canonical XML schema. XML transformation languages such as XSLT and XQuery are typically used for data mediation in such circumstances.

While it is relatively easy to write these mappings, the real difficulty lies in mapping concepts across domains. This is particularly important in B2B scenarios involving multiple trading partners. In addition to proprietary data models, it is not uncommon to have multiple competing XML standards in the same vertical. In general, these data interoperability issues can be syntactic, structural, or semantic in nature. Many SOA projects can trace their failure to those data integration issues.

This is where semantic web technologies can add significant value to SOA. The Semantic Annotations for WSDL and XML Schema (SAWSDL) is a W3C recommendation which defines the following extension attributes that can be added to WSDL and XML Schema components:

  • The modelReference extension attribute associates a WSDL or XML Schema component to a concept in a semantic model such as OWL. The semantic representation is not restricted to OWL (for example it could be an SKOS concept). The modelReference extension attribute is used to annotate XML Schema type definitions, element and attribute declarations as well as WSDL interfaces, operations, and faults.
  • The liftingSchemaMapping and loweringSchemaMapping extension attributes typically point to an XSLT or XQuery mapping file for transforming between XML instances and ontology instances.

A typical example of how SAWSDL might be used is in an electronic commerce network where trading partners use various standards such as EDI, UBL, ebXML, and RosettaNet. In this case, the modelReference extension attribute can be used to map a WSDL or XML Schema component to a concept in a common foundational ontology such as one based on the Suggested Upper Merged Ontology (SUMO). In addition, lifting and lowering XSLT transforms are attached to XML Schema components in the SAWSDL with liftingSchemaMapping and loweringSchemaMapping extension attributes respectively. Note that any number of those transforms can be associated with a given XML schema component.

Traditionally, when dealing with multiple services (often across organizational boundaries), an Enterprise Services Bus (ESB) provides mediation services such as business process orchestration, business rules processing, data format and data model transformation, message routing, and protocol bridging. Semantic mediation services can be added as a new type of ESB service. The SAWSDL4J API defines an object model that allows SOA developers to access and manipulate SAWSDL annotations.

Ontologies have been developed for some existing e-commerce standards such as EDI X12, RosettaNet, and ebXML. When required, ontology alignment can be achieved with OWL constructs such as subClassOf , equivalentClass , and equivalentProperty.

Semantic annotations provided by SAWSDL can also be leveraged in orchestrating business processes using the business process execution language (BPEL). To facilitate service discovery in SOA Registries and Repositories, interface definitions in WSDL documents can be associated with a service taxonomy defined in SKOS. In addition, once an XML message is lifted to an ontology instance, the data in the message becomes available to Semantic Web tools like OWL and SWRL reasoners and SPARQL query engines.

Sunday, July 26, 2009

From Web 2.0 to the Semantic Web: Bridging the Gap in Newsmedia

In this presentation, I explain the Semantic Web value proposition for the newsmedia industry and propose some concrete steps to bridge the gap.

Welcome to the world of news in the Web 3.0 era.

Wednesday, July 8, 2009

Semantic Social Computing

The Web 2.0 revolution has produced an explosion in social data that is fundamentally transforming business, politics, culture, and society in general. Using tools such as wikis, blogs, online forums, and social networking sites, users can now express their point of view, build relationships, and exchange ideas and multimedia content. Combined with portable electronic devices such as cameras and cell phones, these tools are enabling the citizen journalist who can report facts and events faster than traditional media outlets and government agencies.

One of the challenges posed by this explosion in social data is data portability between social networking sites. But the next biggest challenge will be the ability to harvest all that social data in order to extract actionable intelligence (e.g. a better understanding of consumer behavior or the events unfolding at a particular location). In addition, in a world where security has become the number one priority, various sensors from traffic cameras to satellite sensors are also collecting huge amounts of data. The integration of sensor data and social data offers new possibilities.

Those are the types of integration challenges that Semantic Web technologies are designed to solve. The SIOC (Semantically Interlinked Online Communities) Core ontology describes the structure and content of online community sites. A comprehensive list of SIOC tools is available at the SIOC Applications page. Using these tools, developers can export SIOC compliant RDF data from various data sources such as blogs, wikis, online forums, and social networking sites such as Twitter and Flickr. Once exported, the SIOC data can be crawled, aggregated, stored, indexed, browsed, and queried (using SPARQL) to answer interesting questions. Natural Language Processing (NLP) techniques can be used to facilitate entity extraction from user generated content.

SIOC leverages the FOAF ontology to describe the social graph on social networking sites. For example, this can offer deeper insights for marketers into how social recommendations affect consumer behavior.

One unique capability offered by Semantic Web technologies is the ability to infer new facts (inference) from explicit facts based on the use of an ontology (RDFS or OWL) or a set of rules expressed in a rule language such as the Semantic Web Rule Language (SWRL). Using constructs such as owl:sameAs or rdfs:seeAlso, it becomes easy to express the fact that two or more different web pages relate to the same resource (e.g. different profile pages of the same person on difference social networking sites). Linked Data principles can help in linking social data, therefore building bridges between the data islands that today's social networking sites represent.

SIOC compliant social data can be meshed up with other data sources such as sensor data to reveal very useful information about events related to logistics, public safety, or political unrest at a particular location for example. With the advent of GPS-enabled cameras and cell phones, temporal and spatial context can be added to better describe those events. The W3C Time OWL Ontology (OWL-Time) and the Basic Geo Vocabulary have been developed for that purpose.

Thursday, June 11, 2009

Publishing Government Data to the Linked Open Data (LOD) Cloud

In a previous post, I outlined a roadmap for migrating news content to the Semantic Web and the Linked Open Data (LOD) cloud. The BBC has been doing some interesting work in that space by using Linked Data principles to connect BBC Programmes and BBC Music to MusicBrainz and DBpedia. SPARQL endpoints are now available for querying the BBC datasets.

It is clear that Europe is ahead of the US and Canada in terms of Semantic Web research and adoption. The Europeans are likely to further extend their lead with the announcement this week that Tim Berners-Lee (the visionary behind the World Wide Web, the Semantic Web, and the Linked Open Data movement) will be advising the UK Government on making government data more open and accessible.



In the US, data.gov is part of the Open Government Initiative of the Obama Administration. The following is an excerpt from data.gov:

A primary goal of Data.gov is to improve access to Federal data and expand creative use of those data beyond the walls of government by encouraging innovative ideas (e.g., web applications). Data.gov strives to make government more transparent and is committed to creating an unprecedented level of openness in Government. The openness derived from Data.gov will strengthen our Nation's democracy and promote efficiency and effectiveness in Government.

Governments around the world have taken notice and are now considering similar initiatives. It is clear that these initiatives are important for the proper functioning of democracy since they allow citizens to make informed decisions based on facts as opposed to the politicized views of special interests, lobbyists, and their spin doctors. These facts are related to important subjects such as health care, the environment, the criminal justice system, and education. There is an ongoing debate in the tech community about the best approach for publishing these datasets. There are several government data standards available such as the National Information Exchange Model (NIEM). In the Web 2.0 world, RESTful APIs with ATOM, XML, and JSON representation formats have become the norm.

I believe however that Semantic Web technologies and Linked Data principles offer unique capabilities in terms of bridging data silos, queries, reasoning, and visualization of the data. Again, the methodology for adopting Semantic Web technologies is the same:

  1. Create an OWL ontology that is flexible enough to support the different types of data in the dataset including statistical data. This is certainly the most important and challenging part of the effort.
  2. Convert the data from its source format to RDF. For example XSLT2 can be used to convert from CSV or TSV to RDF/XML and XHTML+RDFa. There are also RDFizers such as D2R for relational data sources.
  3. Link the data to other data sources such as Geonames, Federal Election Commission (FEC), and US Census datasets.
  4. Provide an RDF dump for Semantic Web Crawlers and a SPARQL endpoint for querying the datasets.

The following are some of the benefits of this approach:

  • It allows users to ask sophisticated questions against the datasets using the SPARQL query language. These are the kind of questions that a journalist, a researcher, or a concerned citizen will have in mind. For example, which airport has the highest number of reported aircraft bird strikes? (read more here about why Transportation Secretary Ray LaHood rejected a proposal by the FAA to keep bird strikes data secret). Currently data.gov provides only full-text and category-based search.
  • It bridges data silos by allowing users to make queries and connect data in meaningful ways across datasets. For example, a query that correlates health care, environment, and census data.
  • It provides powerful visualizations of the data through Semantic Web meshups.

Tuesday, June 2, 2009

S1000D and SCORM Integration

This is a presentation I gave at the DocTrain Boston 07 conference on how to reduce product lifecycle costs by integrating the S1000D and SCORM specifications.

S1000D is the International Specification for Technical Publications utilizing a Common Source Database (CSDB). Based on open XML standards, the latest issue (4.0) has been developed by the AeroSpace and Defence Industries Association of Europe (ASD), the Aerospace Industries Association of America (AIA), and the Air Transport Association of America (ATA).

Sharable Content Object Reference Model (SCORM) is a specification for online learning content developed by the Advanced Distributed Learning (ADL) Initiative.

The presentation has been updated to reflect the addition of SCORM support in S1000D 4.0.

Tuesday, May 19, 2009

Why XProc Rocks

These are exciting times to be a technologist in the news publishing business. The industry is going through fundamental changes driven by the current economic downturn. Innovation has become an imperative. Examples of recent innovations include news content APIs (newspaper as a Platform), specialized desktop news readers, the ability to publish to an increasing number of mobile devices (Kindle, iPhone, etc.), and migrating news content to the Semantic Web and Linked Open Data (LOD) cloud.

The result is that news publishing workflows are getting more complex. The following is an example of what can be expected from such a workflow:

  1. Retrieve content from a native XML database such as Exist using XQuery and from a MySQL database and combine the result as an XML document.
  2. Expand XIncludes and other content references in the XML.
  3. Apply various processing to the XML depending on the content type.
  4. Make a REST call to data.gov and recovery.gov to retrieve some government published data in XML for a graphical mashup visualization of the data.
  5. Transform the XML into XHTML+RDFa for consumption by Yahoo's SearchMonkey and the recently unveiled Google Rich Snippets.
  6. Transform the XML into RDF/XML and validate the result using the Jena command line ARP RDF parser. The RDF/XML output will provide an RDF dump for RDF crawlers and will be searched via a SPARQL endpoint.
  7. Transform the XML into an Amazon Kindle-friendly format such as Text.
  8. Publish the content as a PDF document with a print layout (e.g. header, footer, multi-column, pagination, etc.).
  9. Transform the XML into a NewsML document and validate the result against the NewsML XML Schema and an ISO Schematron schema.
  10. Generate an Atom feed containing NewsML elements and validate the result using NVDL (Namespace-based Validation Dispatching Language). NVDL is required here because the Atom feed will contain nodes that will be validated against the Atom RelaxNG schema as well as nodes that will be validated against the NewsML XML schema.

As you can see, this publishing workflow is XML document-centric. If you are a Java developer, your first instinct might be to automate all those processing steps with an Ant build file. There is now a better alternative and it is called XProc (an XML Pipeline Language) currently a W3C candidate recommendation. Here are some reasons why XProc is a superior alternative when dealing with an XML document-centric publishing workflow:

  • With XProc, XML documents are first-class citizens. Java objects are first-class citizens in Ant. In fact, using Ant to automate a complex XML document-centric publishing workflow can quickly lead to spaghetti code.
  • XProc allows you to add, delete, rename, replace, filter, split, wrap, unwrap, compare, insert, and conditionally process nodes in XML documents by addressing these nodes using XPath.
  • XProc comes with built-in steps for XML processing such as p:xquery, p:xinclude, p:xslt, p:validate-with-xml-schema, p:validate-with-relax-ng, p:validate-with-schematron, p:xsl-formatter, and p:http-request for making REST calls.
  • XProc is declarative, and programming language and platform neutral. For example, work is underway for an XQuery implementation for the Exist database. Note that step 1. above can be implemented with the SQL extension function provided by Exist for retrieving data from relational databases using JDBC.
  • XProc is extensible, so you can add custom steps. For example the XML Calabash XProc implementation provides extension steps such as cx:nvdl and cx:unzip.
  • XProc provides exception handlers which are essential for any complex workflow.
  • XProc pipelines are amenable to streaming.
  • XProc can simplify the design, documentation, and maintenance of very complex publishing workflows. Since XProc is in XML format, one can envision a visual designer for building XProc pipelines or a tool (such as Graphviz) to visualize XProc documents in the form of processing diagrams.

There are other pipeline technologies that can be useful as well, particularly if you're building mashup applications. Yahoo Pipes is a good choice for combining feeds from various sources, manipulating them as needed, and outputting RSS, JSON, KML, and other formats. And if you are into Semantic Web meshups, DERI Pipes provides support for RDF, OWL, and SPARQL queries. DERI Pipes works as a command line tool and can be hooked into an XProc pipeline with the XProc built-in p:exec step.