Monday, September 21, 2009

Relational, XML, or RDF?

During the last 15 years, I have had the opportunity to work with different data models and approaches to application development. I started with SGML in the aerospace content management space back in 1995, then saw the potential of XML and fully embraced it in 1998. Since that time, I have been continuously following the evolution of XML related specifications and have been able to leverage the bleeding edge including XForms, XQuery, XSLT2, XProc, ISO Schematron, and even XML Schema 1.1.

However, being a curious person, I decided to explore other approaches to data management and application development. I worked on systems using a relational database backend and application development frameworks like Spring, Hibernate, and JSF. I've been involved in SOAP-based web services projects where XML data (constrained and validated by an XML Schema) was unmarshalled into Java objects, and then persisted into a relational table with an Object-Relational Mapping (ORM) solution such as Hibernate.

I also had the opportunity to work with the Java Content Repository (JCR) model in magazine content publishing, and the Entity-Attribute-Value with Classes and Relationships (EAV/CR) model in the context of medical informatics. EAV/CR is suited for domains where entities can have thousands of frequently changing parameters.

Lately, I have been working on Semantic Web technologies including the RDF data model, OWL (the Web Ontology Language), and the SPARQL query language for RDF.

Clients often ask me which of these approaches is the best or which is the most appropriate for their project. Here is what I think:

  • Different approaches should be part of the software architect's toolkit (not just one).
  • To become more productive in an agile environment, every developer should become a "generalizing specialist".
  • The software architect or developer should be open minded (no "not invented here syndrome" or "what's wrong with what we're doing now" attitude).
  • The software architect or developer should be willing to learn new technologies outside of their comfort zone and IT leadership should encourage and reward that learning.
  • Learning new technologies sometimes requires a new way of thinking about the problems at hand and "unlearning" old knowledge.
  • It is important not to have a purist or religious approach to selecting any particular approach, since each has its own merits.
  • Ultimately, the overall context of the project will dictate your choice. This includes but is not limited to: skills set, learning curve, application performance, cost, and time to market.

Based on my personal experience, here is what I have learned:

The XPath 2.0 and XQuery 1.0 Data Model (XDM)


The roots of SGML and XML are in content management applications in domains such as law, aerospace, defense, scientific, technical, medical, and scholarly publishing. The XPath 2.0 and XQuery Data Model (XDM) is particularly well suited for companies selling information products directly as a source of revenues (e.g. non-ad based publishers).

XSLT2 facilitates media-independent publishing (single sourcing) to multiple devices and platforms. XSLT2 is also a very powerful XML transformation language that allows these publishers to perform the series of complex transformations that are often required as the content is extracted from various data sources and assembled into a final information product.

With XQuery, sophisticated contextualized database-like queries can be performed. In addition, XQuery allows the dynamic assembly of content where new compound documents can be constructed on the fly from multiple documents and external data sources. This allows publishers to repurpose content into new information products as needed to satisfy new customer demands and market opportunities.

XInclude enables the chunking of content into reusable pieces. XProc is an XML pipeline language that essentially allows you to automate a complex publishing workflow which typically includes many steps such as content assembly, validation, transformation, and query.

The second category of application for which XML is a strong candidate is what is sometimes referred to as an "XML Workflow" application. The typical design pattern here is XRX (XForms, REST, and XQuery) where user inputs are captured with an XForm front end (itself potentially auto-generated from an XML schema) and data is RESTfully submitted to a native XML database, then queried and manipulated with XQuery. The advantages of this approach are:

  • It is more resilient to schema changes. In fact the data can be stored without a schema.
  • It does not require handling the impedance mismatch between XML documents, Java objects, and relational tables which can introduce design complexity, performance, and maintainability issues even when using code generation.

A typical example of an "XML Workflow" application would be a Human Resources (HR) form-based application that allows employees to fill and submit a form and also provides reporting capabilities.

The third and last category of application are Web Services (RESTful or SOAP-based) that consume XML data from various sources, store the data natively in an XML database directly (bypassing the XML databinding and ORM layers altogether), and perform all processing and queries on the data using a pure XML approach based on XSLT2 and XQuery. An example is a dashboard or mashup application that stores all of the submitted data in a native XML database. In this scenario, the data can be cached for faster response to web services requests. Again the benefits listed for "XML Workflow" applications apply here as well.


The Relational Model

The relational model is well established and well understood. It is usually an option for data-oriented and enterprise-centric applications that are based on a closed world assumption. In such a scenario, there is usually no need for handling the data in XML and a conventional approach based on JSF, Spring, Hibernate and a relational database backend is enough.

Newer Java EE frameworks like JBoss Seam and its seam-gen code generation tools are particularly well-suited for this kind of task. There is no running away from XML however, since these frameworks use XML for their configuration files. Unfortunately, there is currently a movement away from XML configuration files toward Java annotations due to some developers complaining about "XML Hell".

The relational model supports transactions and is scalable although a new movement called NoSQL is starting to challenge that last assumption. An article entitled "Is the Relational Database Doomed?" on readwriteweb.com describes this emerging trend.


The RDF Data Model

Semantic Web technologies like RDF (an incarnation of the EAV/CR model mentioned above), OWL, SKOS, SWRL, and SPARQL and Linked Data publishing principles have received a lot of attention lately. They are well suited for the following applications:

  • Applications that need to infer new implicit facts based on existing explicit facts. Such inferences can be driven by an ontology expressed in OWL or a set of rules expressed in a rule language such SWRL.
  • Applications that need to map concepts across domains such as a trading network where partners use different e-commerce XML vocabularies.
  • Master Data Management (MDM) applications that provide an RDF view and reasoning capabilities in order to facilitate and enhance the process of defining, managing, and querying an organization's core business entities. A paper on IBM's Semantic Master Data Management (SMDM) project is available here.
  • Applications that use a taxonomy, a thesaurus, or a similar concept scheme such as online news archives and medical terminologies. SKOS, recently approved as a W3C recommendation was designed for that purpose.
  • Silo-busting applications that need to link data items to over data items on the web, in order to perform entity correlation or allow users to explore a topic further. The Linked Data design pattern is based on an open world assumption, uses dereferenceable HTTP URIs for identifying and accessing data items, RDF for describing metadada about those items, and semantic links to describe the relationships between those items. An example is an Open Government application that correlates campaign contributions, voting records, census, and location data. Another example is a semantic social application that combines an individual's profiles and social networks from multiple sites in order to support data portability and fully explore the individual's social graph.


Of course, these different approaches are not mutually exclusive. For example, it is possible to provide an RDF view or layer on top of existing XML and relational database applications.