Tuesday, November 11, 2008

The Content Imperative: Unlearning the Relational Model

The relational data model and the SQL query language are an essential part on any computer science curriculum and are well understood by a large number of developers. On the other hand, the use of markup technologies such as SGML and XML for content management and publishing has remained a niche market for highly specialized vendors and consultants.

Today, the majority of developers use XML for application configuration files (e.g. Spring, Hibernate, JSF), syndication (RSS), and web services. When these developers are asked to design and develop XML content management and publishing applications, they often approach the problem from a relational data paradigm which is what they know and are used to. For example, when migrating content stored in a relational database into an XML format, they will simply dump the relational tables into a flat XML representation. The problem is that content is not relational data. The following are some fundamental differences between content and relational data:

  • Content is created to be consumed by eyeballs
  • Content can be rendered in multiple presentation formats such as print, web, and wireless devices. Therefore it is very important to cleanly separate content from presentation
  • Content can have an inherent deep hierarchical structure. For example, think about the book/part/chapter/section/subsection/paragraph hierarchy
  • The relationships between content items are expressed through hierarchical containment and hyperlinks
  • Content is often mixed (in the sense of mixed content in XML). For example, inside a paragraph, some words are italicized, in bold, or underlined to indicate special meaning
  • Content can have multi-valued properties such as the authors of a document. Multi-valued properties are not supported by SQL.

However, content and data do have one important thing in common. Data and content stay with us for a long time, sometimes forever. APIs, protocols, and programming languages come and go. Therefore, content modeling is by far the most important investment you can make during a content management migration project.

Unstructured Content Modeling

In a typical enterprise, there are two different types of content: unstructured and structured. Unstructured content represents the large majority of content in the enterprise. Examples are Office documents such as Word, PowerPoint, and Excel. Content modeling for unstructured content consists in describing document metadata as well as the relationships between the documents. The metadata is usually stored in a relational database. In a typical CMS, the content model is used to customize the user interface for querying specific documents based on metadata and for customizing the user interface (e.g. for capturing and displaying metadata).

Most content management systems have their own proprietary meta-model. The Java content repository API (JCR) introduced a standardized hierarchical repository model. This article by David Nuescheler (JCR spec Lead and CTO at Day Software) explains the peculiarities of JCR content modeling and the gotchas for people coming from a relational data modeling background.

Apache Jackrabbit (the JCR reference implementation) uses a textual DSL called Compact Namespace and Node Type Definition (CND) for specifying a JCR content model. There is no formal graphical notation like UML or ERDs for specifying JCR content models. Lars Trieloff of Day Software proposed a content modeling notation based on UML and Fundamental Modeling Concepts (FMC).

The Content Management Interoperability Services (CMIS) specification proposed a simplified meta-model based on documents, folders, policies, and relationships. The CMIS query language extends SQL 92 with text search, multi-valued properties, and folder-scoped queries. This choice was made because most existing CMS use a relational database and SQL is well understood by the majority of developers out there.

The Case Against Unstructured Content

The problem with unstructured content is that it cannot be processed and queried like the well-structured relational data stored by the RDBMS on which your ERP and CRM systems sit. XML goes beyond tags (in the web 2.0 sense), taxonomies, full-text search, and content categorization to provide fine-grained content discovery, query, and processing capabilities. With XML, the document becomes the database. If your business is content (you are a media company, a publisher, or the technical documentation department of a manufacturing company), then you should seriously consider the benefits of XML in terms of content longevity, reuse, repurposing, and cross-media publishing.

The question is: how to do it right?


From SGML to XML to the Infoset

Charles Goldfarb invented SGML at IBM in 1969, the same year his colleague Edgar Codd invented the relational model. The SGML (through its very popular subset called XML) and relational models are still rock solid today. To be precise, the abstract data model for XML documents was formally specified in the XML Infoset and subsequently the XQuery 1.0 and XPath 2.0 data model (XDM) specifications.

SGML was originally designed at IBM for the editing, retrieval, and composition of legal documents. The second edition of the Oxford English Dictionary and the US Department of Defense (DoD) adopted SGML in the 80s. The goal of the US DoD CALS initiative was to replace the huge quantities of paper with digital data in the acquisition of weapon systems. Before the adoption of XML by the W3C in February 98, SGML was primarily used in the publishing industry as well as for technical documentation applications in industries such as aerospace, defense, software, and telecommunications. Examples of SGML vocabularies include S1000D (aerospace and defense), Docbook (software), and TIM (telecommunications). The most popular SGML vocabulary, HTML, is the foundation of the web itself.

The XML Infoset describes the content of a well-formed XML document as an abstract tree of information items including document, namespace, element, and attribute information items. These information items have properties such as children and parent. Based on the XML Infoset, the XDM defines the abstract data model for the input to XSLT 2.0 and XQuery 1.0 processors. In addition, the XDM supports XML Schema types, atomic typed values, and ordered heterogeneous sequences.

The relational data model is based on set theory and predicate logic. It was designed originally for accounting and banking systems. Data is represented as n-ary relations and manipulated with relational algebra. CMS vendors and even standard bodies have tried to fork SQL in order to support hierarchies and multi-value properties. It is clear however that XQuery is a superior alternative, specifically designed to address those content-related concerns.

WikiXMLDB is an interesting application that uses XQuery to query Wikipedia content. WikiXMLDB allows you to not only perform database-like queries, but also enables dynamic content assembly or the ability to build compound documents from multiple Wikipedia pages. This opens up new opportunities in terms of content enrichment at a time when publishers are struggling to find new ways to monetize their content assets in the face of declining ad revenues.

Structured Content Modeling

There is a body of specialized knowledge in SGML/XML content analysis and modeling that has been applied successfully to projects such as the Oxford English Dictionary (OED), CALS, and Docbook.

In relational data modeling, the three phases of modeling typically include:

  1. The conceptual or domain model
  2. The logical data model (LDM)
  3. The physical data model (PDM)

The LDM describes entity types, data attributes, and the relationships between the entities. The LDM is normalized to eliminate redundancies. The PDM describes the schema of the database that will be used to store the data and may be denormalized to improve performance. Relational data modeling uses well known modeling notations like Entity-Relationship (ER) diagrams and UML.

The difference between logical model and physical model is even more important in modeling XML documents. The logical model can be expressed in XML Schema or Relax NG. However, the physical model depends on the underlying data persistence mechanism (e.g. relational vs. JCR vs. XQuery-enabled native XML database). Many projects make the mistake of skipping the logical modeling phase. The problem is that the physical storage can and will probably change over time.

There is no formal notation for modeling XML documents. Back in the SGML days, document analysis and modeling was a well understood process. Eve Maler and Jeanne El Andaloussi describes a tree notation for modeling DTDs in their book entitled “Developing SGML DTDs: From Text To Model To Markup”. One of the peculiarities of modeling narrative text is the presence of mixed content which is essential for intelligent processing of content (for example for the automatic extraction of book indexes).

David Carlson proposed a UML Profile for XSD which can be used to auto-generate XML schemas from UML class diagrams. As with any model-driven development tool, care should be taken to ensure that the generated XML Schema complies with XML content modeling principles and satisfies the business and technical requirements such as content reuse and repurposing.

Many people prefer Relax NG to XML Schema for its simplicity. The important think to remember however is that XML Schema is part of a full stack of XML specifications which also includes XForms, XPath 2.0, XQuery, and XSLT 2.0. The ability to load XML Schema types into your XForms, validate XForms submissions against an XML Schema, and create schema-aware XSLT 2.0 transforms and XQuery queries can be a deciding factor.

Instead of starting your content modeling effort from scratch, you can leverage any of the existing proven and well tested document-oriented XML vocabularies such as Docbook, DITA, S1000D, and NewsML.

JCR and XML Content Modeling

JCR supports the import of arbitrary XML documents into a compliant repository. The following is an excerpt from the specification:
On import...

Each XML element E becomes a content repository node of the same name, E.
...
Each child XML element C of XML element E becomes a content repository child node C of node E.
Each XML attribute A within an XML element E becomes a property A of content repository node E. The value of each XML attribute A becomes the value of the corresponding property A.
...
Text within an XML element E becomes a STRING property called jcr:xmlcharacters of a node called jcr:xmltext, which itself becomes a child node of the node E.

This is fine if you're only storing a small quantity of XML documents. Performance will probably degrade quickly if you're storing a large quantity of XML documents with a deep hierarchy. Same name siblings are almost always used in document-oriented XML (e.g. paragraph siblings) and can cause performance to degrade or JCR paths to become brittle if you remove or reorder nodes. JCR allows roundtripping of imported XML. However, JCR adds repository metadata such as jcr:primaryType that must be stripped out at export time. It's possible to derive an optimized JCR content model from the XML document's logical model, although this could prevent you from fully exploiting the original hierarchy of the XML documents in your application. This shows the importance of separating the logical model from the physical model.

You should seriously consider a native XML database when dealing with large quantities of document-oriented XML documents. A simple benchmarking exercise (between Jackrabbit and the Exist database for example) can help you settle down on the right solution.

Business Rules and Content Quality

In addition to specifying the content items, types, and relationships in your schema, you should also specify business rules that are beyond the capabilities of the XML Schema language. ISO Schematron uses XPath 2.0 to declare assertions about arbitrary patterns in XML documents and then reports on the presence or absence of those patterns. In this article, Rick Jelliffe, inventor of Schematron explains how it works.

Assertions and conditional type assignments capabilities have been added to XML Schema 1.1.

Content Capture

XML editors have been around for a long time. However, these XML editors remain complex specialist tools that are often used only by professional technical authors in documentation departments. Using XForms, you can provide a user friendly interface for your end users to contribute XML content by presenting them with a simple XHTML form. The Alfresco web content management (WCM) platform uses XForms for content capture and XSLT/XSL FO for content rendition. The XForms controls can be generated automatically from an XML Schema. Alfresco's implementation is based on the open source Chiba XForms engine.

The Right Tools for the Job

When content is stored in a relational database, typically an object relational mapping (ORM) solution like Hibernate is used to map relational tables into Java objects. The business logic is handled with POJOs such as Spring beans which communicate with the front end through a UI framework or templating technology such as JSP or Freemarker.

When dealing with XML documents however, there are domain specific languages such as XForms, XInclude, XLink, XPointer, XPath 2.0, XSLT 2.0, XQuery, and XSL FO that greatly facilitate the processing of document-oriented XML and provide unmatched processing power when compared to traditional approaches based on SQL, JSP, or Freemarker. These XML-related languages are declarative in nature and require some learning curve. Many software architects and developers are not yet aware of the power of this new paradigm. However, I believe that rather than using only familiar development frameworks, it’s important to always evaluate alternatives and select the best approach even if it involves a learning curve.

XML databases such as Exist can store data natively in XML and provide full XQuery support for sophisticated queries and manipulation of XML content. Exist also provides integrated support for XInclude, XSLT 2.0, and AtomPub. The XRX (XForms, REST, XQuery) architecture with Exist and the Orbeon XForms engine allows you to integrate an XForm front end to an Exist data store through a REST API, therefore bypassing the ORM layer altogether.

Content Modeling for the Semantic Web

The web ontology language (OWL) is used for modeling for the Semantic Web. In the RDF data model, statements are expresses as Subject-Predicate-Object. Specialized RDF stores exist for storing RDF triples. In this blog post, Kingsley Idehen, CEO of OpenLink Software (maker of the Virtuoso RDF triple store), explains why The Time for RDBMS Primacy Downgrade is Nigh!

Conclusion

In summary, content modeling in general and XML content modeling in particular are intrinsically different from relational data modeling. In modeling XML content, the logical model should be separated from the physical model because the latter depends on the persistence storage mechanism which can change during the content's life cycle. XQuery-enabled native XML databases provide a better alternative for storing, querying, and processing large quantities of document-oriented XML. The relational data model is still rock solid today and will be around for many years to come because of its strong foundations in mathematics. However, it is not a panacea for all information management problems. While SQL can be forked to support content characteristics such as hierarchy and multi-valued properties, XQuery was natively designed to address those concerns. Companies can help their developers by providing training on declarative XML processing languages like XForms, XQuery, and XSLT 2.0 which are better suited for handling XML content.

4 comments:

Anonymous said...

Great piece!

The content heads and the data heads definitively have a lot to learn from each other.

Jeff

Pat said...

Martin Fowler, one of the best minds in the software industry had similar thoughts about the database landscape:

http://martinfowler.com/bliki/DatabaseThaw.html

Seth Gottlieb said...

Great article Joel. The one thing that I might add is that the RDBMS/POJO/ORM solution is familiar to most developers (in most languages) and handles many content management use cases. This approach seems to break down when you want to manipulate the unstructured part of a large collection of assets.

I would say that most content is semi-structured with a few structured fields (title, author... anything without markup) and then 1 or more unstructured elements. In many websites, the unstructured pieces are just pushed out as they are stored (with some filtering to prevent XSS attacks). For those cases, RDBMS storage still seems very appropriate.

Joel Amoussou said...

Seth,

I agree with you that an RDBMS is fine if you have metadata fields and unstructured content (typically in HTML). XQuery is ideal (compared to JCR or RDBMS) with document-oriented XML (e.g. DITA, DocBook, and S1000D). Document-oriented XML has many benefits in terms of content repurposing, structured queries, and dynamic content assembly.

The problem is that XML authoring is quite demanding. One emerging alternative is automated entity extraction with tools like OpenCalais.