Adventures in Computing: XMLDB

Showing posts with label XMLDB. Show all posts

Wednesday, June 9, 2010

Data Modeling for Electronic Health Records (EHR) Systems

Getting the data model right is of paramount importance for an Electronic Health Records (EHR) system. The factors that drive the data model include but are not limited to:

Patient safety
Support for clinical workflows
Different uses of the data such as input to clinical decision support systems
Reporting and analytics
Regulatory requirements such as Meaningful Use criteria.

Model First

Proven methodologies like contract-first web service design and model driven development (MDD) put the emphasis on deriving application code from the data model and not the other way around. Thousands of line of code can be auto-generated from the model, so it's important to get the model right.

Requirements Gathering

The objective here is to determine the entities, their attributes, and the relationships between those entities. For example, what are the attributes that are necessary to describe a patient's condition and how do you express the fact that a condition is a manifestation of an allergy? The data modeler should work closely with clinicians to gather those requirements. Industry standards should be leveraged as well. For example, HITSP C32 defines the data elements for each EHR data module such as conditions, medications, allergies, and lab results. These data elements are then mapped to the HL7 Continuity of Care Document (CCD) XML schema.

The HL7 CCD is itself derived from the HL7 Reference Information Model (RIM). The latter is expressed as a set of UML class diagrams and is the foundation model for health care and clinical data. A simpler alternative to the CCD is the ASTM Continuity of Care Records (CCR). Both the CCD and CCR provide an XML schema for data exchange and are Meaningful Use criteria. Another relevant data model is the HL7 vMR (Virtual Medical Record) which aims to define a data model for the input and output of Clinical Decision Support Systems (CDSS).

These standards can be cumbersome to use as such from a software development perspective. Nonetheless, they can inform the design of the data model for an EHR system. Alignment with the CCD and CCR will facilitate data exchange with other providers and organizations. The following are Meaningful Use criteria for data exchange:

Electronically receive a patient summary record, from other providers and organizations including, at a minimum, diagnostic test results, problem list, medication list, medication allergy list, immunizations, and procedures and upon receipt of a patient summary record formatted in an alternative standard specified in Table 2A row 1, displaying it in human readable format.
Enable a user to electronically transmit a patient summary record to other providers and organizations including, at a minimum, diagnostic test results, problem list, medication list, medication allergy list, immunizations, and procedures in accordance with the standards specified in Table 2A row 1.

Applying Data Modeling Patterns

Applying data modeling patterns allows model consistency and quality. Relational data modeling is a well established discipline. My favorite resource for relational data modeling patterns is: The Data Model Resource Book, Vol. 3: Universal Patterns for Data Modeling.

Some XML Schema best practices can be found here.

Data Stores

Today, options for data store are no longer limited to relational databases. Alternatives include: native XML databases (e.g. DB2 pureXML), Entity-Attribute-Value with Classes and Relationships (EAV/CR), and Resource Description Framework (RDF) stores.

Native XML databases are more resilient to schema changes and do not require handling the impedance mismatch between XML documents, Java objects, and relational tables which can introduce design complexity, performance, and maintainability issues.

Storing EHRs in an RDF store can enable the inference of medical facts based on existing explicit medical facts. Such inferences can be driven by an ontology expressed in OWL or a set of rules expressed in a rule language such SWRL. Semantic Web technologies can also be helpful in checking the consistency of a model, data and knowledge integration across domains (e.g. the genomics and clinical domains), and for managing classification schemes like medical terminologies. RDF, OWL, and SWRL have been successfully implemented in Clinical Decision Support Systems (CDSS).

The data modeling notation used should be independent of the storage model or at least compatible with the latter. For example, if native XML storage is used, then a relational modeling notation might not be appropriate. In general, UML provides the right level of abstraction for implementation-agnostic modeling.

Due Diligence

When adopting a "noSQL" storage model, it is important to ensure that (a) the database can meet performance and scalability criteria and (b) the team has the skills to develop and maintain the database. Due diligence should be performed through benchmarking using a tool such as the IBM Transaction Processing over XML (TpoX). The team might need formal training in a new query language like XQuery or SPARQL.

A Longitudinal View of the Patient Health

Maintaining an up-to-date and truly longitudinal view of a patient's medical history requires merging and reconciling data from heterogeneous sources including providers' EMR systems, lab companies, medical devices, and payers' claim transaction repositories. The data model should facilitate the assembly of data from such diverse sources. XML tools based on XSLT, XQuery, or XQuery Update can be used to automate the merging.

The Importance of Data Validation

Data validation can be performed at the database layer, the application layer, and the UI layer. The data model should support the validation of the data. The following are examples of techniques that can be used for data validation:

XML Schema for structural validation of XML documents
ISO Schematron (based on XPath 2.0 and XSLT 2.0) for business rules validation of XML documents
A business rules engine like Drools
A data processing framework like Smooks
The validation features of a UI framework such as JSF2
The built-in validation features of the database.

The Future: Modeling with the NIEM IEPD

The HHS ONC issued an RFP for using the National Information Exchange Model (NIEM) Information Exchange Package Documentation (IEPD) process for healthcare data exchange. The ONC will release a NIEM Concept of Operations (ConOps). The NIEM IEPD process is explained here.

Monday, September 21, 2009

Relational, XML, or RDF?

During the last 15 years, I have had the opportunity to work with different data models and approaches to application development. I started with SGML in the aerospace content management space back in 1995, then saw the potential of XML and fully embraced it in 1998. Since that time, I have been continuously following the evolution of XML related specifications and have been able to leverage the bleeding edge including XForms, XQuery, XSLT2, XProc, ISO Schematron, and even XML Schema 1.1.

However, being a curious person, I decided to explore other approaches to data management and application development. I worked on systems using a relational database backend and application development frameworks like Spring, Hibernate, and JSF. I've been involved in SOAP-based web services projects where XML data (constrained and validated by an XML Schema) was unmarshalled into Java objects, and then persisted into a relational table with an Object-Relational Mapping (ORM) solution such as Hibernate.

I also had the opportunity to work with the Java Content Repository (JCR) model in magazine content publishing, and the Entity-Attribute-Value with Classes and Relationships (EAV/CR) model in the context of medical informatics. EAV/CR is suited for domains where entities can have thousands of frequently changing parameters.

Lately, I have been working on Semantic Web technologies including the RDF data model, OWL (the Web Ontology Language), and the SPARQL query language for RDF.

Clients often ask me which of these approaches is the best or which is the most appropriate for their project. Here is what I think:

Different approaches should be part of the software architect's toolkit (not just one).
To become more productive in an agile environment, every developer should become a "generalizing specialist".
The software architect or developer should be open minded (no "not invented here syndrome" or "what's wrong with what we're doing now" attitude).
The software architect or developer should be willing to learn new technologies outside of their comfort zone and IT leadership should encourage and reward that learning.
Learning new technologies sometimes requires a new way of thinking about the problems at hand and "unlearning" old knowledge.
It is important not to have a purist or religious approach to selecting any particular approach, since each has its own merits.
Ultimately, the overall context of the project will dictate your choice. This includes but is not limited to: skills set, learning curve, application performance, cost, and time to market.

Based on my personal experience, here is what I have learned:

The XPath 2.0 and XQuery 1.0 Data Model (XDM)

The roots of SGML and XML are in content management applications in domains such as law, aerospace, defense, scientific, technical, medical, and scholarly publishing. The XPath 2.0 and XQuery Data Model (XDM) is particularly well suited for companies selling information products directly as a source of revenues (e.g. non-ad based publishers).

XSLT2 facilitates media-independent publishing (single sourcing) to multiple devices and platforms. XSLT2 is also a very powerful XML transformation language that allows these publishers to perform the series of complex transformations that are often required as the content is extracted from various data sources and assembled into a final information product.

With XQuery, sophisticated contextualized database-like queries can be performed. In addition, XQuery allows the dynamic assembly of content where new compound documents can be constructed on the fly from multiple documents and external data sources. This allows publishers to repurpose content into new information products as needed to satisfy new customer demands and market opportunities.

XInclude enables the chunking of content into reusable pieces. XProc is an XML pipeline language that essentially allows you to automate a complex publishing workflow which typically includes many steps such as content assembly, validation, transformation, and query.

The second category of application for which XML is a strong candidate is what is sometimes referred to as an "XML Workflow" application. The typical design pattern here is XRX (XForms, REST, and XQuery) where user inputs are captured with an XForm front end (itself potentially auto-generated from an XML schema) and data is RESTfully submitted to a native XML database, then queried and manipulated with XQuery. The advantages of this approach are:

It is more resilient to schema changes. In fact the data can be stored without a schema.
It does not require handling the impedance mismatch between XML documents, Java objects, and relational tables which can introduce design complexity, performance, and maintainability issues even when using code generation.

A typical example of an "XML Workflow" application would be a Human Resources (HR) form-based application that allows employees to fill and submit a form and also provides reporting capabilities.

The third and last category of application are Web Services (RESTful or SOAP-based) that consume XML data from various sources, store the data natively in an XML database directly (bypassing the XML databinding and ORM layers altogether), and perform all processing and queries on the data using a pure XML approach based on XSLT2 and XQuery. An example is a dashboard or mashup application that stores all of the submitted data in a native XML database. In this scenario, the data can be cached for faster response to web services requests. Again the benefits listed for "XML Workflow" applications apply here as well.

The Relational Model

The relational model is well established and well understood. It is usually an option for data-oriented and enterprise-centric applications that are based on a closed world assumption. In such a scenario, there is usually no need for handling the data in XML and a conventional approach based on JSF, Spring, Hibernate and a relational database backend is enough.

Newer Java EE frameworks like JBoss Seam and its seam-gen code generation tools are particularly well-suited for this kind of task. There is no running away from XML however, since these frameworks use XML for their configuration files. Unfortunately, there is currently a movement away from XML configuration files toward Java annotations due to some developers complaining about "XML Hell".

The relational model supports transactions and is scalable although a new movement called NoSQL is starting to challenge that last assumption. An article entitled "Is the Relational Database Doomed?" on readwriteweb.com describes this emerging trend.

The RDF Data Model

Semantic Web technologies like RDF (an incarnation of the EAV/CR model mentioned above), OWL, SKOS, SWRL, and SPARQL and Linked Data publishing principles have received a lot of attention lately. They are well suited for the following applications:

Applications that need to infer new implicit facts based on existing explicit facts. Such inferences can be driven by an ontology expressed in OWL or a set of rules expressed in a rule language such SWRL.
Applications that need to map concepts across domains such as a trading network where partners use different e-commerce XML vocabularies.
Master Data Management (MDM) applications that provide an RDF view and reasoning capabilities in order to facilitate and enhance the process of defining, managing, and querying an organization's core business entities. A paper on IBM's Semantic Master Data Management (SMDM) project is available here.
Applications that use a taxonomy, a thesaurus, or a similar concept scheme such as online news archives and medical terminologies. SKOS, recently approved as a W3C recommendation was designed for that purpose.
Silo-busting applications that need to link data items to over data items on the web, in order to perform entity correlation or allow users to explore a topic further. The Linked Data design pattern is based on an open world assumption, uses dereferenceable HTTP URIs for identifying and accessing data items, RDF for describing metadada about those items, and semantic links to describe the relationships between those items. An example is an Open Government application that correlates campaign contributions, voting records, census, and location data. Another example is a semantic social application that combines an individual's profiles and social networks from multiple sites in order to support data portability and fully explore the individual's social graph.

Of course, these different approaches are not mutually exclusive. For example, it is possible to provide an RDF view or layer on top of existing XML and relational database applications.

Tuesday, November 11, 2008

The Content Imperative: Unlearning the Relational Model

The relational data model and the SQL query language are an essential part on any computer science curriculum and are well understood by a large number of developers. On the other hand, the use of markup technologies such as SGML and XML for content management and publishing has remained a niche market for highly specialized vendors and consultants.

Today, the majority of developers use XML for application configuration files (e.g. Spring, Hibernate, JSF), syndication (RSS), and web services. When these developers are asked to design and develop XML content management and publishing applications, they often approach the problem from a relational data paradigm which is what they know and are used to. For example, when migrating content stored in a relational database into an XML format, they will simply dump the relational tables into a flat XML representation. The problem is that content is not relational data. The following are some fundamental differences between content and relational data:

Content is created to be consumed by eyeballs
Content can be rendered in multiple presentation formats such as print, web, and wireless devices. Therefore it is very important to cleanly separate content from presentation
Content can have an inherent deep hierarchical structure. For example, think about the book/part/chapter/section/subsection/paragraph hierarchy
The relationships between content items are expressed through hierarchical containment and hyperlinks
Content is often mixed (in the sense of mixed content in XML). For example, inside a paragraph, some words are italicized, in bold, or underlined to indicate special meaning
Content can have multi-valued properties such as the authors of a document. Multi-valued properties are not supported by SQL.

However, content and data do have one important thing in common. Data and content stay with us for a long time, sometimes forever. APIs, protocols, and programming languages come and go. Therefore, content modeling is by far the most important investment you can make during a content management migration project.

Unstructured Content Modeling

In a typical enterprise, there are two different types of content: unstructured and structured. Unstructured content represents the large majority of content in the enterprise. Examples are Office documents such as Word, PowerPoint, and Excel. Content modeling for unstructured content consists in describing document metadata as well as the relationships between the documents. The metadata is usually stored in a relational database. In a typical CMS, the content model is used to customize the user interface for querying specific documents based on metadata and for customizing the user interface (e.g. for capturing and displaying metadata).

Most content management systems have their own proprietary meta-model. The Java content repository API (JCR) introduced a standardized hierarchical repository model. This article by David Nuescheler (JCR spec Lead and CTO at Day Software) explains the peculiarities of JCR content modeling and the gotchas for people coming from a relational data modeling background.

Apache Jackrabbit (the JCR reference implementation) uses a textual DSL called Compact Namespace and Node Type Definition (CND) for specifying a JCR content model. There is no formal graphical notation like UML or ERDs for specifying JCR content models. Lars Trieloff of Day Software proposed a content modeling notation based on UML and Fundamental Modeling Concepts (FMC).

The Content Management Interoperability Services (CMIS) specification proposed a simplified meta-model based on documents, folders, policies, and relationships. The CMIS query language extends SQL 92 with text search, multi-valued properties, and folder-scoped queries. This choice was made because most existing CMS use a relational database and SQL is well understood by the majority of developers out there.

The Case Against Unstructured Content

The problem with unstructured content is that it cannot be processed and queried like the well-structured relational data stored by the RDBMS on which your ERP and CRM systems sit. XML goes beyond tags (in the web 2.0 sense), taxonomies, full-text search, and content categorization to provide fine-grained content discovery, query, and processing capabilities. With XML, the document becomes the database. If your business is content (you are a media company, a publisher, or the technical documentation department of a manufacturing company), then you should seriously consider the benefits of XML in terms of content longevity, reuse, repurposing, and cross-media publishing.

The question is: how to do it right?

From SGML to XML to the Infoset

Charles Goldfarb invented SGML at IBM in 1969, the same year his colleague Edgar Codd invented the relational model. The SGML (through its very popular subset called XML) and relational models are still rock solid today. To be precise, the abstract data model for XML documents was formally specified in the XML Infoset and subsequently the XQuery 1.0 and XPath 2.0 data model (XDM) specifications.

SGML was originally designed at IBM for the editing, retrieval, and composition of legal documents. The second edition of the Oxford English Dictionary and the US Department of Defense (DoD) adopted SGML in the 80s. The goal of the US DoD CALS initiative was to replace the huge quantities of paper with digital data in the acquisition of weapon systems. Before the adoption of XML by the W3C in February 98, SGML was primarily used in the publishing industry as well as for technical documentation applications in industries such as aerospace, defense, software, and telecommunications. Examples of SGML vocabularies include S1000D (aerospace and defense), Docbook (software), and TIM (telecommunications). The most popular SGML vocabulary, HTML, is the foundation of the web itself.

The XML Infoset describes the content of a well-formed XML document as an abstract tree of information items including document, namespace, element, and attribute information items. These information items have properties such as children and parent. Based on the XML Infoset, the XDM defines the abstract data model for the input to XSLT 2.0 and XQuery 1.0 processors. In addition, the XDM supports XML Schema types, atomic typed values, and ordered heterogeneous sequences.

The relational data model is based on set theory and predicate logic. It was designed originally for accounting and banking systems. Data is represented as n-ary relations and manipulated with relational algebra. CMS vendors and even standard bodies have tried to fork SQL in order to support hierarchies and multi-value properties. It is clear however that XQuery is a superior alternative, specifically designed to address those content-related concerns.

WikiXMLDB is an interesting application that uses XQuery to query Wikipedia content. WikiXMLDB allows you to not only perform database-like queries, but also enables dynamic content assembly or the ability to build compound documents from multiple Wikipedia pages. This opens up new opportunities in terms of content enrichment at a time when publishers are struggling to find new ways to monetize their content assets in the face of declining ad revenues.

Structured Content Modeling

There is a body of specialized knowledge in SGML/XML content analysis and modeling that has been applied successfully to projects such as the Oxford English Dictionary (OED), CALS, and Docbook.

In relational data modeling, the three phases of modeling typically include:

The conceptual or domain model
The logical data model (LDM)
The physical data model (PDM)

The LDM describes entity types, data attributes, and the relationships between the entities. The LDM is normalized to eliminate redundancies. The PDM describes the schema of the database that will be used to store the data and may be denormalized to improve performance. Relational data modeling uses well known modeling notations like Entity-Relationship (ER) diagrams and UML.

The difference between logical model and physical model is even more important in modeling XML documents. The logical model can be expressed in XML Schema or Relax NG. However, the physical model depends on the underlying data persistence mechanism (e.g. relational vs. JCR vs. XQuery-enabled native XML database). Many projects make the mistake of skipping the logical modeling phase. The problem is that the physical storage can and will probably change over time.

There is no formal notation for modeling XML documents. Back in the SGML days, document analysis and modeling was a well understood process. Eve Maler and Jeanne El Andaloussi describes a tree notation for modeling DTDs in their book entitled “Developing SGML DTDs: From Text To Model To Markup”. One of the peculiarities of modeling narrative text is the presence of mixed content which is essential for intelligent processing of content (for example for the automatic extraction of book indexes).

David Carlson proposed a UML Profile for XSD which can be used to auto-generate XML schemas from UML class diagrams. As with any model-driven development tool, care should be taken to ensure that the generated XML Schema complies with XML content modeling principles and satisfies the business and technical requirements such as content reuse and repurposing.

Many people prefer Relax NG to XML Schema for its simplicity. The important think to remember however is that XML Schema is part of a full stack of XML specifications which also includes XForms, XPath 2.0, XQuery, and XSLT 2.0. The ability to load XML Schema types into your XForms, validate XForms submissions against an XML Schema, and create schema-aware XSLT 2.0 transforms and XQuery queries can be a deciding factor.

Instead of starting your content modeling effort from scratch, you can leverage any of the existing proven and well tested document-oriented XML vocabularies such as Docbook, DITA, S1000D, and NewsML.

JCR and XML Content Modeling

JCR supports the import of arbitrary XML documents into a compliant repository. The following is an excerpt from the specification:

On import...

Each XML element E becomes a content repository node of the same name, E.
...
Each child XML element C of XML element E becomes a content repository child node C of node E.
Each XML attribute A within an XML element E becomes a property A of content repository node E. The value of each XML attribute A becomes the value of the corresponding property A.
...
Text within an XML element E becomes a STRING property called jcr:xmlcharacters of a node called jcr:xmltext, which itself becomes a child node of the node E.

This is fine if you're only storing a small quantity of XML documents. Performance will probably degrade quickly if you're storing a large quantity of XML documents with a deep hierarchy. Same name siblings are almost always used in document-oriented XML (e.g. paragraph siblings) and can cause performance to degrade or JCR paths to become brittle if you remove or reorder nodes. JCR allows roundtripping of imported XML. However, JCR adds repository metadata such as jcr:primaryType that must be stripped out at export time. It's possible to derive an optimized JCR content model from the XML document's logical model, although this could prevent you from fully exploiting the original hierarchy of the XML documents in your application. This shows the importance of separating the logical model from the physical model.

You should seriously consider a native XML database when dealing with large quantities of document-oriented XML documents. A simple benchmarking exercise (between Jackrabbit and the Exist database for example) can help you settle down on the right solution.

Business Rules and Content Quality

In addition to specifying the content items, types, and relationships in your schema, you should also specify business rules that are beyond the capabilities of the XML Schema language. ISO Schematron uses XPath 2.0 to declare assertions about arbitrary patterns in XML documents and then reports on the presence or absence of those patterns. In this article, Rick Jelliffe, inventor of Schematron explains how it works.

Assertions and conditional type assignments capabilities have been added to XML Schema 1.1.

Content Capture

XML editors have been around for a long time. However, these XML editors remain complex specialist tools that are often used only by professional technical authors in documentation departments. Using XForms, you can provide a user friendly interface for your end users to contribute XML content by presenting them with a simple XHTML form. The Alfresco web content management (WCM) platform uses XForms for content capture and XSLT/XSL FO for content rendition. The XForms controls can be generated automatically from an XML Schema. Alfresco's implementation is based on the open source Chiba XForms engine.

The Right Tools for the Job

When content is stored in a relational database, typically an object relational mapping (ORM) solution like Hibernate is used to map relational tables into Java objects. The business logic is handled with POJOs such as Spring beans which communicate with the front end through a UI framework or templating technology such as JSP or Freemarker.

When dealing with XML documents however, there are domain specific languages such as XForms, XInclude, XLink, XPointer, XPath 2.0, XSLT 2.0, XQuery, and XSL FO that greatly facilitate the processing of document-oriented XML and provide unmatched processing power when compared to traditional approaches based on SQL, JSP, or Freemarker. These XML-related languages are declarative in nature and require some learning curve. Many software architects and developers are not yet aware of the power of this new paradigm. However, I believe that rather than using only familiar development frameworks, it’s important to always evaluate alternatives and select the best approach even if it involves a learning curve.

XML databases such as Exist can store data natively in XML and provide full XQuery support for sophisticated queries and manipulation of XML content. Exist also provides integrated support for XInclude, XSLT 2.0, and AtomPub. The XRX (XForms, REST, XQuery) architecture with Exist and the Orbeon XForms engine allows you to integrate an XForm front end to an Exist data store through a REST API, therefore bypassing the ORM layer altogether.

Content Modeling for the Semantic Web

The web ontology language (OWL) is used for modeling for the Semantic Web. In the RDF data model, statements are expresses as Subject-Predicate-Object. Specialized RDF stores exist for storing RDF triples. In this blog post, Kingsley Idehen, CEO of OpenLink Software (maker of the Virtuoso RDF triple store), explains why The Time for RDBMS Primacy Downgrade is Nigh!

Conclusion

In summary, content modeling in general and XML content modeling in particular are intrinsically different from relational data modeling. In modeling XML content, the logical model should be separated from the physical model because the latter depends on the persistence storage mechanism which can change during the content's life cycle. XQuery-enabled native XML databases provide a better alternative for storing, querying, and processing large quantities of document-oriented XML. The relational data model is still rock solid today and will be around for many years to come because of its strong foundations in mathematics. However, it is not a panacea for all information management problems. While SQL can be forked to support content characteristics such as hierarchy and multi-valued properties, XQuery was natively designed to address those concerns. Companies can help their developers by providing training on declarative XML processing languages like XForms, XQuery, and XSLT 2.0 which are better suited for handling XML content.

Adventures in Computing

Wednesday, June 9, 2010

Data Modeling for Electronic Health Records (EHR) Systems

Monday, September 21, 2009

Relational, XML, or RDF?

Tuesday, November 11, 2008

The Content Imperative: Unlearning the Relational Model

License

About Me

Disclaimer

Blog Archive

Adventures in Computing

Wednesday, June 9, 2010

Data Modeling for Electronic Health Records (EHR) Systems

Monday, September 21, 2009

Relational, XML, or RDF?

Tuesday, November 11, 2008

The Content Imperative: Unlearning the Relational Model

License

About Me

Disclaimer

Subscribe To

Blog Archive