Sunday, December 27, 2009

Unlocking the Potential of Health Information Technology (HIT)

Around the world, HIT is widely seen as a key factor in containing rising health care costs and improving the quality of patient care. President Obama, in a speech at George Mason University in January 2009, declared:

To improve the quality of our health care while lowering its cost, we will make the immediate investments necessary to ensure that, within five years, all of America's medical records are computerized.

Yet, recent studies have raised some concerns about the effectiveness of Electronic Health Records (EHRs). A recent survey of 4000 hospitals conducted by Dr. David Himmelstein and his team of researchers at Harvard Medical School and published in the American Journal of Medicine concluded that the adoption of EHRs have had little impact on administrative costs and quality of care.

Another study published in the Milbank Quaterly raised some questions about the effectiveness of EHRs (as compared to paper records) in primary clinical work.

In the US, the American Recovery and Reinvestment Act (ARRA) of 2009 (also known as the "economic stimulus package") contains a portion called the Health Information Technology for Economic and Clinical Health Act (HITECH Act) which provides $19.2 billions of incentives to physicians and hospitals for the adoption of EHRs. No matter where you stand in the political spectrum, ARRA, the HITECH Act, and Health Care Reform are now realities to contend with in the US health care sector.

What can be done to ensure that these massive investments in HIT and EHRs live up to their expectations? A research paper entitled "Use of Electronic Health Records in US Hospitals" and published in April 2009 in The New England Journal of Medicine reveals:

On the basis of responses from 63.1% of hospitals surveyed, only 1.5% of U.S. hospitals have a comprehensive electronic-records system (i.e., present in all clinical units), and an additional 7.6% have a basic system (i.e., present in at least one clinical unit). Computerized provider-order entry for medications has been implemented in only 17% of hospitals.

The EHR is the cornerstone of the emerging Nationwide Health Information Network (NHIN). Software developers can do all sorts of interesting things (in the spirit of "meaningful use") with patient health data once it becomes available in electronic form. In 2010, EHRs will be a top priority for providers because of the incentives available under the HITECH Act. The ECRI Institute puts EHRs on the number two spot on its list of Top 7 Technologies for Health Plans in 2010.

The following factors will be key to unlocking the full potential of HIT:

  • Standardization
  • Open Source
  • Training the HIT Workforce

Standardization


Standards are important for the seamless exchange of data between the different stakeholders (hospitals, physicians, payers, lab companies, etc.) in the health care industry. Furthermore, due to the complexity of the health care domain, organizations adopting HIT cannot afford to reinvent the wheel when deciding on key issues such as:

  • A data model for representing EHR data and quality measure reporting content.
  • Coding systems for representing patient health information in a machine-processable format.
  • A language for expressing clinical practice guidelines for their automated execution in clinical decision support systems (CDSS) as a "meaningful use" of EHRs.
  • A protocol for cross-organization exchange of patient data.
  • Technologies for privacy, security, audit trails, and patient consents.

Fortunately, some of these standards exist today and should be adopted. Organizations such as the Health Information Technology Standards Panel (HITSP) and Integrating the Health Enterprise (IHE) in the US and Canada's Health Infoway's Standards Collaborative are working on harmonizing standards for data exchange as well as security and privacy in the health care domain.

EHR Data and Quality Reporting


The HL7 Reference Information Model (RIM) expressed as a set of UML class diagrams is the foundation model for health care and clinical data. The HL7 Clinical Document Architecture (CDA) which is derived from the HL7 RIM has been widely adopted around the world for EHR implementation projects. The HL7 Continuity of Care Documents (CCD) is a constraint on the CDA driven by the requirements of the ATSM Continuity of Care Record (CCR) specification. The HITSP C32 specification which is a further constraint on the HL7 CCD is emerging as the standard for the exchange of EHR data in the US realm.

On the CCR vs. CCD debate, I have come to appreciate and respect the exhaustive metadata provided by the CCD which is important for medico-legal reasons, but also for building software applications such as clinical decision support systems (CDSS) that rely heavily on the availability of such metadata. However, I also realize that some patient-facing web applications for managing personal health records (PHRs) can benefit from the simplicity of the CCR.

For collecting and reporting performance measure data to improve the quality of care, the Physician Quality Reporting Initiative (PQRI) Registry XML Specification and the HL7 Quality Reporting Document Architecture (QRDA) are available.

Laboratory content can be exchanged in HL7 2.5.1 messages or HITSP CCD documents. Digital Images and Communications in Medecine (DICOM), DICOM Structured Reporting (SR), and Web Access to DICOM objects (WADO) are used for medical imaging.

Medical Terminologies


The HITSP C80 Clinical Document and Message Terminology Component specification defines the vocabularies and terminologies used by various HITSP specifications. HITSP C80 recommends Logical Observation Identifiers Names and Codes (LOINC) for laboratory observations, RxNorm for standard names of clinical drugs, Unique Ingredient Identifier (UNII) for allergies, SNOMED CT for Problem Lists, and Unified Code for Units of Measure (UCUM) for units of measure.

Integrating terminologies is a big challenge in Health IT. An example is the mapping from SNOMED-CT to ICD-10 or Current Procedural Terminology (CPT). Semantic web technologies such as RDF, OWL2, SWRL, SPARQL, and SWASDL have proven their usefulness in semantic mediation across coding systems as well as in building Clinical Decision Support Systems (CDSS).

Administrative Transactions and E-Prescribing


Relevant standards include X12 4010 and 5010 for administrative transactions, Council for Affordable Quality Healthcare (CAQH) Core for online patient eligibility and benefits inquiries, and NCPDP SCRIPT 10.6 for electronic prescriptions (e-prescribing). Conversion from 4010 to 5010 and ICD-9 to ICD-10 will be a priority on the agenda in the next three years (details on final compliance dates can be found on this HHS web page). XQuery, XQuery Update, and XSLT2 are likely to play an important role in that conversion effort.

An Expression and Query Language for Clinical Decision Support Systems (CDSS)


There is currently no consensus on a single standard for expressing clinical practice guidelines for their automated execution in CDSS. Existing specifications include the Arden Syntax, the Guideline Interchange Format (GLIF), and GELLO (roughly Guideline Expression Language Object-Oriented). Some wonder if a standard is needed in this space at all. My position is that a standard with an open source reference implementation with support for some publicly available and approved clinical guidelines can be useful in speeding up adoption.

In addition, since EHR data is one of the primary inputs into a CDSS, the CDSS expression language should at least share the same underlying data model with the EHR data input. That model is the HL7 Reference Information Model (RIM). The "virtual medical record" (or vMR) specified by GELLO is defined as a view of the HL7 RIM. It would be a great help if HITSP could step in and harmonize standards in the CDSS space.

Cross-Enterprise Document Sharing


The Integrating the Health Enterprise (IHE) Cross-Enterprise Document Sharing (XDS) Integration Profile specifies a protocol for the sharing and access to EHRs across health enterprises. XDS defines concepts such as an "Affinity Domain" (a group of collaborating health enterprises), the "Document Source", the "Document Repository", and the "Document Registry". It also specifies transactions such as "Provide and Register Document", "Query Document" and "Retrieve document".

Other relevant IHE Integration Profiles include the Cross Community Access profile (XCA), the Cross Enterprise Document Reliable Interchange (XDR), and the Cross Enterprise Document Media Interchange (XDM).

Privacy and Security



On the privacy and security front, the following publicly available standards can be used to comply with ARRA and HIPAA requirements:

  • Transport Layer Security (TLS)
  • Advanced Encryption Standard (AES)
  • Secure Hash Algorithm (SHA)
  • OASIS eXtensible Access Control Markup Language (XACML)
  • OASIS Security Assertion Markup Language (SAML)
  • OASIS WS-Security
  • OASIS WS-Trust
  • OASIS WS-Policy.

The following IHE Profiles provide specific security guidelines for health enterprises and Health Information Exchanges (HIEs):

  • Audit Trail and Node Authentication (ATNA)
  • Consistent Time (CT)
  • Basic Patient Privacy Consents (BPPC)
  • Enterprise User Authentication (EUA)
  • Cross-Enterprise User Assertion (XUA)
  • Patient Demographics Query (PDQ)
  • Patient Identifier Cross-Referencing (PIX)
  • Digital Signatures (DSG).

For the de-identification and re-identification of content which are important for privacy as well as public health research and epidemiology, the following specifications are available:

  • 45 CFR Parts 160 and 164. Standards for Privacy of Individually Identifiable Health Information; Final Rule. August 14, 2002. Section 164.514(a-b) Deidentification of protected health information. (Deidentification)
  • 46 CFR Parts 160 and 164. Standards for Privacy of Individually Identifiable Health Information; Final Rule. August 14, 2002. Section 164.514(c) Reidentification (Pseudonymization)
  • ISO/TS 25237:2008 Health Informatics --Pseudonymisation, Unpublished Technical Specification (Pseudonymization).

Open Source


High costs, total failure, disappointing return on investment (ROI), and adverse impact on end-users productivity are always risk factors to deal with in any large scale software implementation project. HIT projects are no exception. Driving costs out of health care will be a top priority for all stakeholders in 2010. So, companies will look for viable alternatives to insane software licensing and maintenance fees.

Open source implementations of some of the HIT standards mentioned above can be leveraged to reduce such risks. In addition to lowering costs and risk, open source can provide the transparency that is needed to ensure that HIT software provides adequate privacy and security protections. Examples of these tools include:

  • The Open eHealth Integration Platform which is based on Apache Camel and implements HIT standards such as HL7 2.x, HL7 CDA/CCD, and IHE XDS.
  • The Open Health Tools Project provides tooling for developing HIT applications.
  • The Connect Gateway allows health enterprises to easily connect their HIT systems to health information exchanges (HIE). It implements key components such as a Master Patient Index (MPI), XDS.b Document Registry and Repository, Authorization Policy Engine, Consumer Preferences Manager, and a HIPAA-compliant Audit Log.
  • The WorldVistA EHR open source EHR system which is based on the U.S. Department of Veterans Affairs (VA) sponsored VistA software. Work is underway to add CCR and CCD support to WorldVistA.

Training the HIT workforce


Finally, the potential of HIT will not be fully realized without the availability of a fully trained and competent workforce. All stakeholders in the health care industry should commit the necessary investments to educating the future HIT workforce through college and university programs as well as corporate training.

Lastly, the usability of EHRs in clinical settings deserves more attention and additional research.

Monday, September 21, 2009

Relational, XML, or RDF?

During the last 15 years, I have had the opportunity to work with different data models and approaches to application development. I started with SGML in the aerospace content management space back in 1995, then saw the potential of XML and fully embraced it in 1998. Since that time, I have been continuously following the evolution of XML related specifications and have been able to leverage the bleeding edge including XForms, XQuery, XSLT2, XProc, ISO Schematron, and even XML Schema 1.1.

However, being a curious person, I decided to explore other approaches to data management and application development. I worked on systems using a relational database backend and application development frameworks like Spring, Hibernate, and JSF. I've been involved in SOAP-based web services projects where XML data (constrained and validated by an XML Schema) was unmarshalled into Java objects, and then persisted into a relational table with an Object-Relational Mapping (ORM) solution such as Hibernate.

I also had the opportunity to work with the Java Content Repository (JCR) model in magazine content publishing, and the Entity-Attribute-Value with Classes and Relationships (EAV/CR) model in the context of medical informatics. EAV/CR is suited for domains where entities can have thousands of frequently changing parameters.

Lately, I have been working on Semantic Web technologies including the RDF data model, OWL (the Web Ontology Language), and the SPARQL query language for RDF.

Clients often ask me which of these approaches is the best or which is the most appropriate for their project. Here is what I think:

  • Different approaches should be part of the software architect's toolkit (not just one).
  • To become more productive in an agile environment, every developer should become a "generalizing specialist".
  • The software architect or developer should be open minded (no "not invented here syndrome" or "what's wrong with what we're doing now" attitude).
  • The software architect or developer should be willing to learn new technologies outside of their comfort zone and IT leadership should encourage and reward that learning.
  • Learning new technologies sometimes requires a new way of thinking about the problems at hand and "unlearning" old knowledge.
  • It is important not to have a purist or religious approach to selecting any particular approach, since each has its own merits.
  • Ultimately, the overall context of the project will dictate your choice. This includes but is not limited to: skills set, learning curve, application performance, cost, and time to market.

Based on my personal experience, here is what I have learned:

The XPath 2.0 and XQuery 1.0 Data Model (XDM)


The roots of SGML and XML are in content management applications in domains such as law, aerospace, defense, scientific, technical, medical, and scholarly publishing. The XPath 2.0 and XQuery Data Model (XDM) is particularly well suited for companies selling information products directly as a source of revenues (e.g. non-ad based publishers).

XSLT2 facilitates media-independent publishing (single sourcing) to multiple devices and platforms. XSLT2 is also a very powerful XML transformation language that allows these publishers to perform the series of complex transformations that are often required as the content is extracted from various data sources and assembled into a final information product.

With XQuery, sophisticated contextualized database-like queries can be performed. In addition, XQuery allows the dynamic assembly of content where new compound documents can be constructed on the fly from multiple documents and external data sources. This allows publishers to repurpose content into new information products as needed to satisfy new customer demands and market opportunities.

XInclude enables the chunking of content into reusable pieces. XProc is an XML pipeline language that essentially allows you to automate a complex publishing workflow which typically includes many steps such as content assembly, validation, transformation, and query.

The second category of application for which XML is a strong candidate is what is sometimes referred to as an "XML Workflow" application. The typical design pattern here is XRX (XForms, REST, and XQuery) where user inputs are captured with an XForm front end (itself potentially auto-generated from an XML schema) and data is RESTfully submitted to a native XML database, then queried and manipulated with XQuery. The advantages of this approach are:

  • It is more resilient to schema changes. In fact the data can be stored without a schema.
  • It does not require handling the impedance mismatch between XML documents, Java objects, and relational tables which can introduce design complexity, performance, and maintainability issues even when using code generation.

A typical example of an "XML Workflow" application would be a Human Resources (HR) form-based application that allows employees to fill and submit a form and also provides reporting capabilities.

The third and last category of application are Web Services (RESTful or SOAP-based) that consume XML data from various sources, store the data natively in an XML database directly (bypassing the XML databinding and ORM layers altogether), and perform all processing and queries on the data using a pure XML approach based on XSLT2 and XQuery. An example is a dashboard or mashup application that stores all of the submitted data in a native XML database. In this scenario, the data can be cached for faster response to web services requests. Again the benefits listed for "XML Workflow" applications apply here as well.


The Relational Model

The relational model is well established and well understood. It is usually an option for data-oriented and enterprise-centric applications that are based on a closed world assumption. In such a scenario, there is usually no need for handling the data in XML and a conventional approach based on JSF, Spring, Hibernate and a relational database backend is enough.

Newer Java EE frameworks like JBoss Seam and its seam-gen code generation tools are particularly well-suited for this kind of task. There is no running away from XML however, since these frameworks use XML for their configuration files. Unfortunately, there is currently a movement away from XML configuration files toward Java annotations due to some developers complaining about "XML Hell".

The relational model supports transactions and is scalable although a new movement called NoSQL is starting to challenge that last assumption. An article entitled "Is the Relational Database Doomed?" on readwriteweb.com describes this emerging trend.


The RDF Data Model

Semantic Web technologies like RDF (an incarnation of the EAV/CR model mentioned above), OWL, SKOS, SWRL, and SPARQL and Linked Data publishing principles have received a lot of attention lately. They are well suited for the following applications:

  • Applications that need to infer new implicit facts based on existing explicit facts. Such inferences can be driven by an ontology expressed in OWL or a set of rules expressed in a rule language such SWRL.
  • Applications that need to map concepts across domains such as a trading network where partners use different e-commerce XML vocabularies.
  • Master Data Management (MDM) applications that provide an RDF view and reasoning capabilities in order to facilitate and enhance the process of defining, managing, and querying an organization's core business entities. A paper on IBM's Semantic Master Data Management (SMDM) project is available here.
  • Applications that use a taxonomy, a thesaurus, or a similar concept scheme such as online news archives and medical terminologies. SKOS, recently approved as a W3C recommendation was designed for that purpose.
  • Silo-busting applications that need to link data items to over data items on the web, in order to perform entity correlation or allow users to explore a topic further. The Linked Data design pattern is based on an open world assumption, uses dereferenceable HTTP URIs for identifying and accessing data items, RDF for describing metadada about those items, and semantic links to describe the relationships between those items. An example is an Open Government application that correlates campaign contributions, voting records, census, and location data. Another example is a semantic social application that combines an individual's profiles and social networks from multiple sites in order to support data portability and fully explore the individual's social graph.


Of course, these different approaches are not mutually exclusive. For example, it is possible to provide an RDF view or layer on top of existing XML and relational database applications.

Tuesday, August 11, 2009

Adding Semantics to SOA

What can Semantic Web technologies such as RDF, OWL, SKOS, SWRL, and SPARQL bring to Web Services. One of the most difficult challenges of SOA is data model transformation. This problem occurs when services don't share a canonical XML schema. XML transformation languages such as XSLT and XQuery are typically used for data mediation in such circumstances.

While it is relatively easy to write these mappings, the real difficulty lies in mapping concepts across domains. This is particularly important in B2B scenarios involving multiple trading partners. In addition to proprietary data models, it is not uncommon to have multiple competing XML standards in the same vertical. In general, these data interoperability issues can be syntactic, structural, or semantic in nature. Many SOA projects can trace their failure to those data integration issues.

This is where semantic web technologies can add significant value to SOA. The Semantic Annotations for WSDL and XML Schema (SAWSDL) is a W3C recommendation which defines the following extension attributes that can be added to WSDL and XML Schema components:

  • The modelReference extension attribute associates a WSDL or XML Schema component to a concept in a semantic model such as OWL. The semantic representation is not restricted to OWL (for example it could be an SKOS concept). The modelReference extension attribute is used to annotate XML Schema type definitions, element and attribute declarations as well as WSDL interfaces, operations, and faults.
  • The liftingSchemaMapping and loweringSchemaMapping extension attributes typically point to an XSLT or XQuery mapping file for transforming between XML instances and ontology instances.

A typical example of how SAWSDL might be used is in an electronic commerce network where trading partners use various standards such as EDI, UBL, ebXML, and RosettaNet. In this case, the modelReference extension attribute can be used to map a WSDL or XML Schema component to a concept in a common foundational ontology such as one based on the Suggested Upper Merged Ontology (SUMO). In addition, lifting and lowering XSLT transforms are attached to XML Schema components in the SAWSDL with liftingSchemaMapping and loweringSchemaMapping extension attributes respectively. Note that any number of those transforms can be associated with a given XML schema component.

Traditionally, when dealing with multiple services (often across organizational boundaries), an Enterprise Services Bus (ESB) provides mediation services such as business process orchestration, business rules processing, data format and data model transformation, message routing, and protocol bridging. Semantic mediation services can be added as a new type of ESB service. The SAWSDL4J API defines an object model that allows SOA developers to access and manipulate SAWSDL annotations.

Ontologies have been developed for some existing e-commerce standards such as EDI X12, RosettaNet, and ebXML. When required, ontology alignment can be achieved with OWL constructs such as subClassOf , equivalentClass , and equivalentProperty.

Semantic annotations provided by SAWSDL can also be leveraged in orchestrating business processes using the business process execution language (BPEL). To facilitate service discovery in SOA Registries and Repositories, interface definitions in WSDL documents can be associated with a service taxonomy defined in SKOS. In addition, once an XML message is lifted to an ontology instance, the data in the message becomes available to Semantic Web tools like OWL and SWRL reasoners and SPARQL query engines.

Sunday, July 26, 2009

From Web 2.0 to the Semantic Web: Bridging the Gap in Newsmedia

In this presentation, I explain the Semantic Web value proposition for the newsmedia industry and propose some concrete steps to bridge the gap.

Welcome to the world of news in the Web 3.0 era.

Wednesday, July 8, 2009

Semantic Social Computing

The Web 2.0 revolution has produced an explosion in social data that is fundamentally transforming business, politics, culture, and society in general. Using tools such as wikis, blogs, online forums, and social networking sites, users can now express their point of view, build relationships, and exchange ideas and multimedia content. Combined with portable electronic devices such as cameras and cell phones, these tools are enabling the citizen journalist who can report facts and events faster than traditional media outlets and government agencies.

One of the challenges posed by this explosion in social data is data portability between social networking sites. But the next biggest challenge will be the ability to harvest all that social data in order to extract actionable intelligence (e.g. a better understanding of consumer behavior or the events unfolding at a particular location). In addition, in a world where security has become the number one priority, various sensors from traffic cameras to satellite sensors are also collecting huge amounts of data. The integration of sensor data and social data offers new possibilities.

Those are the types of integration challenges that Semantic Web technologies are designed to solve. The SIOC (Semantically Interlinked Online Communities) Core ontology describes the structure and content of online community sites. A comprehensive list of SIOC tools is available at the SIOC Applications page. Using these tools, developers can export SIOC compliant RDF data from various data sources such as blogs, wikis, online forums, and social networking sites such as Twitter and Flickr. Once exported, the SIOC data can be crawled, aggregated, stored, indexed, browsed, and queried (using SPARQL) to answer interesting questions. Natural Language Processing (NLP) techniques can be used to facilitate entity extraction from user generated content.

SIOC leverages the FOAF ontology to describe the social graph on social networking sites. For example, this can offer deeper insights for marketers into how social recommendations affect consumer behavior.

One unique capability offered by Semantic Web technologies is the ability to infer new facts (inference) from explicit facts based on the use of an ontology (RDFS or OWL) or a set of rules expressed in a rule language such as the Semantic Web Rule Language (SWRL). Using constructs such as owl:sameAs or rdfs:seeAlso, it becomes easy to express the fact that two or more different web pages relate to the same resource (e.g. different profile pages of the same person on difference social networking sites). Linked Data principles can help in linking social data, therefore building bridges between the data islands that today's social networking sites represent.

SIOC compliant social data can be meshed up with other data sources such as sensor data to reveal very useful information about events related to logistics, public safety, or political unrest at a particular location for example. With the advent of GPS-enabled cameras and cell phones, temporal and spatial context can be added to better describe those events. The W3C Time OWL Ontology (OWL-Time) and the Basic Geo Vocabulary have been developed for that purpose.

Thursday, June 11, 2009

Publishing Government Data to the Linked Open Data (LOD) Cloud

In a previous post, I outlined a roadmap for migrating news content to the Semantic Web and the Linked Open Data (LOD) cloud. The BBC has been doing some interesting work in that space by using Linked Data principles to connect BBC Programmes and BBC Music to MusicBrainz and DBpedia. SPARQL endpoints are now available for querying the BBC datasets.

It is clear that Europe is ahead of the US and Canada in terms of Semantic Web research and adoption. The Europeans are likely to further extend their lead with the announcement this week that Tim Berners-Lee (the visionary behind the World Wide Web, the Semantic Web, and the Linked Open Data movement) will be advising the UK Government on making government data more open and accessible.



In the US, data.gov is part of the Open Government Initiative of the Obama Administration. The following is an excerpt from data.gov:

A primary goal of Data.gov is to improve access to Federal data and expand creative use of those data beyond the walls of government by encouraging innovative ideas (e.g., web applications). Data.gov strives to make government more transparent and is committed to creating an unprecedented level of openness in Government. The openness derived from Data.gov will strengthen our Nation's democracy and promote efficiency and effectiveness in Government.

Governments around the world have taken notice and are now considering similar initiatives. It is clear that these initiatives are important for the proper functioning of democracy since they allow citizens to make informed decisions based on facts as opposed to the politicized views of special interests, lobbyists, and their spin doctors. These facts are related to important subjects such as health care, the environment, the criminal justice system, and education. There is an ongoing debate in the tech community about the best approach for publishing these datasets. There are several government data standards available such as the National Information Exchange Model (NIEM). In the Web 2.0 world, RESTful APIs with ATOM, XML, and JSON representation formats have become the norm.

I believe however that Semantic Web technologies and Linked Data principles offer unique capabilities in terms of bridging data silos, queries, reasoning, and visualization of the data. Again, the methodology for adopting Semantic Web technologies is the same:

  1. Create an OWL ontology that is flexible enough to support the different types of data in the dataset including statistical data. This is certainly the most important and challenging part of the effort.
  2. Convert the data from its source format to RDF. For example XSLT2 can be used to convert from CSV or TSV to RDF/XML and XHTML+RDFa. There are also RDFizers such as D2R for relational data sources.
  3. Link the data to other data sources such as Geonames, Federal Election Commission (FEC), and US Census datasets.
  4. Provide an RDF dump for Semantic Web Crawlers and a SPARQL endpoint for querying the datasets.

The following are some of the benefits of this approach:

  • It allows users to ask sophisticated questions against the datasets using the SPARQL query language. These are the kind of questions that a journalist, a researcher, or a concerned citizen will have in mind. For example, which airport has the highest number of reported aircraft bird strikes? (read more here about why Transportation Secretary Ray LaHood rejected a proposal by the FAA to keep bird strikes data secret). Currently data.gov provides only full-text and category-based search.
  • It bridges data silos by allowing users to make queries and connect data in meaningful ways across datasets. For example, a query that correlates health care, environment, and census data.
  • It provides powerful visualizations of the data through Semantic Web meshups.

Tuesday, June 2, 2009

S1000D and SCORM Integration

This is a presentation I gave at the DocTrain Boston 07 conference on how to reduce product lifecycle costs by integrating the S1000D and SCORM specifications.

S1000D is the International Specification for Technical Publications utilizing a Common Source Database (CSDB). Based on open XML standards, the latest issue (4.0) has been developed by the AeroSpace and Defence Industries Association of Europe (ASD), the Aerospace Industries Association of America (AIA), and the Air Transport Association of America (ATA).

Sharable Content Object Reference Model (SCORM) is a specification for online learning content developed by the Advanced Distributed Learning (ADL) Initiative.

The presentation has been updated to reflect the addition of SCORM support in S1000D 4.0.

Tuesday, May 19, 2009

Why XProc Rocks

These are exciting times to be a technologist in the news publishing business. The industry is going through fundamental changes driven by the current economic downturn. Innovation has become an imperative. Examples of recent innovations include news content APIs (newspaper as a Platform), specialized desktop news readers, the ability to publish to an increasing number of mobile devices (Kindle, iPhone, etc.), and migrating news content to the Semantic Web and Linked Open Data (LOD) cloud.

The result is that news publishing workflows are getting more complex. The following is an example of what can be expected from such a workflow:

  1. Retrieve content from a native XML database such as Exist using XQuery and from a MySQL database and combine the result as an XML document.
  2. Expand XIncludes and other content references in the XML.
  3. Apply various processing to the XML depending on the content type.
  4. Make a REST call to data.gov and recovery.gov to retrieve some government published data in XML for a graphical mashup visualization of the data.
  5. Transform the XML into XHTML+RDFa for consumption by Yahoo's SearchMonkey and the recently unveiled Google Rich Snippets.
  6. Transform the XML into RDF/XML and validate the result using the Jena command line ARP RDF parser. The RDF/XML output will provide an RDF dump for RDF crawlers and will be searched via a SPARQL endpoint.
  7. Transform the XML into an Amazon Kindle-friendly format such as Text.
  8. Publish the content as a PDF document with a print layout (e.g. header, footer, multi-column, pagination, etc.).
  9. Transform the XML into a NewsML document and validate the result against the NewsML XML Schema and an ISO Schematron schema.
  10. Generate an Atom feed containing NewsML elements and validate the result using NVDL (Namespace-based Validation Dispatching Language). NVDL is required here because the Atom feed will contain nodes that will be validated against the Atom RelaxNG schema as well as nodes that will be validated against the NewsML XML schema.

As you can see, this publishing workflow is XML document-centric. If you are a Java developer, your first instinct might be to automate all those processing steps with an Ant build file. There is now a better alternative and it is called XProc (an XML Pipeline Language) currently a W3C candidate recommendation. Here are some reasons why XProc is a superior alternative when dealing with an XML document-centric publishing workflow:

  • With XProc, XML documents are first-class citizens. Java objects are first-class citizens in Ant. In fact, using Ant to automate a complex XML document-centric publishing workflow can quickly lead to spaghetti code.
  • XProc allows you to add, delete, rename, replace, filter, split, wrap, unwrap, compare, insert, and conditionally process nodes in XML documents by addressing these nodes using XPath.
  • XProc comes with built-in steps for XML processing such as p:xquery, p:xinclude, p:xslt, p:validate-with-xml-schema, p:validate-with-relax-ng, p:validate-with-schematron, p:xsl-formatter, and p:http-request for making REST calls.
  • XProc is declarative, and programming language and platform neutral. For example, work is underway for an XQuery implementation for the Exist database. Note that step 1. above can be implemented with the SQL extension function provided by Exist for retrieving data from relational databases using JDBC.
  • XProc is extensible, so you can add custom steps. For example the XML Calabash XProc implementation provides extension steps such as cx:nvdl and cx:unzip.
  • XProc provides exception handlers which are essential for any complex workflow.
  • XProc pipelines are amenable to streaming.
  • XProc can simplify the design, documentation, and maintenance of very complex publishing workflows. Since XProc is in XML format, one can envision a visual designer for building XProc pipelines or a tool (such as Graphviz) to visualize XProc documents in the form of processing diagrams.

There are other pipeline technologies that can be useful as well, particularly if you're building mashup applications. Yahoo Pipes is a good choice for combining feeds from various sources, manipulating them as needed, and outputting RSS, JSON, KML, and other formats. And if you are into Semantic Web meshups, DERI Pipes provides support for RDF, OWL, and SPARQL queries. DERI Pipes works as a command line tool and can be hooked into an XProc pipeline with the XProc built-in p:exec step.

Sunday, April 26, 2009

Thoughts on SOAP vs. REST

REST is now an increasingly popular architectural style for building web services. The question for developers is: should REST always be the preferred mechanism for building web services or is SOAP still relevant for certain use cases?

In my opinion, REST is usually a no-brainer when you are exposing a public API over the internet and all you need is basic CRUD operations on your data. However, when designing a truly RESTful web services interface (as opposed to some HTTP API), care must be taken to adhere to key principles:

  • Everything is a URI addressable resource
  • Representations (media types such as XHTML, JSON, RDF, and Atom) describe resources and use links to describe the relationships between those resources
  • These links drive changes in application state (hence Representational State Transfer or REST)
  • The only type that is significant for clients is the representation media type
  • URI templates as opposed to fixed or hard coded resource names
  • Generic HTTP methods (no RPC-style overloaded POST)
  • Statelessness (the server keeps no state information)
  • Cacheability.

Adherence to these principles is what enables massive scalability. One good place to start is the AtomPub protocol which embodies these principles. In the Java space, the recently approved Java API for RESTful Web Services (JAX-RS) specification greatly simplifies REST development with simple annotated POJOs.

Within the enterprise and in B2B scenarios, SOAP (and its WS-* family of specifications) is still very attractive. This is not to say that REST is not enterprise ready. In fact, there are known successful RESTful implementations in mission critical applications such as banking. However, enterprise applications can have specific requirements in the areas of security, reliable messaging, business process execution, and transactions for which SOAP, the WS-* specifications, and supporting tools provide solutions.

These specifications include:

  • WS-Addressing
  • WS-Policy
  • WS-ReliableMessaging
  • WS-SecureConversation
  • WS-Security
  • WS-SecurityPolicy
  • WS-Trust
  • WS-AtomicTransaction
  • WS-BPEL (Business Process Execution Language)

RESTafarians will tell you that REST can handle these requirements as well. For example, RESTful transactions can be implemented by treating the transactions themselves as URI addressable REST resources. This approach can work, but is certainly not trivial to implement. In fact, it is often difficult to support some of these requirements without resorting to overloaded POST, which works more like SOAP and is a clear departure from a pure REST architectural style.

One characteristic of enterprise SOA is the need to expose pieces of application logic (as opposed to data) as web services and this can be more amenable to a SOAP-based approach. Existing SOAP web services toolkits such as Apache CXF provide support for WS-* specifications. More importantly, they greatly simplify the development process by providing various tools such as the ability to create new services with a contract-first approach where JAX-WS annotated services and server stubs can be automatically generated from an existing WSDL.

Furthermore, during the last ten years, organizations have made significant investments in SOAP-based infrastructure such as Enterprise Service Buses (ESBs) and Business Process Management (BPM) software based on WS-BPEL. The Content Management Interoperability Services (CMIS) specification which is currently being developed by OASIS specifies protocol bindings for both SOAP and AtomPub. The SOAP binding will allow organizations to leverage those investments in building interoperable content repositories.

Architecting an SOA solution is a balancing act. It's important not to dismiss any particular approach too soon. Both SOAP and REST should carefully be considered for new web services projects.

Thursday, March 26, 2009

News Content APIs: Uniform Access or Content Silos

The first generation of online news content applications was designed for consumption by humans. With the massive amounts of online content now available, machine processable structured data will be the key to findability and relevance. Major news organizations like the New York Times (NYT), the National Public Radio (NPR), and The Guardian have recently opened up their content repositories through APIs.

These APIs have generated a lot of excitement in the content developers community and are certainly a significant step forward in the evolution of how news content is processed and consumed on the web. The APIs allow developers to create new interesting mashup applications. An example of such a mashup is a map of the United States showing how the stimulus money is being spent by municipalities across the country with hotspots to local newspaper articles about corruption investigations related to the spending. The stimulus spending data will be provided by the Stimulus Feed on the recovery.gov site as specified by the "Initial Implementation Guidance for the American Recovery and Reinvestment Act" document. This is certainly an example of mashup that US tax payers will like.

For news organizations, these APIs represent an opportunity to grow their ad network by pushing their content to more sites on the web. That's the idea behind the recent release of The Guardian Open Platform API.

APIs and Content Silos

The emerging news content APIs typically offer a REST or SOAP web services interface and return content in XML, JSON, or ATOM feeds. However, despite the excitement that they generate, these APIs can quickly turn into content silos for the following reasons:

  • The structure of the content is often based on a proprietary schema. This introduces several potential interoperability issues for API users in terms of content structure, content types, and semantics.
  • It is not trivial to link content across APIs
  • Each API provides its own query syntax. There is a need for universal data browsers and a query language to read, navigate, crawl, and query structured content from different sources.

XML, XSD, XSLT, and XQuery

Migrating content from HTML to XML (so called document-oriented XML) has many benefits. XSLT enables media-independent publishing (single sourcing) to multiple devices such as Amazon's Kindle e-reader and "smart phones". With XQuery, sophisticated contextualized database-like queries can be performed, turning the content itself into a database. In addition, XQuery allows the dynamic assembly of content where new compound documents can be constructed on the fly from multiple documents and external data sources. This allows publishers to repurpose content into new information products as needed to satisfy new customer demands and market opportunities.

However XSD, XSLT, and XQuery operate at the syntax level. The next level up in the content technology stack is semantics and reasoning and that's where RDF, OWL, and SPARQL come into play. To illustrate the issue, consider three news organizations, each with their own XML Schema for describing news articles. To describe the author of an article, the first news organization uses the <creator> element, the second the <byline> element, and the third the <author> element. All of these three distinct element names have exactly the same meaning. Using an OWL ontology, we can establish that these three terms are equivalent.

Semantic Web and Linked Data to the Rescue

Semantic web technologies such as RDF, OWL, and SPARQL can help us close the semantic gap and also open up new opportunities for publishers. Furthermore, with the decline in ad revenues, news organizations are now considering charging users for accessing content online. Semantic web technologies can enrich content by providing new ways to discover and explore content based on user context and interests. An interesting example is a mashup application built by Zemanta called Guardian topic researchr which extract entities (people, places, organizations, etc.) from The Guardian Open Platform API query results and allows readers to explore these entities further. In addition, the recently unveiled Newssift site by the Financial Times is an indication that the industry is starting to pay attention to the benefits of "semantic search" as opposed to keyword search.

The rest of this post outlines some practical steps for migrating news content to the Semantic Web. For existing news content APIs, an interim solution is to create Semantic Web wrappers around these APIs (more on that later). The long term objective however should be to fully embrace the Semantic Web and adopt Linked Data principles in publishing news content.

Adopt the International Press Telecommunication Council (IPTC) News Architecture (NAR)

The main reason for adopting the NAR is interoperability at the content structure, content types, and semantic levels. Imagine a mashup developer trying to integrate news content from three different news organizations. In addition to using three different element names (<creator>, <byline>, and <author>) to describe the same concept, these three organizations use completely different XML Schemas to describe the structure and types of their respective news content. That can lead to a data mapping nightmare for the mashup developer and the problem will only get worse as the number of news sources increases.

The NAR content model defines four high level elements: newsItem, packageItem, conceptItem, and knowledgeItem. You don't have to manage your content internally using the XML structure defined by the NAR. However, you should be able to map and export your content to the NAR as a delivery format. If you have fields in your content repository that do not map to the NAR structure, then you should extend the standard NAR XML Schema using the appropriate XML Schema extension mechanism that allows you to clearly identify your extension elements in your own XML namespace. Provide a mechansim such as dereferenceable URIs to allows users to obtain the meaning of these extensions elements.

The same logic applies to the news taxonomy that you use. Adopting the IPTC News Codes which specified 1300 terms used for categorizing news content will greatly facilitate interoperability as well.

Adopt or Create a News Ontology

Several news ontologies in RDFS or OWL format are now available. The IPTC is in the process of creating an IPTC news ontology in OWL format. To facilitate semantic interoperability, news organizations should use this ontology when it becomes available. In mapping XML Schemas into OWL, ontology best practices should be followed. For example, if mapped automatically, container elements in the XML Schema could generate blank nodes in the RDF graph. However, blank nodes cannot be used for external RDF links and are not recommended for Linked Data applications. Also, RDF reification, RDF containers, and RDF collections are not SPARQL-friendly and should be avoided.

While creating the news ontology, you should reuse or link to other existing ontologies such as FOAF and Dublin Core using OWL elements like owl:equivalentProperty, owl:equivalentClass, rdfs:subClassOf, or rdfs:subPropertyOf.

Similarly, existing taxonomies should be mapped to an RDF compatible format using the SKOS specification. This makes it possible to use an owl:Restriction to constrain the value of a property in the OWL ontology to be an skos:Concept or skos:ConceptScheme.

Generate RDF Data

Assign a dereferenceable HTTP URI for each news item and use content negotiation to provide both an XHTML and an RDF/XML representation of the resource. When the resource is requested, an HTTP 303 See Other redirect is used to serve XHTML or RDF/XML depending on whether the browser's Accept header is text/html or application/rdf+xml. The W3C Best Practice Recipes for Publishing RDF Vocabularies explains how dereferenceale URIs and content negotiation work in the Semantic Web.

The RDF data can be generated using a variety of techniques. For example, you can use an XSLT-based RDFizer to extract RDF/XML from news item already marked up in XML. There are also RDFizers for relational databases. Entity extraction tools like Open Calais can also be useful particularly for extracting RDF metadata from legacy news items available in HTML format.

Link the RDF data to external data sources such as DBPedia and Geonames by using RDF links from existing vocabularies such as FOAF. For example, an article about US Treasury Secretary Timothy Geithner can use foaf:base_near to link the news item to a resource describing Washington, DC on DBPedia. If there is an HTTP URI that describes the same resource in another data source, then use owl:sameAs links to link the two resources. For example, if a news item is about Timothy Geithner, then you can use owl:sameAs to link to Timothy Geithner's data page on DBPedia. An RDF browser like Tabulator can traverse those links and help the reader explore more information about topics of interest.


Expose a SPARQL Endpoint

Use a Semantic Web Crawler (an extension to the Sitemap Protocol) to specify the location of the SPARQL endpoint or an RDF dump for Semantic Web clients and crawlers. OpenLink Virtuoso is an RDF store that also provides a SPARQL endpoint.

Provide a user interface for performing semantic searches. Expose the RDF metadata as facets for browsing the news items.

Provide a Semantic Web Wrapper for existing APIs.

A wrapper provides a deferenceable URI for every news item available through an existing news content API. When an RDF browser requests the news item, the Semantic Web wrapper translates the request into an API call, transforms the response from XML into RDF, and send it back to the Semantic Web client. The RDF Book Mashup is an example of how a Semantic Web Wrapper can be used to integrate publicly available APIs from Amazon, Google, and Yahoo into the Semantic Web.

Conclusion

The Semantic Web is still an obscure topic in the mainstream developers community. I hope I've outlined few practical steps you can take now to take advantage of the new Web of Linked Data.

Saturday, February 21, 2009

TOGAF 9: The missing link between IT and the business

In a recent and famous blog about the death of SOA, Anne Thomas Manes wrote:

After investing millions, IT systems are no better than before. In many organizations, things are worse: costs are higher, projects take longer, and systems are more fragile than ever. The people holding the purse strings have had enough. With the tight budgets of 2009, most organizations have cut funding for their SOA initiatives.
...
The small select group of organizations that has seen spectacular gains from SOA did so by treating it as an agent of transformation. In each of these success stories, SOA was just one aspect of the transformation effort. And here’s the secret to success: SOA needs to be part of something bigger. If it isn’t, then you need to ask yourself why you’ve been doing it.


In my opinion, that "something bigger" that Anne is referring to is Enterprise Architecture (EA), the missing link between IT and the business. There are several reasons why IT projects fail to deliver on their promise. The lack of EA expertise and practice is certainly one of them. If you ask a group of developers how they will go about architecting an SOA solution, the typical answer that you will hear is the use of some kind of agile or UML-based methodology for gathering the requirements and modelling the application. While these steps are required in any software development project, the lack of a methodology and governance framework for aligning IT with the overarching business context, vision, and drivers can lead to chaos and total failure. In the case of SOA, this situation creates a phenomenon known as JBOWS (Just a Bunch of Web Services) Architecture.

For people who are interested or responsible for EA in their organizations, this month has seen the release of two interesting publications:

  • "97 Things Every Software Architect Should Know: Collective Wisdom from the Experts" by O'Reilly Media
  • TOGAF 9 by the Open Group

In the first publication, several software architects share their experiences and lessons learned from the trenches. One of those "97 things" is entitled "Architecting is about balancing". I couldn't agree more. The wisdom of these experts will be a great asset to any software architect.

TOGAF 9 is an ambitious project by the Open Group to create a set of standardized semantics, methodology, and processes in the field of EA. IT professionals have been told repeatedly that they need to become business savvy in order to be more effective to their organizations. TOGAF 9 help them bridge the gap. The diagram below from the TOGAF 9 documentation provides an overview (click on the image to enlarge).



TOGAF 9 has a modular structure which permits an incremental approach to implementation and allows different stakeholders to focus on their respective concerns. The Architecture Development Method (ADM) describes a method for developing an enterprise architecture and is the core of TOGAF. The diagram below from the TOGAF 9 documentation shows the different phases of the ADM (click on the image to enlarge)..



The Content Framework specifies deliverables and artifacts for each architectural building block. These artifacts represent the output of the ADM. The specification provides guidance on how the ADM can be applied to SOA and enterprise security. TOGAF 9 also addresses the important issues of architecture partitioning and architecture repository.

Finally, the Open Group has been working on the adoption of an open EA modeling standard called ArchiMate. ArchiMate provides a higher level view of EA when compared to modeling standards such as BPMN and UML. It can be used to depict different layers of EA including business processes, applications, and technology in a way that can be consumed by non-technical business stakeholders. A sample of an ArchiMate enterprise view of a hospital can be found here.

Saturday, January 24, 2009

In Defense of XSLT

I recently had a conversation with a Java programmer about why he doesn't like XSLT. The following describes his objections and how I handled them as an XSLT salesman.

XSLT is hard to read and debug

This is a matter of using the right tool for the job. While an Eclipse plug-in like EclipseXSLT is better than a simple text editor, using a full blown XML IDE like OxygenXML can greatly increase productivity.

OxygenXML can help you navigate both the input XML document (XML input document view) and the XSLT transform (XSLT template view). The contextual XPath and XSLT content assistants are very helpful given the large number of XSLT 2.0 elements and XPath 2.0 functions now available. The XSLT refactoring feature allows you to turn a selection into a named template or an included XSLT fragment. Finally, the XSLT debugging perspective allows execution with breakpoints as well as an XSLT call stack and XPath watch views and many other goodies.

In addition, the XSLT 2.0 specification itself defines constructs to facilitate the debugging process. You can use the select attribute (an XPath expression) of the <xsl:message> instruction or the content of <xsl:message> as a sequence constructor to output useful information to standard output or even to a log file. The <xsl:comment> instruction and the XPath 2.0 trace() function can be helpful in debugging as well.

XSLT does not support the reusability and maintainability of code

In addition to XSLT named templates inherited from XSLT 1.0, you can now create your own custom functions in XSLT 2.0. While the content models of both <xsl:template> and <xsl:function> are the same, <xsl:function> is preferred for computing new values and selecting nodes (because functions are called from XPath expressions), while <xsl:template> is preferred for constructing new nodes. The "as" attribute which can be specified on <xsl:template>, <xsl:function>, and their <xsl:param> children allows you to constrain the type of returned value and input parameters.

For reusability and maintainability, the <xsl:include>, <xsl:import>, and <xsl:apply-imports> elements inherited from XSLT 1.0 are still available. When used properly, <xsl:import> provides capabilities that are similar to inheritance in object-oriented languages like Java. <xsl:apply-imports> and the XSLT 2.0 <xsl:next-match> instructions are reminiscent of a call to super() in Java.

There is no way to integrate XSLT with my Java libraries

With Saxon, you can represent a Java class as the namespace URI of a function and you can call Java methods and constructors directly from your XSLT transform. This allows you to reuse existing pieces of application logic build in Java without rewriting them in XSLT.

There is no type checking and no way to verify that the result of a transformation is valid against a schema

Again, this argument is no longer valid with XSLT 2.0. You can now validate both the input and output by using a schema-aware XSLT processor. I strongly recommend the schema-aware version of Saxon. This allows you to root out errors and correct bugs early. In addition to the built-in XML Schema types such as xs:decimal and xs:dateTime, you can define custom types in an XSD. You can then write a template that matches all elements of a certain type. XSD type hierarchies and substitution groups are fully supported as well.

After schema validation, a Post Schema Validation Infoset (PSVI) is generated and each node is assigned a typed value (which can be obtained using the data() function) and a type annotation (which is the schema type used to validate the node). To ensure that a string has a given type annotation, constructor functions are available for built-in and custom types such as in xs:date("2009-01-24").

Saxon supports “optimistic static type-checking”. The following is an excerpt from the Saxon FAQ:

Saxon does not do static type-checking in the sense that the term is used in the W3C language specifications (this refers to pessimistic type checking, in which any construct that might fail at run-time is rejected at compile time). This is an optional feature of the W3C specifications. Saxon does however perform optimistic static analysis of queries and stylesheets, in which an error is reported only for constructions that must always fail at run-time. The information derived from this static analysis is also used to optimize the run-time code.

Unlike Java, XSLT 2.0 lacks a try/catch feature. However you can use the handy "castable as" and "instance of" operators to detect programming errors as well as data errors in the input document.

With XSLT, it is "run and pray": there is no unit and functional testing framework like in Java

Automated unit and functional testing are essential in agile software development. Type checking (or schema-aware XSLT) can help you reduce the number of unit tests needed to fully test your XSLT transform but may not be enough to detect all errors such as those related to business rules violations in the output.

Unit testing is performed by testing individual XSLT functions and templates. You can try Jeni Tennison's XSpec framework which is inspired by the Ruby RSpec framework itself based on a Behavior Driven Development (BDD) approach.

Functional testing consists in testing whole outputs of the XSLT transform. One way of doing this is to run a Schematron schema which contains XPath 2.0-based assertions against the output of the XSLT transform. Since the Schematron schema validation process is itself based on XSLT, you can chain all these transformations together with a build tool like Ant which produces an HTML report with friendly diagnostic messages. With XML Schema 1.1, you will be able to build these assertions directly into your schema, although you will have less control other the diagnostic messages that are produced by the XSD validator.

Conclusion

In a world marked up in XML, XSLT 2.0 is a very powerful language available to developers. However, to take full advantage of the language it is important to use the right tools and take the time to explore the full capabilities of the language particularly its code reuse and type checking features. Finally, to bring XSLT into the realm of agile software development, unit and functional testing should become an integrated part of the XSLT development process.