Showing posts with label RDF. Show all posts
Showing posts with label RDF. Show all posts

Saturday, May 5, 2012

How to Add Arbitrary Metadata to Any Element of an HL7 CDA Document

There has been a lot of buzz lately about metadata tagging in the health IT community. In this blog, I describe an approach to annotating HL7 CDA documents (or any other XML documents) without actually editing the document that is being annotated. Metadata tagging is just an example of annotation. The underlying principle of this approach is that Anyone can say Anything about Anything (the AAA slogan) which is well know in the Semantic Web community. In other words, anyone (e.g., patient, care giver, physician, provider organization) should have the ability to add arbitrary metadata to any element of a CDA document. For the sake of "Separation of Concerns" which is a fundamental principle in software engineering, the metadata should be kept out of the CDA document. The benefits of keeping the metadata or annotations out of the CDA document include:
  • Reuse of the same metadata by distinct elements from potentially multiple clinical documents.
  • The ability to update the metadata without affecting the target CDA documents.
  • The ability for any individual, organization, or community of interest (e.g., privacy or medical device manufacturers) to create a metadata vocabulary without going through the process of modifying the normative CDA specification (or one of its offsprings like the CCD, the C32, or the Consolidated CDA) or the XDS metadata specifications.

History and Current Status of Metadata Standards in Health IT


The CDA specification defines some metadata in the header of a CDA document. In addition, the XD* family of specifications (XDS, XDR, and XDM) also defines a comprehensive set of metadata to be used in cross enterprise document exchange. NIEM is being used currently in several health IT projects. In a previous post titled "Toward a Universal Exchange Language for Healthcare", I described how the NIEM metadata approach could be adapted to the healthcare domain.

The President's Council of Advisors on Science and Technology (PCAST) published a report in December 2010 entitled: "Realizing the Full Potential of Health Information Technology to Improve Healthcare for Americans: The Path Forward". To describe the proposed approach to metadata tagging, the report provides an example based on the exchange of mammograms:
"The physician would be able to securely search for, retrieve, and display these privacy protected data elements in much the way that web surfers retrieve results from a search engine when they type in a simple query.
What enables this result is the metadata attached to each of these data elements (mammograms), which would include (i) enough identifying information about the patient to allow the data to be located (not necessarily a universal patient identifier), (ii) privacy protection information-who may access the mammograms, either identified or de­identified, and for what purposes, (iii) the provenance of the data-the date, time, type of equipment used, personnel (physician, nurse, or technician), and so forth."
The HIT Standards Committee (HITSC) Metadata Tiger Team made specific recommendations to the ONC in June 2011. These recommendations included the use of:

  • Policy Pointers: URLs that point to external policy documents affecting the tagged data element.
  • Content Metadata: the actual metadata with datatype (clinical category) and sensitivity (e.g., substance abuse and mental health).
  • Use of the HL7 CDA R2 with headers.

Based on those recommendations, the ONC published a Notice of Proposed Rule Making (NPRM) in August 2011 to receive comments on proposed metadata standards.

The Data Segmentation Working Group of the ONC Standards and Interoperability Framework is currently working on metadata tagging for compliance with privacy policies and consent directives.


The Annotea Protocol


The capability to add arbitrary metadata to documents without modifying them has been available in the Semantic Web for at least a decade. Indeed, it is hard to talk about metadata without a reference to the Semantic Web. I will use the W3C Annotea Protocol (which is implemented by the Amaya open source project) to demonstrate this capability. I will also show that this approach does not require the use of the Resource Description Framework (RDF) format and related Semantic Web technologies like OWL and SPARQL. The approach can be adapted to alternative representation formats such as XML, JSON, or the Atom syndication format. Let's assume that I need to add metadata tags to the CDA document below. The CDA document has only one problem entry for substance abuse disorder (SNOMED CT code 66214007) and my goal is to attach privacy metatada prohibiting the disclosure of that information (the most relevant elements are highlighted in red):

<ClinicalDocument>
.....
<component>
<structuredBody>
<component>
<!--Problems-->
<section>
<templateId root="2.16.840.1.113883.3.88.11.83.103"
    assigningAuthorityName="HITSP/C83"/>
<templateId root="1.3.6.1.4.1.19376.1.5.3.1.3.6"
    assigningAuthorityName="IHE PCC"/>
<templateId root="2.16.840.1.113883.10.20.1.11" assigningAuthorityName="HL7 CCD"/>
<!--Problems section template-->
<code code="11450-4" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC"
    displayName="Problem list"/>
<title>Problems</title>
<text>...</text>
<entry typeCode="DRIV">
<act classCode="ACT" moodCode="EVN">
    <templateId root="2.16.840.1.113883.3.88.11.83.7"
        assigningAuthorityName="HITSP C83"/>
    <templateId root="2.16.840.1.113883.10.20.1.27"
        assigningAuthorityName="CCD"/>
    <templateId root="1.3.6.1.4.1.19376.1.5.3.1.4.5.1"
        assigningAuthorityName="IHE PCC"/>
    <templateId root="1.3.6.1.4.1.19376.1.5.3.1.4.5.2"
        assigningAuthorityName="IHE PCC"/>
    <!-- Problem act template -->
    <id root="6a2fa88d-4174-4909-aece-db44b60a3abb"/>
    <code nullFlavor="NA"/>
    <statusCode code="completed"/>
    <effectiveTime>
        <low value="1950"/>
        <high nullFlavor="UNK"/>
    </effectiveTime>
    <performer typeCode="PRF">
        <assignedEntity>
            <id extension="PseudoMD-2" root="2.16.840.1.113883.3.72.5.2"/>
            <addr/>
            <telecom/>
        </assignedEntity>
    </performer>
    <entryRelationship typeCode="SUBJ" inversionInd="false">
        <observation classCode="OBS" moodCode="EVN">
            <templateId root="2.16.840.1.113883.10.20.1.28"
                assigningAuthorityName="CCD"/>
            <templateId root="1.3.6.1.4.1.19376.1.5.3.1.4.5"
                assigningAuthorityName="IHE PCC"/>
            <!--Problem observation template - NOT episode template-->
            <id root="d11275e7-67ae-11db-bd13-0800200c9a66"/>
            <code code="64572001" displayName="Condition"
                codeSystem="2.16.840.1.113883.6.96"
                codeSystemName="SNOMED-CT"/>
            <text>
                <reference value="#PROBSUMMARY_1"/>
            </text>
            <statusCode code="completed"/>
            <effectiveTime>
                <low value="1950"/>
            </effectiveTime>
            <value  displayName="Substance Abuse Disorder" code="66214007" codeSystemName="SNOMED" codeSystem="2.16.840.1.113883.6.96"/>
            <entryRelationship typeCode="REFR">
                <observation classCode="OBS" moodCode="EVN">
                    <templateId root="2.16.840.1.113883.10.20.1.50"/>
                    <!-- Problem status observation template -->
                    <code code="33999-4" codeSystem="2.16.840.1.113883.6.1"
                        displayName="Status"/>
                    <statusCode code="completed"/>
                    <value  code="55561003"
                        codeSystem="2.16.840.1.113883.6.96"
                        displayName="Active">
                        <originalText>
                        <reference value="#PROBSTATUS_1"/>
                        </originalText>
                    </value>
                </observation>
            </entryRelationship>
        </observation>
    </entryRelationship>
</act>
</entry>
</section>
</component>
</structuredBody>
</component>
</ClinicalDocument>




The following is a separate annotation document containing some metadata pointing to the Substance Abuse Disorder entry in the target CDA document:

<r:RDF xmlns:r="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:a="http://www.w3.org/2000/10/annotation-ns#"
    xmlns:d="http://purl.org/dc/elements/1.1/">
    <r:Description>
        <r:type r:resource="http://www.w3.org/2000/10/annotation-ns#Annotation"/>
        <r:type r:resource="http://www.w3.org/2000/10/annotationType#Metadata"/>
        <a:annotates r:resource="http://hospitalx.com/ehrs/cda.xml"/>
        <a:context>http://hospitalx.com/ehrs/cda.xml#xpointer(/ClinicalDocument/component/structuredBody/component[1]/section[1]/entry[1])</a:context>
        <d:title>Sample Metadata Tagging</d:title>
        <d:creator>Bob Smith</d:creator>
        <a:created>2011-10-14T12:10Z</a:created>
        <d:date>2011-10-14T12:10Z</d:date>
        <a:body>Do Not Disclose</a:body>
    </r:Description>
</r:RDF>

Please note a few interesting facts about the annotation document:

  • As explained by the original specification: "The Annotea protocol works without modifying the original document; that is, there is no requirement that the user have write access to the Web page being annotated."
  • The annotation itself has metadata using the well known Dublin Core metadata specification to specify who created this annotation and when.
  • The document being annotated is cda.xml located at http://hospitalx.com/ehrs/cda.xml. This is described by the element <a:annotates r:resource="http://hospitalx.com/ehrs/cda.xml"/>.
  • The specific element that is being annotated within the target CDA document is specified by the context element: <a:context>http://hospitalx.com/ehrs/cda.xml#xpointer(/ClinicalDocument/component/structuredBody/component[1]/section[1]/entry[1])</a:context> using XPointer, a specification described by the W3C as "the language to be used as the basis for a fragment identifier for any URI reference that locates a resource whose Internet media type is one of text/xml, application/xml, text/xml-external-parsed-entity, or application/xml-external-parsed-entity."
  • The XPath expression /ClinicalDocument/component/structuredBody/component[1]/section[1]/entry[1] within the XPointer is used to target the entry element in the CDA document.
  • Using XPath (1.0 or 2.0) allows us to address any element (or node) in an XML document. For example, this XPath //value[@code='66214007']/ancestor::entry will point to any entry element which contains a value element with an attribute code='66214007' (essentially targeting all entry elements which contain a Substance Abuse Observation). The combination of XPath, XPointer, and standard medical terminology codes gives the ability to attach any annotation or metadata to any element having interoperable semantics.
  • The body element contains the actual annotation: <a:body>Do Not Disclose</a:body>. However, the body of the annotation can also be located outside of the annotation (e.g., in a shared metadata registry) in which case the body element will be marked up as in the following example: <a:body r:resource="http://metadataregistry.com/myconsentdirectives.xml"/>

Alternative Representations

 

As mentioned before, for those who for one reason or another don't want to use RDF and related Semantic Web technologies, the annotation can be easily converted to a pure XML (as opposed to RDF/XML), JSON, or Atom representation. The original Annotea Protocol describes a RESTful protocol which includes the following operations: posting, querying, downloading, updating, and deleting annotations. The Atom Publishing Protocol (APP) is a newer RESTful protocol that is well adapted to the Atom syndication format.


Processing Annotations with XPointer


How the annotations are processed and consumed is only limited by the requirements of a specific application and the imagination of the developers writing it. For example, an application can read both the annotation document and the target CDA document and overlay the annotations on top of the entries in the CDA document while displaying the latter in a web browser. Another example is the enforcement of privacy policies and preferences prior to exchanging the CDA document. The issue that will be raised is how to process the XPointer fragment identifiers. XPointer uses XPath which is a well established XML addressing mechanism supported by many XML processing APIs across programming languages. For those of you who use XSLT2 to process CDA documents, there is the open source XPointer Framework for XSLT2 for use with the Saxon XSLT2 engine.

Thursday, February 3, 2011

A Therapeutic Layered Cake

With all the talk about the PCAST Report, I've been doing some Systems thinking on semantic interoperability in healthcare IT. Trying to put all the pieces together, I remembered Tim Berners-Lee's "Semantic Web Layer Cake".




The Semantic Web layer Cake has gone through several iterations over the years (see James Hendler's presentation on that subject). However, I think it can still be very helpful in visualizing a unified framework for addressing the challenges of semantic interoperability in Healthcare IT.

As we move to Stage 2 of Meaningful Use, I believe Clinical Decision Support (CDS) will take center stage. Beyond currently used XML-based data structures (such as HL7 v3 messages), this will put an increased emphasis on medical terminologies, ontologies, and knowledge representation in OWL. For example, ICD-11 is being developed using OWL to allow consistency checking and linking to other biomedical terminologies and ontologies. Equally important to knowledge representation, but not shown in the layer cake above is the Simple Knowledge Organization System (SKOS) specification.

In a report entitled "Semantic Interoperability Deployment and Research Roadmap", Alan Rector summarized the difference between the notions of ontology, knowledge representation, and data model:

  • Ontology – A representation of what is universally true, including what is true by definition

  • Knowledge Representation or "Background knowledge resource" – a representation of what is generally true, or widely known to be true in some specific instance. In general, the knowledge representation is formulated in terms of and indexed by the Ontology.

  • Information model or Data model a model of how information is structured in a given software system, message, or electronic health record. In general, the data structures carry codes for the ontology as their content.

Clinical guidelines are published in the form of narrative text, sometimes with an evaluation algorithm. The translation of those guidelines into an executable representation is a complex and costly process. Several formalisms and standards have been proposed such as the Arden Syntax, GLIF, GELLO, and GEM. However, none of these standards has been widely adopted. Developed with inputs from the Business Rules, Logic Programming, and Semantic Web communities, the W3C Rule Interchange Format (RIF) can help with the interchange of executable Clinical Decision Support (CDS) rules in addition to adding reasoning capabilities to patient records. This example shows how decision support rules could be exchanged between two rules engines (Drools and Jess) using the RIF PRD syntax, a standard XML serialization format for production rule languages.

Existing patient records marked up in XML HITSP C32 or ASTM CCR can be lifted into RDF statements (with XSLT or XQuery for example) and queried using SPARQL.

Proof, Trust, and Cryptography are being currently addressed by various standards and specifications in the healthcare industry notably the OASIS Cross-Enterprise Security and Privacy Authorization (XSPA) Profiles of XACML, SAML, and WS-Trust.

On the User Interface side, I see HTML5 giving both Flex and Silverlight a run for their money in the next few years. This will be driven in part by the demand for mobile health (mHealth).

Monday, December 13, 2010

Toward a Universal Exchange Language for Healthcare

The US President's Council of Advisors on Science and Technology (PCAST) published a report last week entitled: "Realizing the Full Potential of Health Information Technology to Improve Healthcare for Americans: The Path Forward". The report calls for a universal exchange language for healthcare (abbreviated as UELH in this post). Specifically, the report says:

"We believe that the natural syntax for such a universal exchange language will be some kind of exten­sible markup language (an XML variant, for example) capable of exchanging data from an unspecified number of (not necessarily harmonized) semantic realms. Such languages are structured as individual data elements, together with metadata that provide an annotation for each data element."

First, let me say that I fully support the idea of a UELH. I've written in the past about the future of healthcare data exchange standards. The ASTM CCR and the HL7 CCD have been adopted for Meaningful Use Stage 1 and that was the right choice. In my opinion, the UELH proposed by PCAST is about the next generation healthcare data exchange standard that is yet to be built. It's part of the natural evolution and innovation that are inherent to the information technology industry. It is also a very challenging task that should be informed by the important work that has been done previously in this field including:

  • The ASTM CCR
  • The HL7 RIM, CDA, CCD, and greenCDA
  • Archetype-based EN 13606 from OpenEHR
  • The National Information Exchange Model (NIEM)
  • HITSP C32
  • Biomedical Ontologies using semantic web technologies such as OWL2, SKOS, and RDF.
  • Medical Terminologies such as SNOMED and RxNorm.

This new language should focus on identifying, addressing, and solving the issues with the use of the current set of healthcare data exchange standards. This will require a public discourse that is cordial and focused on solutions and innovative ideas. Most importantly, it will require listening to the concerns of implementers. This proposal should not be about reinventing the wheel. It should be about creating a better future by learning lessons from the past while being open-minded about new ideas and approaches to solving problems.

Note that the report talks about the syntax of this new language as some kind of an "XML variant". It also mentioned that the language must be exten­sible. This is important in order to enable innovation in this field. For example, we've recently seen a serious challenge to XML coming from JSON in the web APIs space (Twitter and Foursquare removed support for XML in their APIs and now only provide a JSON API). Similarly, in the Semantic Web space, alternatives to the RDF/XML serialization syntax have emerged such as the N-triples notation. This is not to say that XML is the wrong representation for healthcare data. It simply means that we should be open to innovation in this area.

Metadata and the Semantic Web in Healthcare

Closely related to the notion of metadata is the idea of the Semantic Web. Although semantic web technologies are not widely used in healthcare today, they could help address some of the issues with current healthcare standard information models including: model consistency, reasoning, and knowledge integration across domains (e.g. the genomics and clinical domains). In a report entitled "Semantic Interoperability Deployment and Research Roadmap", Alan Rector, an authority in the field of biomedical ontologies, explains the difference between ontologies and data structures:

A second closely related notion is that of an "information model" of "model of data structures". Both Archetypes and HL7 V3 Messages are examples of data structures. Formalisms for data structures bear many resemblances to formalisms for ontologies. The confusion is made worse because the description logics are often used for both. However, there is a clear difference.

  • Ontologies are about the things being represented – patients, their diseases. They are about what is always true, whether or not it is known to the clinician. For example, all patients have a body temperature (possibly ambient if they are dead); however, the body temperature may not be known or recorded. It makes no sense to talk about a patient with a "missing" body temperature.
  • Data structures are about the artefacts in which information is recorded. Not every data structure about a patient need include a field for body temperature, and even if it does, that field may be missing for any given patient. It makes perfect sense to speak about a patient record with missing data for body temperature.

A key point is that "epistemological issues" – issues of what a given physician or the healthcare system knows – should be represented in the data structures rather than the ontology. This causes serious problems for terminologies coding systems, which often include notions such as "unspecified" or even "missing". This practice is now widely deprecated but remains common.

One of the Common Terminology Services (CTS 2) submissions to the OMG is based on Semantic Web technologies such as OWL2, SKOS, and SPARQL. The UELH proposed by the PCAST should leverage the work that has been done by the biomedical ontology community.

The NIEM Approach to Metadata-Tagged Data Elements

The report goes on to say that the metadata attached to each of these data elements

"...would include (i) enough identifying information about the patient to allow the data to be located (not necessarily a universal patient identifier), (ii) privacy protection information—who may access the mammograms, either identified or de-identified, and for what purposes, (iii) the provenance of the data—the date, time, type of equipment used, personnel (physician, nurse, or technician), and so forth."

The report does not explain exactly how this should be done. So let's combine the wisdom of the NIEM, HL7 greenCDA, and OASIS XSPA (Cross-Enterprise Security and Privacy Authorization Profile of XACML for healthcare) to propose a solution. Let's assume that we need to add metadata about the equiment used for the lab result as well as patient consent directives to the following lab result entry which is marked up in greenCDA format:

<result>
<resultID root="107c2dc0-67a5-11db-bd13-0800200c9a66" />
<resultDateTime value="200003231430" />
<resultType codeSystem="2.16.840.1.113883.6.1" code="30313-1"
displayName="HGB" />
<resultStatus code="completed" />
<resultValue>
<physicalQuantity value="13.2" unit="g/dl" />
</resultValue>
<resultInterpretation codeSystem="2.16.840.1.113883.5.83"
code="N" />
<resultReferenceRange>M 13-18 g/dl; F 12-16
g/dl</resultReferenceRange>
</result>

In the following, an s:metadata attribute is added to the root element (s:metadata is of type IDREFS and for brevity, I am not showing the namespace declarations):

<result s:metadata="equipment consent">
<resultID root="107c2dc0-67a5-11db-bd13-0800200c9a66" />
<resultDateTime value="200003231430" />
<resultType codeSystem="2.16.840.1.113883.6.1" code="30313-1"
displayName="HGB" />
<resultStatus code="completed" />
<resultValue>
<physicalQuantity value="13.2" unit="g/dl" />
</resultValue>
<resultInterpretation codeSystem="2.16.840.1.113883.5.83"
code="N" />
<resultReferenceRange>M 13-18 g/dl; F 12-16
g/dl</resultReferenceRange>
</result>

The following is the lab test equipment metadata:

<LabTestEquipmentMetadata s:id="equipment">
<SerialNumber>93638494749</SerialNumber>
<Manufacuturer>MedLabEquipCo.</Manufacturer>
</LabTestEquipmentMetadata>

And here is the patient consent directives marked in XACML XSPA format (this snippet is taken from the NHIN Access Consent Policies Specification):

<ConsentMetadata s:id="consent">
<Policy xmlns="urn:oasis:names:tc:xacml:2.0:policy:schema:os"
PolicyId="12345678-1234-1234-1234-123456781234"
RuleCombiningAlgId="urn:oasis:names:tc:xacml:1.0:rule-combining-algorithm:first-applicable">
<Description>Sample XACML policy for NHIN</Description>
<!-- The Target element at the Policy level identifies the subject to whom the Policy applies -->
<Target>
<Resources>
<Resource>
<ResourceMatch MatchId="http://www.hhs.gov/healthit/nhin/function#instance-identifier-equal">

<AttributeValue DataType="urn:hl7-org:v3#II"
xmlns:hl7="urn:hl7-org:v3">
<hl7:PatientId root="2.16.840.1.113883.3.18.103"
extension="00375" />
</AttributeValue>
<ResourceAttributeDesignator AttributeId="http://www.hhs.gov/healthit/nhin#subject-id"
DataType="urn:hl7-org:v3#II" />
</ResourceMatch>
</Resource>
<Actions>
<!-- This policy applies to all document query and document retrieve transactions -->
<Action>
<ActionMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:anyURI-equal">

<AttributeValue DataType="http://www.w3.org/2001/XMLSchema#anyURI">
urn:ihe:iti:2007:CrossGatewayRetrieve</AttributeValue>
<ActionAttributeDesignator AttributeId="urn:oasis:names:tc:xacml:2.0:action"
DataType="http://www.w3.org/2001/XMLSchema#anyURI" />
</ActionMatch>
</Action>
<Action>
<ActionMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:anyURI-equal">

<AttributeValue DataType="http://www.w3.org/2001/XMLSchema#anyURI">
urn:ihe:iti:2007:CrossGatewayQuery</AttributeValue>
<ActionAttributeDesignator AttributeId="urn:oasis:names:tc:xacml:2.0:action"
DataType="http://www.w3.org/2001/XMLSchema#anyURI" />
</ActionMatch>
</Action>
</Actions>
<Rule RuleId="133" Effect="Permit">
<Description>Permit access to all documents to all
physicians and nurses</Description>
<Target>
<Subjects>
<Subject>
<SubjectMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:string-equal">

<!-- coded value for physicians -->
<AttributeValue DataType="http://www.w3.org/2001/XMLSchema#string">
112247003</AttributeValue>
<SubjectAttributeDesignator AttributeId="urn:oasis:names:tc:xacml:2.0:subject:role"
DataType="http://www.w3.org/2001/XMLSchema#string" />
</SubjectMatch>
</Subject>
<Subject>
<SubjectMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:string-equal">

<!-- coded value for nurses -->
<AttributeValue DataType="http://www.w3.org/2001/XMLSchema#string">
106292003</AttributeValue>
<SubjectAttributeDesignator AttributeId="urn:oasis:names:tc:xacml:2.0:subject:role"
DataType="http://www.w3.org/2001/XMLSchema#string" />
</SubjectMatch>
</Subject>
</Subjects>
<!-- since there is no Resource element, this rule applies to all resources -->
</Target>
</Rule>
<Rule RuleId="134" Effect="Permit">
<Description>Allow access Dentists and Dental Hygienists
Access from the Happy Tooth dental practice to documents
with "Normal" confidentiality for a defined time
period.</Description>
<Target>
<Subjects>
<Subject>
<SubjectMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:string-equal">

<!-- coded value for dentists -->
<AttributeValue DataType="http://www.w3.org/2001/XMLSchema#anyURI">
106289002</AttributeValue>
<SubjectAttributeDesignator AttributeId="urn:oasis:names:tc:xacml:2.0:subject:role"
DataType="http://www.w3.org/2001/XMLSchema#string" />
</SubjectMatch>
<SubjectMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:anyURI-equal">

<AttributeValue DataType="http://www.w3.org/2001/XMLSchema#anyURI">
http://www.happytoothdental.com</AttributeValue>
<SubjectAttributeDesignator AttributeId="urn:oasis:names:tc:xspa:1.0:subject:organization-id"
DataType="http://www.w3.org/2001/XMLSchema#anyURI" />
</SubjectMatch>
</Subject>
<Subject>
<SubjectMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:string-equal">

<!-- coded value for dental hygienists -->
<AttributeValue DataType="http://www.w3.org/2001/XMLSchema#string">
26042002</AttributeValue>
<SubjectAttributeDesignator AttributeId="urn:oasis:names:tc:xacml:2.0:subject:role"
DataType="http://www.w3.org/2001/XMLSchema#string" />
</SubjectMatch>
<SubjectMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:anyURI-equal">

<AttributeValue DataType="http://www.w3.org/2001/XMLSchema#anyURI">
http://www.happytoothdental.com</AttributeValue>
<SubjectAttributeDesignator AttributeId="urn:oasis:names:tc:xspa:1.0:subject:organization-id"
DataType="http://www.w3.org/2001/XMLSchema#anyURI" />
</SubjectMatch>
</Subject>
</Subjects>
<Resources>
<Resource>
<ResourceMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:string-equal">

<AttributeValue DataType="http://www.w3.org/2001/XMLSchema#string">
N</AttributeValue>
<ResourceAttributeDesignator AttributeId="urn:oasis:names:tc:xspa:1.0:resource:patient:hl7:confidentiality-code"
DataType="http://www.w3.org/2001/XMLSchema#string" />
</ResourceMatch>
</Resource>
</Resources>
<Environments>
<Environment>
<EnvironmentMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:date-greather-than-or-equal">

<AttributeValue DataType="http://www.w3.org/2001/XMLSchema#date">
2009-07-01</AttributeValue>
<EnvironmentAttributeDesignator AttributeId="http://www.hhs.gov/healthit/nhin#rule-start-date"
DataType="http://www.w3.org/2001/XMLSchema#date" />
</EnvironmentMatch>
</Environment>
<Environment>
<EnvironmentMatch MatchId="urn:oasis:names:tc:xacml:1.0:function:date-less-than-or-equal">

<AttributeValue DataType="http://www.w3.org/2001/XMLSchema#date">
2009-12-31</AttributeValue>
<EnvironmentAttributeDesignator AttributeId="http://www.hhs.gov/healthit/nhin#rule-end-date"
DataType="http://www.w3.org/2001/XMLSchema#date" />
</EnvironmentMatch>
</Environment>
</Environments>
</Target>
</Rule>
<Rule RuleId="135" Effect="Deny">
<Description>deny all access to documents. Since this
rule is last, it will be selected if no other rule
applies, under the rule combining algorithm of first
applicable.</Description>
<Target />
</Rule>
</Resources>
</Target>
</Policy>
</ConsentMetadata>

Please note the following:

  • Metadata "LabTestEquipmentMetadata" asserts the equipment used for the lab test.
  • Metadata "ConsentMetadata" asserts the patient consent directives leveraving the XSPA XACML format.
  • Metadata can be declared once and reused by multiple elements.
  • An element can refer to 0 or more metadata objects.

In NIEM, an appinfo:AppliesTo element in a metadata type declaration is used to indicate the type to which the metadata applies as in the following example (note this is not enforced by the XML schema validating parser, but can be enforced at the application level):

<xsd:complexType name="LabTestEquipmentMetadataType">
<xsd:annotation>
<xsd:appinfo>
<i:AppliesTo i:name="LabResultType" />
</xsd:appinfo>
</xsd:annotation>
<xsd:complexContent>
<xsd:extension base="s:MetadataType">
...
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>

<xsd:element name="LabTestEquipmentMetadata" type="LabTestEquipmentMetadataType" nillable="true"/>

NIEM defines a common metadata type that can be extended by any type definition that requires metadata:

<schema
targetNamespace="http://niem.gov/niem/structures/2.0"
version="alpha2"
xmlns:i="http://niem.gov/niem/appinfo/2.0"
xmlns:s="http://niem.gov/niem/structures/2.0"
xmlns="http://www.w3.org/2001/XMLSchema">


<attribute name="id" type="ID"/>
<attribute name="linkMetadata" type="IDREFS"/>
<attribute name="metadata" type="IDREFS"/>
<attribute name="ref" type="IDREF"/>
<attribute name="sequenceID" type="integer"/>

<attributeGroup name="SimpleObjectAttributeGroup">
<attribute ref="s:id"/>
<attribute ref="s:metadata"/>
<attribute ref="s:linkMetadata"/>
</attributeGroup>

<element name="Metadata" type="s:MetadataType" abstract="true"/>

<complexType name="ComplexObjectType" abstract="true">
<attribute ref="s:id"/>
<attribute ref="s:metadata"/>
<attribute ref="s:linkMetadata"/>
</complexType>

<complexType name="MetadataType" abstract="true">
<attribute ref="s:id"/>
</complexType>

</schema>

Any type definition that needs metadata can simply extend ComplexObjectType as follows for lab result type:

<xsd:complexType name="LabResultType">
<xsd:complexContent>
<xsd:extension base="s:ComplexObjectType">
<xsd:sequence>...</xsd:sequence>
</xsd:extension>
</xsd:complexContent>
</xsd:complexType>

Wednesday, June 9, 2010

Data Modeling for Electronic Health Records (EHR) Systems

Getting the data model right is of paramount importance for an Electronic Health Records (EHR) system. The factors that drive the data model include but are not limited to:

  • Patient safety
  • Support for clinical workflows
  • Different uses of the data such as input to clinical decision support systems
  • Reporting and analytics
  • Regulatory requirements such as Meaningful Use criteria.

Model First

Proven methodologies like contract-first web service design and model driven development (MDD) put the emphasis on deriving application code from the data model and not the other way around. Thousands of line of code can be auto-generated from the model, so it's important to get the model right.


Requirements Gathering

The objective here is to determine the entities, their attributes, and the relationships between those entities. For example, what are the attributes that are necessary to describe a patient's condition and how do you express the fact that a condition is a manifestation of an allergy? The data modeler should work closely with clinicians to gather those requirements. Industry standards should be leveraged as well. For example, HITSP C32 defines the data elements for each EHR data module such as conditions, medications, allergies, and lab results. These data elements are then mapped to the HL7 Continuity of Care Document (CCD) XML schema.

The HL7 CCD is itself derived from the HL7 Reference Information Model (RIM). The latter is expressed as a set of UML class diagrams and is the foundation model for health care and clinical data. A simpler alternative to the CCD is the ASTM Continuity of Care Records (CCR). Both the CCD and CCR provide an XML schema for data exchange and are Meaningful Use criteria. Another relevant data model is the HL7 vMR (Virtual Medical Record) which aims to define a data model for the input and output of Clinical Decision Support Systems (CDSS).

These standards can be cumbersome to use as such from a software development perspective. Nonetheless, they can inform the design of the data model for an EHR system. Alignment with the CCD and CCR will facilitate data exchange with other providers and organizations. The following are Meaningful Use criteria for data exchange:

  1. Electronically receive a patient summary record, from other providers and organizations including, at a minimum, diagnostic test results, problem list, medication list, medication allergy list, immunizations, and procedures and upon receipt of a patient summary record formatted in an alternative standard specified in Table 2A row 1, displaying it in human readable format.

  2. Enable a user to electronically transmit a patient summary record to other providers and organizations including, at a minimum, diagnostic test results, problem list, medication list, medication allergy list, immunizations, and procedures in accordance with the standards specified in Table 2A row 1.



Applying Data Modeling Patterns

Applying data modeling patterns allows model consistency and quality. Relational data modeling is a well established discipline. My favorite resource for relational data modeling patterns is: The Data Model Resource Book, Vol. 3: Universal Patterns for Data Modeling.

Some XML Schema best practices can be found here.


Data Stores

Today, options for data store are no longer limited to relational databases. Alternatives include: native XML databases (e.g. DB2 pureXML), Entity-Attribute-Value with Classes and Relationships (EAV/CR), and Resource Description Framework (RDF) stores.

Native XML databases are more resilient to schema changes and do not require handling the impedance mismatch between XML documents, Java objects, and relational tables which can introduce design complexity, performance, and maintainability issues.

Storing EHRs in an RDF store can enable the inference of medical facts based on existing explicit medical facts. Such inferences can be driven by an ontology expressed in OWL or a set of rules expressed in a rule language such SWRL. Semantic Web technologies can also be helpful in checking the consistency of a model, data and knowledge integration across domains (e.g. the genomics and clinical domains), and for managing classification schemes like medical terminologies. RDF, OWL, and SWRL have been successfully implemented in Clinical Decision Support Systems (CDSS).

The data modeling notation used should be independent of the storage model or at least compatible with the latter. For example, if native XML storage is used, then a relational modeling notation might not be appropriate. In general, UML provides the right level of abstraction for implementation-agnostic modeling.


Due Diligence

When adopting a "noSQL" storage model, it is important to ensure that (a) the database can meet performance and scalability criteria and (b) the team has the skills to develop and maintain the database. Due diligence should be performed through benchmarking using a tool such as the IBM Transaction Processing over XML (TpoX). The team might need formal training in a new query language like XQuery or SPARQL.


A Longitudinal View of the Patient Health

Maintaining an up-to-date and truly longitudinal view of a patient's medical history requires merging and reconciling data from heterogeneous sources including providers' EMR systems, lab companies, medical devices, and payers' claim transaction repositories. The data model should facilitate the assembly of data from such diverse sources. XML tools based on XSLT, XQuery, or XQuery Update can be used to automate the merging.


The Importance of Data Validation

Data validation can be performed at the database layer, the application layer, and the UI layer. The data model should support the validation of the data. The following are examples of techniques that can be used for data validation:

  • XML Schema for structural validation of XML documents
  • ISO Schematron (based on XPath 2.0 and XSLT 2.0) for business rules validation of XML documents
  • A business rules engine like Drools
  • A data processing framework like Smooks
  • The validation features of a UI framework such as JSF2
  • The built-in validation features of the database.


The Future: Modeling with the NIEM IEPD


The HHS ONC issued an RFP for using the National Information Exchange Model (NIEM) Information Exchange Package Documentation (IEPD) process for healthcare data exchange. The ONC will release a NIEM Concept of Operations (ConOps). The NIEM IEPD process is explained here.

Monday, September 21, 2009

Relational, XML, or RDF?

During the last 15 years, I have had the opportunity to work with different data models and approaches to application development. I started with SGML in the aerospace content management space back in 1995, then saw the potential of XML and fully embraced it in 1998. Since that time, I have been continuously following the evolution of XML related specifications and have been able to leverage the bleeding edge including XForms, XQuery, XSLT2, XProc, ISO Schematron, and even XML Schema 1.1.

However, being a curious person, I decided to explore other approaches to data management and application development. I worked on systems using a relational database backend and application development frameworks like Spring, Hibernate, and JSF. I've been involved in SOAP-based web services projects where XML data (constrained and validated by an XML Schema) was unmarshalled into Java objects, and then persisted into a relational table with an Object-Relational Mapping (ORM) solution such as Hibernate.

I also had the opportunity to work with the Java Content Repository (JCR) model in magazine content publishing, and the Entity-Attribute-Value with Classes and Relationships (EAV/CR) model in the context of medical informatics. EAV/CR is suited for domains where entities can have thousands of frequently changing parameters.

Lately, I have been working on Semantic Web technologies including the RDF data model, OWL (the Web Ontology Language), and the SPARQL query language for RDF.

Clients often ask me which of these approaches is the best or which is the most appropriate for their project. Here is what I think:

  • Different approaches should be part of the software architect's toolkit (not just one).
  • To become more productive in an agile environment, every developer should become a "generalizing specialist".
  • The software architect or developer should be open minded (no "not invented here syndrome" or "what's wrong with what we're doing now" attitude).
  • The software architect or developer should be willing to learn new technologies outside of their comfort zone and IT leadership should encourage and reward that learning.
  • Learning new technologies sometimes requires a new way of thinking about the problems at hand and "unlearning" old knowledge.
  • It is important not to have a purist or religious approach to selecting any particular approach, since each has its own merits.
  • Ultimately, the overall context of the project will dictate your choice. This includes but is not limited to: skills set, learning curve, application performance, cost, and time to market.

Based on my personal experience, here is what I have learned:

The XPath 2.0 and XQuery 1.0 Data Model (XDM)


The roots of SGML and XML are in content management applications in domains such as law, aerospace, defense, scientific, technical, medical, and scholarly publishing. The XPath 2.0 and XQuery Data Model (XDM) is particularly well suited for companies selling information products directly as a source of revenues (e.g. non-ad based publishers).

XSLT2 facilitates media-independent publishing (single sourcing) to multiple devices and platforms. XSLT2 is also a very powerful XML transformation language that allows these publishers to perform the series of complex transformations that are often required as the content is extracted from various data sources and assembled into a final information product.

With XQuery, sophisticated contextualized database-like queries can be performed. In addition, XQuery allows the dynamic assembly of content where new compound documents can be constructed on the fly from multiple documents and external data sources. This allows publishers to repurpose content into new information products as needed to satisfy new customer demands and market opportunities.

XInclude enables the chunking of content into reusable pieces. XProc is an XML pipeline language that essentially allows you to automate a complex publishing workflow which typically includes many steps such as content assembly, validation, transformation, and query.

The second category of application for which XML is a strong candidate is what is sometimes referred to as an "XML Workflow" application. The typical design pattern here is XRX (XForms, REST, and XQuery) where user inputs are captured with an XForm front end (itself potentially auto-generated from an XML schema) and data is RESTfully submitted to a native XML database, then queried and manipulated with XQuery. The advantages of this approach are:

  • It is more resilient to schema changes. In fact the data can be stored without a schema.
  • It does not require handling the impedance mismatch between XML documents, Java objects, and relational tables which can introduce design complexity, performance, and maintainability issues even when using code generation.

A typical example of an "XML Workflow" application would be a Human Resources (HR) form-based application that allows employees to fill and submit a form and also provides reporting capabilities.

The third and last category of application are Web Services (RESTful or SOAP-based) that consume XML data from various sources, store the data natively in an XML database directly (bypassing the XML databinding and ORM layers altogether), and perform all processing and queries on the data using a pure XML approach based on XSLT2 and XQuery. An example is a dashboard or mashup application that stores all of the submitted data in a native XML database. In this scenario, the data can be cached for faster response to web services requests. Again the benefits listed for "XML Workflow" applications apply here as well.


The Relational Model

The relational model is well established and well understood. It is usually an option for data-oriented and enterprise-centric applications that are based on a closed world assumption. In such a scenario, there is usually no need for handling the data in XML and a conventional approach based on JSF, Spring, Hibernate and a relational database backend is enough.

Newer Java EE frameworks like JBoss Seam and its seam-gen code generation tools are particularly well-suited for this kind of task. There is no running away from XML however, since these frameworks use XML for their configuration files. Unfortunately, there is currently a movement away from XML configuration files toward Java annotations due to some developers complaining about "XML Hell".

The relational model supports transactions and is scalable although a new movement called NoSQL is starting to challenge that last assumption. An article entitled "Is the Relational Database Doomed?" on readwriteweb.com describes this emerging trend.


The RDF Data Model

Semantic Web technologies like RDF (an incarnation of the EAV/CR model mentioned above), OWL, SKOS, SWRL, and SPARQL and Linked Data publishing principles have received a lot of attention lately. They are well suited for the following applications:

  • Applications that need to infer new implicit facts based on existing explicit facts. Such inferences can be driven by an ontology expressed in OWL or a set of rules expressed in a rule language such SWRL.
  • Applications that need to map concepts across domains such as a trading network where partners use different e-commerce XML vocabularies.
  • Master Data Management (MDM) applications that provide an RDF view and reasoning capabilities in order to facilitate and enhance the process of defining, managing, and querying an organization's core business entities. A paper on IBM's Semantic Master Data Management (SMDM) project is available here.
  • Applications that use a taxonomy, a thesaurus, or a similar concept scheme such as online news archives and medical terminologies. SKOS, recently approved as a W3C recommendation was designed for that purpose.
  • Silo-busting applications that need to link data items to over data items on the web, in order to perform entity correlation or allow users to explore a topic further. The Linked Data design pattern is based on an open world assumption, uses dereferenceable HTTP URIs for identifying and accessing data items, RDF for describing metadada about those items, and semantic links to describe the relationships between those items. An example is an Open Government application that correlates campaign contributions, voting records, census, and location data. Another example is a semantic social application that combines an individual's profiles and social networks from multiple sites in order to support data portability and fully explore the individual's social graph.


Of course, these different approaches are not mutually exclusive. For example, it is possible to provide an RDF view or layer on top of existing XML and relational database applications.

Tuesday, August 11, 2009

Adding Semantics to SOA

What can Semantic Web technologies such as RDF, OWL, SKOS, SWRL, and SPARQL bring to Web Services. One of the most difficult challenges of SOA is data model transformation. This problem occurs when services don't share a canonical XML schema. XML transformation languages such as XSLT and XQuery are typically used for data mediation in such circumstances.

While it is relatively easy to write these mappings, the real difficulty lies in mapping concepts across domains. This is particularly important in B2B scenarios involving multiple trading partners. In addition to proprietary data models, it is not uncommon to have multiple competing XML standards in the same vertical. In general, these data interoperability issues can be syntactic, structural, or semantic in nature. Many SOA projects can trace their failure to those data integration issues.

This is where semantic web technologies can add significant value to SOA. The Semantic Annotations for WSDL and XML Schema (SAWSDL) is a W3C recommendation which defines the following extension attributes that can be added to WSDL and XML Schema components:

  • The modelReference extension attribute associates a WSDL or XML Schema component to a concept in a semantic model such as OWL. The semantic representation is not restricted to OWL (for example it could be an SKOS concept). The modelReference extension attribute is used to annotate XML Schema type definitions, element and attribute declarations as well as WSDL interfaces, operations, and faults.
  • The liftingSchemaMapping and loweringSchemaMapping extension attributes typically point to an XSLT or XQuery mapping file for transforming between XML instances and ontology instances.

A typical example of how SAWSDL might be used is in an electronic commerce network where trading partners use various standards such as EDI, UBL, ebXML, and RosettaNet. In this case, the modelReference extension attribute can be used to map a WSDL or XML Schema component to a concept in a common foundational ontology such as one based on the Suggested Upper Merged Ontology (SUMO). In addition, lifting and lowering XSLT transforms are attached to XML Schema components in the SAWSDL with liftingSchemaMapping and loweringSchemaMapping extension attributes respectively. Note that any number of those transforms can be associated with a given XML schema component.

Traditionally, when dealing with multiple services (often across organizational boundaries), an Enterprise Services Bus (ESB) provides mediation services such as business process orchestration, business rules processing, data format and data model transformation, message routing, and protocol bridging. Semantic mediation services can be added as a new type of ESB service. The SAWSDL4J API defines an object model that allows SOA developers to access and manipulate SAWSDL annotations.

Ontologies have been developed for some existing e-commerce standards such as EDI X12, RosettaNet, and ebXML. When required, ontology alignment can be achieved with OWL constructs such as subClassOf , equivalentClass , and equivalentProperty.

Semantic annotations provided by SAWSDL can also be leveraged in orchestrating business processes using the business process execution language (BPEL). To facilitate service discovery in SOA Registries and Repositories, interface definitions in WSDL documents can be associated with a service taxonomy defined in SKOS. In addition, once an XML message is lifted to an ontology instance, the data in the message becomes available to Semantic Web tools like OWL and SWRL reasoners and SPARQL query engines.

Sunday, July 26, 2009

From Web 2.0 to the Semantic Web: Bridging the Gap in Newsmedia

In this presentation, I explain the Semantic Web value proposition for the newsmedia industry and propose some concrete steps to bridge the gap.

Welcome to the world of news in the Web 3.0 era.

Wednesday, July 8, 2009

Semantic Social Computing

The Web 2.0 revolution has produced an explosion in social data that is fundamentally transforming business, politics, culture, and society in general. Using tools such as wikis, blogs, online forums, and social networking sites, users can now express their point of view, build relationships, and exchange ideas and multimedia content. Combined with portable electronic devices such as cameras and cell phones, these tools are enabling the citizen journalist who can report facts and events faster than traditional media outlets and government agencies.

One of the challenges posed by this explosion in social data is data portability between social networking sites. But the next biggest challenge will be the ability to harvest all that social data in order to extract actionable intelligence (e.g. a better understanding of consumer behavior or the events unfolding at a particular location). In addition, in a world where security has become the number one priority, various sensors from traffic cameras to satellite sensors are also collecting huge amounts of data. The integration of sensor data and social data offers new possibilities.

Those are the types of integration challenges that Semantic Web technologies are designed to solve. The SIOC (Semantically Interlinked Online Communities) Core ontology describes the structure and content of online community sites. A comprehensive list of SIOC tools is available at the SIOC Applications page. Using these tools, developers can export SIOC compliant RDF data from various data sources such as blogs, wikis, online forums, and social networking sites such as Twitter and Flickr. Once exported, the SIOC data can be crawled, aggregated, stored, indexed, browsed, and queried (using SPARQL) to answer interesting questions. Natural Language Processing (NLP) techniques can be used to facilitate entity extraction from user generated content.

SIOC leverages the FOAF ontology to describe the social graph on social networking sites. For example, this can offer deeper insights for marketers into how social recommendations affect consumer behavior.

One unique capability offered by Semantic Web technologies is the ability to infer new facts (inference) from explicit facts based on the use of an ontology (RDFS or OWL) or a set of rules expressed in a rule language such as the Semantic Web Rule Language (SWRL). Using constructs such as owl:sameAs or rdfs:seeAlso, it becomes easy to express the fact that two or more different web pages relate to the same resource (e.g. different profile pages of the same person on difference social networking sites). Linked Data principles can help in linking social data, therefore building bridges between the data islands that today's social networking sites represent.

SIOC compliant social data can be meshed up with other data sources such as sensor data to reveal very useful information about events related to logistics, public safety, or political unrest at a particular location for example. With the advent of GPS-enabled cameras and cell phones, temporal and spatial context can be added to better describe those events. The W3C Time OWL Ontology (OWL-Time) and the Basic Geo Vocabulary have been developed for that purpose.

Thursday, June 11, 2009

Publishing Government Data to the Linked Open Data (LOD) Cloud

In a previous post, I outlined a roadmap for migrating news content to the Semantic Web and the Linked Open Data (LOD) cloud. The BBC has been doing some interesting work in that space by using Linked Data principles to connect BBC Programmes and BBC Music to MusicBrainz and DBpedia. SPARQL endpoints are now available for querying the BBC datasets.

It is clear that Europe is ahead of the US and Canada in terms of Semantic Web research and adoption. The Europeans are likely to further extend their lead with the announcement this week that Tim Berners-Lee (the visionary behind the World Wide Web, the Semantic Web, and the Linked Open Data movement) will be advising the UK Government on making government data more open and accessible.



In the US, data.gov is part of the Open Government Initiative of the Obama Administration. The following is an excerpt from data.gov:

A primary goal of Data.gov is to improve access to Federal data and expand creative use of those data beyond the walls of government by encouraging innovative ideas (e.g., web applications). Data.gov strives to make government more transparent and is committed to creating an unprecedented level of openness in Government. The openness derived from Data.gov will strengthen our Nation's democracy and promote efficiency and effectiveness in Government.

Governments around the world have taken notice and are now considering similar initiatives. It is clear that these initiatives are important for the proper functioning of democracy since they allow citizens to make informed decisions based on facts as opposed to the politicized views of special interests, lobbyists, and their spin doctors. These facts are related to important subjects such as health care, the environment, the criminal justice system, and education. There is an ongoing debate in the tech community about the best approach for publishing these datasets. There are several government data standards available such as the National Information Exchange Model (NIEM). In the Web 2.0 world, RESTful APIs with ATOM, XML, and JSON representation formats have become the norm.

I believe however that Semantic Web technologies and Linked Data principles offer unique capabilities in terms of bridging data silos, queries, reasoning, and visualization of the data. Again, the methodology for adopting Semantic Web technologies is the same:

  1. Create an OWL ontology that is flexible enough to support the different types of data in the dataset including statistical data. This is certainly the most important and challenging part of the effort.
  2. Convert the data from its source format to RDF. For example XSLT2 can be used to convert from CSV or TSV to RDF/XML and XHTML+RDFa. There are also RDFizers such as D2R for relational data sources.
  3. Link the data to other data sources such as Geonames, Federal Election Commission (FEC), and US Census datasets.
  4. Provide an RDF dump for Semantic Web Crawlers and a SPARQL endpoint for querying the datasets.

The following are some of the benefits of this approach:

  • It allows users to ask sophisticated questions against the datasets using the SPARQL query language. These are the kind of questions that a journalist, a researcher, or a concerned citizen will have in mind. For example, which airport has the highest number of reported aircraft bird strikes? (read more here about why Transportation Secretary Ray LaHood rejected a proposal by the FAA to keep bird strikes data secret). Currently data.gov provides only full-text and category-based search.
  • It bridges data silos by allowing users to make queries and connect data in meaningful ways across datasets. For example, a query that correlates health care, environment, and census data.
  • It provides powerful visualizations of the data through Semantic Web meshups.

Thursday, March 26, 2009

News Content APIs: Uniform Access or Content Silos

The first generation of online news content applications was designed for consumption by humans. With the massive amounts of online content now available, machine processable structured data will be the key to findability and relevance. Major news organizations like the New York Times (NYT), the National Public Radio (NPR), and The Guardian have recently opened up their content repositories through APIs.

These APIs have generated a lot of excitement in the content developers community and are certainly a significant step forward in the evolution of how news content is processed and consumed on the web. The APIs allow developers to create new interesting mashup applications. An example of such a mashup is a map of the United States showing how the stimulus money is being spent by municipalities across the country with hotspots to local newspaper articles about corruption investigations related to the spending. The stimulus spending data will be provided by the Stimulus Feed on the recovery.gov site as specified by the "Initial Implementation Guidance for the American Recovery and Reinvestment Act" document. This is certainly an example of mashup that US tax payers will like.

For news organizations, these APIs represent an opportunity to grow their ad network by pushing their content to more sites on the web. That's the idea behind the recent release of The Guardian Open Platform API.

APIs and Content Silos

The emerging news content APIs typically offer a REST or SOAP web services interface and return content in XML, JSON, or ATOM feeds. However, despite the excitement that they generate, these APIs can quickly turn into content silos for the following reasons:

  • The structure of the content is often based on a proprietary schema. This introduces several potential interoperability issues for API users in terms of content structure, content types, and semantics.
  • It is not trivial to link content across APIs
  • Each API provides its own query syntax. There is a need for universal data browsers and a query language to read, navigate, crawl, and query structured content from different sources.

XML, XSD, XSLT, and XQuery

Migrating content from HTML to XML (so called document-oriented XML) has many benefits. XSLT enables media-independent publishing (single sourcing) to multiple devices such as Amazon's Kindle e-reader and "smart phones". With XQuery, sophisticated contextualized database-like queries can be performed, turning the content itself into a database. In addition, XQuery allows the dynamic assembly of content where new compound documents can be constructed on the fly from multiple documents and external data sources. This allows publishers to repurpose content into new information products as needed to satisfy new customer demands and market opportunities.

However XSD, XSLT, and XQuery operate at the syntax level. The next level up in the content technology stack is semantics and reasoning and that's where RDF, OWL, and SPARQL come into play. To illustrate the issue, consider three news organizations, each with their own XML Schema for describing news articles. To describe the author of an article, the first news organization uses the <creator> element, the second the <byline> element, and the third the <author> element. All of these three distinct element names have exactly the same meaning. Using an OWL ontology, we can establish that these three terms are equivalent.

Semantic Web and Linked Data to the Rescue

Semantic web technologies such as RDF, OWL, and SPARQL can help us close the semantic gap and also open up new opportunities for publishers. Furthermore, with the decline in ad revenues, news organizations are now considering charging users for accessing content online. Semantic web technologies can enrich content by providing new ways to discover and explore content based on user context and interests. An interesting example is a mashup application built by Zemanta called Guardian topic researchr which extract entities (people, places, organizations, etc.) from The Guardian Open Platform API query results and allows readers to explore these entities further. In addition, the recently unveiled Newssift site by the Financial Times is an indication that the industry is starting to pay attention to the benefits of "semantic search" as opposed to keyword search.

The rest of this post outlines some practical steps for migrating news content to the Semantic Web. For existing news content APIs, an interim solution is to create Semantic Web wrappers around these APIs (more on that later). The long term objective however should be to fully embrace the Semantic Web and adopt Linked Data principles in publishing news content.

Adopt the International Press Telecommunication Council (IPTC) News Architecture (NAR)

The main reason for adopting the NAR is interoperability at the content structure, content types, and semantic levels. Imagine a mashup developer trying to integrate news content from three different news organizations. In addition to using three different element names (<creator>, <byline>, and <author>) to describe the same concept, these three organizations use completely different XML Schemas to describe the structure and types of their respective news content. That can lead to a data mapping nightmare for the mashup developer and the problem will only get worse as the number of news sources increases.

The NAR content model defines four high level elements: newsItem, packageItem, conceptItem, and knowledgeItem. You don't have to manage your content internally using the XML structure defined by the NAR. However, you should be able to map and export your content to the NAR as a delivery format. If you have fields in your content repository that do not map to the NAR structure, then you should extend the standard NAR XML Schema using the appropriate XML Schema extension mechanism that allows you to clearly identify your extension elements in your own XML namespace. Provide a mechansim such as dereferenceable URIs to allows users to obtain the meaning of these extensions elements.

The same logic applies to the news taxonomy that you use. Adopting the IPTC News Codes which specified 1300 terms used for categorizing news content will greatly facilitate interoperability as well.

Adopt or Create a News Ontology

Several news ontologies in RDFS or OWL format are now available. The IPTC is in the process of creating an IPTC news ontology in OWL format. To facilitate semantic interoperability, news organizations should use this ontology when it becomes available. In mapping XML Schemas into OWL, ontology best practices should be followed. For example, if mapped automatically, container elements in the XML Schema could generate blank nodes in the RDF graph. However, blank nodes cannot be used for external RDF links and are not recommended for Linked Data applications. Also, RDF reification, RDF containers, and RDF collections are not SPARQL-friendly and should be avoided.

While creating the news ontology, you should reuse or link to other existing ontologies such as FOAF and Dublin Core using OWL elements like owl:equivalentProperty, owl:equivalentClass, rdfs:subClassOf, or rdfs:subPropertyOf.

Similarly, existing taxonomies should be mapped to an RDF compatible format using the SKOS specification. This makes it possible to use an owl:Restriction to constrain the value of a property in the OWL ontology to be an skos:Concept or skos:ConceptScheme.

Generate RDF Data

Assign a dereferenceable HTTP URI for each news item and use content negotiation to provide both an XHTML and an RDF/XML representation of the resource. When the resource is requested, an HTTP 303 See Other redirect is used to serve XHTML or RDF/XML depending on whether the browser's Accept header is text/html or application/rdf+xml. The W3C Best Practice Recipes for Publishing RDF Vocabularies explains how dereferenceale URIs and content negotiation work in the Semantic Web.

The RDF data can be generated using a variety of techniques. For example, you can use an XSLT-based RDFizer to extract RDF/XML from news item already marked up in XML. There are also RDFizers for relational databases. Entity extraction tools like Open Calais can also be useful particularly for extracting RDF metadata from legacy news items available in HTML format.

Link the RDF data to external data sources such as DBPedia and Geonames by using RDF links from existing vocabularies such as FOAF. For example, an article about US Treasury Secretary Timothy Geithner can use foaf:base_near to link the news item to a resource describing Washington, DC on DBPedia. If there is an HTTP URI that describes the same resource in another data source, then use owl:sameAs links to link the two resources. For example, if a news item is about Timothy Geithner, then you can use owl:sameAs to link to Timothy Geithner's data page on DBPedia. An RDF browser like Tabulator can traverse those links and help the reader explore more information about topics of interest.


Expose a SPARQL Endpoint

Use a Semantic Web Crawler (an extension to the Sitemap Protocol) to specify the location of the SPARQL endpoint or an RDF dump for Semantic Web clients and crawlers. OpenLink Virtuoso is an RDF store that also provides a SPARQL endpoint.

Provide a user interface for performing semantic searches. Expose the RDF metadata as facets for browsing the news items.

Provide a Semantic Web Wrapper for existing APIs.

A wrapper provides a deferenceable URI for every news item available through an existing news content API. When an RDF browser requests the news item, the Semantic Web wrapper translates the request into an API call, transforms the response from XML into RDF, and send it back to the Semantic Web client. The RDF Book Mashup is an example of how a Semantic Web Wrapper can be used to integrate publicly available APIs from Amazon, Google, and Yahoo into the Semantic Web.

Conclusion

The Semantic Web is still an obscure topic in the mainstream developers community. I hope I've outlined few practical steps you can take now to take advantage of the new Web of Linked Data.

Monday, December 29, 2008

From Web 2.0 to the Semantic Web: Bridging the Gap in the News and Media Industry

The news and media industry is going through fundamental changes. This transformation is driven by the current economic downturn and the emergence of the web as the new platform for creating and publishing content. The decline in ad revenues has forced some media companies to cancel their print publications (e.g. PC Magazine and the Christian Science Monitor). Others such as Forbes Media are consolidating their online and editorial groups into a single entity.

What are the opportunities and challenges of the Semantic Web (sometimes referred to as Web 3.0) for the industry and how can these companies embrace and extend the new web of Linked Data?

Widgets and APIs

Media companies are looking for ways to gain a competitive advantage from their web offerings. In addition to Web 2.0 features like blogs and RSS feeds, major news organizations such as the New York Times (NYT) and the National Pubic Radio (NPR) are opening their content to external developers through APIs. These APIs allow developers to mash-up content in new ways limited only by their own imagination. The ultimate goal is to drive ad revenues by pushing content to places like social networking sites and blogs where readers “hang out” online. Another interesting example is the Times Widget (from the NYT) which allows readers to insert NYT content such as news headlines, stock quotes, and recipes on their personal web pages or blogs.

Beyond Web 2.0: the Semantic Web

Today's end users obtain information from a variety of sources such as online newspapers, blogs, Wikipedia, YouTube videos, Flickr photos, cartoons, animations, and social networking sites such as Facebook and Twitter. I personally get some of the news I read from the people I follow on Twitter (including political cartoons at http://twitter.com/dcagle). The challenge is to integrate all these sources of information into a seamless search and browsing experience for readers. Publishers are starting to realize the importance of this integration as illustrated by the recently unveiled "Times Extra" feature from the NYT.

The key to this integration is metadata and this is where Semantic Web technologies such as RDF, OWL, and SPARQL have an important role to play in presenting massive amounts of content to end users in a way that is compelling. Semantic search and browsing as well as inference capabilities can help publishers in their effort to attract and retain readers and boost ad revenues.

News and Media Industry Metadata Standards

Established in 1965, the International Press Telecommunications Council (IPTC) is a consortium of news agencies and publishers including the Associated Press, the NYT, Reuters, and the Dow Jones Company. The IPTC maintains and publishes a set of news exchange and metadata standards for media types such as text, photos, graphics, and streaming media like audio and video. This includes:

  • NITF which defines the content and structure of news articles
  • NewsML 1 for the packaging and exchange of multimedia news
  • NewsML-G2, the latest standard for the exchange of all kinds of media types
  • EventsML-G2 for news events
  • SportsML for sport data

These standards are all based on XML and use XML Schema Definitions (XSDs) to describe the structure and content of the media types. In addition, IPTC defines a taxonomy for news items called NewsCodes as well as a Photo Metadata Standard based on Adobe's XMP specification.

Another interesting standard in the news and media industry is the Publishing Requirements for Industry Standard Metadata (PRISM) specification. Also based on XML, PRISM is more applicable to magazines and journals and is compatible with RDF. There is a certain degree of overlap between PRISM and the IPTC news metadata standards. Media companies that have adopted these standards are well positioned to bridge the gap to the Semantic Web.

From XML/XSD to RDF/OWL

The W3C Semantic Web Activity and the Semantic Web community have been working on a number of specifications and tools to facilitate the transition to the Semantic Web. The W3C Gleaning Resource Descriptions from Dialects of Languages (GRDDL) specification defines a method for using XSLT to extract RDF statements from existing XML documents. For example, using GRDDL, a news organization would be able to generate RDF from content items marked up in NITF or NewsML.

The recently approved RDFa W3C specification allows publishers to get their content ready for the Semantic Web by adding extension attributes to XHTML content to capture semantic information based on Dublin Core or FOAF for example.

The W3C Simple Knowledge Organization System (SKOS) is an application of RDF that can be used to represent existing taxonomies and classification schemes (such as the IPTC NewsCodes) in a way that is compatible with the Semantic Web. SPARQL is a query language for RDF that is supported in Semantic Web toolkits such as Jena.

There are also tools that can be used to derive an OWL ontology from an existing XSD. This transformations can be straightforward if the XSD was designed to be compatible with RDF. As an example, an OWL ontology can be derived from the IPTC's NewsML-G2 XSDs. Once created, new ontologies can be linked to existing ontologies such as Dublin Core, FOAF, and DBPedia as described by Tim Berner Lee in his vision of Linked Data. Other relevant ontologies for the news and media industry include MPEG-7, the Core Ontology for Multimedia (COMM), and Geonames.

From Web 2.0 Tagging to Semantic Annotations

To facilitate the transition of the industry to the Semantic Web, it will be important to design content management systems (CMS) interfaces that make it easy for content contributors to add semantic annotations to their content. These systems certainly have a lot to learn from Web 2.0 tagging interfaces such as Flickr clustering and machine tags. However the complexity of content in the news and media industry demands more sophisticated annotation capabilities than those that are available to the masses on YouTube and Flickr. Therefore full support for Semantic Web standards like RDF, OWL, SKOS, and SPARQL will be expected.

An interesting application in this space is the Thompson Reuters Calais application which allows content publishers to automatically extract RDF-based metadata on entities such as people, places, facts, and events. LinkedFacts.com is an example of a semantic news search application powered by Calais.

Calais has been integrated with the open source content management system (CMS) Alfresco to enable auto-tagging of content as well as the automatic suggestion of tags to content contributors. The latest release of Calais adds the ability to generate Linked Data to other sources such as DBPedia, GeoNames, and the CIA World Fact Book.

Conclusion

With the rapid growth of news content online, relevance is going to become very important. This will require metadata and structured data in general. By exposing content in XML format, news content APIs are certainly a step in the right direction. However, these APIs do have their own limitations and can create content silos in the future. Beyond APIs, Semantic Web technologies and Linked Data principles will ensure a uniform and intelligent access to news content. This will be the key to reader retention and content monetization.