Saturday, November 12, 2011

Thoughts on the Query Health Initiative

I have been following the Query Health Initiative of the ONC Standards and Interoperability Framework with great interest. The following are key goals of the Query Health Initiative:

  • Identify standards and services for distributed population health queries to EHRs, HIEs, and other clinical data sources such as registries.

  • Define a framework to allow partners to create their own distributed query networks (with or without intermediaries called "Network Data Partners"). I believe that the solution should not mandate a specific implementation in order to foster innovation in the field.

  • Support a number of use cases such as quality measures reporting, public health surveillance, comparative effectiveness research (CER), and patient centered outcome research (PCOR).

  • Support queries over a common and extensible Clinical Information Model (CIM).

  • Support security, audit trails, privacy, patient consent directives, and other policy and legal requirements. Techniques such as the de-identification of data will be essential to maintaining privacy.

  • Create a solution that can be implemented with a financially sustainable model. I think there are many lessons to be learned here from failed or struggling Health Information Exchange (HIE) initiatives.


The Query Health Initiative supports a distributed model as a opposed to a centralized model. This allows data to be kept in the originating systems and securely queried and aggregated.

Previous initiatives to create such a distributed query health network include:

  • i2b2 (Informatics for Integrating Biology and the Bedside) and SHRINE (Shared Health Research Information Network) - a scalable informatics framework that will enable clinical researchers to use existing clinical data as well as genomic data for research and discovery.

  • hQuery - an open source project by MITRE which leverages the ability of certified EHRs to produce C32 or CCR documents. hQuery is based on a MongoDB document database and uses JavaScript Map and Reduce functions.

  • PopMedNet – a multi-purpose distributed networks for secondary use of EHR, administrative, claims, and registry data.


In addition, large health enterprises are investing considerable efforts and resources in building clinical data warehouses and analytics capabilities for their own internal needs. Examples are the Enterprise Data Trust at Mayo Clinic and the STRIDE (Stanford Translational Research Integrated Database Environment) project at Stanford University. These existing systems may have to eventually participate in distributed population health query networks. Most of these systems are based on SQL databases which have reached a high level of maturity in terms of scalability, performance, and the availability of data processing and analytics techniques and tools.

However, NoSQL and alternatives based on Big Data Analytics (e.g. Hadoop and Hive) are currently making significant inroads into the enterprise. Emerging NoSQL databases include key-value stores, document databases, graph databases, and triple stores such as those based on the SPARQL query language for RDF. In some cases, these NoSQL databases provide superior scalability when compared to SQL databases. They are increasingly popular with developers because they simplify the application development process by eliminating the need for Object Relational Mapping (ORM). For example, in document-oriented databases such as MongoDB (which is used in hQuery), objects are persisted as JSON documents.

So, I believe that the technical choices that are made for the Query Health Initiative should be grounded in that current reality of enterprise data management.

  1. First, I think that queries should be formulated in a declarative as a opposed to a procedural manner. This rules out an approach based on JavaScript.

  2. Second, I believe that queries should be formulated in an established query language. By established query language, I mean a standard like SQL, SPARQL, or XQuery that was design specifically for the purpose of querying data stores. This rules out standards such as the HL7 Health Quality Measures Format (HQMF) or any implementation of the HL7 CDA. In fact, quality measures reporting is just one of many use cases in Query Health. In my opinion, in a value-based healthcare system, patient-centered outcome measurement is even more important than quality measures which are essentially process measures and do not necessarily correlate with improved patient outcomes.

    By the way, I believe this same principle should extend to other ONC Standards and Interoperability efforts such as the Data Segmentation Initiative which is trying to define an interoperable approach to implementing privacy policies, consent directives, and authorizations. The Data Segmentation Initiative should embrace the approach taken by the OASIS Cross-Enterprise Security and Privacy Authorization (XSPA) which consists in defining healthcare profiles for well established and recognized standards such as SAML, XACML, and WS-Trust. This contrasts with an approach that would consist in creating a CDA implementation for patient consent directives. This discussion on patient consent directives is relevant to the Query Health Initiative.

    i2b2 which is one of the projects considered by the Query Health Initiative uses SQL. i2b2 is a well engineered, robust, and proven architecture. The i2b2 data model is based on the "star schema" which has a central "fact" table where each row represents an observation about a patient. The current implementation of the clinical research chart (CRC) in i2b2 is based on the Oracle and Microsoft SQL Server databases.

    The maturity of SQL-based tools could be a deciding factor.

  3. Third, the adopted solution should leave the door open to innovation, by giving participants the choice of embracing alternative and emerging solutions such as SPARQL or UnSQL, a newly proposed query language for NoSQL document databases. Erik Meijer and Gavin Bierman from Microsoft Research wrote a paper titled "A co-Relational Model of Data for Large Shared Data Banks" where they propose a new common query language for both SQL and noSQL databases called coSQL.


In the era of Clinical Question Answering (CQA), Natural Language Processing (NLP) and ontologies will play a critical role in clinical data repositories. SPARQL-based queries when combined with the use of ontologies could offer significant advantages over traditional SQL-based systems (see my previous post titled "Why do we Need Ontologies in Healthcare Applications"). Furthermore, standard vocabularies (such as SNOMED CT) and value sets are essential components of clinical data repositories. These terminologies are often derived from ontologies, so a solution that integrate well with ontologies will be important. I believe that the CTS2 specification satisfies all the vocabulary and value set requirements for Query Health. CTS2 is also currently being implemented by commercial vocabulary tool vendors and various open source projects.

The W3C R2RML (RDB to RDF mapping language) specification allows existing applications to provide an RDF view other relational databases. The SPARQL 1.1 query language supports key Query Health requirements such as aggregates, grouping, and subqueries, while the SPARQL 1.1 Federation Extensions specification supports federated queries.

The Translational Medicine Ontology is designed as a unifying ontology for the integration of EHR, genomic, treatment, drug, and other types of clinical data. This allows the creation of knowledge bases that can be queried with SPARQL to answer important questions related to clinical research as well as patient care. If you are interested in this topic, I highly recommend these two papers: