Sunday, April 28, 2013

How I Make Technology Decisions

The open source community has responded to the increasing complexity of software systems by creating many frameworks which are supposed to facilitate the work of developing software. Software developers spend a considerable amount of time researching, learning, and integrating these frameworks to build new software products. Selecting the wrong technology can cost an organization millions of dollars. In this post, I describe my approach to selecting these frameworks. I also discuss the frameworks that have made it to my software development toolbox.

Understanding the Business


The first step is to build a strong understanding of the following:

  • The business goals and challenges of the organization. For example, the healthcare industry is currently shifting to a value-based payment model in an increasingly tightening regulatory environment. Healthcare organizations are looking for a computing infrastructure that support new demands such as the Accountable Care Organization (ACO) model, patient-centered outcomes, patient engagement, care coordination, quality measures, bundled payments, and Patient-Centered Medical Homes (PCMH).

  • The intended buyers and users of the system and their concerns. For example, what are their pain points? which devices are they using? and what are their security and privacy concerns?

  • The standards and regulations of the industry.

  • The competitive landscape in the industry. To build a system that is relevant, it is important to have some ideas about the following: what is the competition? what are the current capabilities of their systems? what is on their road map? and what are customers saying about their products. This knowledge can help shape a Blue Ocean Strategy.

  • Emerging trends in technologies.

This type of knowledge comes with industry experience and a habit of continuously paying attention to these issues. For example, on a daily basis, I read industry news as well as scientific and technical publications. As a member of the American Medical Informatics Association (AMIA), I receive the latest issue of the Journal of the American Medical Informatics Association (JAMIA) which allows me to access cutting-edge research in medical informatics. I speak at industry conferences when possible and this allows me not only to hone my presentation skills, but also attend all sessions for free or at a discounted price. For the latest in software development, I turn to publications like InfoQ, DZone, and TechCrunch.

To better understand the users and their needs and concerns, I perform early usability testing (using sketches, wireframes, or mockups) to test design ideas and obtain feedback before actual development starts. For generating innovative design ideas, I recommend the following book: Universal Methods of Design: 100 Ways to Research Complex Problems, Develop Innovative Ideas, and Design Effective Solutions by Bruce Hanington and Bella Martin.

 

Architecting the Solution


Armed with a solid understanding of the business and technological landscape as well as the domain, I can start creating a solution architecture. Software development projects can be chaotic. Based on my experience working on many software development projects across industries, I found that Domain Driven Design (DDD) can help foster a disciplined approach to software development. For more on my experience with DDD, see my previous post entitled How Not to Build A Big Ball of Mud, Part 2.

Frameworks evolve over time. So, I make sure that the architecture is framework-agnostic and focused on supporting the domain. This allows me to retrofit the system in the future with new frameworks as they emerge.


 

Due Diligence


Software development is a rapidly evolving field. I keep my eyes on the radar and try not to drink the vendors Kool-Aid. For example, not all vendors have a good track record in supporting standards, interoperability, and cross-platform solutions.

The ThoughtWorks Technology Radar is an excellent source of information and analysis on emerging trends in software. Its contributors include software thought leaders like Martin Fowler and Rebecca Parson. I also look at surveys of the developers community to determine the popularity, community size, and usage statistics of competing frameworks and tools. Sites like InfoQ often conduct these types of surveys like the recent InfoQ survey on Top JavaScript MVC Frameworks. I also like Matt Raible's Comparing JVM Web Frameworks.

I value the opinion of recognized experts in the field of interest. I read their books, blogs, and watch their presentations. Before formulating my own position, I make sure that I read expert opinions on opposing sides of the argument. For example, in deciding on a pure Java EE vs. Spring Framework approach, I read arguments by experts on both sides (experts like Arun Gupta, Java EE Evangelist at Oracle and Adrian Colyer, CTO at SpringSource).

Finally, consider a peer review of the architecture using a methodology like the Architecture Tradeoff Analysis Method (ATAM). Simply going through the exercise of explaining the architecture to stakeholders and receiving feedback can significantly help in improving it.


Rapid Prototyping 

 

It's generally a good idea to create a rapid prototype to quickly learn and demonstrate the capabilities and value of the framework to the business. This can also generate excitement in the development team, particularly if the framework can enhance the productivity of developers and make their life easier.

 

The Frameworks I've Selected


The Spring Framework

I am a big fan of the Spring Framework. I believe it is really designed to support the need of developers from a productivity standpoint. In addition to dependency injection (DI), Aspect Oriented Programming (AOP), and Spring MVC, I like the Spring Data repository abstraction for JPA, MongoDB, Neo4J, and Hadoop. Spring supports Polyglot Persistence and Big Data today. I use Spring Roo for rapid application development and this allows me to focus on modeling the domain. I use the Roo scaffolding feature to generate a lot of Spring configuration and Java code for the domain, repository (Roo supports JPA and MongDB), service, and web layers (Roo supports Spring MVC, JSF, and GWT). Spring also support for unit and integration testing with the recent release of Spring MVC Test.

I use Spring Security which allows me to use AOP and annotations to secure methods and supports advanced features like Remenber Me and regular expressions for URLs. I think that JAAS is too low-level. Spring Security allows me to meet all OWASP Top Ten requirements (see my previous post entitled  Application-Level Security in Health IT Systems: A Roadmap).

Spring Social makes it easy to connect a Spring application to social network sites like Facebook, Twitter, and LinkedIn using the OAuth2 protocol. From a tooling standpoint, Spring STS supports many Spring features and I can deploy directly to Cloud Foundry from Spring STS. I look forward to evaluating Grails and the Play Framework which use convention over configuration and are built on Groovy and Scala respectively.

Thymeleaf, Twitter Boostrap, and JQuery

I use Twitter Boostrap because it is based on HTML5, CSS3, JQuery, LESS, and also supports a Responsive Web Design (RWD) approach. The size of the components library and the community is quite impressive.

Thymeleaf is an HTML5 templating engine and a replacement for traditional JSP. It is well integrated with Spring MVC and supports a clear division of labor between back-end and front-end developers. Twitter Boostrap and Thymeleaf work well together.


AngularJS

For Single Page Applications (SPA) my definitive choice is AngularJS. It provides everything I need including a clean MVC pattern implementation, directives, view routing, Deep Linking (for bookmarking), dependency injection, two-way databinding, and BDD-style unit testing with Jasmine. AngularJS has its own dedicated debugging tool called Batarang. There are also several learning resources (including books) on AngularJS.

Check this page comparing the performance of AngulaJS vs. KnockoutJS. This is a survey of the popularity of  Top JavaScript MVC Frameworks.

 

D3.js 

D3.js is my favorite for data visualization in data-intensive applications. It is based on HTML5, SVG, and Javascript. For simple charting and plotting, I use jqPlot which is based on JQuery. See my previous post entitled Visual Analytics for Clinical Decision Making.

 

I use R for statistical computing, data analysis, and predictive analytics. See my previous post entitled Statistical Computing and Data Mining with R.


Development Tools


My development tools include: Git (Distributed Version Control), Maven or Gradle (build), Jenkins (Continuous Integration), Artifactory (Repository Manager), and Sonar (source code quality management). My testing toolkit includes Mockito, DBUnit, Cucumber JVM, JMeter, and Selenium.

Sunday, April 14, 2013

Addressing Challenges to the Adoption of Clinical Decision Support (CDS) Systems


Despite its potential to improve the quality of care, CDS is not widely used in health care delivery today. In tech marketing parlance, CDS has not crossed the chasm. There are several issues that need to be addressed including:

  • Clinician acceptance of the concept of automated execution of evidence-based clinical guidelines

  • Seamless integration into clinical workflows

  • Usability issues including alert fatigue

  • Standardization to enable the interoperability of CDS knowledge artifacts and executable clinical guidelines

  • Integration with Natural Language Processing (NLP) to allow clinicians to ask clinical questions in natural language (Clinical Question Answering or CQA) and to extract clinical answers from very large amounts of unstructured sources of medical knowledge

  • Integration of predictive risk models based on the analysis of historical data to enable targeted interventions for specific at-risk populations

  • Integration of genomics to enable personalized medicine

  • The use of simulation in CDS to explore and compare the outcomes of various treatment alternatives

  • Integration of outcomes research in the context of a shift to a value-based healthcare delivery system. This can be achieved by incorporating the results of Comparative Effectiveness Research (CER) and Patient-Centered Outcome Research (PCOR) into CDS systems. Increasingly, outcomes research will be performed using observational studies (based on real world data) which are recognized as complementary to randomized control trials (RCTs) for discovering what works and what doesn't work in practice. This is a form of Practice-Based Evidence (PBE) that is necessary to close the evidence loop

  • Support for a shared decision making process that takes into account the values, goals, and wishes of the patient

  • The use of Visual Analytics in CDS to facilitate analytical reasoning over very large amount of structured and unstructured data sources.

In this post, I share my thoughts on CDS interoperability, integration with clinical workflows, and natural language processing (NLP).

 

Interoperable and Executable Clinical Guidelines


The complexity and cost inherent in capturing the medical knowledge in clinical guidelines and translating that knowledge into executable code remains an impediment to the widespread adoption of CDS software. Therefore, there is a need for standards for the sharing and interchange of CDS knowledge artifacts and executable clinical guidelines.

Different formalisms, methodologies, and architectures have been proposed over the years for representing the medical knowledge in clinical guidelines. Examples include, but are not limited to the following:

  • The Arden Syntax
  • GLIF (Guideline Interchange Format)
  • GELLO (Guideline Expression Language Object-Oriented)
  • GEM (Guidelines Element Model)
  • The Web Ontology Language (OWL)
  • PROforma
  • EON
  • PRODIGY
  • Asbru
  • GUIDE
  • SAGE.

More recently, the ONC Health eDecision Initiative has published the following specifications:



Because of the complexity and cost of developing CDS software, CDS software capabilities can be exposed as a set of services (part of a Service Oriented Architecture) that can be consumed by other client health IT systems such as EHR and Computerized Physician Order Entry (CPOE) systems. To reduce costs, these CDS software services can be shared by several health care providers. Consensus was recently achieved on the ONC Health eDecision CDS Guidance Service Use Case.

Enabling the interoperability of executable clinical guidelines requires a standardized domain model for  representing the medical information of patients and other contextual clinical information. The HL7 virtual Medical Record (vMR) is a standardized domain model for representing the inputs and outputs of CDS systems.

In practice, executable CDS rules (like other complex types of business rules) can be implemented with a business rule engine using forward chaining. This is the approach taken by OpenCDS and some large scale CDS implementations in real world healthcare delivery settings. This allows CDS software developers to externalize the medical knowledge in clinical guidelines in the form of declarative rules as opposed to embedding that knowledge in procedural code. Many viable open source business rule management systems (BRMS) are available today and provide capabilities such as a rule authoring, a repository, and a testing environment. Furthermore, a rule execution environment can be easily integrated with business processes (see the section below on clinical workflow integration), ontologies, and predictive analytics models.

The W3C Rule Interchange Format (RIF) specification is a possible solution to the interchange of executable CDS rules. The RIF Production Rule Dialect (PRD) is designed as a common XML serialization syntax for multiple rule languages to enable rule interchange between different BRMS. For example, RIF-PRD would allow the exchange of executable rules between existing BRMS like JBoss Drools, IBM ILOG JRules, and Jess. RIF is currently a W3C Recommendation and is backed by several BRMS vendors. A paper entitled "A model driven approach for bridging ILOG Rule Language and RIF" presented by Valerio Cosentino, Marcos Didonet Del Fabro, and Adil El Ghali at the RuleML 2012 conference describes an application of RIF to rule interoperability.

 

Seamless Integration into Clinical Workflows and Care Pathways


One of the main complaints against CDS systems is that they are not well integrated into clinical workflows. Existing business process management standards like the Business Process Modeling Notation (BPMN) can provide a proven, practical, and adaptable approach to the integration of CDS rules and clinical pathways. Some existing open source and commercial BRMS already provide an integration of business rules and business processes out-of-the box and there are well-known patterns for integrating rules and processes in business applications.


Integration with Natural Language Processing (NLP)


It is not practical to expect that all medical knowledge required in clinical decision making will be manually translated into IF-THEN statements to be executed by rule engines. A vast amount of rapidly increasing medical knowledge exist in the form of narrative text in clinical guidelines and other unstructured sources of medical knowledge like textbooks, research papers, and academic journals.

For an in-depth discussion on the topic on NLP in Clinical Decision Support, see my previous post entitled Automated Clinical Question Answering: The Next Frontier in Healthcare Informatics.

Sunday, March 24, 2013

Statistical Computing and Data Mining with R

The use of predictive risk models for personalized medicine is becoming a common practice in healthcare delivery. These models can predict the health risk of patients based on their individual health profiles including genetic profiles. Examples include models for predicting breast cancer, stroke, cardiovascular disease, Alzheimer's disease, diabetes, hypertension, operative mortality for patients undergoing cardiac surgery, and hospital readmission. These predictive models are created through data analysis using statistical computing.

Statistical Computing is an essential component of intelligent health IT systems. Over the last few years, the free and open source R Project for Statistical Computing has emerged as one the most popular tools for data analysis. This poll by kdnuggets.com shows the breakdown in popularity of various data mining and analytic tools.

The following are very useful resources for doing statistical computing and data mining with R:
 
  • RStudio: an Integrated Development Environment (IDE) for R

  • ggplot2: statistical graphics and plotting system for R

  • sqldf: a package for manipulating R data frames using SQL

  • RMySQL: R interface to the MySQL database

  • RMongo: MongoDB Database interface for R

  • RHIPE: Big Data analysis using R and Hadoop. This approach is referred to as D&R (Divide and Recombine) Analysis of Large Complex Data (see this tech report on D&R from the RHIPE team)

  • RHadoop:  Big Data analysis using R and Hadoop. This tool provides Hadoop MapReduce functionality in R

  • Rattle: A Graphical User Interface for Data Mining using R. This tool can export predictive models in Predictive Model Markup Language (PMML) format.

Sunday, March 10, 2013

How Not to Build A Big Ball of Mud, Part 2

In a previous post entitled How not to build a big  ball of mud, I described the complexity of modern software systems and the challenges faced today by software developers and architects. Domain Driven Design (DDD) is a proven pattern language that can foster a disciplined approach to software development. DDD was first introduced by Eric Evans nine years ago in a seminal book entitled: Domain-Driven Design: Tackling Complexity in the Heart of Software. Over the last 9 years, a community of practice has emerged around DDD and many lessons have been learned in applying DDD to real world complex software development projects. During that time, software complexity has also increased significantly. Changes in the field of software development during the last few years include:

  • The proliferation of client devices which requires a Responsive Web Design (RWD) approach. RWD is made possible by open web standards like HTML5, CSS3, and Javascript which have displaced proprietary user interface technologies like Flex and Silverlight. RWD frameworks like Twitter Boostrap and Javascript Libraries like JQuery have become very popular with developers. The demands put on Javscript on the client side have created the need for Javascript MVC frameworks like AngularJS and EmberJS.

  • The importance of the user experience in a competitive online marketplace. Performing usability testing early in the software development life cycle (using wireframes or mockups) to test design ideas and obtain early feedback from future users is extremely valuable for creating the right solution. Metrics such as the System Usability Scale (SUS) can be used to assess the results of usability testing.

  • The prevalence of REST, JSON, OAuth2, and Web APIs for achieving web scale.

  • The emergence of Polyglot Persistence or the use of different persistence mechanisms such as relational, document, and graph databases within the same application. Developers are discovering that modeling data for NoSQL databases has many benefits, but also its own peculiarities.

  • The demands for quality and faster time-to-market have led to new techniques like test automation and continuous delivery.

The open source community has responded to these challenges by creating many frameworks which are supposed to facilitate the work of developing software. Software developers spend a considerable amount of time researching, learning, and integrating these various frameworks to build a system. Some of these frameworks can indeed be very helpful when used properly. However, DDD puts a big emphasis on understanding the domain. Here is what I learned from applying DDD over the last few years:


  • DDD is a significant intellectual investment, but with a potential for big rewards. To be successful in applying DDD, one must take the time to understand and digest the underlying principles, from the building blocks (entities, aggregates, value objects, modules, domain events, services, repositories, and factories) to the strategic aspects of applying DDD. For example, understanding the difference between an aggregate, a value object, and an entity is essential. Learning the right approach to designing aggregates is also very important as this can significantly impact transactions and performance. I highly recommend reading the recently published Implementing Domain Driven Design by Vaughn Vernon. The book provides a contemporary approach to applying DDD. For example, it covers important topics in applying DDD to modern software systems such as:  sub-domains, domain events, event stores and event sourcing, rules for aggregate design, transactions, eventual consistency, REST, NoSQL, and enterprise application integration with concrete examples.

  • Proper application layering (user interface, application, domain, and infrastructure), understanding the responsibility of each layer (for example, an anemic domain model and a fat application layer are anti-pattern), and coding to interfaces. DDD is object-oriented (OO) design done right. The SOLID Principles of OO design are still applicable.

  • Determine if DDD is right for your project. Most of my work during the last few years has been in the healthcare domain. The HL7 CCDA and the Virtual Medical Record (vMR) define an information model for Electronic Healthcare Records (EHR) and Clinical Decision Support (CDS) systems respectively. Interoperability is an important and challenging issue in healthcare. DDD concepts such as "Strategic Design", "Context Map", "Bounded Context", and "Published Language" are very helpful in addressing and navigating this type of complexity.

  • As I mentioned earlier, DDD puts a big emphasis on understanding the domain. Developers applying DDD should be prepared to dedicate a considerable amount of time to learning about the domain, for example by collaborating and carefully listening to domain experts and by reading as much as they can about the domain. This is also the key to creating a rich domain model with behavior (as opposed to an anemic one). I found that simply reading industry standards and regulations is a great way to understand a domain. So understanding the domain is not just the responsibility of the Business Analyst. The code is the expression of the domain, so the coder needs to understand the domain in order to express it with code.

  • Some developers blame popular frameworks for encouraging anemic domain models. I found that a lack of understanding of the domain and its business rules is a major contributing factor to anemia in the domain model. A rule engine like Drools can help externalize these business rules in the form of declarative rules that can be maintained by domain experts through a DSL, spreadsheet, or web-based user interface.

  • There are opportunities in using recent ideas like Event Sourcing and the Command Query Responsibility Segregation (CQRS). These opportunities include: scalability, true audit trails, data mining, temporal queries, application integration. However, being pragmatic can help avoid unnecessary complexity.

  • I recommend exploring tools that are specifically designed to support a DDD or Model-Driven Development (MDD) approach. Apache Isis, Roma Meta Framework, Tynamo, and Naked Objects are examples of such tools. These tools can automatically generate all the layers of an application based on the specification of a domain model. By doing so, these tools allow you to really focus your time and attention on exploring and understanding the domain as opposed to framework and infrastructure concerns. For architects, these tools can serve as design pattern automation, constraining the development process to conform to DDD principles and patterns. I believe this is part of a larger trend in automating software development which also includes the essential practice of test automation. We software developers like to automate the job of other people. However, many tasks that we perform (including coding itself) are still very manual. Aspect-Oriented Programming (AOP) (using AspectJ for example) can also be used to enable this type of design pattern automation through compile-time weaving.

  • Check my previous post for 20 techniques for achieving software excellence.

Sunday, February 24, 2013

State of the Semantic Web in the Clinical Domain

In a previous post entitled Why Do We Need Ontologies in Healthcare Applications, I explained the important difference between ontologies, coding systems, and information models of data structures. I also outlined the benefits of using Semantic Web technologies like RDF, RDFS, OWL, SWRL, R2RML, SPARQL, SKOS, and Linked Open Data (LOD). These benefits include:

  • Reasoning and inferencing which are essential characteristics of intelligent Health IT Systems (iHIT)
  • Model consistency checking
  • Open World Assumption (OWA) and Non-Unique Naming Assumption enabling the integration of heterogeneous data sources and knowledge bases using Linked Open Data (LOD) principles. This integration can be accomplished by providing an RDF view over existing relational databases using R2RML (RDB to RDF Mapping Language) and by performing SPARQL federated queries. Intelligent queries can retrieve inferred facts using SPARQL 1.1 Entailment Regimes.
  • Linking to other biomedical ontologies like SNOMED and the Translational Medicine Ontology
  • Clinical Knowledge Management (CKM) using OWL to model and execute Clinical Practice Guidelines (CPGs) and Care Pathways (CPs).

Semantic Web in Clinical and Translational Research


The following are papers on how Semantic Web technologies are being used to realize these benefits in the healthcare domain:

Apache Stanbol


I recently came across Apache Stanbol, a new Apache project which is described as "a set of reusable components for semantic content management". What I really like about Apache Stanbol is that it not only works on unstructured data sources, but also integrates a number of other popular Apache open source software which can be used to add a semantic layer to modern RESTful content-oriented applications. These components include:

  • Apache Tika for text and metadata extraction from a variety of commonly used document formats
  • Apache OpenNLP for natural language processing and named entity recognition (NER)
  • Apache Solr for document store and semantic search
  • Apache Jena as the RDF and Semantic Web framework.
Other open source components like Apache Mahout (a scalable Machine Learning library) can be integrated to provide document recommendation and clustering services.

The Content Enhancers in Stanbol can perform named entity recognition (NER) and link text annotations to external datasets such as DBPedia.  In the clinical domain, these enhancers can be used to extract entities from medical records, journal articles, and clinical guidelines. These entities can then be linked to other clinical data sources such as drug and disease databases using Linked Data techniques.

Apache Stanbol also provides Reasoners based on Jena RDFS, OWL, and OWLMini Reasoners as well as the HermiT OWL Reasoner. These reasoners can perform consistency checking and classification. Stanbol supports Inference Rules in the following formats: SWRL, Jena Rules, and SPARQL (by converting Stanbol Rules into SPARQL CONSTRUCTs).

Sunday, February 17, 2013

Automated Clinical Question Answering: The Next Frontier in Healthcare Informatics

In a previous post, I predicted that 2013 will be the year Intelligent Health IT Systems (iHIT) go mainstream.  I based my prediction on a number of factors, notably the transformation of healthcare to a value-based delivery system driven by the latest scientific evidence (evidence-based practice and practice-based evidence).

Last week, IBM together with health insurer WellPoint Inc., and New York’s Memorial Sloan-Kettering Cancer Center announced the commercialization of Watson (the supercomputer which beat human champions in "Jeopardy!" on February 16, 2011) for question answering (QA) in the clinical domain. The following are some interesting facts released by IBM as part of this announcement:

  • The supercomputer has ingested 1,500 lung cancer cases from Sloan-Kettering records, plus 2 million pages of text from journals, textbooks and treatment guidelines. This is what I called Big Data in medicine.
  • In 2012, Watson became 240 percent faster and 75 percent smaller so it can run on a single server. No surprise here and I expect this trend to continue.

The following YouTube video entitled Oncology Diagnosis and Treatment explains how IBM envisions using Watson for Clinical Question Answering (CQA):



The User Experience in the Watson Demo

 

  • Clinical questions can be posed in natural language (spoken or typed in by the clinician using a keyboard).
  • The sources used for answering clinical questions include both structured (EMR databases) and unstructured information (journal articles, clinical guidelines, etc.).
  • Personalized medicine: the proposed interventions are driven by the data in the patient's medical record and the system can prompt the clinician for additional information on the patient if necessary. The displayed evidence and recommendations are updated to reflect changes in the patient's clinical data.
  • Human Factors: the clinician is always in the loop. She can ask Watson how it arrives at a specific care recommendation and can even remove a specific evidence (if deemed irrelevant or not appropriate).
  • The use of confidence scoring and evidence highlighting.
  • Patient-centeredness and shared decision making: the treatment plans take into account the values, goals, and wishes of the patient (patient preferences). Treatment options are discussed with the patient.
  • Comparative effectiveness is used to compare the benefits and harms of different interventions.
  • Information is displayed using data visualization (dashboard) to help meet key performance indicators in the context of a value-based payment model.


The Science Behind Watson


The real question is how do we make intelligent health IT systems like Watson widely available to all patients. A landmark report published by the Institute of Medicine in 2001 and titled Crossing the Quality Chasm - A New Health System for the 21st Century contained the following recommendation:

Patients should receive care based on the best available scientific knowledge. Care should not vary illogically from clinician to clinician or from place to place.

For the scientifically (and Artificial Intelligence) inclined, the following are some pointers on the science behind Watson:


The picture below represents a high level architecture of Watson (click on the image to enlarge it).


DeepQA



AskHermes and MiPACQ


IBM Watson is not the only effort to develop automated CQA capabilities.  Some earlier CQA efforts used the PICO framework (Problem/Population, Intervention, Comparison, Outcome) to facilitate processing. More recent efforts have focused on the use of clinical questions posed in natural language.

AskHermes (Help clinicians to Extract and aRrticulate Multimedia information for answering clinical quEstionS) allows clinicians to enter questions in natural language and uses the following unstructured information sources: MEDLINE abstracts, PubMed Central full-text articles, eMedicine documents, clinical guidelines, and Wikipedia articles.

The processing pipeline in AskHermes includes the following: Question Analysis, Related Questions Extraction, Information Retrieval, Summarization and Answer Presentation. AskHermes performs question classification using MMTx (MetaMap Technology Transfer) to map keywords to UMLS concepts  and semantic types. Classification is also achieved through supervised machine learning algorithms such as Support Vector Machine (SVM) and conditional random fields (CFRs). Summarization and answer presentation are based on clustering techniques.

MiPACQ (Multi-source Integrated Platform for Answering Clinical Questions) is based on Natural Language Processing (NLP) and Information Retrieval (IR) and utilizes data sources such as Electronic Medical Record (EMR) databases and online medical encyclopedia like Medpedia. MiPACQ uses a processing pipeline based on UIMA (Unstructured Information Management Architecture) and machine learning-based as well as rule-based scoring. NLP capabilities are provided by ClearTK and cTakes (clinical Text Analysis and Knowledge Extraction System).



The Road Ahead


Automated Clinical Question Answering (CQA) is really hard. However, that is the future of computing: intelligent machines we can have meaningful conversations with. CQA is a multidisciplinary field which combines disciplines like statistical computing, information retrieval, natural language processing, machine learning, rule engines, semantic web technologies, knowledge representation and reasoning, visual analytics, and massively parallel computing. There are several open source projects that provide the building blocks. Many EHR software today are glorified data entry systems. We need to move to the next level and that will require technical leadership.

Sunday, February 3, 2013

Patient Privacy At Web Scale

A study entitled Patients want granular privacy control over health information in electronic medical records by Kelly Caine and Rima Hanania in the current issue of the Journal of the American Medical Informatics Association (JAMIA) clearly indicates that patients want a granular level of control over the sharing of their medical information. Patients also want to control with whom their health information is shared and for what purpose. The study looks at how the presence of sensitive health information in a medical record affects patient privacy preferences. In this post, I discuss how current and emerging standards can be used to enforce patient privacy preferences at web scale.

First, I think the key to achieving patient privacy at web scale is to adopt proven light-weight protocols and standards such as REST, JSON, OAuth2, and OpenID Connect. The RESTful Health Exchange (RHEx) project funded by the Federal Health Archicture (FHA) was a step in the right direction. These protocols have also been embraced by large internet identity providers like Google, Facebook, and Microsoft. To increase the strength of authentication when using these existing online identities in patient-facing healthcare applications, techniques like multi-factor authentication (e.g., two-factor authentication using the user's phone) and adaptive risk authentication can be used. These light-weight standards and protocols contrast with enterprise-centric alternatives like SOAP and SAML which are the foundation for Integrating the Health Enterprise (IHE) standards including XDS.b, XDR, and XUA.

An emerging approach that could really help put patients in control of the privacy of their electronic medical record is the User-Managed Access (UMA) Protocol of the Kantara Initiative. According to the UMA Core specification:
User-Managed Access (UMA) is a profile of OAuth 2.0. UMA defines how resource owners can control protected-resource access by clients operated by arbitrary requesting parties, where the resources reside on any number of resource servers, and where a centralized authorization server governs access based on resource owner policy.
That sounds a lot like a healthcare environment where a typical patient has her health information residing in the Electronic Health Record (EHR) systems of multiple healthcare providers. A frequent use case is when the patient's health information is shared among providers during primary care physicians' referrals to specialist outpatient clinics. The following are the benefits for the patient privacy of a centralized authorization server as defined in UMA:

  • The ability to manage her consent directives (scope of access in UMA parlance) from a central location (ideally in the cloud) as opposed to the current paper-based environment where the patient signs a consent form for each provider and has no visibility into how the consent is being used and enforced.
  • It facilitates the update and revocation of the consent directives by the patient. 
  • It would give the patient a full audit trail of requests and access events related to her health information.
  • The patient user experience of managing their privacy preferences online can be significantly enhanced by data visualization. A study titled Exploring Visualization Techniques to Enhance Privacy Control UX for User-Managed Access introduced the notion of a "UMA Connection" for helping users visualize the context of a data sharing policy (e.g., contacts, allowed actions, access restrictions, and trusted claims).

In UMA, trusted claims (e.g., information about a requesting healthcare provider such as email, name, role, organization, and NPI) can be conveyed using OpenID Connect. The Google OpenID Connect Demo provides a step by step guide to OpenID Connect and Nat Sakimara's Dummy’s guide for the Difference between OAuth Authentication and OpenID is a good explanation of how OpenID Connect complements OAuth2. A separate specification entitled Binding Obligations on User-Managed Access (UMA) Participants proposes a legal framework that defines the obligations of parties that operate and use UMA-conforming software programs and services.

A recent post by Domenico Catalono entitled UMA Approach to Protect and Control Online Reputation describes a UMA-based approach for supporting privacy based on reputation and trust.  An example in the post is a "global reputation ranking" in the context of an online e-commerce site. In the context of healthcare privacy, when deciding to share their sensitive medical information with a specific healthcare provider, the same concept could be used to display the number and severity of security breaches experienced by the healthcare provider in the past. Section 13402(e)(4) of the HITECH Act actually requires posting a list of breaches of unsecured protected health information affecting 500 or more individuals. The list is available here.

The recently approved XACML 3.0 standard is a powerful mechanism for expressing and evaluating privacy policies. It provides capabilities such as obligation and advice expressions as well as delegation of authorization. In this presentation, Eve Maler discusses possible integration points between UMA and XACML.  The REST Profile of XACML 3.0 and the Request/Response Interface based on JSON and HTTP for XACML 3.0 proposals introduce the notion of "RESTful Authorization-as-a-Service (AZaaS)" which can facilitate the use of XACML in a UMA-based access control environment.


Sunday, January 20, 2013

Application-Level Security in Health IT Systems: A Roadmap

In a previous post titled "A Journey into Software Excellence", I described a twenty-step journey into software excellence including steps on understanding the OWASP Top Ten (list of the top ten most critical web application security vulnerabilities), secure coding, static analysis, and penetration testing. In this post, I elaborate on these security-related steps and discuss a roadmap to application-level security for healthcare IT systems. A study conducted in 2004 found that 64 percent of vulnerabilities in the National Vulnerability Database (NVD) were the result of coding errors. The Department of Homeland Security issued an alert (US-CERT Alert TA13-010A - Oracle Java 7 Security Manager Bypass Vulnerability) on a Java 7 vulnerability which has received a lot of media attention lately.

The Regulatory and Business Context


An investigative report titled "Health-care sector vulnerable to hackers, researchers say" published last month in the Washington Post on the state of cybersecurity reveals that:

"...health care is among the most vulnerable industries in the country, in part because it lags behind in addressing known problems."

The healthcare industry is indeed seriously lagging in security when compared to other industries that handle consumer sensitive information like the payment card industry. The Payment Card Industry Data Security Standard (PCI DSS) is an information security standard for organizations that handle cardholder information. PCI DSS includes requirements for security code reviews, penetration testing, and compliance validation by an external Qualified Security Assessor (QSA).

This week, the Department of Health and Human Services (HHS) issued a final omnibus rule on the HIPAA Privacy, Security, Enforcement, and Breach Notification Rules. The rules impose the following:

  • Increased and tiered civil money penalty structure for security breaches depending on "reasonable diligence", "willful  neglect", and "timely correction". The penalty amount varies from $100 to $50,000 per violation with a maximum penalty of $1.5 million annually for all violations of an identical provision.
  • Expansion of accountability and liability for Business Associates (BAs) and subcontractors.
  • Increased privacy protections under the Genetic Information Nondiscrimination Act (GINA).

Furthermore, the Security and Privacy Tiger Team of the US Office of the National Coordinator (ONC) for health IT released a set of recommendations related to the Meaningful Use (MU) Stage 2 requirements for patients access to health record portals. The need for patient engagement as a prerequisite to a successful transformation of healthcare means that particular attention should be paid to the security needs of consumer-facing web applications.

Developers and Architects building health IT systems should have a solid grasp of the regulatory environment for security and privacy in healthcare and the consequences of non-compliance on their organizations and the patients they serve. 

Considering the critical importance of Security and Privacy for the patient's experience as well as the reputation and bottom line of healthcare organizations, I believe that Security and Privacy should be elevated to a core competency and even a source of competitive advantage. C. K. Prahalad and Gary Hamel first introduced the notion of "core competency" in a 1990 Harvard Business Review (HBR) article titled "The Core Competence of the Corporation". Prahalad and Hamel further elaborated on how to build competitive advantage around core competencies in a book titled "Competing for the Future" published in 1996.

For software developers and architects in healthcare IT, Security and Privacy as a core competency means the following:
  • Security and Privacy are top priorities, as important as the functional requirements of the healthcare IT applications we design, develop, deploy, monitor, and maintain.
  • Excellence in secure software development through awareness and use of best practices, methods, and tools.
  • Learning, continuous improvement, and innovation in Security and Privacy.

 

Security in the Software Development Life Cycle (SDLC)


Unfortunately, security as a non-functional requirement, is often relegated to an afterthought in the software development life cycle (SDLC). As an afterthought, security is added to the software later or at the end of the development cycle. At that point, adding adequate security is difficult and costly, requiring significant rework. In some cases, penetration testing is not performed at all before the application is deployed into production.

This situation can be exacerbated by an interpretation of the Agile methodology that puts the emphasis on the early and frequent demonstrations to the customer of functional (as opposed to non-functional) features of the system under development. To address the issues of secure software development in the context of Agile, the Software Assurance Forum for Excellence in Code (SAFECode) published a guide titled "Practical Security Stories and Tasks for Agile Development Environment".

Another issue is that developers and architects often over-rely on 3rd-party security infrastructure, as opposed to (1) developing a Threat Model for the application they are building and (2) creating a security implementation approach to address the Threat Model. 3rd-party security infrastructure can be helpful, but should serve the security implementation strategy as opposed to driving it. As Bruce Schneier, a well-known cryptographer and computer security specialist said in an article titled "Computer Security: Will We Ever Learn?":
"Security is a process, not a product. Products provide some protection, but the only way to effectively do business in an insecure world is to put processes in place that recognize the inherent insecurity in the products. The trick is to reduce your risk of exposure regardless of the products or patches."


Understanding Potential Security Vulnerabilities


Application Security is a mature discipline. Developers and architects should build a deep understanding of web application security vulnerabilities as opposed to completely relying on 3rd-party security infrastructure for addressing security concerns. The following are well documented bodies of knowledge on security vulnerabilities:

  1. The OWASP Top 10 Web Application Security Risks (cheat sheets explaining each of those vulnerabilities and how to address them are available on the OWASP web site):

    A1: Injection
    A2: Cross-Site Scripting (XSS)
    A3: Broken Authentication and Session Management
    A4: Insecure Direct Object References
    A5: Cross-Site Request Forgery (CSRF)
    A6: Security Misconfiguration
    A7: Insecure Cryptographic Storage
    A8: Failure to Restrict URL Access
    A9: Insufficient Transport Layer Protection
    A10: Unvalidated Redirects and Forwards.

  2. The CWE/SANS Top 25 Most Dangerous Software Errors, the result of collaboration between the SANS Institute, MITRE, and many top software security experts in the US and Europe.
  3. Programming language-specific vulnerabilities such as those listed in the Cert Oracle Secure Coding Standard for Java.
  4. Well-documented security vulnerabilities introduced by the use of 3rd-party application development frameworks such as those based on the popular MVC pattern.
  5. The National Vulnerability Database Version 2.2
  6. The Common Weakness Enumeration (CWE) which is currently maintained by the MITRE Corporation with support from the National Cyber Security Division (DHS). The diagram below  from the CWE web site shows a portion of the CWE hierarchical structure. Click on the image below to enlarge it. 
  7. Obviously, developers should be on the lookout for new and emerging threats to web application security.





Application Threat Modelling


Armed with a deep understanding of potential vulnerabilities, developers and architects can build a Security Policy (who has what type of access to which resource in the system) and a Threat Model including:

  • An analysis of the attack surface of the application.
  • Identification of potential threats and attackers (both inside and outside the organization and its business associates and subcontractors) and their characteristics, tactics, and motivations. A threat categorization methodology such as STRIDE can be used. STRIDE defines the following threat categories: Spoofing of user identity, Tampering, Repudiation, Information disclosure (privacy breach or Data leak), Denial of Service (D.o.S.), and Elevation of privilege
  • The consequences of those attacks for patients and the healthcare organization serving them.
  • Countermeasures and a risk mitigation strategy. The Application Security Frame (ASF) defines the following categories of countermeasures:  Authentication, Authorization, Configuration Management, Data Protection in Storage and Transit, Data Validation/Parameter Validation, Error Handling and Exception Management, User and Session Management, Auditing and Logging.
  • How the deployment environment will impact privacy and security. NIST and the Cloud Security Alliance (CSA) provide specific security guidance for cloud deployment.
When it comes to application threat modelling, to paraphrase Andy Grove, former CEO and Chairman of Intel, "only the paranoid survive".

 

Developing a Security Implementation Strategy


When it comes to ensuring application security, I believe that the key is verification. The OWASP "Application Security Verification Standard" provides guidelines for introducing security verification into the Software Development Lifecycle (SDLC).


Secure Coding Standards, Static Analysis, and Security Code Review


Many developers are aware of coding conventions (such as the Code Conventions for the Java Programming Language),  and the benefits of peer code reviews and static code analysis (using tools like Checkstyle, PMD, FindBugs, and Sonar). These practices should be expanded to cover secure coding as well. The following resources can help:

  • The Cert Oracle Secure Coding Standard for Java.
  • The OWASP Code Review Guide.
  • The "Fundamental Practices for Secure Software Development: A Guide to the Most Effective Secure Development Practices in Use Today" published by the Software Assurance Forum for Excellence in Code (SAFECode)
  • The Payment Card Industry Data Security Standard (PCI DSS) "Information Supplement: Requirement 6.6 Code Reviews and Application Firewalls Clarified" is an example of secure code review requirements in an industry vertical.

There are secure code static analysis tools that can be particularly useful when used in combination with a secure code review process. The static code analysis should be integrated into the build and continuous integration process to provide specific secure code metrics as well as the evolution of those metrics over time.


Penetration Testing


Finally, the application should go through penetration testing before it is deployed into production. Application-level penetration testing should be done in addition to network-level penetration testing. Application-level penetration should simulate attacks from malicious insiders and outsiders. OWASP provides a detailed Testing Guide and a number of open source and commercial penetration testing tools are available as well.

Sunday, January 13, 2013

Visual Analytics for Clinical Decision Making

In my last post, I talked about the era of Big Data in medicine, Evidence-Based Practice (EBP),  Practice-Based Evidence (PBE), and the need for a human-centered approach to building intelligent health IT (iHIT) systems. In this post, I discuss Visual Analytics, an emerging discipline in Data Science. In a report titled "Illuminating the Path: The R&D Agenda for Visual Analytics" published in 2004 by the National Visualization and Analytics Center (NVAC), Visual Analytics is defined as "the science of analytical reasoning facilitated by visual interactive interfaces."

The goal of Visual Analytics is to obtain deep insight for effective understanding, reasoning, and decision making through the visual exploration of massive, complex, and often ambiguous data. As a multidisciplinary field, Visual Analytics combines several disciplines such as human perception and cognition, interactive graphic design, statistical computing, data mining, spatio-temporal data analysis, and even art.

In his book titled "Beautiful Evidence", Edward Tufte illustrates the fundamental principles of analytical design by using Charles Minard's famous map known as "Carte figurative des pertes successives en hommes de l'Armée Française dans la campagne de Russie 1812-1813" (Figurative Map of the successive losses in men of the French Army in the Russian Campaign 1812-1813). The map is a dramatic account of the heavy losses of the french army during Napoleon's Russian campaign of 1812. Edward Tuffe calls the map the "best statistical graphics ever". Click on the image below to enlarge it.




Visual Analytics is also an emerging discipline in healthcare informatics. For example, similar to Minard's map of the Russian Campaign of 1812-1813, Visual Analytics can help in comparing different interventions and care pathways and their respective clinical outcomes over a certain period of time through the vivid showing of causes, variables, comparisons, and explanations. This approach contrasts with the traditional display of clinical data in table rows that is so common in electronic health record (EHR) systems interfaces. Visual Cluster Analysis (another Visual Analytics technique) can be particularly helpful in Comparative Effectiveness Research (CER) where the goal is to compare the benefits and harms of different interventions for different subgroups (groups of patients sharing similar clinical characteristics such as age, gender, race, genetic profile, and comorbidities).

You can find interesting examples of research projects and implementations in the proceedings of the Visual Analytics in Healthcare Workshop which has been held in conjunction with the IEEE VisWeek for the past three years. There are a number of open source toolkits that can be used to implement Visual Analytics. Some of them are based on open web standards such as HTML5, CSS3, SVG, and Javascript. My favorite is D3.js.

Sunday, December 30, 2012

Prediction for 2013: Intelligent Health IT Systems (iHIT) Go Mainstream

iHIT systems represent an evolution of clinical decision support (CDS) systems. Traditionally, CDS systems have provided functionalities such as Alerts and Reminders, Order Sets, Infobuttons, and Documentation Templates. iHIT systems go beyond these basic functionalities and are poised to go mainstream in 2013. This evolution is enabled by recent developments in both computing and healthcare. Notably in computing:

  • The emergence of Big Data and massively parallel computing platforms like Hadoop.
  • The entrance of the following disciplines into the mainstream of computing: Machine Learning (a branch of Artificial Intelligence), Statistical Computing, Visual Analytics, Natural Language Processing, Information Retrieval, Rule engines, and Semantic Web Technologies (like RDF, OWL, SPARQL, and SWRL). These disciplines have been around for many years, but have been largely confined into Academia, very large organizations, and niche markets.
  • The availability of open source tools, platforms, and resources to support the technologies mentioned above. Examples include: R (a statistical engine), Apache Hadoop, Apache Mahout, Apache Jena, Apache Stanbol, Apache OpenNLP, and Apache UIMA. The number of books, courses, and conferences dedicated to these topics has increased dramatically over the last two years signalling an entrance into the mainstream.
In addition, the healthcare industry itself is currently going through a significant transformation from a business model based on the number of patients treated to a value-based payment model. The Accountable Care Organization (ACO) is an example of this new model. This model puts an increased emphasis on meeting certain quality and performance metrics driven by the latest scientific evidence (this is called Evidence Based Practice or EBP).

Although very costly, Randomized Control Trials (RCTs) are considered the strongest form of evidence in EBP. Despite their inherent methodological challenges (lack of randomization leading to possible bias and confounding), observational studies (using real world data) are increasingly recognized as complementary to RCTs and an important tool in clinical decision making and health policy. According to a report titled "Clinical Practice Guidelines (CPGs) We Can Trust"  published by the Institute Of Medicine (IoM):
"Randomized trials commonly have an under representation of important subgroups, including those with comorbidities, older persons, racial and ethnic minorities, and low-income, less educated, or low-literacy patients."
Investments into Comparative Effectiveness Research (CER) are increasing as well. CER, an emerging trend in Evidence Based Practice (EBP), has been defined by the Federal Coordinating Council for CER as "the conduct and synthesis of research comparing the benefits and harms of different interventions and strategies to prevent, diagnose, treat and monitor health conditions in 'real world' settings." CER is important not only for discovering what works and what doesn't work in practice, but also for an informed shared decision making process between the patient and her provider.

The use of predictive risk models for personalized medicine is becoming a common practice. These models can predict the health risks of patients based on their individual health profiles (including genetic profiles). These models often take the form of logistic regression models. Examples include models for predicting cardiovascular disease, ICU mortality, and hospital readmission (an important ACO performance measure).

Thanks to the Meaningful Use incentive program, adoption of electronic health record (EHR) systems by providers is rapidly increasing. This translates into the availability of huge amount of EHR data which can be harvested to provide Practice Based Evidence (PBE) necessary to close the evidence loop. PBE is the key to a learning health system. The Institute of Medicine (IOM) released a report last year titled "Digital Infrastructure for the Learning Health System: The Foundation for Continuous Improvement in Health and Health Care". The report describes a learning health system as:
"...delivery of best practice guidance at the point of choice, continuous learning and feedback in both health and health care, and seamless, ongoing communication among participants, all facilitated through the application of IT."
Both EBP and PBE will require not only rigorous scientific methodologies, but also a computing platform suitable for the era of Big Data in medicine. As Williams Osler (1849-1919) famously said:
"Medicine is a science of uncertainty and an art of probability."
Lastly, to be successful, the emergence of iHIT systems will require a human-centered design approach. This will be facilitated by the use of techniques that can enhance human cognitive abilities. Examples are: Electronic Checklists (an approach that originates from the aviation industry and has been proven to save lives in healthcare delivery as well) and Visual Analytics.

Happy New Year to You and Your Family!

Saturday, December 8, 2012

A Journey into Software Excellence

I am back in the blogosphere after a seven month hiatus. It was about time I get my blogging act together. Software development has never been so much fun. In this post, I will share some thoughts on using tools, methods, and practices that can really help in your search for software excellence from the initial prototyping of the user interface to deployment.

  1. With the rapid proliferation of mobile and desktop devices, adopt a Responsive Wed Design (RWD) strategy to reach the largest audience possible.
  2. Create responsive sketches, wireframes, or mockups and apply usability guidelines during the initial prototyping. The NHS Common User Interface (CUI) Program is a good example of usability guidelines for healthcare IT applications. Usability.gov also has many interesting resources as well.
  3. Perform usability testing to test your design ideas and obtain early feedback from future users of your product before actual development starts. Use metrics such as the System Usability Scale (SUS) to assess the results.
  4. Carefuly select the right HTML5, CSS3, and Javascript libraries and frameworks to support your Responsive Wed Design (RWD) strategy. This is no easy task as you need to balance the requirements of responsiveness vs. an optimal mobile user experience.
  5. Consider "Specification By Example" and Behaviour Driven Development (BDD) tools like Cucumber-JVM to create executable user stories.
  6. Pattern languages like Domain Driven Design (DDD) can help you avoid a "Big Ball of Mud" in architecting your software. DDD concepts such as "Strategic Design", "Bounded Context", "Published Language", and "Anti-Corruption Layer" can help you put your architecture in the right perspective, particularly if there is a need to support industry interoperability standards such as HL7 and IHE. However, beware that the practice of DDD has evolved over the last 8 years and new lessons have been learned particularly in the area of "Aggregate" design. So keep up-to-date with new developments in the field in order to leverage the experience of the community. I also found the concept of "Hexagonal Architecture" very helpful in visualizing the complexity of an architecture from different angles.
  7. Consider a peer review of the architecture using a methodology like the Architecture Tradeoff Analysis Method (ATAM).
  8. Embrace Polyglot Persistence (the use of different persistence mechanisms such as relational, document, and graph databases within the same application). However, use the right application development framework to make this easy. Beware of the peculiarities of modeling data for NoSQL databases and remember that "Persistence Ignorance" is not always easy to achieve in practice.
  9. Add a social dimension to your product by integrating the user experience with existing social networking sites that your users already belong to.
  10. Make your application more intelligent through the use of techniques such as Machine Learning (e.g., a recommendation engine), ontologies and rule engines (e.g., automated reasoning), and Natural Language Processing (NLP) (e.g., automated question answering). As Richard Hamming said: "The purpose of computing is insight, not numbers".
  11. To enhance the user experience, adopt HTML5, SVG, and Javascript-based graphing and data visualization techniques for data-intensive applications.
  12. Consider the benefits of deploying the application to the cloud and if you decide to deploy to the cloud, factor that into your entire design and development process including the selection of development tools. Choosing the right Platform-as-a-Service (PaaS) provider can facilitate the process.
  13. Create a Continuous Delivery pipeline based on the core concept of automated testing. Leverage tools like Git (Distributed Version Control), Gradle (build), Jenkins (Continuous Integration), and Artifactory. Continuous Delivery allows you to go to market faster and with confidence in the quality of your product. Save infrastructure costs by using these tools in the cloud during development.
  14. Although there is still a place for manual testing, all tests should be automated as much as possible. In addition to the traditional unit tests (using tools like JUnit, TestNG, and Mockito), embrace automated cross-device, cross-browser, and cross-platform user interface (UI) testing using a tool like Selenium.
  15. Web services and performance testing should also become part of your build and Continuous Delivery pipeline using tools like soapUI and JMeter respectively. Performance testing should not be an afterthought.
  16. Adopt automated code quality inspection with tools like Sonar, Checkstyle, FindBugs, and PMD. This can supplement your peer code review process and can provide you with concrete code quality metrics in addition to automatically flagging bugs (including insecure code) in your code base.
  17. Write secure code by carefully studying the OWASP Top Ten. Adopt OWASP guidelines related to security testing and secure code reviews. Perform penetration testing to find vulnerabilities in your application before it is too late.
  18. Do your due diligence in protecting the privacy of your users data. Put the users in control of their privacy in your system by adopting standards such as OAuth2, OpenID Connect, and the User Managed Access (UMA) protocol of the Kantara Initiative. Consider increasing the strength of authentication using multi-factor authentication (e.g., two-factor authentication using the user's phone).
  19. Invest in learning and training your development team. Software excellence can only be achieved by skilled professionals.
  20. Relax, have fun, and remember that software excellence is a journey.

Saturday, May 5, 2012

How to Add Arbitrary Metadata to Any Element of an HL7 CDA Document

There has been a lot of buzz lately about metadata tagging in the health IT community. In this blog, I describe an approach to annotating HL7 CDA documents (or any other XML documents) without actually editing the document that is being annotated. Metadata tagging is just an example of annotation. The underlying principle of this approach is that Anyone can say Anything about Anything (the AAA slogan) which is well know in the Semantic Web community. In other words, anyone (e.g., patient, care giver, physician, provider organization) should have the ability to add arbitrary metadata to any element of a CDA document. For the sake of "Separation of Concerns" which is a fundamental principle in software engineering, the metadata should be kept out of the CDA document. The benefits of keeping the metadata or annotations out of the CDA document include:
  • Reuse of the same metadata by distinct elements from potentially multiple clinical documents.
  • The ability to update the metadata without affecting the target CDA documents.
  • The ability for any individual, organization, or community of interest (e.g., privacy or medical device manufacturers) to create a metadata vocabulary without going through the process of modifying the normative CDA specification (or one of its offsprings like the CCD, the C32, or the Consolidated CDA) or the XDS metadata specifications.

History and Current Status of Metadata Standards in Health IT


The CDA specification defines some metadata in the header of a CDA document. In addition, the XD* family of specifications (XDS, XDR, and XDM) also defines a comprehensive set of metadata to be used in cross enterprise document exchange. NIEM is being used currently in several health IT projects. In a previous post titled "Toward a Universal Exchange Language for Healthcare", I described how the NIEM metadata approach could be adapted to the healthcare domain.

The President's Council of Advisors on Science and Technology (PCAST) published a report in December 2010 entitled: "Realizing the Full Potential of Health Information Technology to Improve Healthcare for Americans: The Path Forward". To describe the proposed approach to metadata tagging, the report provides an example based on the exchange of mammograms:
"The physician would be able to securely search for, retrieve, and display these privacy protected data elements in much the way that web surfers retrieve results from a search engine when they type in a simple query.
What enables this result is the metadata attached to each of these data elements (mammograms), which would include (i) enough identifying information about the patient to allow the data to be located (not necessarily a universal patient identifier), (ii) privacy protection information-who may access the mammograms, either identified or de­identified, and for what purposes, (iii) the provenance of the data-the date, time, type of equipment used, personnel (physician, nurse, or technician), and so forth."
The HIT Standards Committee (HITSC) Metadata Tiger Team made specific recommendations to the ONC in June 2011. These recommendations included the use of:

  • Policy Pointers: URLs that point to external policy documents affecting the tagged data element.
  • Content Metadata: the actual metadata with datatype (clinical category) and sensitivity (e.g., substance abuse and mental health).
  • Use of the HL7 CDA R2 with headers.

Based on those recommendations, the ONC published a Notice of Proposed Rule Making (NPRM) in August 2011 to receive comments on proposed metadata standards.

The Data Segmentation Working Group of the ONC Standards and Interoperability Framework is currently working on metadata tagging for compliance with privacy policies and consent directives.


The Annotea Protocol


The capability to add arbitrary metadata to documents without modifying them has been available in the Semantic Web for at least a decade. Indeed, it is hard to talk about metadata without a reference to the Semantic Web. I will use the W3C Annotea Protocol (which is implemented by the Amaya open source project) to demonstrate this capability. I will also show that this approach does not require the use of the Resource Description Framework (RDF) format and related Semantic Web technologies like OWL and SPARQL. The approach can be adapted to alternative representation formats such as XML, JSON, or the Atom syndication format. Let's assume that I need to add metadata tags to the CDA document below. The CDA document has only one problem entry for substance abuse disorder (SNOMED CT code 66214007) and my goal is to attach privacy metatada prohibiting the disclosure of that information (the most relevant elements are highlighted in red):

<ClinicalDocument>
.....
<component>
<structuredBody>
<component>
<!--Problems-->
<section>
<templateId root="2.16.840.1.113883.3.88.11.83.103"
    assigningAuthorityName="HITSP/C83"/>
<templateId root="1.3.6.1.4.1.19376.1.5.3.1.3.6"
    assigningAuthorityName="IHE PCC"/>
<templateId root="2.16.840.1.113883.10.20.1.11" assigningAuthorityName="HL7 CCD"/>
<!--Problems section template-->
<code code="11450-4" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC"
    displayName="Problem list"/>
<title>Problems</title>
<text>...</text>
<entry typeCode="DRIV">
<act classCode="ACT" moodCode="EVN">
    <templateId root="2.16.840.1.113883.3.88.11.83.7"
        assigningAuthorityName="HITSP C83"/>
    <templateId root="2.16.840.1.113883.10.20.1.27"
        assigningAuthorityName="CCD"/>
    <templateId root="1.3.6.1.4.1.19376.1.5.3.1.4.5.1"
        assigningAuthorityName="IHE PCC"/>
    <templateId root="1.3.6.1.4.1.19376.1.5.3.1.4.5.2"
        assigningAuthorityName="IHE PCC"/>
    <!-- Problem act template -->
    <id root="6a2fa88d-4174-4909-aece-db44b60a3abb"/>
    <code nullFlavor="NA"/>
    <statusCode code="completed"/>
    <effectiveTime>
        <low value="1950"/>
        <high nullFlavor="UNK"/>
    </effectiveTime>
    <performer typeCode="PRF">
        <assignedEntity>
            <id extension="PseudoMD-2" root="2.16.840.1.113883.3.72.5.2"/>
            <addr/>
            <telecom/>
        </assignedEntity>
    </performer>
    <entryRelationship typeCode="SUBJ" inversionInd="false">
        <observation classCode="OBS" moodCode="EVN">
            <templateId root="2.16.840.1.113883.10.20.1.28"
                assigningAuthorityName="CCD"/>
            <templateId root="1.3.6.1.4.1.19376.1.5.3.1.4.5"
                assigningAuthorityName="IHE PCC"/>
            <!--Problem observation template - NOT episode template-->
            <id root="d11275e7-67ae-11db-bd13-0800200c9a66"/>
            <code code="64572001" displayName="Condition"
                codeSystem="2.16.840.1.113883.6.96"
                codeSystemName="SNOMED-CT"/>
            <text>
                <reference value="#PROBSUMMARY_1"/>
            </text>
            <statusCode code="completed"/>
            <effectiveTime>
                <low value="1950"/>
            </effectiveTime>
            <value  displayName="Substance Abuse Disorder" code="66214007" codeSystemName="SNOMED" codeSystem="2.16.840.1.113883.6.96"/>
            <entryRelationship typeCode="REFR">
                <observation classCode="OBS" moodCode="EVN">
                    <templateId root="2.16.840.1.113883.10.20.1.50"/>
                    <!-- Problem status observation template -->
                    <code code="33999-4" codeSystem="2.16.840.1.113883.6.1"
                        displayName="Status"/>
                    <statusCode code="completed"/>
                    <value  code="55561003"
                        codeSystem="2.16.840.1.113883.6.96"
                        displayName="Active">
                        <originalText>
                        <reference value="#PROBSTATUS_1"/>
                        </originalText>
                    </value>
                </observation>
            </entryRelationship>
        </observation>
    </entryRelationship>
</act>
</entry>
</section>
</component>
</structuredBody>
</component>
</ClinicalDocument>




The following is a separate annotation document containing some metadata pointing to the Substance Abuse Disorder entry in the target CDA document:

<r:RDF xmlns:r="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:a="http://www.w3.org/2000/10/annotation-ns#"
    xmlns:d="http://purl.org/dc/elements/1.1/">
    <r:Description>
        <r:type r:resource="http://www.w3.org/2000/10/annotation-ns#Annotation"/>
        <r:type r:resource="http://www.w3.org/2000/10/annotationType#Metadata"/>
        <a:annotates r:resource="http://hospitalx.com/ehrs/cda.xml"/>
        <a:context>http://hospitalx.com/ehrs/cda.xml#xpointer(/ClinicalDocument/component/structuredBody/component[1]/section[1]/entry[1])</a:context>
        <d:title>Sample Metadata Tagging</d:title>
        <d:creator>Bob Smith</d:creator>
        <a:created>2011-10-14T12:10Z</a:created>
        <d:date>2011-10-14T12:10Z</d:date>
        <a:body>Do Not Disclose</a:body>
    </r:Description>
</r:RDF>

Please note a few interesting facts about the annotation document:

  • As explained by the original specification: "The Annotea protocol works without modifying the original document; that is, there is no requirement that the user have write access to the Web page being annotated."
  • The annotation itself has metadata using the well known Dublin Core metadata specification to specify who created this annotation and when.
  • The document being annotated is cda.xml located at http://hospitalx.com/ehrs/cda.xml. This is described by the element <a:annotates r:resource="http://hospitalx.com/ehrs/cda.xml"/>.
  • The specific element that is being annotated within the target CDA document is specified by the context element: <a:context>http://hospitalx.com/ehrs/cda.xml#xpointer(/ClinicalDocument/component/structuredBody/component[1]/section[1]/entry[1])</a:context> using XPointer, a specification described by the W3C as "the language to be used as the basis for a fragment identifier for any URI reference that locates a resource whose Internet media type is one of text/xml, application/xml, text/xml-external-parsed-entity, or application/xml-external-parsed-entity."
  • The XPath expression /ClinicalDocument/component/structuredBody/component[1]/section[1]/entry[1] within the XPointer is used to target the entry element in the CDA document.
  • Using XPath (1.0 or 2.0) allows us to address any element (or node) in an XML document. For example, this XPath //value[@code='66214007']/ancestor::entry will point to any entry element which contains a value element with an attribute code='66214007' (essentially targeting all entry elements which contain a Substance Abuse Observation). The combination of XPath, XPointer, and standard medical terminology codes gives the ability to attach any annotation or metadata to any element having interoperable semantics.
  • The body element contains the actual annotation: <a:body>Do Not Disclose</a:body>. However, the body of the annotation can also be located outside of the annotation (e.g., in a shared metadata registry) in which case the body element will be marked up as in the following example: <a:body r:resource="http://metadataregistry.com/myconsentdirectives.xml"/>

Alternative Representations

 

As mentioned before, for those who for one reason or another don't want to use RDF and related Semantic Web technologies, the annotation can be easily converted to a pure XML (as opposed to RDF/XML), JSON, or Atom representation. The original Annotea Protocol describes a RESTful protocol which includes the following operations: posting, querying, downloading, updating, and deleting annotations. The Atom Publishing Protocol (APP) is a newer RESTful protocol that is well adapted to the Atom syndication format.


Processing Annotations with XPointer


How the annotations are processed and consumed is only limited by the requirements of a specific application and the imagination of the developers writing it. For example, an application can read both the annotation document and the target CDA document and overlay the annotations on top of the entries in the CDA document while displaying the latter in a web browser. Another example is the enforcement of privacy policies and preferences prior to exchanging the CDA document. The issue that will be raised is how to process the XPointer fragment identifiers. XPointer uses XPath which is a well established XML addressing mechanism supported by many XML processing APIs across programming languages. For those of you who use XSLT2 to process CDA documents, there is the open source XPointer Framework for XSLT2 for use with the Saxon XSLT2 engine.

Monday, February 6, 2012

Toward Intelligent Health IT (iHIT) Systems: Getting Out of the Box

In this post, I describe a new type of application that I refer to as iHIT. iHIT stands for Intelligent Health IT.

The Architecture of Traditional Health IT systems

Traditional software architectures for health IT systems typically include the following:

  • Dependency Injection (DI)

  • Object Relational Mapping (ORM)

  • An architectural pattern for the presentation layer such as the Model View Controller (MVC) pattern

  • HTML5, CSS3, and a JavaScript library like JQuery/Mobile

  • Other architectural patterns including GoF Design Patterns, SOLID Principles, and Domain Driven Design (DDD)

  • Structured Query Language (SQL)

  • Enterprise Integration Patterns (EIPs) implemented through an Enterprise Service Bus (ESB) using HL7 messages as the "Published Language"

  • REST or SOAP-based web services.

An entire generation of developers has been trained in these techniques. They represent proven best practices accumulated over several decades of object-oriented design and relational data management. Although pervasive in today's clinical systems, these applications lack basic intelligent features such as the ability to capture and execute expert knowledge, make inferences, or make predictions about the future based on the analysis of historical data. Some of these systems actually look like glorified data entry systems.

With the availability and explosion of medical knowledge and real world observational EHR data, these intelligent features will become important in assisting clinicians in the medical decision making process at the point of care by reducing their cognitive load.

Intelligent Health IT (iHIT) Systems

iHIT systems process huge quantities of both structured and unstructured data to provide clinicians with specific recommendations. iHIT systems play an important role in translating Comparative Effectiveness Research (CER) findings into clinical practice. Comparative effectiveness Research (CER), an emerging trend in Evidence Based Medicine (EBM), has been defined by the Federal Coordinating Council for CER as "the conduct and synthesis of research comparing the benefits and harms of different interventions and strategies to prevent, diagnose, treat and monitor health conditions in 'real world' settings." For example, based on the clinical profile of a patient, CER can help determine the best treatment option for breast cancer among the various options available such as: chemotherapy, radiation therapy, and surgery (Masectomy and Lumpectomy).

The following are examples of key characteristics displayed by iHIT systems:

  • The ability to analyze patient data as well as very large historical observational data sets in order to make probability-based predictions about the future and recommend specific actions that can yield the best clinical outcomes given the clinical profile of a patient.

  • The ability to capture and execute expert knowledge such as the medical knowledge contained in Clinical Practice Guidelines (CPGs). This includes the ability to mediate between different CPGs to arrive at a specific recommendation by merging and reconciling the medical knowledge in multiple CPGs as is the case with patients with comorbidities.

  • The ability to perform automated reasoning by inferring new implicit clinical facts from existing explicit facts and by exploiting semantic relationships between concepts and entities.

  • The ability to retrieve knowledge from unstructured data sources such as the biomedical research literature from sources like PubMed in order to answer clinical questions sometimes posed in natural language.

  • The ability to learn over time (and hence become smarter) as the amount of processed data continues its exponential growth.

  • Very fast response time to queries over very large data sets.


Sounds like Artificial Intelligence (AI)? I believe we are indeed witnessing the resurgence of AI and even the ideas of the Semantic Web in the healthcare industry. As healthcare costs and quality become national priorities for many countries around the world, the boundaries of computing will continue to be pushed further. Actually, some of the underlying principles of intelligent systems were originally developed decades and even centuries ago in the field of biomedical research. Williams Osler (1849-1919) famously said:

Medicine is a science of uncertainty and an art of probability.

Technologically advanced and competitive industries like financial services (e.g., credit eligibility and fraud detection), online retail (e.g., recommendation engine), and logistics (e.g., delivery route optimization) have adopted some of these technologies. Health IT developers now need to embrace them as well. This will require thinking out of the box.


The Ingredients of iHIT Systems

iHIT systems represent not one, but the integration of many different technologies. Mathematical Models, Statistical Analysis, and Machine Learning algorithms play an important role in iHIT systems. Examples include:

  • Logistic Regression models

  • Decision Trees

  • Association Rules

  • Bayesian Network

  • Neural Networks

  • Random Forests

  • Time Series for temporal reasoning

  • k-means Clustering

  • Support Vector Machines (SVM)

  • Probabilistic Graphical Models (PGMs) based on methods such as Bayesian networks and Markov Networks for making clinical decisions under uncertainty.

These algorithms can be used not only for making therapeutic predictions (e.g., the future hospitalization risk of a patient with Asthma), but also for dividing a population into subgroups based on the clinical profile of patients in order to achieve the best treatment outcomes.

Clinical Practice Guidelines (CPGs) are usually-based on Systematic Reviews (SRs) of Randomized Controlled Trials (RCTs) which are essentially scientific experiments. According to a report titled "Clinical Practice Guidelines (CPGs) We Can Trust" which was published last year by the Institute Of Medicine (IoM):

However, even when studies are considered to have high internal validity, they may not be generalizable to or valid for the patient population of guideline relevance. Randomized trials commonly have an under representation of important subgroups, including those with comorbidities, older persons, racial and ethnic minorities, and low-income, less educated, or low-literacy patients. Many RCTs and observational studies fail to include such "typical patients" in their samples; even when they do, there may not be sufficient numbers of such patients to assess them separately or the subgroups may not be properly analyzed for differences in outcomes.

On the other hand, observational studies using statistical analysis and machine learning algorithms operate on large real world observational data and can therefore provide feedback on the effectiveness of the actual use of different therapeutic interventions. Although very costly, RCTs are still considered the strongest form of evidence in EBM. Despite their inherent methodological challenges (lack of randomization leading to possible bias and confounding), observational studies are increasingly recognized as complementary to RCTs and an important tool in clinical decision making and health policy. iHIT systems play an important role in translating Comparative Effectiveness Research (CER) findings into clinical practice in the form of clinical decision support (CDS) interventions at the point of care.

iHIT systems also use business rules engines to capture and execute expert knowledge such as the medical knowledge contained in Clinical Practice Guidelines (CPGs). Examples include rules engines based on forward chaining inference, also known as production rule systems. These rules engines can be combined with Complex Event Processing (CEP) and Business Process Management (BPM) for intelligent decision making.

iHIT systems support ontologies such as those represented by the web ontology language (OWL) providing reasoning capabilities as well as the ability to navigate semantic relationships between concepts and entities.

More advanced iHIT systems have Natural Language Processing (NLP) and Automatic Speech Recognition (ASR) capabilities in order to answer clinical questions posed in natural language. They rely on Information Retrieval techniques like probabilistic methods for scoring the relevance of a document given a query and the application of supervised machine learning classification methods such as decision trees, Naive Bayes, K-Nearest Neighbors (kNN), and Support Vector Machines (SVM).

In some cases, the responsibilities of an iHIT system are performed by Intelligent Agents which are autonomous entities capable of observing the clinical environment and acting upon those observations.

For scalability and performance, iHIT systems often sit on NoSQL databases and run on massively parallel computing platforms like Apache Hadoop while leveraging the elasticity of the cloud.

Integrating these technologies is the main challenge posed by iHIT systems. An example is the integration between statistical and machine learning models, business rules, ontologies, and more traditional forms of computing such as object-oriented programming. Various solutions to these challenges have been proposed and implemented.

Human-Centered Design

Finally, iHIT systems fully embrace a human-centered design approach. They provide a seamless integration between automated decision logic and clinical workflows. They provide the clinician with detailed explanations of the rationale behind the actions they recommend. In addition, they use techniques like Visual Analytics to enhance human cognitive abilities in order to facilitate analytical reasoning over very large data sets.