Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Sunday, August 17, 2014

Natural Language Processing (NLP) for Clinical Decision Support: A Practical Approach

A significant portion of the electronic documentation of clinical care is captured in the form of unstructured narrative text like psychotherapy and progress notes. Despite the big push to adopt structured data entry (as required by the Meaningful Use incentive program for example), many clinicians still like to document care using free narrative text. The advantage of using narrative text as opposed to coded entries is that narrative text can tell the story of the patient and the care provided particularly in complex cases. My opinion is that free narrative text should be used to complement coded entries when necessary to capture relevant information.

Furthermore, medical knowledge is expanding very rapidly. For example, PubMed has more than 24 millions citations for biomedical literature from MEDLINE, life science journals, and online books. It is impossible for the human brain to keep up with that amount of knowledge. These unstructured sources of knowledge contain the scientific evidence that is required for effective clinical decision making in what is referred to as Evidence-Based Medicine (EBM).

In this blog, I discuss two practical applications of Natural Language Processing (NLP). The first is the use of NLP tools and techniques to automatically extract clinical concepts and other insight from clinical notes for the purpose of providing treatment recommendations in Clinical Decision Support (CDS) systems. The second is the use of text analytics techniques like clustering and summarization for Clinical Question Answering (CQA).

The emphasis of this post is on a practical approach using freely available and mature open source tools as opposed to an academic or theoretical approach. For a theoretical treatment of the subject, please refer to the book Speech and Language Processing by Daniel Jurafsky and James Martin.


Clinical NLP with Apache cTAKES


Based on the Apache Unstructured Information Management Architecture (UIMA) framework and the Apache OpenNLP natural language processing toolkit, Apache cTAKES provides a modular architecture utilizing both rule-based and machine learning techniques for information extraction from clinical notes. cTAKES can extract named entities (clinical concepts) from clinical notes in plain text or HL7 CDA format and map these entities to various dictionaries including the following Unified Medical Language System (UMLS) semantic types: diseases/disorders, signs/symptoms, anatomical sites, procedures, and medications.

cTAKES includes the following key components which can be assembled to create processing pipelines:

  • Sentence boundary detector based on the OpenNLP Maximum Entropy (ME) sentence detector.
  • Tokenizor
  • Normalizer using the National Library of Medicine's Lexical Variant Generation (LVG) tool
  • Part-of-speech (POS) tagger
  • Shallow parser
  • Named Entity Recognition (NER) annotator using dictionary look-up to UMLS concepts and semantic types. The Drug NER can extract drug entities and their attributes such as dosage, strength, route, etc.
  • Assertion module which determines the subject of the statement (e.g., is the subject of the statement the patient or a parent of the patient) and whether a named entity or event is negated (e.g., does the presence of the word "depression" in the text implies that the patient has depression).
Apache cTAKES 3.2 has added YTEX, a set of extensions developed at Yale University which provide integration with MetaMap, semantic similarity, export to Machine Learning packages like Weka and R, and feature engineering.

The following diagram from the Apache cTAKES Wiki provides an overview of these components and their dependencies (click to enlarge):


Massively Parallel Clinical Text Analytics in the Cloud with GATECloud


The General Architecture for Text Engineering (GATE) is a mature, comprehensive, and open source text analytics platform. GATE is a family of tools which includes:

  • GATE Developer: an integrated development environment (IDE) for language processing components with a comprehensive set of available plugins called CREOLE (Collection of REusable Objects for Language Engineering). 
  • GATE Embedded: an object library for embedding services developed with GATE Developer into third-party applications.
  • GATE Teamware: a collaborative semantic annotation environment based on a workflow engine for creating manually annotated corpora for applying machine learning algorithms. 
  • GATE Mímir: the "Multi-paradigm Information Management Index and Repository" which supports a multi-paradigm approach to index and search over text, ontologies, and semantic metadata.
  • GATE Cloud: a massively parallel clinical text analytics platform (Platform as a Service or PaaS) built on the Amazon AWS Cloud.
What makes GATE particularly attractive is the recent addition of GATECloud.net PaaS which can boost the productivity of people involved in large scale text analytics tasks.

 

Clustering, Classification, Text Summarization, and Clinical Question Answering (CQA)

 

An unsupervised machine learning approach called Clustering can be used to classify large volumes of medical literature into groups (clusters) based on some similarity measure (such as the Euclidean distance). Clustering can be applied at the document, search result, and word/topic levels. Carrot2 and Apache Mahout are open source projects that provide several methods for document clustering. For example, the Latent Dirichlet Allocation learning algorithm in Apache Mahout automatically clusters words into topics and documents into mixtures of topics. Other clustering algorithms in Apache Mahout include: Canopy, Mean-Shift, Spectral, K-Means and Fuzzy K-Means. Apache Mahout is part of the Hadoop ecosystem and can therefore scale to very large volumes of unstructured text.

Document classification essentially consists in assigning predefined set of labels to documents. This can be achieved through supervised machine learning algorithms. Apache Mahout implements the Naive Bayes classifier.

Text summarization techniques can be used to present succinct and clinically relevant evidence to clinicians at the point of care. MEAD (http://www.summarization.com/mead/) is an open source project that implements multiple summarization algorithms. In the biomedical domain, SemRep is a program that extracts semantic predications (subject-relation-object triples) from biomedical free text. Subject and object arguments of each predication are concepts from the UMLS Metathesaurus and the relation is from the UMLS Semantic Network (e.g., TREATS, Co-OCCURS_WITH). The SemRep summarization provides a short summary of these concepts and their semantic relations.

AskHermes (Help clinicians to Extract and aRrticulate Multimedia information for answering clinical quEstionS) is a project that attempts to implement these techniques in the clinical domain. It allows clinicians to enter questions in natural language and uses the following unstructured information sources: MEDLINE abstracts, PubMed Central full-text articles, eMedicine documents, clinical guidelines, and Wikipedia articles.

The processing pipeline in AskHermes includes the following: Question Analysis, Related Questions Extraction, Information Retrieval, Summarization and Answer Presentation. AskHermes performs question classification using MMTx (MetaMap Technology Transfer) to map keywords to UMLS concepts and semantic types. Classification is achieved through supervised machine learning algorithms such as Support Vector Machine (SVM) and conditional random fields (CFRs). Summarization and answer presentation are based on clustering techniques. AskHermes is powered by open source components including: JBoss Seam, Weka, Mallet , Carrot2 , Lucene/Solr, and WordNet (a lexical database for the English language).

Sunday, April 28, 2013

How I Make Technology Decisions

The open source community has responded to the increasing complexity of software systems by creating many frameworks which are supposed to facilitate the work of developing software. Software developers spend a considerable amount of time researching, learning, and integrating these frameworks to build new software products. Selecting the wrong technology can cost an organization millions of dollars. In this post, I describe my approach to selecting these frameworks. I also discuss the frameworks that have made it to my software development toolbox.

Understanding the Business


The first step is to build a strong understanding of the following:

  • The business goals and challenges of the organization. For example, the healthcare industry is currently shifting to a value-based payment model in an increasingly tightening regulatory environment. Healthcare organizations are looking for a computing infrastructure that support new demands such as the Accountable Care Organization (ACO) model, patient-centered outcomes, patient engagement, care coordination, quality measures, bundled payments, and Patient-Centered Medical Homes (PCMH).

  • The intended buyers and users of the system and their concerns. For example, what are their pain points? which devices are they using? and what are their security and privacy concerns?

  • The standards and regulations of the industry.

  • The competitive landscape in the industry. To build a system that is relevant, it is important to have some ideas about the following: what is the competition? what are the current capabilities of their systems? what is on their road map? and what are customers saying about their products. This knowledge can help shape a Blue Ocean Strategy.

  • Emerging trends in technologies.

This type of knowledge comes with industry experience and a habit of continuously paying attention to these issues. For example, on a daily basis, I read industry news as well as scientific and technical publications. As a member of the American Medical Informatics Association (AMIA), I receive the latest issue of the Journal of the American Medical Informatics Association (JAMIA) which allows me to access cutting-edge research in medical informatics. I speak at industry conferences when possible and this allows me not only to hone my presentation skills, but also attend all sessions for free or at a discounted price. For the latest in software development, I turn to publications like InfoQ, DZone, and TechCrunch.

To better understand the users and their needs and concerns, I perform early usability testing (using sketches, wireframes, or mockups) to test design ideas and obtain feedback before actual development starts. For generating innovative design ideas, I recommend the following book: Universal Methods of Design: 100 Ways to Research Complex Problems, Develop Innovative Ideas, and Design Effective Solutions by Bruce Hanington and Bella Martin.

 

Architecting the Solution


Armed with a solid understanding of the business and technological landscape as well as the domain, I can start creating a solution architecture. Software development projects can be chaotic. Based on my experience working on many software development projects across industries, I found that Domain Driven Design (DDD) can help foster a disciplined approach to software development. For more on my experience with DDD, see my previous post entitled How Not to Build A Big Ball of Mud, Part 2.

Frameworks evolve over time. So, I make sure that the architecture is framework-agnostic and focused on supporting the domain. This allows me to retrofit the system in the future with new frameworks as they emerge.


 

Due Diligence


Software development is a rapidly evolving field. I keep my eyes on the radar and try not to drink the vendors Kool-Aid. For example, not all vendors have a good track record in supporting standards, interoperability, and cross-platform solutions.

The ThoughtWorks Technology Radar is an excellent source of information and analysis on emerging trends in software. Its contributors include software thought leaders like Martin Fowler and Rebecca Parson. I also look at surveys of the developers community to determine the popularity, community size, and usage statistics of competing frameworks and tools. Sites like InfoQ often conduct these types of surveys like the recent InfoQ survey on Top JavaScript MVC Frameworks. I also like Matt Raible's Comparing JVM Web Frameworks.

I value the opinion of recognized experts in the field of interest. I read their books, blogs, and watch their presentations. Before formulating my own position, I make sure that I read expert opinions on opposing sides of the argument. For example, in deciding on a pure Java EE vs. Spring Framework approach, I read arguments by experts on both sides (experts like Arun Gupta, Java EE Evangelist at Oracle and Adrian Colyer, CTO at SpringSource).

Finally, consider a peer review of the architecture using a methodology like the Architecture Tradeoff Analysis Method (ATAM). Simply going through the exercise of explaining the architecture to stakeholders and receiving feedback can significantly help in improving it.


Rapid Prototyping 

 

It's generally a good idea to create a rapid prototype to quickly learn and demonstrate the capabilities and value of the framework to the business. This can also generate excitement in the development team, particularly if the framework can enhance the productivity of developers and make their life easier.

 

The Frameworks I've Selected


The Spring Framework

I am a big fan of the Spring Framework. I believe it is really designed to support the need of developers from a productivity standpoint. In addition to dependency injection (DI), Aspect Oriented Programming (AOP), and Spring MVC, I like the Spring Data repository abstraction for JPA, MongoDB, Neo4J, and Hadoop. Spring supports Polyglot Persistence and Big Data today. I use Spring Roo for rapid application development and this allows me to focus on modeling the domain. I use the Roo scaffolding feature to generate a lot of Spring configuration and Java code for the domain, repository (Roo supports JPA and MongDB), service, and web layers (Roo supports Spring MVC, JSF, and GWT). Spring also support for unit and integration testing with the recent release of Spring MVC Test.

I use Spring Security which allows me to use AOP and annotations to secure methods and supports advanced features like Remenber Me and regular expressions for URLs. I think that JAAS is too low-level. Spring Security allows me to meet all OWASP Top Ten requirements (see my previous post entitled  Application-Level Security in Health IT Systems: A Roadmap).

Spring Social makes it easy to connect a Spring application to social network sites like Facebook, Twitter, and LinkedIn using the OAuth2 protocol. From a tooling standpoint, Spring STS supports many Spring features and I can deploy directly to Cloud Foundry from Spring STS. I look forward to evaluating Grails and the Play Framework which use convention over configuration and are built on Groovy and Scala respectively.

Thymeleaf, Twitter Boostrap, and JQuery

I use Twitter Boostrap because it is based on HTML5, CSS3, JQuery, LESS, and also supports a Responsive Web Design (RWD) approach. The size of the components library and the community is quite impressive.

Thymeleaf is an HTML5 templating engine and a replacement for traditional JSP. It is well integrated with Spring MVC and supports a clear division of labor between back-end and front-end developers. Twitter Boostrap and Thymeleaf work well together.


AngularJS

For Single Page Applications (SPA) my definitive choice is AngularJS. It provides everything I need including a clean MVC pattern implementation, directives, view routing, Deep Linking (for bookmarking), dependency injection, two-way databinding, and BDD-style unit testing with Jasmine. AngularJS has its own dedicated debugging tool called Batarang. There are also several learning resources (including books) on AngularJS.

Check this page comparing the performance of AngulaJS vs. KnockoutJS. This is a survey of the popularity of  Top JavaScript MVC Frameworks.

 

D3.js 

D3.js is my favorite for data visualization in data-intensive applications. It is based on HTML5, SVG, and Javascript. For simple charting and plotting, I use jqPlot which is based on JQuery. See my previous post entitled Visual Analytics for Clinical Decision Making.

 

I use R for statistical computing, data analysis, and predictive analytics. See my previous post entitled Statistical Computing and Data Mining with R.


Development Tools


My development tools include: Git (Distributed Version Control), Maven or Gradle (build), Jenkins (Continuous Integration), Artifactory (Repository Manager), and Sonar (source code quality management). My testing toolkit includes Mockito, DBUnit, Cucumber JVM, JMeter, and Selenium.

Sunday, March 24, 2013

Statistical Computing and Machine Learning with R

The use of predictive risk models for personalized medicine is becoming a common practice in healthcare delivery. These models can predict the health risk of patients based on their individual health profiles. Examples include models for predicting breast cancer, stroke, cardiovascular disease, Alzheimer's disease, chronic kidney disease, diabetes, hypertension, and operative mortality for patients undergoing cardiac surgery. These predictive models are created through data analysis using statistical computing.

Predictive risk modeling can be used to identity at-risk populations and provide them with pro-active care including early screening and prevention. For example, predictive risk modeling can help identify patients at risk of hospital re-admission, an important Accountable Care Organization (ACO) quality measure.

Another important challenge in healthcare is to discover what works and what does not work in clinical practice. Comparative Effectiveness Research (CER), an emerging trend in Evidence Based Practice (EBP), has been defined by the Federal Coordinating Council for CER as "the conduct and synthesis of research comparing the benefits and harms of different interventions and strategies to prevent, diagnose, treat, and monitor health conditions in 'real world' settings."

Despite their inherent methodological challenges (lack of randomization leading to possible bias and confounding), observational studies (using real world clinical data) are increasingly recognized as complementary to Randomized Control Trials (RCTs) and an important tool in clinical decision making and health policy.

Statistical Computing and Machine Learning are essential components of intelligent health IT systems. Over the last few years, the free and open source R Project for Statistical Computing has emerged as one the most popular tools for data analysis. This poll by kdnuggets.com shows the breakdown in popularity of various data mining and analytic tools.

R supports several Machine Learning algorithms including:

  • Nearest Neighbor
  • Naive Bayes
  • Decision Trees
  • Logistic Regression
  • Neural Networks
  • Support Vector Machines
  • Association Rules
  • k-Means Clustering
A technique called "Ensemble Methods" which consists in combining multiple models into one can be used to achieve a higher level of accuracy than its components. There are also R packages for niche methods like the Latent Class Causal Analysis (LCCA) Package for R. LCA is used in behavioral health research.

The following are very useful resources for doing statistical computing and data mining with R:
 
  • RStudio: an Integrated Development Environment (IDE) for R

  • ggplot2: statistical graphics and plotting system for R

  • sqldf: a package for manipulating R data frames using SQL

  • RMySQL: R interface to the MySQL database

  • RMongo: MongoDB Database interface for R

  • RHIPE: Big Data analysis using R and Hadoop. RHIPE stands for R and Hadoop Integrated Programming Environment. This approach is referred to as D&R (Divide and Recombine) Analysis of Large Complex Data (see this tech report on D&R from the RHIPE team)

  • RHadoop:  Big Data analysis using R and Hadoop. This tool provides Hadoop MapReduce functionality in R

  • Rattle: A Graphical User Interface for Data Mining using R. This tool can export predictive models in Predictive Model Markup Language (PMML) format.

Monday, February 6, 2012

Toward Intelligent Health IT (iHIT) Systems: Getting Out of the Box

In this post, I describe a new type of application that I refer to as iHIT. iHIT stands for Intelligent Health IT.

The Architecture of Traditional Health IT systems

Traditional software architectures for health IT systems typically include the following:

  • Dependency Injection (DI)

  • Object Relational Mapping (ORM)

  • An architectural pattern for the presentation layer such as the Model View Controller (MVC) pattern

  • HTML5, CSS3, and a JavaScript library like JQuery/Mobile

  • Other architectural patterns including GoF Design Patterns, SOLID Principles, and Domain Driven Design (DDD)

  • Structured Query Language (SQL)

  • Enterprise Integration Patterns (EIPs) implemented through an Enterprise Service Bus (ESB) using HL7 messages as the "Published Language"

  • REST or SOAP-based web services.

An entire generation of developers has been trained in these techniques. They represent proven best practices accumulated over several decades of object-oriented design and relational data management. Although pervasive in today's clinical systems, these applications lack basic intelligent features such as the ability to capture and execute expert knowledge, make inferences, or make predictions about the future based on the analysis of historical data. Some of these systems actually look like glorified data entry systems.

With the availability and explosion of medical knowledge and real world observational EHR data, these intelligent features will become important in assisting clinicians in the medical decision making process at the point of care by reducing their cognitive load.

Intelligent Health IT (iHIT) Systems

iHIT systems process huge quantities of both structured and unstructured data to provide clinicians with specific recommendations. iHIT systems play an important role in translating Comparative Effectiveness Research (CER) findings into clinical practice. Comparative effectiveness Research (CER), an emerging trend in Evidence Based Medicine (EBM), has been defined by the Federal Coordinating Council for CER as "the conduct and synthesis of research comparing the benefits and harms of different interventions and strategies to prevent, diagnose, treat and monitor health conditions in 'real world' settings." For example, based on the clinical profile of a patient, CER can help determine the best treatment option for breast cancer among the various options available such as: chemotherapy, radiation therapy, and surgery (Masectomy and Lumpectomy).

The following are examples of key characteristics displayed by iHIT systems:

  • The ability to analyze patient data as well as very large historical observational data sets in order to make probability-based predictions about the future and recommend specific actions that can yield the best clinical outcomes given the clinical profile of a patient.

  • The ability to capture and execute expert knowledge such as the medical knowledge contained in Clinical Practice Guidelines (CPGs). This includes the ability to mediate between different CPGs to arrive at a specific recommendation by merging and reconciling the medical knowledge in multiple CPGs as is the case with patients with comorbidities.

  • The ability to perform automated reasoning by inferring new implicit clinical facts from existing explicit facts and by exploiting semantic relationships between concepts and entities.

  • The ability to retrieve knowledge from unstructured data sources such as the biomedical research literature from sources like PubMed in order to answer clinical questions sometimes posed in natural language.

  • The ability to learn over time (and hence become smarter) as the amount of processed data continues its exponential growth.

  • Very fast response time to queries over very large data sets.


Sounds like Artificial Intelligence (AI)? I believe we are indeed witnessing the resurgence of AI and even the ideas of the Semantic Web in the healthcare industry. As healthcare costs and quality become national priorities for many countries around the world, the boundaries of computing will continue to be pushed further. Actually, some of the underlying principles of intelligent systems were originally developed decades and even centuries ago in the field of biomedical research. Williams Osler (1849-1919) famously said:

Medicine is a science of uncertainty and an art of probability.

Technologically advanced and competitive industries like financial services (e.g., credit eligibility and fraud detection), online retail (e.g., recommendation engine), and logistics (e.g., delivery route optimization) have adopted some of these technologies. Health IT developers now need to embrace them as well. This will require thinking out of the box.


The Ingredients of iHIT Systems

iHIT systems represent not one, but the integration of many different technologies. Mathematical Models, Statistical Analysis, and Machine Learning algorithms play an important role in iHIT systems. Examples include:

  • Logistic Regression models

  • Decision Trees

  • Association Rules

  • Bayesian Network

  • Neural Networks

  • Random Forests

  • Time Series for temporal reasoning

  • k-means Clustering

  • Support Vector Machines (SVM)

  • Probabilistic Graphical Models (PGMs) based on methods such as Bayesian networks and Markov Networks for making clinical decisions under uncertainty.

These algorithms can be used not only for making therapeutic predictions (e.g., the future hospitalization risk of a patient with Asthma), but also for dividing a population into subgroups based on the clinical profile of patients in order to achieve the best treatment outcomes.

Clinical Practice Guidelines (CPGs) are usually-based on Systematic Reviews (SRs) of Randomized Controlled Trials (RCTs) which are essentially scientific experiments. According to a report titled "Clinical Practice Guidelines (CPGs) We Can Trust" which was published last year by the Institute Of Medicine (IoM):

However, even when studies are considered to have high internal validity, they may not be generalizable to or valid for the patient population of guideline relevance. Randomized trials commonly have an under representation of important subgroups, including those with comorbidities, older persons, racial and ethnic minorities, and low-income, less educated, or low-literacy patients. Many RCTs and observational studies fail to include such "typical patients" in their samples; even when they do, there may not be sufficient numbers of such patients to assess them separately or the subgroups may not be properly analyzed for differences in outcomes.

On the other hand, observational studies using statistical analysis and machine learning algorithms operate on large real world observational data and can therefore provide feedback on the effectiveness of the actual use of different therapeutic interventions. Although very costly, RCTs are still considered the strongest form of evidence in EBM. Despite their inherent methodological challenges (lack of randomization leading to possible bias and confounding), observational studies are increasingly recognized as complementary to RCTs and an important tool in clinical decision making and health policy. iHIT systems play an important role in translating Comparative Effectiveness Research (CER) findings into clinical practice in the form of clinical decision support (CDS) interventions at the point of care.

iHIT systems also use business rules engines to capture and execute expert knowledge such as the medical knowledge contained in Clinical Practice Guidelines (CPGs). Examples include rules engines based on forward chaining inference, also known as production rule systems. These rules engines can be combined with Complex Event Processing (CEP) and Business Process Management (BPM) for intelligent decision making.

iHIT systems support ontologies such as those represented by the web ontology language (OWL) providing reasoning capabilities as well as the ability to navigate semantic relationships between concepts and entities.

More advanced iHIT systems have Natural Language Processing (NLP) and Automatic Speech Recognition (ASR) capabilities in order to answer clinical questions posed in natural language. They rely on Information Retrieval techniques like probabilistic methods for scoring the relevance of a document given a query and the application of supervised machine learning classification methods such as decision trees, Naive Bayes, K-Nearest Neighbors (kNN), and Support Vector Machines (SVM).

In some cases, the responsibilities of an iHIT system are performed by Intelligent Agents which are autonomous entities capable of observing the clinical environment and acting upon those observations.

For scalability and performance, iHIT systems often sit on NoSQL databases and run on massively parallel computing platforms like Apache Hadoop while leveraging the elasticity of the cloud.

Integrating these technologies is the main challenge posed by iHIT systems. An example is the integration between statistical and machine learning models, business rules, ontologies, and more traditional forms of computing such as object-oriented programming. Various solutions to these challenges have been proposed and implemented.

Human-Centered Design

Finally, iHIT systems fully embrace a human-centered design approach. They provide a seamless integration between automated decision logic and clinical workflows. They provide the clinician with detailed explanations of the rationale behind the actions they recommend. In addition, they use techniques like Visual Analytics to enhance human cognitive abilities in order to facilitate analytical reasoning over very large data sets.