Sunday, March 24, 2013

Statistical Computing and Machine Learning with R

The use of predictive risk models for personalized medicine is becoming a common practice in healthcare delivery. These models can predict the health risk of patients based on their individual health profiles. Examples include models for predicting breast cancer, stroke, cardiovascular disease, Alzheimer's disease, chronic kidney disease, diabetes, hypertension, and operative mortality for patients undergoing cardiac surgery. These predictive models are created through data analysis using statistical computing.

Predictive risk modeling can be used to identity at-risk populations and provide them with pro-active care including early screening and prevention. For example, predictive risk modeling can help identify patients at risk of hospital re-admission, an important Accountable Care Organization (ACO) quality measure.

Another important challenge in healthcare is to discover what works and what does not work in clinical practice. Comparative Effectiveness Research (CER), an emerging trend in Evidence Based Practice (EBP), has been defined by the Federal Coordinating Council for CER as "the conduct and synthesis of research comparing the benefits and harms of different interventions and strategies to prevent, diagnose, treat, and monitor health conditions in 'real world' settings."

Despite their inherent methodological challenges (lack of randomization leading to possible bias and confounding), observational studies (using real world clinical data) are increasingly recognized as complementary to Randomized Control Trials (RCTs) and an important tool in clinical decision making and health policy.

Statistical Computing and Machine Learning are essential components of intelligent health IT systems. Over the last few years, the free and open source R Project for Statistical Computing has emerged as one the most popular tools for data analysis. This poll by shows the breakdown in popularity of various data mining and analytic tools.

R supports several Machine Learning algorithms including:

  • Nearest Neighbor
  • Naive Bayes
  • Decision Trees
  • Logistic Regression
  • Neural Networks
  • Support Vector Machines
  • Association Rules
  • k-Means Clustering
A technique called "Ensemble Methods" which consists in combining multiple models into one can be used to achieve a higher level of accuracy than its components. There are also R packages for niche methods like the Latent Class Causal Analysis (LCCA) Package for R. LCA is used in behavioral health research.

The following are very useful resources for doing statistical computing and data mining with R:
  • RStudio: an Integrated Development Environment (IDE) for R

  • ggplot2: statistical graphics and plotting system for R

  • sqldf: a package for manipulating R data frames using SQL

  • RMySQL: R interface to the MySQL database

  • RMongo: MongoDB Database interface for R

  • RHIPE: Big Data analysis using R and Hadoop. RHIPE stands for R and Hadoop Integrated Programming Environment. This approach is referred to as D&R (Divide and Recombine) Analysis of Large Complex Data (see this tech report on D&R from the RHIPE team)

  • RHadoop:  Big Data analysis using R and Hadoop. This tool provides Hadoop MapReduce functionality in R

  • Rattle: A Graphical User Interface for Data Mining using R. This tool can export predictive models in Predictive Model Markup Language (PMML) format.

No comments: