*"In God we trust; all others must bring data."*

-William Edwards Deming (1900-1993),

American Engineer, Statistician, and Quality Guru

Clinical Prediction Models are increasingly used in clinical practice to provide diagnosis, prognosis, and anomaly detection. These models form a core component of Intelligent Systems in clinical care. In addition to traditional biomarkers, new predictors are emerging from research in genomics, proteomics, and imaging [1]. The predictive accuracy of Clinical Prediction Models continues to improve also due to the use of novel techniques and tools in the science of Statistical Learning. In some cases, these methods are capable of providing better predictive performance than traditional regression methods, albeit at the cost of decreased interpretability. Open source statistical computing tools like R [2] have also made it relatively easy for Data Scientists to apply these novel techniques in practice.

Furthermore, the ubiquitous use of wearable sensors by an increasing number of patients (part of a larger trend called the "Internet of Things" or IoT) will generate an abundance of data that will also contribute to improvements in the predictive accuracy of these models. One interesting applications of Clinical Prediction Models with wearable sensors is the use of Machine Learning algorithms for anomaly detection in physiological time-series in real time.

Several Clinical Prediction Models have been published in the biomedical literature in recent years. Some have even been introduced into clinical practice. However, there are serious concerns about the credibility and validity of these models. A systematic review of the methodology and reporting of multivariable clinical prediction models reported the following:

In this first post of the year, I discuss the usefulness as well as state-of-the art techniques and recommended methodologies for the development and validation of Clinical Prediction Models. I also explore some traditional and new quantitative performance measures for clinical prediction models.The validation studies were characterized by poor design, inappropriate handling and acknowledgment of missing data and one of the most key performance measures of prediction models i.e. calibration often omitted from the publication.[3]

## Why do we need Clinical Prediction Models?

Clinical Prediction Models provide absolute risk prediction for conditions such as diabetes, kidney disease, cancer, cardiovascular disease, and depression. Other examples include predicting patient treatment response in cancer care, 30-day mortality for patients with an acute myocardial infection (AMI), 30-day emergency admission to hospital, and 30-day readmission. Well-known Clinical Prediction Model development efforts include several risk prediction algorithms created by the QResearch project [4] in the United Kingdom (UK) and the cardiovascular risk functions developed by the Framingham Heart Study [5] in the United States (US).

These risk predictions are clinically useful for a number of reasons. They provide risk stratification for effective population health management, a key component of the accountable care organization (ACO) delivery model. They have the potential to reduce healthcare costs through early screening and the delivery of preventive services such as those recommended by the US Preventive Services Task Force (USPSTF). In clinical practice, Clinical Prediction Models can support clinician decision making during diagnostic work-up and test ordering [6].

Clinical Prediction Models enable Personalized Medicine since the predictions are made based on the clinical data of individual patients. Clinical Prediction Models also support shared decision making between patients and their providers about the benefits and harms of various treatment options and patients preferences and personal values.

Thanks to significant investments in biomedical research in recent years, the number of treatment options for any specific disease continues to increase. Furthermore, with the discovery of new biomarkers from imaging, genomics, and proteomics research, the number of data types that should be considered in clinical decision making will surpass the information processing capacities of the human brain [7]. An average human can only hold 7 ± 2 objects in working memory [8].

Research in noninvasive neuroimaging is improving our understanding of how the brain works and is leading to the discovery of neurological markers (neuromarkers) which are being used to create clinical prediction models for use in mental health and substance abuse treatment. The emerging field of neuroprognosis is leveraging these neuromarkers to predict patients' future relapse or treatment response to pharmacological and behavioral treatment [28]. The emerging Deep Learning techniques hold great promise in the field of medical image analysis.

A prospective study at the MAASTRO Clinic of Maastricht University Medical Center in The Netherlands compared treatment outcome predictions by experienced radiation oncologists (ROs) against those made by Clinical Prediction Models. The study found that the models

*"substantially outperformed ROs’ predictions and guideline-based recommendations currently used in clinical practice"*[9]. According to Dr. Cary Oberije, a researcher at the MAASTRO Clinic who presented the findings at the 2nd Forum of the European Society for Radiotherapy and Oncology (ESTRO):

Finally, predictive models can be used for creating simulations. Simulations are used extensively in the aerospace industry during the design and training phases of aircraft systems. For example, predictive models can be used in Monte Carlo simulations to model the cost of treatment for a population of patients with diabetes [30].If models based on patient, tumor and treatment characteristics already out-perform the doctors, then it is unethical to make treatment decisions based solely on the doctors’ opinions. We believe models should be implemented in clinical practice to guide decisions. [10]

## The Importance of the "No Free Lunch" Theorem in Predictive Modeling

When it comes to
the choice of statistical learning method for a given modeling task,
I subscribe to the

The R [2] packages Caret [13] and Rattle [14] provide functions for fitting models using several algorithms on a data set. Caret provides utilities for comparing their results.

*"No Free Lunch"*theorem [11]. According to the*"No Free Lunch"*theorem, there is no one single model builder which will produce a model with the best performance for all modeling tasks. The modeler should be familiar with and try multiple model builders and select the model with the best performance for the prediction task and data set at hand. Beyond the traditional regression analyses (linear, logistic, and Cox regression) typically used in the medical field, new sophisticated Statistical Learning methods are now available. These algorithms include Neural Networks, Support Vector Machines (SVMs), Bayesian Networks, Multiple Adaptive Regression Splines (MARS), and Boosted Trees to name a few. For example, Jayasurya et al. compared the performance of Bayesian Network (BN) and support vector machine (SVM) models for two-year survival prediction in lung cancer patients treated with radiotherapy. They concluded that BN models had an overall better performance than SVM models when handling missing data [12] which is often the case in medical data.The R [2] packages Caret [13] and Rattle [14] provide functions for fitting models using several algorithms on a data set. Caret provides utilities for comparing their results.

## Transparent and Reproducible Modeling

The credibility of
a Clinical Prediction Model comes from a transparent, reproducible,
and peer-reviewed modeling approach based on methodological rigor.
Ideally, Clinical Prediction Models should be developed as open
source software so that anyone can evaluate their underlying quality.
Free and publicly available de-identified data sets should be
available to predictive modelers and researchers. Commercial entities should have the right to keep their models proprietary as long as there is a third-party independent validation of the model as is done so well for reliable avionics software in the aviation industry.

The author(s) of the model should document the data pre-processing steps that have been applied to the original data set during the analysis and model building process. The R package pmmlTransformations [29] provides an interoperable and computable representation of the data pre-processing steps that are applied to the input data prior to modeling. Supported data transformation elements include: normalization, discretization, value mappings, and functions.

Open source tools like knitr [13] and R Markdown [16] simplify the task of creating well-documented and reproducible models by leveraging the typesetting capabilities of LATEX in combination with R code for the dynamic generation of documents, presentations, and reports in multiple formats like HTML, PDF, and Word.

The author(s) of the model should document the data pre-processing steps that have been applied to the original data set during the analysis and model building process. The R package pmmlTransformations [29] provides an interoperable and computable representation of the data pre-processing steps that are applied to the input data prior to modeling. Supported data transformation elements include: normalization, discretization, value mappings, and functions.

Open source tools like knitr [13] and R Markdown [16] simplify the task of creating well-documented and reproducible models by leveraging the typesetting capabilities of LATEX in combination with R code for the dynamic generation of documents, presentations, and reports in multiple formats like HTML, PDF, and Word.

## Development Methodology and Validation

Steyerberg and Vergouwe
suggest the following steps for the development of Clinical Prediction Models:

Over-fitting is always a concern in model building. Traditional data splitting techniques include: random sampling, stratified random sampling (to account for severe class imbalance), and maximum dissimilarity sampling [13]. However, more recent resampling techniques (as opposed to simple random training/test splits of the data) can provide more reliable estimates of model performance. Resampling techniques include:

*"(i) consideration of the research question and initial data inspection; (ii) coding of predictors; (iii) model specification; (iv) model estimation; (v) evaluation of model performance; (vi) internal validation; and (vii) model presentation"*[1]. An important distinction is made between internal validity and external validity. Internal validity refers to the reproducibility of the model to the patient population whose clinical data were used to train the model. External validity refers to the generalizability or extrapolation of the model to previously unseen clinical data from other patient populations (e.g., patient populations from a different country, region, or clinical site).## Validation using Resampling Techniques

Over-fitting is always a concern in model building. Traditional data splitting techniques include: random sampling, stratified random sampling (to account for severe class imbalance), and maximum dissimilarity sampling [13]. However, more recent resampling techniques (as opposed to simple random training/test splits of the data) can provide more reliable estimates of model performance. Resampling techniques include:

- K-fold cross-validation
- Repeated k-fold cross-validation
- Leave-one-out cross-validation (LOOCV)
- Repeated training/test splits or "Monte Carlo cross-validation"
- The Bootstrap and its variants such as the ".632 method" and the ".632+ method".

Model builders
typically have one or more tuning parameters. For example, the
numbers of neighbors to be used with a K-nearest neighbor (kNN) model
builder is a tuning parameter that can affect model performance. For
each candidate value of K, the training data is resampled several
times and an aggregated performance profile is generated and
evaluated to determine the optimal value [17, 18].

An important issue to be aware of during model tuning is the so called

An important issue to be aware of during model tuning is the so called

*"Bias-Variance Trade-off"*. Models with high variance can lead to over-fitting although they may have low bias. The challenge is to arrive at a model with low variance and low squared bias [18].## Quantitative Measures of Model Performance

### The Root Mean Squared Error (RMSE)

In general,
quantitative measures of quality depends on whether or not the
outcome is continuous. When the outcome is continuous, the root mean
squared error (RMSE) is typically used. The RMSE is a measure of the
model residuals. The model residuals are the differences between the
observed and the predicted values.

### The Coefficient of Determination or R-squared

Another measure of performance in regression models is the coefficient of determination or R-squared. The R-squared can be obtained by computing the correlation coefficient between the observed and predicted values and by squaring it. The R-squared can have values between 0 to 1. A value of 1 indicates a perfect fit of the model to the data.

### Calibration

When the outcome is
not numeric such as in classification models, the goal is to obtain
predicted class probabilities. Calibration is a measure of how
predicted class probabilities reflect the true probability of the
outcome. For example, for a prediction of 60% of chance of positive
outcome for a patient, the observed proportion should be 60 patients
with positive outcome per 100 "similar patients". A calibration
plot displays predicted class probabilities on the x-axis and the
observed probabilities on the y-axis. Well-calibrated predictions are
on the 45 degrees line. The observed probabilities can be plotted by
deciles of predicted probabilities to compare their means.

### Confusion Matrix

The Confusion Matrix
(also known as a Contingency Table or Error Matrix) is a simple
cross-tabulation of the observed and predicted classes for the data.
The Confusion Matrix can be represented as a table with two rows and
two columns that reports the number of false positives, false
negatives, true positives, and true negatives.

The
confusion matrix can also display the overall

*accuracy rate*or*error rate*. However, the*accuracy rate*is not a reliable measure of performance because its value can be misleading in the case of severe class imbalance (very low or very high prevalence).### Common Measures

The
following are performance metrics commonly found in the biomedical
literature and equations for computing their values:

- Sensitivity or True Positive Rate (TPR): TPR = TP/(TP + FN)
- Specificity
(SPC) or True Negative Rate:
*SPC = TN/(FP + TN)* - Precision
or Positive Predicted Value:
*PPV = TP/(TP + FP)* - Negative
Predicted Value:
*NPV = TN/(FN + TN)* - Fall-Out
or False Positive Rate:
*FPR = FP/(FP + TN)* - Accuracy
*= (TP + TN)/(TP + TN + FN + FP) = 1 – Error Rate* - F-Measure
or F-Score
*= 2TP/(2TP + FP + FN).*

There
is usually a trade-off between the specificity and the sensitivity.
This trade-off can be evaluated using the Receiver Operating
Characteristic (ROC) curve (more on that later). A cautionary note is
that the PPV and the NPV depends on the prevalence which can vary
across patient populations.

### Kappa Statitistics

The Kappa statistics is a measure of the difference between the observed
accuracy of a model and the expected accuracy. The latter is the
accuracy that can be obtained by random chance alone. Compared with
the overall accuracy, the Kappa statistics is more resilient to severe
class imbalance. It can be computed using the following formula:

Kappa
= (observed accuracy - expected accuracy)/(1 - expected accuracy)

The
range of value -1 to 1. A value of 1 represents perfect agreement. A
value of 0 indicates agreement no better than what would be obtained by random chance. Most values fall between 0 and 1.

### The Receiver Operating Characteristic Curve (ROC) and associated Area under the Curve (AUC)

The ROC plots the
sensitivity (true-positive rate) against 1 – specificity
(false-positive rate) for a range of cut-off values. The AUC is a key
indicator of model performance in classification models. Larger
values indicate better performance. An advantage of the ROC
curve is that it insensitive to class imbalance.

### Youden's J Index

The Youden’s J
Index can be calculated using the following formula:

J = Sensitivity + Specificity - 1

The Youden's Index (J) is essentially the difference between the true positive rate (TPR) and the false positive rate (FPR) [19]. The range of value is 0 to 1. The optimal classification cut-off point can be determined by the maximum value of the Youden's Index (the height above the chance line) on the ROC curve.

### Equivocal Zones

In addition to an optimal cut-off value, an

*"equivocal"*zone can be defined as well. For predictions that fall in to this zone, the sample is classified as

*"equivocal"*(meaning class membership is indeterminate) [20, 21].

### Lift Charts

The lift is a
measure of the relative performance of the model against a baseline
like random guessing or a non-informative model. A Lift Chart plots
the cumulative lift values on the y-axis against the percentage of
samples evaluated on the x-axis. The lift function in the Caret
package calculates the lift as the ratio of the percentage of samples
(in each approximately equal split of the data) predicted as positive for a given
class over the same percentage in the entire data set [13].

## Performance Measures for Regression Analysis

Steyerberg and Vergouwe suggest the following performance measures for regression analysis:

*"calibration-in-the-large, or the model intercept (A); calibration slope (B); discrimination, with a concordance statistic (C); and clinical usefulness, with decision-curve analysis (D)*" [1].

For generalized linear models, calibration-in-the large is related to the intercept and compares the mean of predictions to the mean of outcomes. The mean of predictions and the mean of outcomes are equal during internal validation with resampling techniques, but could differ during external validation on previously unseen data. During internal validation, the calibration slope can be used as a shrinkage factor. During external validation on previously unseen data, the calibration slope could be less than one due to over-fitting [6]. This methodology for assessing the calibration of a logistic regression model was first proposed by Cox in 1958 [22]. When the intercept and the slope do not significantly differ from 0 and 1 respectively, the model is considered to have good calibration.

For binary outcomes, the concordance statistics is the area under the ROC. Next, we discuss decision-curve analysis.

## Decision Curve Analysis and Net Benefits (NB)

In clinical practice, a cut-off or threshold value of the predicted class probability is needed to assist clinicians in decision making. A default cut-off of 50% implies that benefits (e.g., remission and improved functional status and quality of life) and harms (e.g., severity of side-effects and costs) are weighted equally. Since this assumption is rarely correct in medicine, such a cut-off value would not be too useful to clinicians in practice [6]. For example, the value of true-positive classifications (e.g., patients with the disease correctly diagnosed as having the disease) and false-positive classifications (e.g., patients without the disease incorrectly diagnosed as having the disease) are typically not equal. Also the determination of benefits and harms of any decision could be driven by the specific context of the patient including the patient's preferences and values (shared decision making).

Vickers and Elkin introduced a decision-analytic approach called the

*"decision curve"*which evaluates the Net Benefit (NB) of the model over a range of cut-off values [24]. The NB can be calculated using the formula:

NB = (TP - w * FP)/N

where TP is the number of true-positive classifications, FP the
number of false-positive classifications, N the
patient population, and w (weight) the ratio of harm to benefit.
The latter is
calculated as the odds of the cut-off. For example, a cut-off value
of 10% indicates that the value of a true-positive (TP) is 9 times
higher than the value of a false-positive (FP). The
Decision Curve approach is an important tool for measuring value in
the transition from a fee-for-service to a value-driven care delivery
system.

## Bending the Cost Curve

Controlling healthcare costs remain a challenge for many countries. Drummond and Holte proposed another method called the Cost Curve which can be used to visualize and compare the performance of classifiers based on the combination of misclassification costs and class distributions [25]. The Cost Curve plots the normalized

*expected misclassification cost*(NEC) on the y-axis and the probability cost

*PC(+)*on the x-axis. The probability cost

*PC(+)*represents the combination of the two misclassification costs and the class distribution and can be calculated with the following formula:

PC(+) = (p(+)C(-|+))/(p(+)C(-|+) + (1 - p(+))C(+|-))

where p(+) is the class distribution (the probability that a given instance is positive), C(-|+) is the cost of a false negative, and C(+|-) is the cost of a false positive. The range of PC(+) values is 0 to 1.

The

*"Normalized Expected Cost" (NEC)*can be calculated with the following formula:

NEC = FN * PC(+) + FP * (1 - PC(+))

where FN is the false negative rate and FP is the false positive rate. The range of NEC values is 0 to 1. There is bidirectional point/line duality [26] between ROC curves and Cost Curves. The point (FP, TP) in ROC space is a line in cost space which joins the points (0, FP) and (1, FN) [27].

## Model Presentation and Deployment

Model presentation techniques include traditional score charts, nomograms, and clinical rules [6]. However Clinical Prediction Models are easier to use and maintain when deployed as scoring services (part of a service-oriented software architecture) and integrated into Clinical Decision Support (CDS) systems. The scoring service can be deployed in the cloud to allow integration with multiple client clinical systems. The Data Mining Group (DMG) Predictive Model Markup Language (PMML) specification supports the interoperable deployment of predictive models in heterogeneous software environments.

Visual Analytics or data visualization techniques can also play an important role in the effective presentation of Clinical Prediction Models to nonstatisticians particularly in the context of shared decision making.

## References

[1]
Ewout W. Steyerberg, Yvonne Vergouwe. Towards better
clinical prediction models: seven steps for development and an ABCD
for validation. European Heart Journal 2014
Aug 1;35(29):1925-31

[2] R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

[3] Collins GS, de Groot JA, Dutton S et al (2014) External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol 14:40.

[4] Hippisley-Cox J, Coupland C, Brindle P. The performance of seven QPrediction risk scores in an independent external sample of patients from general practice: a validation study. BMJ Open. 2014 Aug 28;4(8)

[5] Dawber TR, Meadors GF, Moore FEJ: Epidemiological approaches to heart disease: the Framingham Study. Am J Public Health 1951, 41:279-286.

[6] Ewout W. Steyerberg. Clinical Prediction Models. A Practical Approach to Development, Validation, and Updating. New York: Springer, 2010.

[7] Stead WW, Searle JR, Fessler HE, Smith JW, Shortliffe EH. Biomedical informatics: changing what physicians need to know and how they learn. Acad Med. 2011 Apr;86(4):429-34.

[8] Miller GA. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol Rev. 1956;63:81–97.

[2] R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

[3] Collins GS, de Groot JA, Dutton S et al (2014) External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol 14:40.

[4] Hippisley-Cox J, Coupland C, Brindle P. The performance of seven QPrediction risk scores in an independent external sample of patients from general practice: a validation study. BMJ Open. 2014 Aug 28;4(8)

[5] Dawber TR, Meadors GF, Moore FEJ: Epidemiological approaches to heart disease: the Framingham Study. Am J Public Health 1951, 41:279-286.

[6] Ewout W. Steyerberg. Clinical Prediction Models. A Practical Approach to Development, Validation, and Updating. New York: Springer, 2010.

[7] Stead WW, Searle JR, Fessler HE, Smith JW, Shortliffe EH. Biomedical informatics: changing what physicians need to know and how they learn. Acad Med. 2011 Apr;86(4):429-34.

[8] Miller GA. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol Rev. 1956;63:81–97.

[9] Oberije C, Nalbantov G, Dekker A, Boersma L, Borger J, Reymen B, van Baardwijk A, Wanders R, De Ruysscher D, Meyerbeer E, Dingemans AM, Lambin P. A prospective study comparing the predictions of doctors versus models for treatment outcome of lung cancer patients: a step toward individualized care and shared decision making. Radiother Oncol. 2014 Jul;112(1):37-43

[10] European Society for Radiotherapy and Oncology (ESTRO). "Mathematical models out-perform doctors in predicting cancer patients' responses to treatment." ScienceDaily. www.sciencedaily.com/releases/2013/04/130420110651.htm (accessed January 3, 2015).

[11] Wolpert D (1996). "The Lack of a priori Distinctions Between Learning Algorithms." Neural Computation, 8(7), 1341–1390.

[12] Jayasurya K, Fung G, Yu S, Dehing-Oberije C, De Ruysscher D, Hope A, De Neve W, Lievens Y, Lambin P, Dekker AL. Comparison of Bayesian network and support vector machine models for two-year survival prediction in lung cancer patients treated with radiotherapy. Med Phys. 2010 Apr;37(4):1401-7.

[13] Max Kuhn. Contributions from Jed Wing, Steve Weston, Andre Williams, Chris Keefer and Allan Engelhardt (2012). caret: Classification and Regression Training. R package version 5.15-044. http://CRAN.R-project.org/package=caret

[14] Williams GJ (2014). rattle: Graphical user interface for data mining in R. R package version 3.1.4, URL http://rattle.togaware.com/.

[15] Yihui Xie (2014). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.8.

[16] R Studio (2013), Using R Markdown with Rstudio, https://support.rstudio.com/hc/en-us/articles/200552086-Using-R-Markdown

[17] Max Kuhn, Kjell Johnson. Applied Predictive Modeling. New York: Springer, 2013.

[18] Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning with Applications in R. New York: Springer, 2013.

[19] Youden W (1950). "Index for Rating Diagnostic Tests." Cancer, 3(1), 32–35.

[20] Us Food and Drug Administration. Guidance for Industry and FDA Staff - Class II Special Controls Guidance Document: Cardiac Allograft Gene Expression Profiling Test Systems. http://www.fda.gov/MedicalDevices/DeviceRegulationandGuidance/GuidanceDocuments/ucm187084.htm. Accessed January 10, 2015.

[21] Max Kuhn. Equivocal Zones. R Bloggers. http://www.r-bloggers.com/equivocal-zones/. Accessed January 10, 2015.

[22] Cox DR. Two further applications of a model for binary regression. Biometrika 1958; 45:562-565.

[23] Frank E. Harel, Jr. Regression Modeling Strategies. With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer, 2010.

[24] Vickers AJ, Elkin EB. Decision Curve Analysis: a novel method for evaluating prediction models. Med Decis Making. 2006; 26(6):565-75

[25] Robert C. Holte and Chris Drummond. Cost-sensitive Classifier Evaluation using Cost Curves. Advances in Knowledge Discovery and Data Mining Lecture Notes in Computer Science Volume 5012, 2008, pp 26-29.

[26] Preparata, F. P., & Shamos, M. I. (1988). Computational Geometry, An Introduction, Text and Monographs in Computer Science. New York: Springer-Verlag.

[27] Chris Drummond, Robert C. Holte. Cost curves: An improved method for visualizing classifier performance. Mach Learn (2006) 65:95–130.

[28] Gabrieli, John D.E., Ghosh, Satrajit S., Whitfield-Gabrieli, Susan. Prediction as a Humanitarian and Pragmatic Contribution from Human Cognitive Neuroscience. Neuron, Volume 85, Issue 1, 11-26.

[29] Tridivesh Jena, Wen Ching Lin (2014). Package pmmlTransformations. R package version 1.2.2.

[30] Svetlana Levitan, Richard Cohen, Vladimir Shklover. PMML in Simulation. PMML'13, August 11 2013, Chicago, Illinois, USA.