The Path to Useful and Trustworthy Clinical Prediction Models

Introduction
Clinical prediction modeling, using data to predict patients’ clinical outcomes and improve medical decision-making, can be incredibly powerful. Modern medicine is awash in unused data. Information that we can use to improve patients’ lives is everywhere if we just knew how to distill it from the muck and mire of the everyday chaos of clinical care.
With it, we could:
More efficiently diagnose diseases without patients suffering through tortuous journeys in the medical system, bouncing from doctor to doctor with no answers.
More accurately predict the prognosis of diseases, providing information about what the future will hold.
Make optimal medical decisions, enabling patients and physicians to wrestle with the risks and benefits of different treatment options to make the best decisions.
Confirm what we know to clarify what we don’t, accelerating progress in basic and translational science and identifying which clinical trials are worth doing.
All of these benefits are within our reach if we could harness the ocean of data generated by the medical system every day to produce trustworthy and reliable clinical prediction models. Unfortunately, that has proven difficult.
Unrealized potential
Classical methods of statistical prediction modeling for clinical applications have existed for decades, such as multivariable linear regression, logistic regression, ordinal regression, survival modeling, etc. There is a vast literature covering all of the relevant statistical and clinical aspects of developing such models, from sample size calculations and model performance to decision theory and implementation. The science of prediction modeling is very well developed.
Yet, these methods have not realized their potential. They have been used well for some applications and terribly for many others. Most models are optimized for publication and not for clinical use. They are developed to advance careers and not improve patient care. Even the good ones are rarely implemented into clinical practice. Running code in statistics software and writing a paper is vastly easier than changing clinical care. Despite a well-developed science of prediction modeling, the health system carries on largely ignorant of it.
AI has made the situation worse
The rise of machine learning (ML) and artificial intelligence (AI) has only worsened the situation. AI has demonstrated blockbuster results in certain use cases, such as for chatbots and applications in radiology and pathology. Yet, these do not translate well to clinical prediction tasks. AI is ravenous for data because it must train an incredible number of parameters. AI also craves stability in the system it is learning to reproduce. If the behavior of the system changes, the titanic algorithms struggle to adapt, like a massive ship that turns in a slow, wide arc. Clinical medicine is the opposite of this. Data are relatively sparse, and the underlying system that generates it is built on quicksand. Structural change is the rule rather than the exception. Moreover, the science of describing the reliability and trustworthiness of AI models is not well developed. The discipline is still grappling with key questions, such as causal inference methods and how to represent uncertainty when data are limited. AI methods that have produced headline success in some applications do not translate to the complex, dynamic, and messy world of clinical practice.
Useful and trustworthy clinical prediction models
To realize clinical prediction modeling’s potential, we need a method for producing useful predictions that patients and providers can trust.
Useful predictions:
Answer a specific and valuable question. Sometimes, prediction models produce answers to different questions than the user has in mind. Other times, models are developed that answer a question that is not valuable because the information makes no difference either in terms of actions that people might take or psychological benefit from the knowledge it gives.
Support decisions. People can use the predictions to decide what the best next step is. This process involves more than predictions, such as utilities of outcomes and the context of actions. The output of the model should be readily usable by someone equippe
Can be implemented by the health system. If the model demands too many resources, for example data, IT infrastructure, or specialized expert knowledge that is unavailable, then it cannot practically be implemented. Models must be designed with constraints in mind.
Trustworthy predictions:
Provide the right kind of answers. Predictions can be in the form of discrete categories or probabilities. The question it answers and the decisions it supports determine the best type of answer. The wrong type of answer can produce worse decisions.
Are accurate. They generally answer the modeling question correctly. This can be demonstrated by measures of overall model performance, discrimination (how well the model separates groups), and calibration (whether the value or probability of an outcome matches the observed values/probabilities).
Have quantifiable uncertainty. One can say that both the chance of getting heads on a coin toss and the chance of a team winning a match for a sport you have never heard of and know nothing about is 50%, but you would be very certain about that statement for the coin and very uncertain for the sports team. Similarly, predictions from models can vary in certainty. That certainty matters as you make a decision, monitor results, consider changing your mind about the right course of action, or consider whether predictions can be improved with more work on the model.
Are not overfit to the data. Models can be too good at predicting the data that are used to train it. They will fall apart when making predictions with new data. Methods for internal validation during model development must be used to estimate how well the model will perform on unseen data drawn from the same source population.
Perform acceptably over time. The dynamics of the health system that produce outcomes can change slowly or suddenly over time. Processes must be in place to monitor the model and respond when their performance degrades.
Perform acceptably in new situations. Model performance can vary tremendously when used in new situations, such as in different hospitals or different types of patients. The predictions must be shown to be trustworthy in new settings.
This is the path to developing valuable prediction models that help patients. These are the minimal criteria. Yet classical models and ML/AI can fail spectacularly at these points.
ML/AI models experience greater challenges compared to classical methods with implementation in the health system, providing the right kinds of answers, quantifying uncertainty, and not overfitting. ML/AI models are data hungry and complex. Vast amounts of data are required to train a model, and it takes even more to assure the operating characteristics that useful and trustworthy predictions require. This amount of data is often unavailable for clinical applications. Proponents of ML/AI argue that their models are more accurate. Even granting this point, which is debatable, the consequences of not meeting the requirements for useful and trustworthy predictions often outweigh any benefits gained from improved accuracy.
Achieving all of these properties is difficult regardless of the approach. The time, effort, and money required to do this well generally outweighs the benefit that an academic gains from publishing an analysis and moving on to the next thing. The literature is filled with descriptions of models that are accurate when predicting the training data but are otherwise useless and not worthy of anyone’s trust.
To the future!
None of these are my original thoughts. There are many, many people who want to do this well and have rigorously worked through these ideas (the TRIPOD+AI statement and its associated references are a good starting place). This post is my process of synthesizing these concepts for my own practice as both a physician and data scientist.
These points should guide the integration of prediction models into clinical practice, but they probably won’t. Or at least they will only partially. The publishing incentives to stop short of anything practical are too great. Hype is much more effective at selling to healthcare executives than a careful evaluation of how well a model performs. This is boring nerd stuff.
Yet, useful and trustworthy modeling can win because reality is unrelenting. Hype will come and go. Expensive AI products will be purchased and found to be useless. Cycles of boom and bust will roll on. Amidst the noise, the future will remain open. We won’t stop craving reliable predictions while we continuously proceed into the unknown. Useful and trustworthy models will demonstrate their virtues as time rolls on, and health systems will become better at recognizing their value and more capable of supporting their implementation. The opportunities are tremendous if only we can avoid the allure of short-term payoffs and do the hard work of demonstrating that a model truly improves patient care.