AI predictive models shown to be unreliable over time in clinical settings
Abstract: https://www.acpjournals.org/doi/10.7326/M23-0949
Editorial https://www.acpjournals.org/doi/10.7326/M23-2345
URL goes live when the embargo lifts
Researchers from Icahn School of Medicine at Mount Sinai Health System and the University of Michigan School of Medicine simulated 3 common scenarios of model implementation and associated changes in model performance using data from 130,000 critical care admissions. The scenarios each consider deployment of models to predict the risk for death or acute kidney injury (AKI) in the first 5 days after admission to the ICU. Scenario 1 considers the result of implementing and retraining a mortality prediction model, scenario 2 considers the implementation of an AKI model followed by the creation of a new mortality prediction model, and scenario 3 considers the simultaneous implementation of both an AKI and mortality prediction model. The authors found that the model in scenario 1 lost 9 to 39% specificity after retraining once. The mortality model in scenario 2 lost 8% to 15% specificity when created after the AKI model had been in use. In scenario 3, models for AKI and mortality prediction implemented simultaneously, each led to reduced effective accuracy of the other by 1% to 28%. The authors report that in each scenario, performance for models trained on data from populations that benefit from interventions afforded by model prediction is inferior to performance of the original model.
Based on their findings, rather than adopting a universal strategy, model developers should simulate each model’s updating strategy at each site where a model is to be implemented. They also recommend measures to track how and when predictions influence clinical decision making, because most suggested mitigation strategies rely on this information being available, and EHR data may be rendered unsuitable for training models otherwise.
An accompanying editorial by authors from Johns Hopkins University provides important context to these findings. They note that the drift observed in the models included in this study appear in AI in other contexts, including popular large language models like Chat GPT. The authors highlight that these models collapse if recursively trained on their own output, and this “noise” introduced to other clinical models may further degrade clinical predictions in the future. The editorial authors also suggest that fixing model drift starting with inspection, rather than immediate correction, may help. They suggest that, as with other interventions, clinical trials may be required to evaluate the effect of AI models on relevant clinical outcomes.