AI tools trained to look for signs of pneumonia on chest X-rays can perform worse when tested on data from outside health systems, according to a study conducted at the Icahn School of Medicine at Mount. The research found that more testing needs to be done before using AI tools on real patients.
The results suggest that AI systems used in medical settings may need greater training, over a wider range of people and equipment.
Researchers looked at how AI models detected pneumonia in 158,000 chest X-rays, across three medical institutions: The National Institutes of Health, The Mount Sinai Hospital, and Indiana University Hospital.
Pneumonia is a common health issue with clinical significance and prevalence in the research community, making it well suited to the study.
The research found that in three out of five comparisons, convolutional neural networks (CNN) used to analyse medical imaging and provide a computer-aided diagnosis, saw performance decrease when diagnosing diseases on X-rays from hospitals outside of its own network was significantly lower than on X-rays from the original health system.
AI disease detection challenges
On top of this, CNNs were able to detect the hospital system where an X-ray was acquired with a high degree of accuracy, and cheated at their predictive task based on the prevalence of pneumonia at the training institution.
Researchers found that the difficulty of using deep learning models in medicine is that they use a massive number of parameters, making it challenging to identify specific variables driving predictions, such as the types of CT scanners used at a hospital and the resolution quality of imaging.
Senior author Eric Oermann, MD, Instructor in Neurosurgery at the Icahn School of Medicine at Mount Sinai said:
Our findings should give pause to those considering rapid deployment of artificial intelligence platforms without rigorously assessing their performance in real-world clinical settings reflective of where they are being deployed.
“Deep learning models trained to perform medical diagnosis can generalise well, but this cannot be taken for granted since patient populations and imaging techniques differ significantly across institutions.”
Also commenting on the reasearch, first author John Zech, a medical student at the Icahn School of Medicine, said:
“If CNN systems are to be used for medical diagnosis, they must be tailored to carefully consider clinical questions, tested for a variety of real-world scenarios, and carefully assessed to determine how they impact accurate diagnosis.”