Systematic Review of 519 Evaluations of LLMs Used in Healthcare Found that Only 5% Used Real Patient Data and the Majority Focused on LLM Accuracy Instead of Other Dimensions like Bias and Uncertainty

OpenEvidence has signed content agreements with JAMA and The New England Journal of Medicine.

TL;Dr.: In a systematic review published in JAMA, Bedi and colleagues summarized evaluations of large language models (LLMs) in healthcare. The review involved 519 studies published between 2022 and 2024, and found that only 5% used real patient care data during evaluation. Almost 45% of studies focused on assessing the accuracy of medical knowledge, such as answering questions from medical licensing examinations, and another 19.5% focused on medical diagnoses. Few studies focused on administrative tasks, such as billing and writing prescriptions. The vast majority of studies (95.4%) used accuracy as the main dimension of evaluation, with few studies giving attention to bias, fairness, toxicity, deployment, calibration and uncertainty. The study’s authors recommended that future evaluations use clinical data, adopt standardized applications and metrics, and broaden their focus to include LLMs that address less common healthcare tasks.

Medically Reviewed By: Fernanda Ferreira, PhD (Harvard Medical School)

Updated: October 17, 2024