TL;Dr.: In a systematic review published in JAMA, Bedi and colleagues summarized evaluations of large language models (LLMs) in healthcare. The review involved 519 studies published between 2022 and 2024, and found that only 5% used real patient care data during evaluation. Almost 45% of studies focused on assessing the accuracy of medical knowledge, such as answering questions from medical licensing examinations, and another 19.5% focused on medical diagnoses. Few studies focused on administrative tasks, such as billing and writing prescriptions. The vast majority of studies (95.4%) used accuracy as the main dimension of evaluation, with few studies giving attention to bias, fairness, toxicity, deployment, calibration and uncertainty. The study’s authors recommended that future evaluations use clinical data, adopt standardized applications and metrics, and broaden their focus to include LLMs that address less common healthcare tasks.