2Department of Emergency Medicine, Gaziantep City Hospital, Gaziantep, Türkiye
Abstract
Objective: To evaluate the diagnostic accuracy and reliability of closed-source, multimodal large language models (LLMs)—ChatGPT-4o, ChatGPT-4.5, and Gemini 2.5 Pro—in detecting acute knee fractures on radiographs compared with an emergency medicine specialist and a radiologist.
Materials and Methods: This retrospective study included 252 patients who underwent both knee radiography and CT between September 2023 and July 2025. Fracture status was determined by CT and reviewed by radiologists. Anteroposterior and lateral radiographs were independently assessed by an emergency medicine specialist, a radiologist, and three LLMs. Diagnostic performance was evaluated using sensitivity, specificity, predictive values, likelihood ratios, accuracy, and area under the curve (AUC). Reliability was assessed using Cohen’s kappa and McNemar’s tests.
Results: According to CT findings, fractures were present in 23.08% (n=58) of patients. The LLMs demonstrated low sensitivity: ChatGPT-4o, 37.9%; ChatGPT-4.5, 13.8%; and Gemini 2.5 Pro, 10.3%, with moderate overall accuracy (72–77%). In contrast, the radiologist achieved 92.1% accuracy, with high sensitivity (77.6%) and specificity (96.4%), whereas the emergency medicine specialist showed 83.7% accuracy. AUC comparisons revealed significantly higher diagnostic performance for clinicians, particularly radiologists, than for all LLMs (p<0.05). Consistency analysis showed moderate agreement for ChatGPT-4o, slight agreement for ChatGPT-4.5, and substantial agreement for Gemini 2.5 Pro.
Conclusion: Closed-source LLMs performed worse than clinicians in diagnosing acute knee fractures on radiographs, with a high risk of missed fractures. Although they may support triage by reliably identifying normal cases, they are not sufficient for standalone diagnostic use.
