Assessing the Reliability of Large Language Models in Detecting Acute Knee Fractures on Radiographs: A Comparative Study

Konukoglu, Osman; Kaya, Murat; Arslan, Baris Can; Gunaydin, Isa

doi:10.14744/cpr.2026.37887

Original Article

Assessing the Reliability of Large Language Models in Detecting Acute Knee Fractures on Radiographs: A Comparative Study

Osman Konukoglu ¹

, Murat Kaya ¹

, Baris Can Arslan ¹

, Isa Gunaydin ²

¹Department of Radiology, Gaziantep City Hospital, Gaziantep, Türkiye
²Department of Emergency Medicine, Gaziantep City Hospital, Gaziantep, Türkiye

J Clin Pract Res 2026; 48(2): 176-182 DOI: 10.14744/cpr.2026.37887

Full Text PDF

Abstract

Objective: To evaluate the diagnostic accuracy and reliability of closed-source, multimodal large language models (LLMs)—ChatGPT-4o, ChatGPT-4.5, and Gemini 2.5 Pro—in detecting acute knee fractures on radiographs compared with an emergency medicine specialist and a radiologist.

Materials and Methods: This retrospective study included 252 patients who underwent both knee radiography and CT between September 2023 and July 2025. Fracture status was determined by CT and reviewed by radiologists. Anteroposterior and lateral radiographs were independently assessed by an emergency medicine specialist, a radiologist, and three LLMs. Diagnostic performance was evaluated using sensitivity, specificity, predictive values, likelihood ratios, accuracy, and area under the curve (AUC). Reliability was assessed using Cohen’s kappa and McNemar’s tests.

Results: According to CT findings, fractures were present in 23.08% (n=58) of patients. The LLMs demonstrated low sensitivity: ChatGPT-4o, 37.9%; ChatGPT-4.5, 13.8%; and Gemini 2.5 Pro, 10.3%, with moderate overall accuracy (72–77%). In contrast, the radiologist achieved 92.1% accuracy, with high sensitivity (77.6%) and specificity (96.4%), whereas the emergency medicine specialist showed 83.7% accuracy. AUC comparisons revealed significantly higher diagnostic performance for clinicians, particularly radiologists, than for all LLMs (p<0.05). Consistency analysis showed moderate agreement for ChatGPT-4o, slight agreement for ChatGPT-4.5, and substantial agreement for Gemini 2.5 Pro.

Conclusion: Closed-source LLMs performed worse than clinicians in diagnosing acute knee fractures on radiographs, with a high risk of missed fractures. Although they may support triage by reliably identifying normal cases, they are not sufficient for standalone diagnostic use.

Keywords: Fracture, knee trauma, large language models, radiology, X-ray.

Konukoglu O, Kaya M, Arslan BC, Gunaydin I. Assessing the Reliability of Large Language Models in Detecting Acute Knee Fractures on Radiographs: A Comparative Study. J Clin Pract Res. 2026 ;48(2):176-182. doi: 10.14744/cpr.2026.37887.

Konukoglu O, Kaya M, Arslan BC, Gunaydin I. Assessing the Reliability of Large Language Models in Detecting Acute Knee Fractures on Radiographs: A Comparative Study. J Clin Pract Res. 2026;48(2):176-182. doi: 10.14744/cpr.2026.37887

Konukoglu, O., Kaya, M., Arslan, B. C., & Gunaydin, I. (2026). Assessing the Reliability of Large Language Models in Detecting Acute Knee Fractures on Radiographs: A Comparative Study. Journal of Clinical Practice and Research, 48(2), 176-182. https://doi.org/10.14744/cpr.2026.37887

Konukoglu, O., et al. "Assessing the Reliability of Large Language Models in Detecting Acute Knee Fractures on Radiographs: A Comparative Study." Journal of Clinical Practice and Research, vol. 48, no. 2, 2026, pp. 176-182. https://doi.org/10.14744/cpr.2026.37887.

Konukoglu, O., et al. 2026. "Assessing the Reliability of Large Language Models in Detecting Acute Knee Fractures on Radiographs: A Comparative Study." Journal of Clinical Practice and Research 48, no. 2: 176-182. https://doi.org/10.14744/cpr.2026.37887.

Journal Display Format:

Authors: Osman Konukoglu, Murat Kaya, Baris Can Arslan, Isa Gunaydin
Article Title: Assessing the Reliability of Large Language Models in Detecting Acute Knee Fractures on Radiographs: A Comparative Study
Journal Name: Journal of Clinical Practice and Research
Year: 2026
Volume: 48
Issue: 2
Pages: 176 - 182
DOI: 10.14744/cpr.2026.37887

RIS BibTeX EndNote Medlars