Comparative performance analysis of AI-based large language models in assessing cervical vertebral maturation stages on lateral cephalometric radiographs


Erdem R., Yildirim A., Genc Y. S., BEŞER GÜL B., NARALAN M. E., Cicek O.

BMC ORAL HEALTH, cilt.26, sa.1, 2026 (SCI-Expanded, Scopus) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 26 Sayı: 1
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1186/s12903-026-08293-8
  • Dergi Adı: BMC ORAL HEALTH
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, EMBASE, MEDLINE, Directory of Open Access Journals, Natural Science Collection (ProQuest), Biological Science Database (ProQuest), Biomedical Reference Collection: Corporate Edition (EBSCO), Health Research Premium Collection (ProQuest)
  • Recep Tayyip Erdoğan Üniversitesi Adresli: Evet

Özet

Background The aim of this study was to evaluate the performance of artificial intelligence (AI)-based large language models (LLMs) in predicting cervical vertebral maturation (CVM) stages on lateral cephalometric radiographs. Methods This retrospective study evaluated the performance of AI-based LLMs in predicting CVM stages using 120 lateral cephalometric radiographs obtained from individuals aged 6-19 years. The radiographs, which included an equal number of samples from each CVM stage, were independently classified by two experienced orthodontists, with the consensus-established stages serving as the gold standard. Five distinct LLMs (GPT-4o, GPT-o3 pro, GPT-5, GPT-5 pro, and Grok-4) were tested in separate sessions using the same command for each image. Model performance was assessed using accuracy, correlation coefficients, Bland-Altman analysis, and mean absolute error (MAE). Results Exact-match accuracy of the AI-based LLMs ranged between 14% and 28%, while accuracy within +/- 1 stage tolerance ranged from 55% to 64%. GPT-4o demonstrated the highest correlation with the reference standard (rho = 0.616, p < 0.001), followed by GPT-5 pro (rho = 0.535). Other AI-based LLMs exhibited moderate correlations (rho = 0.3-0.4). Bland-Altman analyses indicated bias values close to zero but revealed wide limits of agreement. MAE values were comparable across AI-based LLMs, with no statistically significant differences (p > 0.05). Conclusions Current LLMs did not exhibit clinically acceptable agreement with expert CVM assessments, showing wide error margins that limit their clinical utility. LLMs should presently be considered only as supportive tools. Further improvements in training and multimodal model design are needed to improve their diagnostic reliability.