Can LLMs simplify operative notes? A comparative analysis in otorhinolaryngology

Kilictas, Ahmet; Gul, Oguz; Kilictas, Bilgesah; KABA, ESAT; ERDİVANLI, BAŞAR

doi:10.1007/s00405-025-09758-2

Can LLMs simplify operative notes? A comparative analysis in otorhinolaryngology

Kilictas A. U., Gul O., Kilictas B., KABA E., ERDİVANLI B.

EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY, cilt.283, sa.1, ss.477-489, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 283 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.1007/s00405-025-09758-2
Dergi Adı: EUROPEAN ARCHIVES OF OTO-RHINO-LARYNGOLOGY
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, EMBASE, MEDLINE
Sayfa Sayıları: ss.477-489
Recep Tayyip Erdoğan Üniversitesi Adresli: Evet

Özet

IntroductionOperative notes play a critical role in documenting surgical procedures and supporting medical communication. However, due to their technical language, these documents are often complex and difficult to understand for patients, non-medical individuals, and even some healthcare professionals. Large Language Models (LLMs) offer a novel opportunity to simplify such documents and make them more accessible. This study aims to quantify how six LLMs simplify otolaryngology operative notes and to compare readability, clinical accuracy and clarity.Materials and methodsIn this study, 39 fictional operative notes specific to otolaryngologic surgery were simplified using six LLMs (GPT-4, GPT-4o, Claude 3.7, Gemini 2.0, DeepSeek, and Microsoft Copilot). The outputs were analyzed using eight different readability metrics and evaluated by two expert physicians in terms of medical accuracy and comprehensibility. Correlation analyses were also conducted across clinical subgroups (rhinology, otology, head and neck surgery).ResultsClaude 3.7 produced the most complex outputs, whereas GPT-4o, Gemini, and DeepSeek generated the most readable texts. According to expert evaluations, GPT-4 achieved the highest scores for medical accuracy, while GPT-4o received the highest ratings for clarity. Model performance varied across clinical subgroups.ConclusionLLMs are effective tools for simplifying medical texts; however, model selection should consider the target audience and clinical context, and all outputs must be verified by medical experts. When used in a controlled and validated manner, LLMs may contribute significantly to a new era of health communication.Level of evidenceN/A.