Human medical documentation significantly outperforms ChatGPT-4o in critical clinical dimensions: A blinded comparative assessment in paediatric orthopaedics

Abstract

Authors » Camathias C, Papp K, Betschart P, Speth B, Valderrabano V, Ammann E -More

Category » Primary study

Journal»Knee surgery, sports traumatology, arthroscopy : official journal of the ESSKA

Year » 2026

Links » DOI, Pubmed

Purpose: This study evaluated the quality of ChatGPT-generated medical history summaries compared to human-created documentation in a paediatric orthopaedic practice setting. Methods: A prospective, randomised, blinded comparative study was conducted involving 20 consecutive paediatric patients (mean age 14.2 ± 2.3 years; 11 males, 9 females) presenting with knee problems. Audio recordings of medical consultations were transcribed and processed by ChatGPT-4o (OpenAI) using standardised prompts. Three independent orthopaedic specialists evaluated both human-generated and AI-generated summaries using eight quality criteria: temporal consistency, spatial consistency, accident description, symptom accuracy, symptom specificity, previous interventions, writing style and overall impression. Each criterion was scored on a 6-point Likert scale. Results: Human-created summaries received significantly higher overall ratings (5.2 ± 0.8) compared to ChatGPT-generated summaries (4.5 ± 0.8, p < 0.001, Cohen's d = 0.80). After Bonferroni correction for multiple comparisons, statistically significant differences favouring human documentation were confirmed in four of eight criteria: temporal consistency (p < 0.001), spatial consistency (p < 0.001), accident description (p < 0.001) and overall impression (p < 0.001). No significant differences were observed for writing style and documentation of previous interventions. Inter-rater reliability was moderate (ICC = 0.64). ChatGPT demonstrated frequent temporal inconsistencies (14 of 60 evaluations, 23%) and omission of relevant accident details (21 of 60 evaluations, 35%). Conclusion: While AI-generated summaries showed acceptable stylistic quality, human documentation significantly outperformed ChatGPT in critical clinical dimensions, including temporal consistency and accuracy of complex orthopaedic presentations. Current large language models are not ready to replace human medical documentation in paediatric orthopaedic practice without careful oversight. The findings support the implementation of hybrid workflows where AI assists but does not replace human clinical judgement. Level of Evidence: Level I. © 2026 The Author(s). Knee Surgery, Sports Traumatology, Arthroscopy published by John Wiley & Sons Ltd on behalf of European Society of Sports Traumatology, Knee Surgery and Arthroscopy.

Epistemonikos ID: 83ba691765d2a9e1f2b6d7cceaa69ae7f13b0ee9
First added on: Mar 01, 2026