Temperature-Driven Variability in Emergency Diagnostic Accuracy by a Leading Language Model

Abstract

Authors » Jarrett, P., Hill, J., Howell, M., Grabow Moore, K., Thoppil, J. J., Vargas Ortiz, L. -More

Category » Primary study

Pre-print»medRxiv

Year » 2025

Links » DOI

Objective To determine the impact of the temperature parameter on GPT-4o's diagnostic accuracy when evaluating emergency medicine cases and assess the effect on diagnostic divergence across iterations. Methods We conducted a simulation-based diagnostic accuracy study using four challenging emergency medicine cases adapted from the Foundations of Emergency Medicine curriculum. Each case was submitted to GPT-4o 250 times at five temperature settings (0.0, 0.25, 0.50, 0.75, 1.0), both with and without physical examination findings, yielding 10,000 total outputs. Each output contained exactly three differential diagnoses with one leading diagnosis. Diagnostic accuracy was assessed by comparing outputs against predetermined gold-standard diagnoses. Results At temperature 0.0, GPT-4o achieved 100% leading diagnosis accuracy across all cases with physical exam data. As temperature increased, accuracy declined systematically to 89.4% at temperature 1.0. Diagnostic divergence increased dramatically from an average of 4.5 unique diagnoses at temperature 0.0 to 26.25 at temperature 1.0 (583% increase). Case sensitivity varied significantly, with ascending cholangitis showing the greatest temperature sensitivity (accuracy dropping from 100% to 70.4%) while carbon monoxide poisoning maintained 100% accuracy across all settings. Discussion Higher temperatures introduced concerning diagnostic inconsistency rather than beneficial exploration, with substantial accuracy degradation in temperature-sensitive cases.

The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY-NC-ND 4.0 International license

Epistemonikos ID: f48bbd6c46dc01cd1938b53322b289f6e2c4fd44
First added on: Jun 07, 2025

Temperature-Driven Variability in Emergency Diagnostic Accuracy by a Leading Language Model

Resources

Evidence related with this article: (not yet available)

Broad syntheses

Systematic reviews

Primary studies

External links:

Available languages for this document

Share