Publication
Article
Digital Edition
Author(s):
The use of ChatGPT may be premature to replace anterior segment experts.
Artificial intelligence (AI) may someday be a valuable asset in medical practice by achieving equivalency with medical experts, but that time has not yet arrived in the cornea subspecialty, according to a recent study evaluating AI’s performance in cataract, cornea, and refractive surgery clinical scenarios.
ChatGPT 4.0 demonstrated suboptimal absolute agreement with expert users and a 12% hallucination or incorrect answer rate, though it performed relatively better in some areas, according to Laura Palazzolo, MD, ABO, and Gaurav Prakash, MD, FRCS. The investigators are, respectively, from New York University Grossman School of Medicine, in Huntington Station, and the Department of Ophthalmology, University of Pittsburgh School of Medicine, in Pennsylvania.
By way of background, large language models (LLMs) represent an evolving frontier in generative AI, capable of learning language patterns and nuances, including medical terminology and concepts, through extensive training data. ChatGPT 4.0, the study’s focus, is one of the most popular LLMs and was trained with 1.7 trillion parameters, the investigators explained.
Several studies have assessed LLMs’ capabilities with single-correct-response multiple-choice questions.1-5 ChatGPT’s performance in more complex, open-ended clinical scenarios has been explored through small studies on retina, glaucoma, and neuro-ophthalmology.6-8
However, Palazzolo and Prakash noted a lack of robust data on open-ended clinical cases compared among international experts and LLMs in cataract, cornea, and refractive surgeries.
To address this gap, they conducted a comparative study evaluating ChatGPT 4.0, a commercially available LLM, in technically nuanced ophthalmic clinical scenarios, comparing its performance with published expert answers.
They presented ChatGPT with open-ended clinical scenarios previously published9 on PubMed (2019 to 2023). The published experts had been instructed as follows: Assume you are an experienced cornea, refractive, and anterior segment surgeon, analyze the given clinical scenario, and list your suggestions in bulleted points.
The investigators explained that the published expert answers (available behind a paywall and not presented to ChatGPT) were compared with ChatGPT’s responses, with each ChatGPT answer considered as a bulleted point. Cornea specialists evaluated the expert responses.
The answers, or study end point measurements, were labeled as correct, incorrect, or incomplete. The absolute concordance rate (ACR) (full questions and subcomponent concordance rate [SCR]) was calculated for each bulleted point divided by the total questions or bulleted points. Hallucination or incorrect answer rates and incomplete answer rates were also measured.
Because LLMs generate text based on the prompt given, they may produce factually incorrect information that they regard as accurate, known as hallucinations. A hallucination occurs when the AI generates an answer based on limited data. Incorrect answers stem from wrong or outdated information. Due to the black box nature of LLMs, hallucination and incorrect answer rates were combined.
ChatGPT responded to 33 questions, yielding 275 bulleted points.
“The SCR was 76.7% (211 of 275), which dropped to 24.2% (8 of 33) when absolute (‘AND’-gated) clustering was performed at the question level for ACR (P = .02, chi-square test). ChatGPT covered all points noted by experts in only 36.4% (12 of 33) of cases,” the authors reported.
Additionally, ChatGPT concurred with the editors’ differential diagnosis in 20 of 33 cases, compared with 33 of 33 for the experts (P < .001, Fisher’s exact test). At the bulleted-point level, the hallucination/incorrect answer rate was 12.4% (34 of 275), the incomplete answer rate was 10.5% (29 of 275), and the correct-to-incorrect answer ratio was 6.2:1.
“ChatGPT 4.0 showed poor absolute agreement with expert users and had a 12% hallucination/incorrect answer rate,” Palazzolo and Prakash concluded. “Its relatively better performance in the subcomponents aligns with more optimistic published results in closed-set answers (multiple-choice questions). Open-ended clinical scenarios reflect real-world circumstances, and ChatGPT appears premature for this use.”