Publication

Article

Digital Edition

Ophthalmology Times: December 2024
Volume49
Issue 12

Generative AI: It’s only just begun

Author(s):

Key Takeaways

  • ChatGPT 4.0 showed a 12% hallucination/incorrect answer rate and poor agreement with experts in cornea subspecialty scenarios.
  • The AI model performed better in subcomponents but struggled with open-ended clinical scenarios, reflecting real-world complexities.
SHOW MORE

The use of ChatGPT may be premature to replace anterior segment experts.

(Image Credit: AdobeStock/Smart Future)

(Image Credit: AdobeStock/Smart Future)

Artificial intelligence (AI) may someday be a valuable asset in medical practice by achieving equivalency with medical experts, but that time has not yet arrived in the cornea subspecialty, according to a recent study evaluating AI’s performance in cataract, cornea, and refractive surgery clinical scenarios.

ChatGPT 4.0 demonstrated suboptimal absolute agreement with expert users and a 12% hallucination or incorrect answer rate, though it performed relatively better in some areas, according to Laura Palazzolo, MD, ABO, and Gaurav Prakash, MD, FRCS. The investigators are, respectively, from New York University Grossman School of Medicine, in Huntington Station, and the Department of Ophthalmology, University of Pittsburgh School of Medicine, in Pennsylvania.

By way of background, large language models (LLMs) represent an evolving frontier in generative AI, capable of learning language patterns and nuances, including medical terminology and concepts, through extensive training data. ChatGPT 4.0, the study’s focus, is one of the most popular LLMs and was trained with 1.7 trillion parameters, the investigators explained.

Several studies have assessed LLMs’ capabilities with single-correct-response multiple-choice questions.1-5 ChatGPT’s performance in more complex, open-ended clinical scenarios has been explored through small studies on retina, glaucoma, and neuro-ophthalmology.6-8

However, Palazzolo and Prakash noted a lack of robust data on open-ended clinical cases compared among international experts and LLMs in cataract, cornea, and refractive surgeries.

To address this gap, they conducted a comparative study evaluating ChatGPT 4.0, a commercially available LLM, in technically nuanced ophthalmic clinical scenarios, comparing its performance with published expert answers.

They presented ChatGPT with open-ended clinical scenarios previously published9 on PubMed (2019 to 2023). The published experts had been instructed as follows: Assume you are an experienced cornea, refractive, and anterior segment surgeon, analyze the given clinical scenario, and list your suggestions in bulleted points.

The investigators explained that the published expert answers (available behind a paywall and not presented to ChatGPT) were compared with ChatGPT’s responses, with each ChatGPT answer considered as a bulleted point. Cornea specialists evaluated the expert responses.

The answers, or study end point measurements, were labeled as correct, incorrect, or incomplete. The absolute concordance rate (ACR) (full questions and subcomponent concordance rate [SCR]) was calculated for each bulleted point divided by the total questions or bulleted points. Hallucination or incorrect answer rates and incomplete answer rates were also measured.

Because LLMs generate text based on the prompt given, they may produce factually incorrect information that they regard as accurate, known as hallucinations. A hallucination occurs when the AI generates an answer based on limited data. Incorrect answers stem from wrong or outdated information. Due to the black box nature of LLMs, hallucination and incorrect answer rates were combined.

Expert and ChatGPT comparison

ChatGPT responded to 33 questions, yielding 275 bulleted points.

“The SCR was 76.7% (211 of 275), which dropped to 24.2% (8 of 33) when absolute (‘AND’-gated) clustering was performed at the question level for ACR (P = .02, chi-square test). ChatGPT covered all points noted by experts in only 36.4% (12 of 33) of cases,” the authors reported.

Additionally, ChatGPT concurred with the editors’ differential diagnosis in 20 of 33 cases, compared with 33 of 33 for the experts (P < .001, Fisher’s exact test). At the bulleted-point level, the hallucination/incorrect answer rate was 12.4% (34 of 275), the incomplete answer rate was 10.5% (29 of 275), and the correct-to-incorrect answer ratio was 6.2:1.

“ChatGPT 4.0 showed poor absolute agreement with expert users and had a 12% hallucination/incorrect answer rate,” Palazzolo and Prakash concluded. “Its relatively better performance in the subcomponents aligns with more optimistic published results in closed-set answers (multiple-choice questions). Open-ended clinical scenarios reflect real-world circumstances, and ChatGPT appears premature for this use.”

Gaurav Prakash, MD, FRCS

E: drgauravprakash@gmail.com

Prakash has no financial interests related to the content of this article.

Laura Palazzolo, MD, ABO

E: laurabelle729@gmail.com

Palazzolo has no financial interests related to the content of this article. The data was presented at the American Society of Cataract and Refractive Surgery Annual Meeting; April 5-9, 2024, in Boston, Massachusetts; paper session: Surgical Outcomes III.

References
  1. Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312. doi:10.2196/45312
  2. Lin SY, Chan PK, Hsu WH, Kao CH. Exploring the proficiency of ChatGPT-4: an evaluation of its performance in the Taiwan advanced medical licensing examination. Digit Health. 2024;(10):20552076241237678. doi:10.1177/20552076241237678
  3. Panthier C, Gatinel D. Success of ChatGPT, an AI language model, in taking the French language version of the European Board of Ophthalmology examination: a novel approach to medical knowledge assessment. J Fr Ophtalmol. 2023;46(7):706-711. doi:10.1016/j.jfo.2023.05.006
  4. Teebagy S, Colwell L, Wood E, Yaghy A, Faustina M. Improved performance of ChatGPT-4 on the OKAP examination: a comparative study with ChatGPT-3.5. J Acad Ophthalmol (2017). 2023;15(2):e184-e187. doi:10.1055/s-0043-1774399
  5. Fowler T, Pullen S, Birkett L. Performance of ChatGPT and Bard on the official part 1 FRCOphth practice questions. Br J Ophthalmol. 2024;108(10):1379-1383. doi:10.1136/bjo-2023-324091
  6. Maywood MJ, Parikh R, Deobhakta A, Begaj T. Performance assessment of an artificial intelligence Chatbot in clinical vitreoretinal scenarios. Retina. 2024;44(6):954-964. doi:10.1097/IAE.0000000000004053
  7. Delsoz M, Raja H, Madadi Y, et al. The use of ChatGPT to assist in diagnosing glaucoma based on clinical case reports.Ophthalmol Ther.2023;12(6):3121-3132. doi:10.1007/s40123-023-00805-x
  8. Madadi Y, Delsoz M, Lao PA, et al. ChatGPT assisting diagnosis of neuro-ophthalmology diseases based on case reports. medRxiv. Preprint posted online September 14, 2023. doi:10.1101/2023.09.13.23295508
  9. Nuijts RMMA, Kartal S. Epithelial ingrowth after LASIK September consultation #1. J Cataract Refract Surg. 2021;47(9):1242. doi:10.1097/j.jcrs.0000000000000764
Related Videos
© 2025 MJH Life Sciences

All rights reserved.