Article
Author(s):
A study of ChatGPT found the artificial intelligence tool answered less than half of the test questions correctly from a study resource commonly used by physicians when preparing for board certification in ophthalmology.
Using artificial intelligence to prepare for ophthalmic board certification through the Ophthalmic Knowledge Assessment Program (OKAP) and Written Qualitfying Exam (WQE) examinations likely won’t make the process any easier, according to a new study.
According to a study, researchers found that ChatGPT could answer approximately half of the presented multiple-choice questions correctly when prompted.
The study, published in JAMA Ophthalmology1 and led by St. Michael’s Hospital, a site of Unity Health Toronto, found ChatGPT correctly answered 46 per cent of questions when initially conducted in Jan. 2023. When researchers conducted the same test one month later, ChatGPT scored more than 10 percent higher.
In a news release, St. Michael’s Hospital noted the use of AI in medicine and exam preparation has received plenty of attention since ChatGPT became publicly available in November 2022. The technology also is raising concern for the potential of incorrect information and cheating in academia. ChatGPT is free, available to anyone with an internet connection, and works in a conversational manner.
“ChatGPT may have an increasing role in medical education and clinical practice over time, however it is important to stress the responsible use of such AI systems,” Rajeev H. Muni, MD, MSc, FRCSC, principal investigator of the study and a researcher at the Li Ka Shing Knowledge Institute at St. Michael’s said in the news release. “ChatGPT as used in this investigation did not answer sufficient multiple choice questions correctly for it to provide substantial assistance in preparing for board certification at this time.”
According to the news release, ChatGPT is an AI chatbot developed by OpenAI that can interact with users conversationally and act as an educational tool when used appropriately. Using ChatGPT responsibly in medical education and clinical practice is vital in the future, the authors of the current study noted.
Although a past study found that ChatGPT has knowledge equivalent to that of a third-year medical student when answering questions related to the United States Medical Licensing Examination, the performance of ChatGPT in other disciplines is unclear. The current study aimed to assess the knowledge of ChatGPT against practice questions used for board certification examinations for ophthalmology.
All questions were collected from the free trial of OphthoQuestions, which provides practice questions for the OKAP and WQE tests. Questions that required input of images and videos were excluded, whereas text-based questions were left in.
The researchers’ primary outcome was the performance of ChatGPT in answering the questions; secondary outcomes included whether ChatGPT provided explanations, the mean length of questions and responses, performance in answering questions without multiple-choice options, and changes in performance.
“ChatGPT is an artificial intelligence system that has tremendous promise in medical education. Though it provided incorrect answers to board certification questions in ophthalmology about half the time, we anticipate that ChatGPT’s body of knowledge will rapidly evolve,” said Marko Popovic, MD, a co-author of the study and a resident physician in the Department of Ophthalmology and Vision Sciences at the University of Toronto.
All conversations in ChatGPT were cleared before asking each question to avoid responses being influenced by past conversations. A new account was also used to avoid any past history influencing the answers. The primary analysis used the January 9 version of ChatGPT, whereas the secondary analysis used the February 13 version. All answers were manually reviewed by the authors.
ChatGPT answered questions from January 9 to 16, 2023, in the primary analysis and on February 17, 2023, in the secondary analysis. There were 125 text-based questions asked and analyzed by ChatGPT of the 166 available. All included questions were high yield for board certification examinations per OphthoQuestions.
ChatGPT had high demand when responding to 44 questions (35%) and its mean (SD) response time was 17.8 (14.4) seconds. ChatGPT was able to answer 58 of 125 questions (46.4%) correctly in January 2023. General medicine questions had the best results, with 11 of 14 (79%) questions correctly answered. Retina and vitreous questions had the worst results, with ChatGPT incorrectly answering all of them.
Additional insight or explanations were provided for 79 of 125 questions (63%); the proportion of questions given explanations or insights was similar between the questions answered incorrectly and correctly (difference, 5.8%; 95% CI, –11.0% to 22.0%). Length of questions was similar between questions that were answered correctly and incorrectly (difference, 21.4 characters; SE, 36.8; 95% CI, –51.5 to 94.3) and length of answers was also similar regardless of accuracy (difference, –80.0 characters; SE, 65.4; 95% CI, –209.5 to 49.5).
ChatGPT closely matched how trainees answer questions, and selected the same multiple-choice response as the most common answer provided by ophthalmology trainees 44 per cent of the time. ChatGPT selected the multiple-choice response that was least popular among ophthalmology trainees 11 per cent of the time, second least popular 18 per cent of the time, and second most popular 22 per cent of the time.
Andrew Mihalache, lead author of the study and undergraduate student at Western University, noted that ChatGPT performed most accurately on general medicine questions, answering 79 percent of them correctly.
“On the other hand, its accuracy was considerably lower on questions for ophthalmology subspecialties. For instance, the chatbot answered 20 percent of questions correctly on oculoplastics and zero per cent correctly from the subspecialty of retina,” he said. “The accuracy of ChatGPT will likely improve most in niche subspecialties in the future,.”
ChatGPT improved in the February 2023 analysis, with questions answered correctly in 73 of the 125 questions (58%). Stand-alone questions without multiple-choice options performed similarly to multiple-choice questions, with 42 of 78 (54%) stand-alone questions answered correctly (difference, 4.6%; 95% CI, –9.2% to 18.3%).
Internet speed, online traffic, and delays in response time could have biased certain parameters, the authors wrote in diacussing the study’s limitations. Different results could be yielded in a separate study due to ChatGPT providing unique answers. Questions that were not text based were excluded from the study. Questions may have been answered more broadly when not multiple choice, which could have led to an incorrect response.
The researchers concluded that ChatGPT was not able to “answer sufficient multiple-choice questions correctly for it to provide substantial assistance in preparing for board certification at this time.” However, they acknowledged that future studies should evaluate the progression of AI chatbots’ performance.
Reference
1. Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol. Published online April 27, 2023. doi:10.1001/jamaophthalmol.2023.1144