@Article{info:doi/10.2196/52784, author="Ming, Shuai and Guo, Qingge and Cheng, Wenjun and Lei, Bo", title="Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study", journal="JMIR Med Educ", year="2024", month="Aug", day="13", volume="10", pages="e52784", keywords="ChatGPT; Chinese National Medical Licensing Examination; large language models; medical education; system role; LLM; LLMs; language model; language models; artificial intelligence; chatbot; chatbots; conversational agent; conversational agents; exam; exams; examination; examinations; OpenAI; answer; answers; response; responses; accuracy; performance; China; Chinese", abstract="Background: With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research. Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE). Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt's designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60{\%}. The $\chi$2 tests and $\kappa$ values were employed to evaluate the model's accuracy and consistency. Results: GPT-4.0 achieved a passing accuracy of 72.7{\%}, which was significantly higher than that of GPT-3.5 (54{\%}; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9{\%} vs 19.5{\%}; P<.001). However, both models showed relatively good response coherence, with $\kappa$ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3{\%}‐3.7{\%}) and GPT-3.5 (1.3{\%}‐4.5{\%}), and reduced variability by 1.7{\%} and 1.8{\%}, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response. Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model's reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study. ", issn="2369-3762", doi="10.2196/52784", url="https://mededu.jmir.org/2024/1/e52784", url="https://doi.org/10.2196/52784" }