評估多個版本ChatGPT對於台灣泌尿科專科考試的表現:準確性、速度,以及特定領域的挑戰
江佩璋1,3、陳進利1、高建璋1,2、楊明昕1,2、曹智惟1、蒙恩1
1三軍總醫院外科部泌尿外科;2國防醫學院醫學科學研究所;3台北醫學大學醫學院人工智慧醫療碩士在職專班
Assessing Multiple Versions of ChatGPT in the Taiwan Urology Board Exam: Accuracy, Speed, and Specific Domain Challenges
Pei-Jhang Chiang1,3, Chin-Li Chen1, Chien-Chang Kao1,2, Ming-Hsin Yang1,2, Chih-Wei Tsao1, En Meng1
1Division of Urology, Department of Surgery, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan1; Graduate School of Medical Sciences, National Defense Medical Center, Taipei, Taiwan2; In-Service Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taiwan3
Purpose:
Large language models (LLMs), such as ChatGPT, have demonstrated the ability to generate human-like text. However, they need help with logical reasoning, particularly in specialized domains. To overcome these challenges, newer iterations of LLMs, like ChatGPT o1, have been created with improved reasoning capabilities. This study aims to evaluate the accuracy and efficiency of four ChatGPT models in answering urology board exam questions, specifically focusing on their potential to revolutionize medical education and decision-making processes. By identifying the strengths and weaknesses of these models, this research aims to pave the way for significant improvements in AI-driven medical education and decision support.
Materials and Methods:
This study evaluated the performance of four versions of the ChatGPT model: ChatGPT 4o (with a database updated as of October 2023), ChatGPT 4o-mini, ChatGPT o1, and ChatGPT o1-mini. One hundred twenty questions from the 2024 Taiwan Urology Board Exam, released September 2024, were utilized as the testing material. Each model was assessed independently, using a consistent format for all questions to ensure standardization. The models were prompted in a zero-shot setting, and they answered questions without any prior specific training or adaptation related to this urology board exam. Accuracy and response time were recorded for each model's answer. All tests were conducted under controlled conditions to minimize variability. Friedman test was used to determine the significance of differences in model performance, providing insights into each model's relative strengths and weaknesses.
Results:
ChatGPT o1 had the highest accuracy at 66.7%, followed by o1-mini at 56.7%. ChatGPT 4o and 4o-mini showed lower accuracy at 55.8% and 43.3%, respectively. Regarding response time, 4o-mini was the fastest (7.94 seconds), while o1 was the slowest (19.20 seconds). In quiz types like Surgical Anatomy and Complication, o1 outperformed other models significantly (p <0.05). Performance by character count group indicated better results for shorter questions, with more significant discrepancies as question length increased.
Conclusion:
ChatGPT o1 exhibited the best balance between accuracy and logical reasoning, achieving the highest overall accuracy. ChatGPT 4o also performed well, although slightly less accurately. Smaller models like 4o-mini, though faster, had significantly lower accuracy rates, demonstrating a clear trade-off between speed and reliability. This understanding is essential for making well-informed decisions about using these models in specialized medical applications.