ChatGPT對台灣泌尿科專科醫師考試的效能分析:洞悉當前的優勢與弱點

蔡宗佑1,2*、鄭百諭1,3、謝尚儒1、黃泓翔1李峻嘉1

1亞東紀念醫院外科部泌尿外科, 2元智大學電機工程學系, 3國立台灣大學醫學院暨工學院醫學工程學系

Performance of ChatGPT on the Taiwan Urology Board Examination: Insights into Current Strengths and Shortcomings

Chung-You Tsai1,2*, Pai-Yu Cheng1,3, Shang-Ju Hsieh1, Hung-Hsiang Huang1, Jiun-Jia Li1

1 Divisions of Urology, Department of Surgery, Far Eastern Memorial Hospital, Taiwan.
2 Department of Electrical Engineering, Yuan Ze University, Taiwan.

3Department of Biomedical Engineering, College of Medicine and College of Engineering, National Taiwan University, Taipei, Taiwan

 

Purpose: To compare ChatGPT-4 and ChatGPT-3.5's performance on Taiwan Urology Board Examination (TUBE), focusing on answer accuracy, explanation consistency, and uncertainty management tactics to minimize score penalties from incorrect responses across twelve urology domains.

Methods: A total of 450 multiple-choice questions from TUBE (2020-2022) was presented to two models. Three urologists evaluated the accuracy and consistency of responses, supplemented by an experiment on penalty reduction using prompt variations.

Results: ChatGPT-4 showed strengths in urology, achieved an overall accuracy of 57.8%, with annual accuracies of 64.7%(2020), 58.0%(2021), and 50.7%(2022), significantly surpassing ChatGPT-3.5 (OR=2.68, 95%CI[2.05-3.52], p<0.001). It could have passed the TUBE written exams if solely based on accuracy but failed in the final score due to penalties. ChatGPT-4 displayed a declining accuracy trend over time (slope -7, p=0.016), unlike ChatGPT-3.5. Variability in accuracy across 12 urological domains was noted, with more frequently updated knowledge domains showing lower accuracy (53.2% vs. 62.2%, OR=0.69, p=0.05). A high consistency rate of 91.6% in explanations across all domains, indicates reliable delivery of coherent and logical information. The simple prompt outperformed strategy-based prompts in accuracy (60% vs. 40%, p=0.016), highlighting ChatGPT's limitations in its inability to accurately self-assess uncertainty and a tendency towards overconfidence, which may hinder medical decision-making.

Conclusion: ChatGPT-4's high accuracy and consistent explanations in urology board examination demonstrate its potential in medical information processing. However, its limitations in self-assessment and overconfidence necessitate caution in its application, especially for inexperienced users. These insights call for ongoing advancements and development of urology-specific AI applications.

    位置
    資料夾名稱
    摘要
    發表人
    TUA線上教育_家琳
    單位
    台灣泌尿科醫學會
    建立
    2024-06-11 17:11:08
    最近修訂
    2024-06-11 17:11:43
    更多