ChatGPT對台灣泌尿科專科醫師考試的效能分析:洞悉當前的優勢與弱點

蔡宗佑1,2*、鄭百諭1,3、謝尚儒1、黃泓翔1李峻嘉1

1亞東紀念醫院外科部泌尿外科, 2元智大學電機工程學系, 3國立台灣大學醫學院暨工學院醫學工程學系

Performance of ChatGPT on the Taiwan Urology Board Examination: Insights into Current Strengths and Shortcomings

Chung-You Tsai1,2*, Pai-Yu Cheng1,3, Shang-Ju Hsieh1, Hung-Hsiang Huang1, Jiun-Jia Li1

1 Divisions of Urology, Department of Surgery, Far Eastern Memorial Hospital, Taiwan.
2 Department of Electrical Engineering, Yuan Ze University, Taiwan.

3Department of Biomedical Engineering, College of Medicine and College of Engineering, National Taiwan University, Taipei, Taiwan

 

Purpose: To compare ChatGPT-4 and ChatGPT-3.5's performance on Taiwan Urology Board Examination (TUBE), focusing on answer accuracy, explanation consistency, and uncertainty management tactics to minimize score penalties from incorrect responses across twelve urology domains.

Methods: A total of 450 multiple-choice questions from TUBE (2020-2022) was presented to two models. Three urologists evaluated the accuracy and consistency of responses, supplemented by an experiment on penalty reduction using prompt variations.

Results: ChatGPT-4 showed strengths in urology, achieved an overall accuracy of 57.8%, with annual accuracies of 64.7%(2020), 58.0%(2021), and 50.7%(2022), significantly surpassing ChatGPT-3.5 (OR=2.68, 95%CI[2.05-3.52], p<0.001). It could have passed the TUBE written exams if solely based on accuracy but failed in the final score due to penalties. ChatGPT-4 displayed a declining accuracy trend over time (slope -7, p=0.016), unlike ChatGPT-3.5. Variability in accuracy across 12 urological domains was noted, with more frequently updated knowledge domains showing lower accuracy (53.2% vs. 62.2%, OR=0.69, p=0.05). A high consistency rate of 91.6% in explanations across all domains, indicates reliable delivery of coherent and logical information. The simple prompt outperformed strategy-based prompts in accuracy (60% vs. 40%, p=0.016), highlighting ChatGPT's limitations in its inability to accurately self-assess uncertainty and a tendency towards overconfidence, which may hinder medical decision-making.

Conclusion: ChatGPT-4's high accuracy and consistent explanations in urology board examination demonstrate its potential in medical information processing. However, its limitations in self-assessment and overconfidence necessitate caution in its application, especially for inexperienced users. These insights call for ongoing advancements and development of urology-specific AI applications.

    位置
    資料夾名稱
    摘要
    上傳者
    TUA線上教育_家琳
    單位
    台灣泌尿科醫學會
    建立
    2024-06-11 17:11:08
    最近修訂
    2024-06-11 17:11:43
    1. 1.
      Podium 01
    2. 2.
      Podium 02
    3. 3.
      Podium 03
    4. 4.
      Podium 04
    5. 5.
      Podium 05
    6. 6.
      Podium 06
    7. 7.
      Podium 07
    8. 8.
      Moderated Poster 01
    9. 9.
      Moderated Poster 02
    10. 10.
      Moderated Poster 03
    11. 11.
      Moderated Poster 04
    12. 12.
      Moderated Poster 05
    13. 13.
      Non-Discussion Poster