PD06-05: Performance of ChatGPT on the taiwan urology board examination: Insights into current strengths and shortcomings

PD06-05: Performance of ChatGPT on the taiwan urology board examination: Insights into current strengths and shortcomings

分享

QR code

列印

我要推薦

瀏覽: 711, 最近修訂: 2024-06-11

PD06-05: Performance of ChatGPT on the taiwan urology board examination: Insights into current strengths and shortcomings

瀏覽: 711, 最近修訂: 2024-06-11

ChatGPT對台灣泌尿科專科醫師考試的效能分析：洞悉當前的優勢與弱點

蔡宗佑^1,2*、鄭百諭^1,3、謝尚儒¹、黃泓翔¹、李峻嘉¹

¹亞東紀念醫院外科部泌尿外科, ²元智大學電機工程學系, ³國立台灣大學醫學院暨工學院醫學工程學系

Performance of ChatGPT on the Taiwan Urology Board Examination: Insights into Current Strengths and Shortcomings

Chung-You Tsai^1,2*, Pai-Yu Cheng^1,3, Shang-Ju Hsieh¹, Hung-Hsiang Huang¹, Jiun-Jia Li¹

¹Divisions of Urology, Department of Surgery, Far Eastern Memorial Hospital, Taiwan.
²Department of Electrical Engineering, Yuan Ze University, Taiwan.

³Department of Biomedical Engineering, College of Medicine and College of Engineering, National Taiwan University, Taipei, Taiwan

Purpose: To compare ChatGPT-4 and ChatGPT-3.5's performance on Taiwan Urology Board Examination (TUBE), focusing on answer accuracy, explanation consistency, and uncertainty management tactics to minimize score penalties from incorrect responses across twelve urology domains.

Methods: A total of 450 multiple-choice questions from TUBE (2020-2022) was presented to two models. Three urologists evaluated the accuracy and consistency of responses, supplemented by an experiment on penalty reduction using prompt variations.

Results: ChatGPT-4 showed strengths in urology, achieved an overall accuracy of 57.8%, with annual accuracies of 64.7%(2020), 58.0%(2021), and 50.7%(2022), significantly surpassing ChatGPT-3.5 (OR=2.68, 95%CI[2.05-3.52], p<0.001). It could have passed the TUBE written exams if solely based on accuracy but failed in the final score due to penalties. ChatGPT-4 displayed a declining accuracy trend over time (slope -7, p=0.016), unlike ChatGPT-3.5. Variability in accuracy across 12 urological domains was noted, with more frequently updated knowledge domains showing lower accuracy (53.2% vs. 62.2%, OR=0.69, p=0.05). A high consistency rate of 91.6% in explanations across all domains, indicates reliable delivery of coherent and logical information. The simple prompt outperformed strategy-based prompts in accuracy (60% vs. 40%, p=0.016), highlighting ChatGPT's limitations in its inability to accurately self-assess uncertainty and a tendency towards overconfidence, which may hinder medical decision-making.

Conclusion: ChatGPT-4's high accuracy and consistent explanations in urology board examination demonstrate its potential in medical information processing. However, its limitations in self-assessment and overconfidence necessitate caution in its application, especially for inexperienced users. These insights call for ongoing advancements and development of urology-specific AI applications.

未登入或權限不足!

位置

資料夾名稱

摘要

發表人

TUA線上教育_家琳

單位

台灣泌尿科醫學會

建立

2024-06-11 17:11:08

最近修訂

2024-06-11 17:11:43

PD06-05: Performance of ChatGPT on the taiwan urology board examination: Insights into current strengths and shortcomings 分享 QR code 列印 我要推薦

PD06-05: Performance of ChatGPT on the taiwan urology board examination: Insights into current strengths and shortcomings

PD06-05: Performance of ChatGPT on the taiwan urology board examination: Insights into current strengths and shortcomings

分享

QR code

列印

我要推薦