分析四種人工智慧大語言模型在第四期前列腺癌風險評估與資訊檢索的表現
袁倫祥1,2、黃士維2,3、杜威4、蔡宗佑4,5*
1國立成功大學 生物醫學工程學系 2國立台灣大學醫學院附設醫院雲林分院 泌尿部 3國立台灣大學醫學院附設醫院 泌尿部 4亞東紀念醫院 外科部 泌尿外科 5元智大學電機工程學系
Analysis of Performance of 4 AI LLMs in Stage IV Prostate Cancer Risk Assessment and Information Retrieval
Lun-Hsiang Yuan1,2, Shi-Wei Huang2,3, Wei-Tu4, Chung-You Tsai4,5*
1Department of Biomedical Engineering, National Cheng-Kung University, Tainan, Taiwan
2Department of Urology, National Taiwan University Hospital, Yunlin Branch, Yunlin, Taiwan
3Department of Urology, National Taiwan University Hospital, Taipei, Taiwan
4Division of Urology, Department of Surgery, Far Eastern Memorial Hospital, New Taipei, Taiwan
5Department of Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan
*Correspondence to: Chung-You Tsai, MD. PhD.
Purpose:
This study aims to assess the performance of four general-purpose large language models (LLMs) in information retrieval (IR) and risk assessment (RA) tasks. Both of these tasks from multi-modality imaging and pathology reports are really essential for effective prostate cancer(PC) treatment.
Material and Methods:
We conducted a study using simulated text reports from computed tomography, magnetic resonance imaging, bone scans, and biopsy pathology for stage IV prostate cancer patients. Four large language models (ChatGPT-4-turbo, Claude-3-opus, Gemini-pro-1.0, and ChatGPT-3.5-turbo) were evaluated on three risk assessment (RA) tasks (LATITUDE, CHAARTED, TwNHI) and seven information retrieval (IR) tasks. These included TNM staging, as well as the detection and quantification of bone and visceral metastases, offering a comprehensive evaluation of their ability to process diverse clinical data. We used zero-shot chain-of-thought prompting via API to query the LLMs with multi-modality reports. Their performances were evaluated through repeated single-round queries and ensemble-voting methods, with the consensus of three adjudicators serving as the gold standard. The models were assessed using six outcome metrics.
Results:
Among 350 stage IV PC patients with simulated reports, 115(32.8%), 128(36.5%) and 94(27%) belonged to the high-risk group of LATITUDE, CHAARTED and TwNHI respectively. Ensemble voting, based on three repeated single-round queries, consistently enhances accuracy with a higher likelihood of achieving non-inferior results compared to a single query. Four models showed minimal differences in IR tasks with high accuracy (87.4%-94.2%) and consistency (ICC > 0.8) in TNM staging. However, there were significant differences in RA performance, and the ranking are as follows: ChatGPT-4-turbo> Claude-3-opus> Gemini-pro-1.0> ChatGPT-3.5-turbo. ChatGPT-4-turbo achieved the highest accuracy (90.1%, 90.7%,91.6%), and consistency (ICC 0.86, 0.93, 0.76) across 3 RA tasks. Moreover, its high sensitivity and NPV could facilitate ruling out high-risk patients.
Conclusions:
While combining ChatGPT-4-turbo with ensemble voting, it demonstrated high accuracy and promising outcomes in RA and IR for Stage IV PC, and its potential for clinical decision support was impressed. Further research is necessary to validate these findings across additional cancer types
Keywords: Prostate cancer, large language model, risk assessment, information retrieval, clinical decision support, ChatGPT