#0307

Do AI Language Models Outperform Human Experts? Benchmarking Nine Models in Stage IV Prostate Cancer Risk Assessment and Feature Stratification

C. Tsai1,2W. Tu1, S. Huang3,4

1Far Eastern Memorial Hospital, Division of Urology, Department of Surgery, New Taipei City, Taiwan
2Yuan Ze University, Department of Electrical Engineering, Taoyuan, Taiwan
3National Taiwan University Hospital, Yunlin Branch, Department of Urology, Yunlin, Taiwan
4National Taiwan University Hospital, Department of Urology, Taipei, Taiwan

Introduction:

Feature stratification (FS) and risk assessment (RA) based on multimodal imaging and pathology reports are essential in guiding treatment decisions for stage IV prostate cancer (PC). This study assesses the performance of nine large language models (LLMs) in FS and RA tasks compared to human experts.

Material and methods:

We obtained text-based clinical reports from computed tomography, magnetic resonance imaging, bone scans, and biopsy pathology from 314 patients with stage IV PC. The study assessed the performance of nine LLMs categorized by scale: large-scale (o1-preview, Claude-3.5-sonnet, ChatGPT-4o, ChatGPT-4-turbo, Gemini-1.5-pro, Meta-Llama-3.1-405B), medium-scale (Meta-Llama-3.1-70B), and small-scale (Meta-Llama-3.1-8B, Medllama3-v20). Each LLM was evaluated on three RA tasks (LATITUDE, CHAARTED, TwNHI) and seven FS tasks, inclusive of TNM staging, detection of bone and visceral metastases, and metastatic site quantification. The models were queried via Application Programming Interface using zero-shot chain-of-thought prompting, and their outputs were assessed through repeated single-round queries and ensemble voting strategies. Performance was benchmarked against a gold-standard consensus from three human experts, with accuracy and consistency—measured using the intraclass correlation coefficient (ICC)—as the primary evaluation metrics. Generalized estimating equations were used to compare model performance across multiple queries with human experts.

Results:

Among the 314 patients, 115 (32.8%) were classified as LATITUDE high-risk, 128 (36.5%) as CHAARTED high-volume, and 94 (27%) as TwNHI high-risk. State-of-the-art (SOTA) LLMs achieved accuracy comparable to human experts in RA and FS tasks. Specifically, o1-preview (95.22%-96.50%) and Claude-3.5-sonnet (93.63%-96.50%) matched human expert accuracy (92.36%-96.73%) across three RA and five FS tasks, without significant differences in most comparisons. Notably, SOTA LLMs outperformed human experts in regional lymph node (N1) and distant metastasis (M1a) detection, showing higher accuracy and ICC. Closed-source LLMs consistently outperformed open-source models in both RA and FS tasks. In the LATITUDE RA task, o1-preview achieved 95.22% accuracy, while open-source models, including Meta-Llama-3.1-405B, 70B, and 8B, scored 88.54%, 72.61%, and 42.68%, respectively, emphasizing the advantage of larger proprietary models. LLM performance followed the scaling law, with large-scale models outperforming medium- and small-scale ones (Large>Medium>Small). Additionally, higher-accuracy models exhibited greater consistency across multiple queries, with SOTA LLMs surpassing human experts in ICC.



    位置
    資料夾名稱
    摘要
    上傳者
    TUA線上教育_家琳
    單位
    台灣泌尿科醫學會
    建立
    2026-04-23 20:56:41
    最近修訂
    2026-04-23 20:56:52
    更多