優化大型語言模型以增強台灣攝護腺肥大病患衛教:跨模型比較
江佩璋、陳進利、高建璋、楊明昕、曹智惟、蒙恩、江佩璋
三軍總醫院 外科部 泌尿外科
Optimizing Large Language Models for Enhanced Benign Prostatic Hyperplasia Patient Education in Taiwan: A Cross-Model Comparison
Pei-Jhang Chiang, Chin-Li Chen, Chien-Chang Kao, Ming-Hsin Yang, Chih-Wei Tsao, En Meng, Pei-Jhang Chiang
Division of Urology, Department of Surgery, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
Purpose: With the rise of large language models (LLMs), patients have begun using them as a new source of medical information. However, as LLMs primarily rely on training data in English or simplified Chinese, traditional Chinese (TC) users may face language and cultural barriers when using these models. This study aims to evaluate the response quality of LLMs with customized prompts in generating patient education materials on BPH in Taiwan and other regions where TC is primarily used.
Materials and Methods: We utilized two Large Language Models (LLMs), ChatGPT-4 and Claude 3 Opus, and designed customized prompts for elderly male patients in Taiwan using TC. Five questions were developed to cover the etiology, symptoms, diagnosis, treatment, prevention, and self-care aspects of Benign Prostatic Hyperplasia (BPH). Three senior urologists independently analyzed the responses provided by ChatGPT-4 and Claude 3 Opus with default prompts, custom prompts, and context-aware custom prompts using the DISCERN instrument for quality, the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P) for understandability and actionability, and the Chinese Readability Index Explorer (CRIE) for readability.
Results: Both ChatGPT-4 and Claude 3 Opus demonstrated moderate to high information quality about BPH across six model settings (median DISCERN score: 4 out of 5, range: 3–5). Claude 3 Opus with context-aware custom prompts achieved the highest overall DISCERN score. All six model settings exhibited high understandability (median PEMAT-P understandability: 84.6%, range: 83.3–84.6%) but moderate to poor actionability (median PEMAT-P actionability: 40%, range: 40.0–40.0%). While the accuracy of cited sources was not evaluated, all model responses were found to be free of misinformation. Readability was consistent across all settings, with a mean CRIE score of 7 (no variation), corresponding to a 7th-grade reading level in Taiwan.
Conclusions: This pioneering study evaluates the potential of LLMs for generating patient education materials on BPH in the TC environment, specifically in Taiwan. Our findings indicate that when LLMs are adapted with domain-relevant prompts and contextual cues, they can effectively create informative content in TC. Optimizing the prompts and refining the prompt engineering may be necessary to further improve the practicality and customization of the generated content.