Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model
Standard
Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model. / Chiarelli, Giuseppe; Stephens, Alex; Finati, Marco; Cirulli, Giuseppe Ottone; Beatrici, Edoardo; Filipas, Dejan K; Arora, Sohrab; Tinsley, Shane; Bhandari, Mahendra; Carrieri, Giuseppe; Trinh, Quoc-Dien; Briganti, Alberto; Montorsi, Francesco; Lughezzani, Giovanni; Buffi, Nicolò; Rogers, Craig; Abdollah, Firas.
In: INT UROL NEPHROL, Vol. 56, No. 8, 08.2024, p. 2589-2595.Research output: SCORING: Contribution to journal › SCORING: Journal article › Research › peer-review
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - JOUR
T1 - Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model
AU - Chiarelli, Giuseppe
AU - Stephens, Alex
AU - Finati, Marco
AU - Cirulli, Giuseppe Ottone
AU - Beatrici, Edoardo
AU - Filipas, Dejan K
AU - Arora, Sohrab
AU - Tinsley, Shane
AU - Bhandari, Mahendra
AU - Carrieri, Giuseppe
AU - Trinh, Quoc-Dien
AU - Briganti, Alberto
AU - Montorsi, Francesco
AU - Lughezzani, Giovanni
AU - Buffi, Nicolò
AU - Rogers, Craig
AU - Abdollah, Firas
N1 - © 2024. The Author(s), under exclusive licence to Springer Nature B.V.
PY - 2024/8
Y1 - 2024/8
N2 - PURPOSE: We aimed to assess the appropriateness of ChatGPT in providing answers related to prostate cancer (PCa) screening, comparing GPT-3.5 and GPT-4.METHODS: A committee of five reviewers designed 30 questions related to PCa screening, categorized into three difficulty levels. The questions were formulated identically for both GPTs three times, varying the prompts. Each reviewer assigned a score for accuracy, clarity, and conciseness. The readability was assessed by the Flesch Kincaid Grade (FKG) and Flesch Reading Ease (FRE). The mean scores were extracted and compared using the Wilcoxon test. We compared the readability across the three different prompts by ANOVA.RESULTS: In GPT-3.5 the mean score (SD) for accuracy, clarity, and conciseness was 1.5 (0.59), 1.7 (0.45), 1.7 (0.49), respectively for easy questions; 1.3 (0.67), 1.6 (0.69), 1.3 (0.65) for medium; 1.3 (0.62), 1.6 (0.56), 1.4 (0.56) for hard. In GPT-4 was 2.0 (0), 2.0 (0), 2.0 (0.14), respectively for easy questions; 1.7 (0.66), 1.8 (0.61), 1.7 (0.64) for medium; 2.0 (0.24), 1.8 (0.37), 1.9 (0.27) for hard. GPT-4 performed better for all three qualities and difficulty levels than GPT-3.5. The FKG mean for GPT-3.5 and GPT-4 answers were 12.8 (1.75) and 10.8 (1.72), respectively; the FRE for GPT-3.5 and GPT-4 was 37.3 (9.65) and 47.6 (9.88), respectively. The 2nd prompt has achieved better results in terms of clarity (all p < 0.05).CONCLUSIONS: GPT-4 displayed superior accuracy, clarity, conciseness, and readability than GPT-3.5. Though prompts influenced the quality response in both GPTs, their impact was significant only for clarity.
AB - PURPOSE: We aimed to assess the appropriateness of ChatGPT in providing answers related to prostate cancer (PCa) screening, comparing GPT-3.5 and GPT-4.METHODS: A committee of five reviewers designed 30 questions related to PCa screening, categorized into three difficulty levels. The questions were formulated identically for both GPTs three times, varying the prompts. Each reviewer assigned a score for accuracy, clarity, and conciseness. The readability was assessed by the Flesch Kincaid Grade (FKG) and Flesch Reading Ease (FRE). The mean scores were extracted and compared using the Wilcoxon test. We compared the readability across the three different prompts by ANOVA.RESULTS: In GPT-3.5 the mean score (SD) for accuracy, clarity, and conciseness was 1.5 (0.59), 1.7 (0.45), 1.7 (0.49), respectively for easy questions; 1.3 (0.67), 1.6 (0.69), 1.3 (0.65) for medium; 1.3 (0.62), 1.6 (0.56), 1.4 (0.56) for hard. In GPT-4 was 2.0 (0), 2.0 (0), 2.0 (0.14), respectively for easy questions; 1.7 (0.66), 1.8 (0.61), 1.7 (0.64) for medium; 2.0 (0.24), 1.8 (0.37), 1.9 (0.27) for hard. GPT-4 performed better for all three qualities and difficulty levels than GPT-3.5. The FKG mean for GPT-3.5 and GPT-4 answers were 12.8 (1.75) and 10.8 (1.72), respectively; the FRE for GPT-3.5 and GPT-4 was 37.3 (9.65) and 47.6 (9.88), respectively. The 2nd prompt has achieved better results in terms of clarity (all p < 0.05).CONCLUSIONS: GPT-4 displayed superior accuracy, clarity, conciseness, and readability than GPT-3.5. Though prompts influenced the quality response in both GPTs, their impact was significant only for clarity.
U2 - 10.1007/s11255-024-04009-5
DO - 10.1007/s11255-024-04009-5
M3 - SCORING: Journal article
C2 - 38564079
VL - 56
SP - 2589
EP - 2595
JO - INT UROL NEPHROL
JF - INT UROL NEPHROL
SN - 0301-1623
IS - 8
ER -