Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model

Giuseppe Chiarelli; Alex Stephens; Marco Finati; Giuseppe Ottone Cirulli; Edoardo Beatrici; Dejan K Filipas; Sohrab Arora; Shane Tinsley; Mahendra Bhandari; Giuseppe Carrieri; Quoc-Dien Trinh; Alberto Briganti; Francesco Montorsi; Giovanni Lughezzani; Nicolò Buffi; Craig Rogers; Firas Abdollah

doi:10.1007/s11255-024-04009-5

Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model

Standard

Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model. / Chiarelli, Giuseppe; Stephens, Alex; Finati, Marco; Cirulli, Giuseppe Ottone; Beatrici, Edoardo; Filipas, Dejan K; Arora, Sohrab; Tinsley, Shane; Bhandari, Mahendra; Carrieri, Giuseppe; Trinh, Quoc-Dien; Briganti, Alberto; Montorsi, Francesco; Lughezzani, Giovanni; Buffi, Nicolò; Rogers, Craig; Abdollah, Firas.

In: INT UROL NEPHROL, Vol. 56, No. 8, 08.2024, p. 2589-2595.

Research output: SCORING: Contribution to journal › SCORING: Journal article › Research › peer-review

Harvard

Chiarelli, G, Stephens, A, Finati, M, Cirulli, GO, Beatrici, E, Filipas, DK, Arora, S, Tinsley, S, Bhandari, M, Carrieri, G, Trinh, Q-D, Briganti, A, Montorsi, F, Lughezzani, G, Buffi, N, Rogers, C & Abdollah, F 2024, 'Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model', INT UROL NEPHROL, vol. 56, no. 8, pp. 2589-2595. https://doi.org/10.1007/s11255-024-04009-5

APA

Chiarelli, G., Stephens, A., Finati, M., Cirulli, G. O., Beatrici, E., Filipas, D. K., Arora, S., Tinsley, S., Bhandari, M., Carrieri, G., Trinh, Q-D., Briganti, A., Montorsi, F., Lughezzani, G., Buffi, N., Rogers, C., & Abdollah, F. (2024). Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model. INT UROL NEPHROL, 56(8), 2589-2595. https://doi.org/10.1007/s11255-024-04009-5

Vancouver

Chiarelli G, Stephens A, Finati M, Cirulli GO, Beatrici E, Filipas DK et al. Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model. INT UROL NEPHROL. 2024 Aug;56(8):2589-2595. https://doi.org/10.1007/s11255-024-04009-5

Bibtex

@article{e639351050594dcfacd9de15b3a909ac,

title = "Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model",

abstract = "PURPOSE: We aimed to assess the appropriateness of ChatGPT in providing answers related to prostate cancer (PCa) screening, comparing GPT-3.5 and GPT-4.METHODS: A committee of five reviewers designed 30 questions related to PCa screening, categorized into three difficulty levels. The questions were formulated identically for both GPTs three times, varying the prompts. Each reviewer assigned a score for accuracy, clarity, and conciseness. The readability was assessed by the Flesch Kincaid Grade (FKG) and Flesch Reading Ease (FRE). The mean scores were extracted and compared using the Wilcoxon test. We compared the readability across the three different prompts by ANOVA.RESULTS: In GPT-3.5 the mean score (SD) for accuracy, clarity, and conciseness was 1.5 (0.59), 1.7 (0.45), 1.7 (0.49), respectively for easy questions; 1.3 (0.67), 1.6 (0.69), 1.3 (0.65) for medium; 1.3 (0.62), 1.6 (0.56), 1.4 (0.56) for hard. In GPT-4 was 2.0 (0), 2.0 (0), 2.0 (0.14), respectively for easy questions; 1.7 (0.66), 1.8 (0.61), 1.7 (0.64) for medium; 2.0 (0.24), 1.8 (0.37), 1.9 (0.27) for hard. GPT-4 performed better for all three qualities and difficulty levels than GPT-3.5. The FKG mean for GPT-3.5 and GPT-4 answers were 12.8 (1.75) and 10.8 (1.72), respectively; the FRE for GPT-3.5 and GPT-4 was 37.3 (9.65) and 47.6 (9.88), respectively. The 2nd prompt has achieved better results in terms of clarity (all p < 0.05).CONCLUSIONS: GPT-4 displayed superior accuracy, clarity, conciseness, and readability than GPT-3.5. Though prompts influenced the quality response in both GPTs, their impact was significant only for clarity.",

author = "Giuseppe Chiarelli and Alex Stephens and Marco Finati and Cirulli, {Giuseppe Ottone} and Edoardo Beatrici and Filipas, {Dejan K} and Sohrab Arora and Shane Tinsley and Mahendra Bhandari and Giuseppe Carrieri and Quoc-Dien Trinh and Alberto Briganti and Francesco Montorsi and Giovanni Lughezzani and Nicol{\`o} Buffi and Craig Rogers and Firas Abdollah",

note = "{\textcopyright} 2024. The Author(s), under exclusive licence to Springer Nature B.V.",

year = "2024",

month = aug,

doi = "10.1007/s11255-024-04009-5",

language = "English",

volume = "56",

pages = "2589--2595",

journal = "INT UROL NEPHROL",

issn = "0301-1623",

publisher = "Springer Netherlands",

number = "8",

}

RIS

TY - JOUR

T1 - Adequacy of prostate cancer prevention and screening recommendations provided by an artificial intelligence-powered large language model

AU - Chiarelli, Giuseppe

AU - Stephens, Alex

AU - Finati, Marco

AU - Cirulli, Giuseppe Ottone

AU - Beatrici, Edoardo

AU - Filipas, Dejan K

AU - Arora, Sohrab

AU - Tinsley, Shane

AU - Bhandari, Mahendra

AU - Carrieri, Giuseppe

AU - Trinh, Quoc-Dien

AU - Briganti, Alberto

AU - Montorsi, Francesco

AU - Lughezzani, Giovanni

AU - Buffi, Nicolò

AU - Rogers, Craig

AU - Abdollah, Firas

PY - 2024/8

Y1 - 2024/8

N2 - PURPOSE: We aimed to assess the appropriateness of ChatGPT in providing answers related to prostate cancer (PCa) screening, comparing GPT-3.5 and GPT-4.METHODS: A committee of five reviewers designed 30 questions related to PCa screening, categorized into three difficulty levels. The questions were formulated identically for both GPTs three times, varying the prompts. Each reviewer assigned a score for accuracy, clarity, and conciseness. The readability was assessed by the Flesch Kincaid Grade (FKG) and Flesch Reading Ease (FRE). The mean scores were extracted and compared using the Wilcoxon test. We compared the readability across the three different prompts by ANOVA.RESULTS: In GPT-3.5 the mean score (SD) for accuracy, clarity, and conciseness was 1.5 (0.59), 1.7 (0.45), 1.7 (0.49), respectively for easy questions; 1.3 (0.67), 1.6 (0.69), 1.3 (0.65) for medium; 1.3 (0.62), 1.6 (0.56), 1.4 (0.56) for hard. In GPT-4 was 2.0 (0), 2.0 (0), 2.0 (0.14), respectively for easy questions; 1.7 (0.66), 1.8 (0.61), 1.7 (0.64) for medium; 2.0 (0.24), 1.8 (0.37), 1.9 (0.27) for hard. GPT-4 performed better for all three qualities and difficulty levels than GPT-3.5. The FKG mean for GPT-3.5 and GPT-4 answers were 12.8 (1.75) and 10.8 (1.72), respectively; the FRE for GPT-3.5 and GPT-4 was 37.3 (9.65) and 47.6 (9.88), respectively. The 2nd prompt has achieved better results in terms of clarity (all p < 0.05).CONCLUSIONS: GPT-4 displayed superior accuracy, clarity, conciseness, and readability than GPT-3.5. Though prompts influenced the quality response in both GPTs, their impact was significant only for clarity.

AB - PURPOSE: We aimed to assess the appropriateness of ChatGPT in providing answers related to prostate cancer (PCa) screening, comparing GPT-3.5 and GPT-4.METHODS: A committee of five reviewers designed 30 questions related to PCa screening, categorized into three difficulty levels. The questions were formulated identically for both GPTs three times, varying the prompts. Each reviewer assigned a score for accuracy, clarity, and conciseness. The readability was assessed by the Flesch Kincaid Grade (FKG) and Flesch Reading Ease (FRE). The mean scores were extracted and compared using the Wilcoxon test. We compared the readability across the three different prompts by ANOVA.RESULTS: In GPT-3.5 the mean score (SD) for accuracy, clarity, and conciseness was 1.5 (0.59), 1.7 (0.45), 1.7 (0.49), respectively for easy questions; 1.3 (0.67), 1.6 (0.69), 1.3 (0.65) for medium; 1.3 (0.62), 1.6 (0.56), 1.4 (0.56) for hard. In GPT-4 was 2.0 (0), 2.0 (0), 2.0 (0.14), respectively for easy questions; 1.7 (0.66), 1.8 (0.61), 1.7 (0.64) for medium; 2.0 (0.24), 1.8 (0.37), 1.9 (0.27) for hard. GPT-4 performed better for all three qualities and difficulty levels than GPT-3.5. The FKG mean for GPT-3.5 and GPT-4 answers were 12.8 (1.75) and 10.8 (1.72), respectively; the FRE for GPT-3.5 and GPT-4 was 37.3 (9.65) and 47.6 (9.88), respectively. The 2nd prompt has achieved better results in terms of clarity (all p < 0.05).CONCLUSIONS: GPT-4 displayed superior accuracy, clarity, conciseness, and readability than GPT-3.5. Though prompts influenced the quality response in both GPTs, their impact was significant only for clarity.

U2 - 10.1007/s11255-024-04009-5

DO - 10.1007/s11255-024-04009-5

M3 - SCORING: Journal article

C2 - 38564079

VL - 56

SP - 2589

EP - 2595

JO - INT UROL NEPHROL

JF - INT UROL NEPHROL

SN - 0301-1623

IS - 8

ER -