Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Annika Meyer; Janik Riese; Thomas Streichert

doi:10.2196/50965

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

Standard

Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study. / Meyer, Annika; Riese, Janik; Streichert, Thomas.

In: JMIR MED EDUC, Vol. 10, 08.02.2024, p. e50965.

Research output: SCORING: Contribution to journal › SCORING: Journal article › Research › peer-review

Harvard

Meyer, A, Riese, J & Streichert, T 2024, 'Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study', JMIR MED EDUC, vol. 10, pp. e50965. https://doi.org/10.2196/50965

APA

Meyer, A., Riese, J., & Streichert, T. (2024). Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study. JMIR MED EDUC, 10, e50965. https://doi.org/10.2196/50965

Vancouver

Meyer A, Riese J, Streichert T. Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study. JMIR MED EDUC. 2024 Feb 8;10:e50965. https://doi.org/10.2196/50965

Bibtex

@article{1020631be46940cda47c57eb50543e92,

title = "Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study",

abstract = "BACKGROUND: The potential of artificial intelligence (AI)-based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance.OBJECTIVE: This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination.METHODS: To assess GPT-3.5's and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022.RESULTS: GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research.CONCLUSIONS: The study results highlight ChatGPT's remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4's predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population.",

keywords = "Humans, Artificial Intelligence, Education, Medical, Language, Students, Medical, Educational Measurement",

author = "Annika Meyer and Janik Riese and Thomas Streichert",

note = "{\textcopyright}Annika Meyer, Janik Riese, Thomas Streichert. Originally published in JMIR Medical Education (https://mededu.jmir.org), 08.02.2024.",

year = "2024",

month = feb,

day = "8",

doi = "10.2196/50965",

language = "English",

volume = "10",

pages = "e50965",

journal = "JMIR MED EDUC",

issn = "2369-3762",

publisher = "JMIR Publications Inc.",

}

RIS

TY - JOUR

T1 - Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study

AU - Meyer, Annika

AU - Riese, Janik

AU - Streichert, Thomas

N1 - ©Annika Meyer, Janik Riese, Thomas Streichert. Originally published in JMIR Medical Education (https://mededu.jmir.org), 08.02.2024.

PY - 2024/2/8

Y1 - 2024/2/8

N2 - BACKGROUND: The potential of artificial intelligence (AI)-based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance.OBJECTIVE: This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination.METHODS: To assess GPT-3.5's and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022.RESULTS: GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research.CONCLUSIONS: The study results highlight ChatGPT's remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4's predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population.

AB - BACKGROUND: The potential of artificial intelligence (AI)-based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance.OBJECTIVE: This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination.METHODS: To assess GPT-3.5's and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022.RESULTS: GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research.CONCLUSIONS: The study results highlight ChatGPT's remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4's predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population.

KW - Humans

KW - Artificial Intelligence

KW - Education, Medical

KW - Language

KW - Students, Medical

KW - Educational Measurement

U2 - 10.2196/50965

DO - 10.2196/50965

M3 - SCORING: Journal article

C2 - 38329802

VL - 10

SP - e50965

JO - JMIR MED EDUC

JF - JMIR MED EDUC

SN - 2369-3762

ER -