Deep Learning for Natural Language Processing in Urology: State-of-the-Art Automated Extraction of Detailed Pathologic Prostate Cancer Data From Narratively Written Electronic Health Records

Sami-Ramzi Leyh-Bannurah; Zhe Tian; Pierre I Karakiewicz; Ulrich Wolffgang; Guido Sauter; Margit Fisch; Dirk Pehrke; Hartwig Huland; Markus Graefen; Lars Budäus

doi:10.1200/CCI.18.00080

Deep Learning for Natural Language Processing in Urology: State-of-the-Art Automated Extraction of Detailed Pathologic Prostate Cancer Data From Narratively Written Electronic Health Records

Standard

Deep Learning for Natural Language Processing in Urology: State-of-the-Art Automated Extraction of Detailed Pathologic Prostate Cancer Data From Narratively Written Electronic Health Records. / Leyh-Bannurah, Sami-Ramzi; Tian, Zhe; Karakiewicz, Pierre I; Wolffgang, Ulrich; Sauter, Guido ; Fisch, Margit; Pehrke, Dirk; Huland, Hartwig; Graefen, Markus; Budäus, Lars.

In: JCO CLIN CANCER INFO, Vol. 2, 12.2018, p. 1-9.

Research output: SCORING: Contribution to journal › SCORING: Journal article › Research › peer-review

Harvard

Leyh-Bannurah, S-R, Tian, Z, Karakiewicz, PI, Wolffgang, U, Sauter, G , Fisch, M, Pehrke, D, Huland, H, Graefen, M & Budäus, L 2018, 'Deep Learning for Natural Language Processing in Urology: State-of-the-Art Automated Extraction of Detailed Pathologic Prostate Cancer Data From Narratively Written Electronic Health Records', JCO CLIN CANCER INFO, vol. 2, pp. 1-9. https://doi.org/10.1200/CCI.18.00080

APA

Leyh-Bannurah, S-R., Tian, Z., Karakiewicz, P. I., Wolffgang, U., Sauter, G., Fisch, M., Pehrke, D., Huland, H., Graefen, M., & Budäus, L. (2018). Deep Learning for Natural Language Processing in Urology: State-of-the-Art Automated Extraction of Detailed Pathologic Prostate Cancer Data From Narratively Written Electronic Health Records. JCO CLIN CANCER INFO, 2, 1-9. https://doi.org/10.1200/CCI.18.00080

Vancouver

Leyh-Bannurah S-R, Tian Z, Karakiewicz PI, Wolffgang U, Sauter G , Fisch M et al. Deep Learning for Natural Language Processing in Urology: State-of-the-Art Automated Extraction of Detailed Pathologic Prostate Cancer Data From Narratively Written Electronic Health Records. JCO CLIN CANCER INFO. 2018 Dec;2:1-9. https://doi.org/10.1200/CCI.18.00080

Bibtex

@article{77f03db2be0d4e339d1f5ad0f4a2c103,

title = "Deep Learning for Natural Language Processing in Urology: State-of-the-Art Automated Extraction of Detailed Pathologic Prostate Cancer Data From Narratively Written Electronic Health Records",

abstract = "PURPOSE: Entering all information from narrative documentation for clinical research into databases is time consuming, costly, and nearly impossible. Even high-volume databases do not cover all patient characteristics and drawn results may be limited. A new viable automated solution is machine learning based on deep neural networks applied to natural language processing (NLP), extracting detailed information from narratively written (eg, pathologic radical prostatectomy [RP]) electronic health records (EHRs).METHODS: Within an RP pathologic database, 3,679 RP EHRs were randomly split into 70% training and 30% test data sets. Training EHRs were automatically annotated, providing a semiautomatically annotated corpus of narratively written pathologic reports with initially context-free gold standard encodings. Primary and secondary Gleason pattern, corresponding percentages, tumor stage, nodal stage, total volume, tumor volume and diameter, and surgical margin were variables of interest. Second, state-of-the-art NLP techniques were used to train an industry-standard language model for pathologic EHRs by transfer learning. Finally, accuracy of the named entity extractors was compared with the gold standard encodings.RESULTS: Agreement rates (95% confidence interval) for primary and secondary Gleason patterns each were 91.3% (89.4 to 93.0), corresponding to the following: Gleason percentages, 70.5% (67.6 to 73.3) and 80.9% (78.4 to 83.3); tumor stage, 99.3% (98.6 to 99.7); nodal stage, 98.7% (97.8 to 99.3); total volume, 98.3% (97.3 to 99.0); tumor volume, 93.3% (91.6 to 94.8); maximum diameter, 96.3% (94.9 to 97.3); and surgical margin, 98.7% (97.8 to 99.3). Cumulative agreement was 91.3%.CONCLUSION: Our proposed NLP pipeline offers new abilities for precise and efficient data management from narrative documentation for clinical research. The scalable approach potentially allows the NLP pipeline to be generalized to other genitourinary EHRs, tumor entities, and other medical disciplines.",

keywords = "Journal Article",

author = "Sami-Ramzi Leyh-Bannurah and Zhe Tian and Karakiewicz, {Pierre I} and Ulrich Wolffgang and Guido Sauter and Margit Fisch and Dirk Pehrke and Hartwig Huland and Markus Graefen and Lars Bud{\"a}us",

year = "2018",

month = dec,

doi = "10.1200/CCI.18.00080",

language = "English",

volume = "2",

pages = "1--9",

journal = "JCO CLIN CANCER INFO",

issn = "2473-4276",

publisher = "American Society of Clinical Oncology",

}

RIS

TY - JOUR

T1 - Deep Learning for Natural Language Processing in Urology: State-of-the-Art Automated Extraction of Detailed Pathologic Prostate Cancer Data From Narratively Written Electronic Health Records

AU - Leyh-Bannurah, Sami-Ramzi

AU - Tian, Zhe

AU - Karakiewicz, Pierre I

AU - Wolffgang, Ulrich

AU - Sauter, Guido

AU - Fisch, Margit

AU - Pehrke, Dirk

AU - Huland, Hartwig

AU - Graefen, Markus

AU - Budäus, Lars

PY - 2018/12

Y1 - 2018/12

N2 - PURPOSE: Entering all information from narrative documentation for clinical research into databases is time consuming, costly, and nearly impossible. Even high-volume databases do not cover all patient characteristics and drawn results may be limited. A new viable automated solution is machine learning based on deep neural networks applied to natural language processing (NLP), extracting detailed information from narratively written (eg, pathologic radical prostatectomy [RP]) electronic health records (EHRs).METHODS: Within an RP pathologic database, 3,679 RP EHRs were randomly split into 70% training and 30% test data sets. Training EHRs were automatically annotated, providing a semiautomatically annotated corpus of narratively written pathologic reports with initially context-free gold standard encodings. Primary and secondary Gleason pattern, corresponding percentages, tumor stage, nodal stage, total volume, tumor volume and diameter, and surgical margin were variables of interest. Second, state-of-the-art NLP techniques were used to train an industry-standard language model for pathologic EHRs by transfer learning. Finally, accuracy of the named entity extractors was compared with the gold standard encodings.RESULTS: Agreement rates (95% confidence interval) for primary and secondary Gleason patterns each were 91.3% (89.4 to 93.0), corresponding to the following: Gleason percentages, 70.5% (67.6 to 73.3) and 80.9% (78.4 to 83.3); tumor stage, 99.3% (98.6 to 99.7); nodal stage, 98.7% (97.8 to 99.3); total volume, 98.3% (97.3 to 99.0); tumor volume, 93.3% (91.6 to 94.8); maximum diameter, 96.3% (94.9 to 97.3); and surgical margin, 98.7% (97.8 to 99.3). Cumulative agreement was 91.3%.CONCLUSION: Our proposed NLP pipeline offers new abilities for precise and efficient data management from narrative documentation for clinical research. The scalable approach potentially allows the NLP pipeline to be generalized to other genitourinary EHRs, tumor entities, and other medical disciplines.

AB - PURPOSE: Entering all information from narrative documentation for clinical research into databases is time consuming, costly, and nearly impossible. Even high-volume databases do not cover all patient characteristics and drawn results may be limited. A new viable automated solution is machine learning based on deep neural networks applied to natural language processing (NLP), extracting detailed information from narratively written (eg, pathologic radical prostatectomy [RP]) electronic health records (EHRs).METHODS: Within an RP pathologic database, 3,679 RP EHRs were randomly split into 70% training and 30% test data sets. Training EHRs were automatically annotated, providing a semiautomatically annotated corpus of narratively written pathologic reports with initially context-free gold standard encodings. Primary and secondary Gleason pattern, corresponding percentages, tumor stage, nodal stage, total volume, tumor volume and diameter, and surgical margin were variables of interest. Second, state-of-the-art NLP techniques were used to train an industry-standard language model for pathologic EHRs by transfer learning. Finally, accuracy of the named entity extractors was compared with the gold standard encodings.RESULTS: Agreement rates (95% confidence interval) for primary and secondary Gleason patterns each were 91.3% (89.4 to 93.0), corresponding to the following: Gleason percentages, 70.5% (67.6 to 73.3) and 80.9% (78.4 to 83.3); tumor stage, 99.3% (98.6 to 99.7); nodal stage, 98.7% (97.8 to 99.3); total volume, 98.3% (97.3 to 99.0); tumor volume, 93.3% (91.6 to 94.8); maximum diameter, 96.3% (94.9 to 97.3); and surgical margin, 98.7% (97.8 to 99.3). Cumulative agreement was 91.3%.CONCLUSION: Our proposed NLP pipeline offers new abilities for precise and efficient data management from narrative documentation for clinical research. The scalable approach potentially allows the NLP pipeline to be generalized to other genitourinary EHRs, tumor entities, and other medical disciplines.

KW - Journal Article

U2 - 10.1200/CCI.18.00080

DO - 10.1200/CCI.18.00080

M3 - SCORING: Journal article

C2 - 30652616

VL - 2

SP - 1

EP - 9

JO - JCO CLIN CANCER INFO

JF - JCO CLIN CANCER INFO

SN - 2473-4276

ER -