Deep Learning for Natural Language Processing in Urology: State-of-the-Art Automated Extraction of Detailed Pathologic Prostate Cancer Data From Narratively Written Electronic Health Records
Standard
Deep Learning for Natural Language Processing in Urology: State-of-the-Art Automated Extraction of Detailed Pathologic Prostate Cancer Data From Narratively Written Electronic Health Records. / Leyh-Bannurah, Sami-Ramzi; Tian, Zhe; Karakiewicz, Pierre I; Wolffgang, Ulrich; Sauter, Guido; Fisch, Margit; Pehrke, Dirk; Huland, Hartwig; Graefen, Markus; Budäus, Lars.
In: JCO CLIN CANCER INFO, Vol. 2, 12.2018, p. 1-9.Research output: SCORING: Contribution to journal › SCORING: Journal article › Research › peer-review
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - JOUR
T1 - Deep Learning for Natural Language Processing in Urology: State-of-the-Art Automated Extraction of Detailed Pathologic Prostate Cancer Data From Narratively Written Electronic Health Records
AU - Leyh-Bannurah, Sami-Ramzi
AU - Tian, Zhe
AU - Karakiewicz, Pierre I
AU - Wolffgang, Ulrich
AU - Sauter, Guido
AU - Fisch, Margit
AU - Pehrke, Dirk
AU - Huland, Hartwig
AU - Graefen, Markus
AU - Budäus, Lars
PY - 2018/12
Y1 - 2018/12
N2 - PURPOSE: Entering all information from narrative documentation for clinical research into databases is time consuming, costly, and nearly impossible. Even high-volume databases do not cover all patient characteristics and drawn results may be limited. A new viable automated solution is machine learning based on deep neural networks applied to natural language processing (NLP), extracting detailed information from narratively written (eg, pathologic radical prostatectomy [RP]) electronic health records (EHRs).METHODS: Within an RP pathologic database, 3,679 RP EHRs were randomly split into 70% training and 30% test data sets. Training EHRs were automatically annotated, providing a semiautomatically annotated corpus of narratively written pathologic reports with initially context-free gold standard encodings. Primary and secondary Gleason pattern, corresponding percentages, tumor stage, nodal stage, total volume, tumor volume and diameter, and surgical margin were variables of interest. Second, state-of-the-art NLP techniques were used to train an industry-standard language model for pathologic EHRs by transfer learning. Finally, accuracy of the named entity extractors was compared with the gold standard encodings.RESULTS: Agreement rates (95% confidence interval) for primary and secondary Gleason patterns each were 91.3% (89.4 to 93.0), corresponding to the following: Gleason percentages, 70.5% (67.6 to 73.3) and 80.9% (78.4 to 83.3); tumor stage, 99.3% (98.6 to 99.7); nodal stage, 98.7% (97.8 to 99.3); total volume, 98.3% (97.3 to 99.0); tumor volume, 93.3% (91.6 to 94.8); maximum diameter, 96.3% (94.9 to 97.3); and surgical margin, 98.7% (97.8 to 99.3). Cumulative agreement was 91.3%.CONCLUSION: Our proposed NLP pipeline offers new abilities for precise and efficient data management from narrative documentation for clinical research. The scalable approach potentially allows the NLP pipeline to be generalized to other genitourinary EHRs, tumor entities, and other medical disciplines.
AB - PURPOSE: Entering all information from narrative documentation for clinical research into databases is time consuming, costly, and nearly impossible. Even high-volume databases do not cover all patient characteristics and drawn results may be limited. A new viable automated solution is machine learning based on deep neural networks applied to natural language processing (NLP), extracting detailed information from narratively written (eg, pathologic radical prostatectomy [RP]) electronic health records (EHRs).METHODS: Within an RP pathologic database, 3,679 RP EHRs were randomly split into 70% training and 30% test data sets. Training EHRs were automatically annotated, providing a semiautomatically annotated corpus of narratively written pathologic reports with initially context-free gold standard encodings. Primary and secondary Gleason pattern, corresponding percentages, tumor stage, nodal stage, total volume, tumor volume and diameter, and surgical margin were variables of interest. Second, state-of-the-art NLP techniques were used to train an industry-standard language model for pathologic EHRs by transfer learning. Finally, accuracy of the named entity extractors was compared with the gold standard encodings.RESULTS: Agreement rates (95% confidence interval) for primary and secondary Gleason patterns each were 91.3% (89.4 to 93.0), corresponding to the following: Gleason percentages, 70.5% (67.6 to 73.3) and 80.9% (78.4 to 83.3); tumor stage, 99.3% (98.6 to 99.7); nodal stage, 98.7% (97.8 to 99.3); total volume, 98.3% (97.3 to 99.0); tumor volume, 93.3% (91.6 to 94.8); maximum diameter, 96.3% (94.9 to 97.3); and surgical margin, 98.7% (97.8 to 99.3). Cumulative agreement was 91.3%.CONCLUSION: Our proposed NLP pipeline offers new abilities for precise and efficient data management from narrative documentation for clinical research. The scalable approach potentially allows the NLP pipeline to be generalized to other genitourinary EHRs, tumor entities, and other medical disciplines.
KW - Journal Article
U2 - 10.1200/CCI.18.00080
DO - 10.1200/CCI.18.00080
M3 - SCORING: Journal article
C2 - 30652616
VL - 2
SP - 1
EP - 9
JO - JCO CLIN CANCER INFO
JF - JCO CLIN CANCER INFO
SN - 2473-4276
ER -