Learning debiased graph representations from the OMOP common data model for synthetic data generation

Standard

Learning debiased graph representations from the OMOP common data model for synthetic data generation. / Schulz, Nicolas; Carus, Jasmin; Wiederhold, Alexander Johannes; Johanns, Ole; Peters, Frederik; Rath, Nathalie; Rausch, Karharina; Holleczek, Bernd ; Katalinic, Alexander; Gundler, Christopher; AI-CARE Working Group.

In: BMC MED RES METHODOL, Vol. 24, No. 1, 136, 22.06.2024, p. 136.

Research output: SCORING: Contribution to journalSCORING: Journal articleResearchpeer-review

Harvard

Schulz, N, Carus, J, Wiederhold, AJ, Johanns, O, Peters, F, Rath, N, Rausch, K, Holleczek, B, Katalinic, A, Gundler, C & AI-CARE Working Group 2024, 'Learning debiased graph representations from the OMOP common data model for synthetic data generation', BMC MED RES METHODOL, vol. 24, no. 1, 136, pp. 136. https://doi.org/10.1186/s12874-024-02257-8

APA

Schulz, N., Carus, J., Wiederhold, A. J., Johanns, O., Peters, F., Rath, N., Rausch, K., Holleczek, B., Katalinic, A., Gundler, C., & AI-CARE Working Group (2024). Learning debiased graph representations from the OMOP common data model for synthetic data generation. BMC MED RES METHODOL, 24(1), 136. [136]. https://doi.org/10.1186/s12874-024-02257-8

Vancouver

Bibtex

@article{a401341bf6da4a1ba662ebe5a253356c,
title = "Learning debiased graph representations from the OMOP common data model for synthetic data generation",
abstract = "BackgroundGenerating synthetic patient data is crucial for medical research, but common approaches build up on black-box models which do not allow for expert verification or intervention. We propose a highly available method which enables synthetic data generation from real patient records in a privacy preserving and compliant fashion, is interpretable and allows for expert intervention.MethodsOur approach ties together two established tools in medical informatics, namely OMOP as a data standard for electronic health records and Synthea as a data synthetization method. For this study, data pipelines were built which extract data from OMOP, convert them into time series format, learn temporal rules by 2 statistical algorithms (Markov chain, TARM) and 3 algorithms of causal discovery (DYNOTEARS, J-PCMCI+, LiNGAM) and map the outputs into Synthea graphs. The graphs are evaluated quantitatively by their individual and relative complexity and qualitatively by medical experts.ResultsThe algorithms were found to learn qualitatively and quantitatively different graph representations. Whereas the Markov chain results in extremely large graphs, TARM, DYNOTEARS, and J-PCMCI+ were found to reduce the data dimension during learning. The MultiGroupDirect LiNGAM algorithm was found to not be applicable to the problem statement at hand.ConclusionOnly TARM and DYNOTEARS are practical algorithms for real-world data in this use case. As causal discovery is a method to debias purely statistical relationships, the gradient-based causal discovery algorithm DYNOTEARS was found to be most suitable.",
author = "Nicolas Schulz and Jasmin Carus and Wiederhold, {Alexander Johannes} and Ole Johanns and Frederik Peters and Nathalie Rath and Karharina Rausch and Bernd Holleczek and Alexander Katalinic and Christopher Gundler and {AI-CARE Working Group}",
year = "2024",
month = jun,
day = "22",
doi = "10.1186/s12874-024-02257-8",
language = "English",
volume = "24",
pages = "136",
journal = "BMC MED RES METHODOL",
issn = "1471-2288",
publisher = "BioMed Central Ltd.",
number = "1",

}

RIS

TY - JOUR

T1 - Learning debiased graph representations from the OMOP common data model for synthetic data generation

AU - Schulz, Nicolas

AU - Carus, Jasmin

AU - Wiederhold, Alexander Johannes

AU - Johanns, Ole

AU - Peters, Frederik

AU - Rath, Nathalie

AU - Rausch, Karharina

AU - Holleczek, Bernd

AU - Katalinic, Alexander

AU - Gundler, Christopher

AU - AI-CARE Working Group

PY - 2024/6/22

Y1 - 2024/6/22

N2 - BackgroundGenerating synthetic patient data is crucial for medical research, but common approaches build up on black-box models which do not allow for expert verification or intervention. We propose a highly available method which enables synthetic data generation from real patient records in a privacy preserving and compliant fashion, is interpretable and allows for expert intervention.MethodsOur approach ties together two established tools in medical informatics, namely OMOP as a data standard for electronic health records and Synthea as a data synthetization method. For this study, data pipelines were built which extract data from OMOP, convert them into time series format, learn temporal rules by 2 statistical algorithms (Markov chain, TARM) and 3 algorithms of causal discovery (DYNOTEARS, J-PCMCI+, LiNGAM) and map the outputs into Synthea graphs. The graphs are evaluated quantitatively by their individual and relative complexity and qualitatively by medical experts.ResultsThe algorithms were found to learn qualitatively and quantitatively different graph representations. Whereas the Markov chain results in extremely large graphs, TARM, DYNOTEARS, and J-PCMCI+ were found to reduce the data dimension during learning. The MultiGroupDirect LiNGAM algorithm was found to not be applicable to the problem statement at hand.ConclusionOnly TARM and DYNOTEARS are practical algorithms for real-world data in this use case. As causal discovery is a method to debias purely statistical relationships, the gradient-based causal discovery algorithm DYNOTEARS was found to be most suitable.

AB - BackgroundGenerating synthetic patient data is crucial for medical research, but common approaches build up on black-box models which do not allow for expert verification or intervention. We propose a highly available method which enables synthetic data generation from real patient records in a privacy preserving and compliant fashion, is interpretable and allows for expert intervention.MethodsOur approach ties together two established tools in medical informatics, namely OMOP as a data standard for electronic health records and Synthea as a data synthetization method. For this study, data pipelines were built which extract data from OMOP, convert them into time series format, learn temporal rules by 2 statistical algorithms (Markov chain, TARM) and 3 algorithms of causal discovery (DYNOTEARS, J-PCMCI+, LiNGAM) and map the outputs into Synthea graphs. The graphs are evaluated quantitatively by their individual and relative complexity and qualitatively by medical experts.ResultsThe algorithms were found to learn qualitatively and quantitatively different graph representations. Whereas the Markov chain results in extremely large graphs, TARM, DYNOTEARS, and J-PCMCI+ were found to reduce the data dimension during learning. The MultiGroupDirect LiNGAM algorithm was found to not be applicable to the problem statement at hand.ConclusionOnly TARM and DYNOTEARS are practical algorithms for real-world data in this use case. As causal discovery is a method to debias purely statistical relationships, the gradient-based causal discovery algorithm DYNOTEARS was found to be most suitable.

U2 - 10.1186/s12874-024-02257-8

DO - 10.1186/s12874-024-02257-8

M3 - SCORING: Journal article

C2 - 38909216

VL - 24

SP - 136

JO - BMC MED RES METHODOL

JF - BMC MED RES METHODOL

SN - 1471-2288

IS - 1

M1 - 136

ER -