Bias-invariant RNA-sequencing metadata annotation
Standard
Bias-invariant RNA-sequencing metadata annotation. / Wartmann, Hannes; Heins, Sven; Kloiber, Karin; Bonn, Stefan.
In: GIGASCIENCE, Vol. 10, No. 9, 22.09.2021, p. giab064.Research output: SCORING: Contribution to journal › SCORING: Journal article › Research › peer-review
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - JOUR
T1 - Bias-invariant RNA-sequencing metadata annotation
AU - Wartmann, Hannes
AU - Heins, Sven
AU - Kloiber, Karin
AU - Bonn, Stefan
N1 - © The Author(s) 2021. Published by Oxford University Press GigaScience.
PY - 2021/9/22
Y1 - 2021/9/22
N2 - BACKGROUND: Recent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Missing annotations makes it impossible for researchers to find datasets specific to their needs.FINDINGS: Here, we investigate RNA-sequencing metadata prediction based on gene expression values. We present a deep-learning-based domain adaptation algorithm for the automatic annotation of RNA-sequencing metadata. We show, in multiple experiments, that our model is better at integrating heterogeneous training data compared with existing linear regression-based approaches, resulting in improved tissue type classification. By using a model architecture similar to Siamese networks, the algorithm can learn biases from datasets with few samples.CONCLUSION: Using our novel domain adaptation approach, we achieved metadata annotation accuracies up to 15.7% better than a previously published method. Using the best model, we provide a list of >10,000 novel tissue and sex label annotations for 8,495 unique SRA samples. Our approach has the potential to revive idle datasets by automated annotation making them more searchable.
AB - BACKGROUND: Recent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Missing annotations makes it impossible for researchers to find datasets specific to their needs.FINDINGS: Here, we investigate RNA-sequencing metadata prediction based on gene expression values. We present a deep-learning-based domain adaptation algorithm for the automatic annotation of RNA-sequencing metadata. We show, in multiple experiments, that our model is better at integrating heterogeneous training data compared with existing linear regression-based approaches, resulting in improved tissue type classification. By using a model architecture similar to Siamese networks, the algorithm can learn biases from datasets with few samples.CONCLUSION: Using our novel domain adaptation approach, we achieved metadata annotation accuracies up to 15.7% better than a previously published method. Using the best model, we provide a list of >10,000 novel tissue and sex label annotations for 8,495 unique SRA samples. Our approach has the potential to revive idle datasets by automated annotation making them more searchable.
U2 - 10.1093/gigascience/giab064
DO - 10.1093/gigascience/giab064
M3 - SCORING: Journal article
C2 - 34553213
VL - 10
SP - giab064
JO - GIGASCIENCE
JF - GIGASCIENCE
SN - 2047-217X
IS - 9
ER -