A resampling strategy for studying robustness in virus detection pipelines

Standard

A resampling strategy for studying robustness in virus detection pipelines. / Kohls, Moritz; Saremi, Babak; Muchsin, Ihsan; Fischer, Nicole; Becher, Paul; Jung, Klaus.

In: COMPUT BIOL CHEM, Vol. 94, 107555, 10.2021.

Research output: SCORING: Contribution to journalSCORING: Journal articleResearchpeer-review

Harvard

APA

Vancouver

Bibtex

@article{3ada0b6a8d5849e39d4cdab1ba411184,
title = "A resampling strategy for studying robustness in virus detection pipelines",
abstract = "Next-generation sequencing is regularly used to identify viral sequences in DNA or RNA samples of infected hosts. A major step of most pipelines for virus detection is to map sequence reads against known virus genomes. Due to small differences between the sequences of related viruses, and due to several biological or technical errors, mapping underlies uncertainties. As a consequence, the resulting list of detected viruses can lack robustness. A new approach for generating artificial sequencing reads together with a strategy of resampling from the original findings is proposed that can help to assess the robustness of the originally identified list of viruses. From the original mapping result in form of a SAM file, a set of statistical distributions are derived. These are used in the resampling pipeline to generate new artificial reads which are again mapped versus the reference genomes. By summarizing the resampling procedure, the analyst receives information about whether the presence of a particular virus in the sample gains or losses evidence, and thus about the robustness of the original mapping list but also that of individual viruses in this list. To judge robustness, several indicators are derived from the resampling procedure such as the correlation between original and resampling read counts, or the statistical detection of outliers in the differences of read counts. Additionally, graphical illustrations of read count shifts via Sankey diagrams are provided. To demonstrate the use of the new approach, the resampling approach is applied to three real-world data samples, one of them with laboratory-confirmed Influenza sequences, and to artificially generated data where virus sequences have been spiked into the sequencing data of a host. By applying the resampling pipeline, several viruses drop from the original list while new viruses emerge, showing robustness of those viruses that remain in the list. The evaluation of the new approach shows that the resampling approach is helpful to analyze the viral content of a biological sample, to rate the robustness of original findings and to better show the overall distribution of findings. The method is also applicable to other virus detection pipelines based on read mapping.",
author = "Moritz Kohls and Babak Saremi and Ihsan Muchsin and Nicole Fischer and Paul Becher and Klaus Jung",
note = "Copyright {\textcopyright} 2021 Elsevier Ltd. All rights reserved.",
year = "2021",
month = oct,
doi = "10.1016/j.compbiolchem.2021.107555",
language = "English",
volume = "94",
journal = "COMPUT BIOL CHEM",
issn = "1476-9271",
publisher = "Elsevier Limited",

}

RIS

TY - JOUR

T1 - A resampling strategy for studying robustness in virus detection pipelines

AU - Kohls, Moritz

AU - Saremi, Babak

AU - Muchsin, Ihsan

AU - Fischer, Nicole

AU - Becher, Paul

AU - Jung, Klaus

N1 - Copyright © 2021 Elsevier Ltd. All rights reserved.

PY - 2021/10

Y1 - 2021/10

N2 - Next-generation sequencing is regularly used to identify viral sequences in DNA or RNA samples of infected hosts. A major step of most pipelines for virus detection is to map sequence reads against known virus genomes. Due to small differences between the sequences of related viruses, and due to several biological or technical errors, mapping underlies uncertainties. As a consequence, the resulting list of detected viruses can lack robustness. A new approach for generating artificial sequencing reads together with a strategy of resampling from the original findings is proposed that can help to assess the robustness of the originally identified list of viruses. From the original mapping result in form of a SAM file, a set of statistical distributions are derived. These are used in the resampling pipeline to generate new artificial reads which are again mapped versus the reference genomes. By summarizing the resampling procedure, the analyst receives information about whether the presence of a particular virus in the sample gains or losses evidence, and thus about the robustness of the original mapping list but also that of individual viruses in this list. To judge robustness, several indicators are derived from the resampling procedure such as the correlation between original and resampling read counts, or the statistical detection of outliers in the differences of read counts. Additionally, graphical illustrations of read count shifts via Sankey diagrams are provided. To demonstrate the use of the new approach, the resampling approach is applied to three real-world data samples, one of them with laboratory-confirmed Influenza sequences, and to artificially generated data where virus sequences have been spiked into the sequencing data of a host. By applying the resampling pipeline, several viruses drop from the original list while new viruses emerge, showing robustness of those viruses that remain in the list. The evaluation of the new approach shows that the resampling approach is helpful to analyze the viral content of a biological sample, to rate the robustness of original findings and to better show the overall distribution of findings. The method is also applicable to other virus detection pipelines based on read mapping.

AB - Next-generation sequencing is regularly used to identify viral sequences in DNA or RNA samples of infected hosts. A major step of most pipelines for virus detection is to map sequence reads against known virus genomes. Due to small differences between the sequences of related viruses, and due to several biological or technical errors, mapping underlies uncertainties. As a consequence, the resulting list of detected viruses can lack robustness. A new approach for generating artificial sequencing reads together with a strategy of resampling from the original findings is proposed that can help to assess the robustness of the originally identified list of viruses. From the original mapping result in form of a SAM file, a set of statistical distributions are derived. These are used in the resampling pipeline to generate new artificial reads which are again mapped versus the reference genomes. By summarizing the resampling procedure, the analyst receives information about whether the presence of a particular virus in the sample gains or losses evidence, and thus about the robustness of the original mapping list but also that of individual viruses in this list. To judge robustness, several indicators are derived from the resampling procedure such as the correlation between original and resampling read counts, or the statistical detection of outliers in the differences of read counts. Additionally, graphical illustrations of read count shifts via Sankey diagrams are provided. To demonstrate the use of the new approach, the resampling approach is applied to three real-world data samples, one of them with laboratory-confirmed Influenza sequences, and to artificially generated data where virus sequences have been spiked into the sequencing data of a host. By applying the resampling pipeline, several viruses drop from the original list while new viruses emerge, showing robustness of those viruses that remain in the list. The evaluation of the new approach shows that the resampling approach is helpful to analyze the viral content of a biological sample, to rate the robustness of original findings and to better show the overall distribution of findings. The method is also applicable to other virus detection pipelines based on read mapping.

U2 - 10.1016/j.compbiolchem.2021.107555

DO - 10.1016/j.compbiolchem.2021.107555

M3 - SCORING: Journal article

C2 - 34364046

VL - 94

JO - COMPUT BIOL CHEM

JF - COMPUT BIOL CHEM

SN - 1476-9271

M1 - 107555

ER -