BCB Faculty Member in the Statistics Department
Death to the Bioinformatics NGS Pipeline! (At least the first few steps.)
The age-old wisdom "garbage in, garbage out" underscores any analysis using next-generation sequencing (NGS) data. NGS errors are typically mitigated through two consecutive stages in the data analysis pipeline, base-calling followed by error-correction. Interestingly, base callers rarely utilize information available about the underlying genome sequence, whereas error-corrections methods seldom utilize properties of the sequencing machine. We have demonstrated that a probabilistic approach to error correction beats all existing algorithmic methods.
Even better, an integrated approach that combines the information available to both base-calling and error-correction methods improves performance over their serial application. Menges et al.
(2011) has previously proposed a base caller that borrows information from alignment to a known reference genome. In contrast, our goal is to utilize genome information in the absence of a known reference genome. Specifically, we use a Hidden Markov Model, where the transition distribution captures local genomic dependence, and the emission distribution models intensities. The combined method removes at least twice as many errors Bustard, the standard Illumina base caller.