This Title All WIREs
How to cite this WIREs title:
WIREs Comput Mol Sci
Impact Factor: 8.127

Objective review of de novo stand‐alone error correction methods for NGS data

Full article on Wiley Online Library:   HTML PDF

Can't access this content? Tell your librarian.

The sequencing market has increased steadily over the last few years, with different approaches to read DNA information prone to different types of errors. Multiple studies demonstrated the impact of sequencing errors on different applications of next‐generation sequencing (NGS), making error correction a fundamental initial step. Different methods in the literature use different approaches and fit different types of problems. We analyzed 50 methods divided into five main approaches (k‐spectrum, suffix arrays, multiple‐sequence alignment, read clustering, and probabilistic models). They are not published as a part of a suite (stand‐alone), and target raw, unprocessed data without an existing reference genome (de novo). These correctors handle one or more sequencing technologies using the same or different approaches. They face general challenges (sometimes with specific traits for specific technologies) such as repetitive regions, uncalled bases, and ploidy. Even assessing their performance is a challenge in itself because of the approach taken by various authors, the unknown factor (de novo), and the behavior of the third‐party tools employed in the benchmarks. This study aims to help the researcher in the field to advance the field of error correction, the educator to have a brief but comprehensive companion, and the bioinformatician to choose the right tool for the right job. WIREs Comput Mol Sci 2016, 6:111–146. doi: 10.1002/wcms.1239 This article is categorized under: Computer and Information Science > Computer Algorithms and Programming
Main sequencing steps for Illumina.
[ Normal View | Magnified View ]
The EM algorithm initializes the probabilities of the bases before entering the loop where it alternates between E‐step and M‐step; once the convergence threshold has been reached, the method exits and enters the correction stage; the capital P represents the probability for a base to be the real one.
[ Normal View | Magnified View ]
(a) Clustering approach for one reference read and four related having one difference each; (b) real example with the main read market in bold and the satellites aligned and with the different locus market with bold and italic.
[ Normal View | Magnified View ]
(a) Multiple‐sequence alignment of reads versus the (prospective) reference genome; (b) example of four read with the common k‐mer "TTACGAA" and the four basic types of errors.
[ Normal View | Magnified View ]
Suffix trie example; (a) an error on the rightmost path results in branch having a very low frequency (<< k/2) compared with its sibling branch (≲k/2); (b) example of a trie for a very short genome with read TAAA having an error on its third position
[ Normal View | Magnified View ]
Typical distribution of k‐mers used by ksb correctors; vertical axis shows the number of k‐mers that appear in the number of reads displayed on the horizontal axis; first peak corresponds to erroneous k‐mers that appears only in a few reads; correct k‐mers typically exist in a number of reads close to the coverage; k‐mers found in many reads (right part of the spectrum) typically correspond to repetitive regions.
[ Normal View | Magnified View ]
Classification using the technology support among correctors; letters between paranthesises on the leaves used to group the algorithms in Table .
[ Normal View | Magnified View ]

Browse by Topic

Computer and Information Science > Computer Algorithms and Programming

Access to this WIREs title is by subscription only.

Recommend to Your
Librarian Now!

The latest WIREs articles in your inbox

Sign Up for Article Alerts