DocumentCode :
1787200
Title :
HErCoOl: High-Throughput Error Correction by Oligomers
Author :
Milicchio, Franco ; Prosperi, Mattia C. F.
Author_Institution :
Dept. of Eng., Univ. of Roma Tre, Rome, Italy
fYear :
2014
fDate :
27-29 May 2014
Firstpage :
227
Lastpage :
232
Abstract :
Next-generation sequencing (NGS) technologies are marking the foundations for a new paradigm in genomics and transcriptomics. Nowadays is possible to sequence any microbial organism or meta-genomic sample within hours, and to obtain a whole human genome in less than a month. The sequencing prices are decreasing dramatically, opening to actual personalised medicine. NGS technologies however are error-prone, and correcting errors is a challenge due to multiple factors, including the data sizes (gigabyte scale) and the machine-specific, non-at-random, characteristics of errors and error distributions. Several approaches have been proposed, but yet the problem is a challenge, especially when analysing mixtures of (closely related) species, e.g., highly variable viruses infecting in a host as a swarm, like hepatitis C or human immunodeficiency virus. This work presents a novel error correction algorithm based on k-mer strings with their associated overlap graph, along with an open-source, multi-threaded, implementation. The algorithm, named Her Cool (High-throughput Error Correction by Oligomers), needs minimal tuning, only an overall error rate and -optionally- information about the genome sizes. Her Cool was compared against other state-of-the art methods, using empirical NGS data obtained with Roche 454 technology, focusing the benchmarks on mixtures of related species. Results show that Her Cool improves significantly over the current methods, and the parallelisation scales well with the size of input NGS genome producing long sequence reads, such as Roche 454 or Ion Torrent. Her Cool provides a fast and efficient error correction of NGS data, especially for mixed samples. Its platform-independent, open-source, multi-threaded implementation assures flexibility for being employed and integrated in any NGS data analysis software.
Keywords :
biology computing; data analysis; error correction; genomics; graph theory; Ion Torrent; NGS data analysis software; NGS technology; Roche 454; Roche 454 technology; error correction algorithm; genomics; hepatitis C; high-throughput error correction by oligomers; human immunodeficiency virus; k-mer strings; meta-genomic sample; microbial organism; next-generation sequencing technology; overlap graph; transcriptomics; Benchmark testing; Bioinformatics; DNA; Error analysis; Error correction; Genomics; Sequential analysis; error correction; genome assembly; next generation sequencing; spectral alignment problem;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer-Based Medical Systems (CBMS), 2014 IEEE 27th International Symposium on
Conference_Location :
New York, NY
Type :
conf
DOI :
10.1109/CBMS.2014.7
Filename :
6881881
Link To Document :
بازگشت