Title :
In search of perfect reads
Author :
Pal, Soumitra ; Aluru, Srinivas
Author_Institution :
Dept. of Comput. Sci. & Eng., Indian Inst. of Technol. Bombay, Mumbai, India
Abstract :
Continued advances in next generation short-read sequencing technologies are increasing throughput and read lengths, while driving down the error rates, for example within 1% for Illumina HiSeq reads. Moreover, the errors are not uniformly distributed in all reads, and a large percentage of reads are indeed error-free. Ability to predict such perfect reads can have significant impact on run-time complexity of applications. In this paper, we present a simple and fast k-spectrum analysis based method to identify error-free reads. Our experiments show that if around 80% of the reads in a dataset are perfect, then our method retains almost 99.9% of them with more than 90% precision rate. Though filtering out reads identified as erroneous by our method reduces the coverage by about 7% on an average, coverage pattern across genome remains similar. The filtration process can be customized at several levels of stringency depending upon the downstream application need.
Keywords :
error analysis; filtration; genomics; error-free read identification; fast k-spectrum analysis based method; filtration process; genomics; next generation short-read sequencing technologies; Accuracy; Bioinformatics; Error correction; Genomics; Next generation networking; Prediction algorithms; Sequential analysis; Next generation sequencing; error correction;
Conference_Titel :
Computational Advances in Bio and Medical Sciences (ICCABS), 2014 IEEE 4th International Conference on
Conference_Location :
Miami, FL
Print_ISBN :
978-1-4799-5786-6
DOI :
10.1109/ICCABS.2014.6863919