مرکز منطقه ای اطلاع رساني علوم و فناوري - Parallel Read Error Correction for Big Genomic Datasets

Abstract :

Genome sequencing, using instruments in vogue today, deciphers in the order of a billion short genomic fragments per run. These fragments are a few hundred bases long and are commonly referred to as `reads´. Reads contain errors due to limitations of sequencing technology. Read error correction enhances the quality of results produced by applications in areas such as genomics, metagenomics, and transcriptomics. Use of error corrected reads also improves the runtime and the memory usage of such applications. Sequential error correction tools cannot cope with the large number of reads produced by modern day sequencing instruments. A distributed-memory Parallel Spectrum-based Error Correction (PSbEC) algorithm was proposed to overcome this drawback [1]. In this work, we propose techniques to address three major shortcomings of the PSbEC algorithm. Our optimizations enhance the scope and the speedup of the PSbEC algorithm, thereby enabling error correction of big genomic datasets. More specifically, by combining our optimizations, we are able to achieve a cumulative speedup of up to 11 X. Further, we demonstrate error correction of a human dataset containing nearly 1.55 billion reads. This work stands as the first demonstration of distributed-memory genomic read error correction for a dataset consisting of more than a billion reads.