A Parallel Algorithm for Compression of Big Next-Generation Sequencing Datasets

Author

Pérez;Fahad Saeed

Author_Institution

Comput. Sci. Dept., Western Michigan Univ., Kalamazoo, MI, USA

Volume

3

fYear

2015

Firstpage

196

Lastpage

201

Abstract

The amount of big data from high-throughput Next-Generation Sequencing (NGS) techniques represents various challenges such as storage, analysis and transmission of massive datasets. One solution to storage and transmission of data is compression using specialized compression algorithms. The existing specialized algorithms suffer from poor scalability with increasing size of the datasets and best available solutions can take hours to compress gigabytes of data. Compression and decompression using these techniques for peta-scale data sets is prohibitively expensive in terms of time and energy. In this paper we introduce paraDSRC, a parallel implementation of the DNA Sequence Reads Compression (DSRC) application using a message passing model that presents reduction of the compression time complexity by a factor of O(1/p) (where p is the number of processing units). Our experimental results show that paraDSRC achieves compression times that are 43% to 99% faster than DSRC and compression throughputs of up to 8.4GB/s on a moderate size cluster. For many of the datasets used in our experiments super-linear speedups have been registered making the implementation strongly scalable. We also show that paraDSRC is more than 25.6x faster than comparable parallel compression algorithms. The code is available for free-academic use at https://github.com/PCDS/paraDSRC.

Keywords

"DNA","Algorithm design and analysis","Encoding","Sequential analysis","Compression algorithms","Message passing","Throughput"

Publisher

ieee

Conference_Titel

Trustcom/BigDataSE/ISPA, 2015 IEEE

Type

conf

DOI

10.1109/Trustcom.2015.632

Filename

7345648