• DocumentCode
    3704263
  • Title

    A Parallel Algorithm for Compression of Big Next-Generation Sequencing Datasets

  • Author

    Pérez;Fahad Saeed

  • Author_Institution
    Comput. Sci. Dept., Western Michigan Univ., Kalamazoo, MI, USA
  • Volume
    3
  • fYear
    2015
  • Firstpage
    196
  • Lastpage
    201
  • Abstract
    The amount of big data from high-throughput Next-Generation Sequencing (NGS) techniques represents various challenges such as storage, analysis and transmission of massive datasets. One solution to storage and transmission of data is compression using specialized compression algorithms. The existing specialized algorithms suffer from poor scalability with increasing size of the datasets and best available solutions can take hours to compress gigabytes of data. Compression and decompression using these techniques for peta-scale data sets is prohibitively expensive in terms of time and energy. In this paper we introduce paraDSRC, a parallel implementation of the DNA Sequence Reads Compression (DSRC) application using a message passing model that presents reduction of the compression time complexity by a factor of O(1/p) (where p is the number of processing units). Our experimental results show that paraDSRC achieves compression times that are 43% to 99% faster than DSRC and compression throughputs of up to 8.4GB/s on a moderate size cluster. For many of the datasets used in our experiments super-linear speedups have been registered making the implementation strongly scalable. We also show that paraDSRC is more than 25.6x faster than comparable parallel compression algorithms. The code is available for free-academic use at https://github.com/PCDS/paraDSRC.
  • Keywords
    "DNA","Algorithm design and analysis","Encoding","Sequential analysis","Compression algorithms","Message passing","Throughput"
  • Publisher
    ieee
  • Conference_Titel
    Trustcom/BigDataSE/ISPA, 2015 IEEE
  • Type

    conf

  • DOI
    10.1109/Trustcom.2015.632
  • Filename
    7345648