• DocumentCode
    1301342
  • Title

    KungFQ: A Simple and Powerful Approach to Compress fastq Files

  • Author

    Grassi, E. ; Gregorio, F.D. ; Molineris, I.

  • Author_Institution
    Dept. of Genetics, Biol. & Biochem., Mol. Biotechnol. Center, Turin, Italy
  • Volume
    9
  • Issue
    6
  • fYear
    2012
  • Firstpage
    1837
  • Lastpage
    1842
  • Abstract
    Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.
  • Keywords
    bioinformatics; data compression; storage management; KungFQ; binary format; constant memory requirement; fastq characteristics; fastq file compression; nongenomic data; reference-based compression algorithms; Bioinformatics; Compression algorithms; Decoding; Encoding; Genomics; Standards; Biology and genetics; algorithms for data and knowledge management; Algorithms; Data Compression; Databases, Genetic; Genomics; High-Throughput Nucleotide Sequencing; Internet; Software;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2012.123
  • Filename
    6313591