DocumentCode :
1301342
Title :
KungFQ: A Simple and Powerful Approach to Compress fastq Files
Author :
Grassi, E. ; Gregorio, F.D. ; Molineris, I.
Author_Institution :
Dept. of Genetics, Biol. & Biochem., Mol. Biotechnol. Center, Turin, Italy
Volume :
9
Issue :
6
fYear :
2012
Firstpage :
1837
Lastpage :
1842
Abstract :
Nowadays storing data derived from deep sequencing experiments has become pivotal and standard compression algorithms do not exploit in a satisfying manner their structure. A number of reference-based compression algorithms have been developed but they are less adequate when approaching new species without fully sequenced genomes or nongenomic data. We developed a tool that takes advantages of fastq characteristics and encodes them in a binary format optimized in order to be further compressed with standard tools (such as gzip or lzma). The algorithm is straightforward and does not need any external reference file, it scans the fastq only once and has a constant memory requirement. Moreover, we added the possibility to perform lossy compression, losing some of the original information (IDs and/or qualities) but resulting in smaller files; it is also possible to define a quality cutoff under which corresponding base calls are converted to N. We achieve 2.82 to 7.77 compression ratios on various fastq files without losing information and 5.37 to 8.77 losing IDs, which are often not used in common analysis pipelines. In this paper, we compare the algorithm performance with known tools, usually obtaining higher compression levels.
Keywords :
bioinformatics; data compression; storage management; KungFQ; binary format; constant memory requirement; fastq characteristics; fastq file compression; nongenomic data; reference-based compression algorithms; Bioinformatics; Compression algorithms; Decoding; Encoding; Genomics; Standards; Biology and genetics; algorithms for data and knowledge management; Algorithms; Data Compression; Databases, Genetic; Genomics; High-Throughput Nucleotide Sequencing; Internet; Software;
fLanguage :
English
Journal_Title :
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
Publisher :
ieee
ISSN :
1545-5963
Type :
jour
DOI :
10.1109/TCBB.2012.123
Filename :
6313591
Link To Document :
بازگشت