Title :
K-means clustering based compression algorithm for the high-throughput DNA sequence
Author :
Li Tan ; Jifeng Sun
Author_Institution :
Sch. of Electron. & Inf. Eng., South China Univ. of Technol., Guangzhou, China
Abstract :
This paper proposes a compression algorithm based on K-means clustering for high-through DNA sequence (DNAC-K). In DNAC-K, we create cluster of sequences based on K-means clustering method at first, then iterate clusters according to the edit distances of subsequences, and finally, adopt Huffman coding to encode the result of clustering result. Experimental results on several sequencing data sets demonstrate better performance of DNAC-K than many of the current high-throughput DNA sequence compression algorithms.
Keywords :
DNA; Huffman codes; biology computing; data compression; encoding; pattern clustering; DNA sequence compression algorithms; DNAC-K; Huffman coding; K-means clustering based compression algorithm; edit distances; high-throughput DNA sequence; sequencing data sets; subsequences; Bioinformatics; Clustering algorithms; Clustering methods; Compression algorithms; DNA; Genomics; Huffman coding; DNA sequence compression; Huffman coding; K-means clustering; sequence alignment;
Conference_Titel :
Audio, Language and Image Processing (ICALIP), 2014 International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-1-4799-3902-2
DOI :
10.1109/ICALIP.2014.7009935