DocumentCode :
1798869
Title :
K-means clustering based compression algorithm for the high-throughput DNA sequence
Author :
Li Tan ; Jifeng Sun
Author_Institution :
Sch. of Electron. & Inf. Eng., South China Univ. of Technol., Guangzhou, China
fYear :
2014
fDate :
7-9 July 2014
Firstpage :
952
Lastpage :
955
Abstract :
This paper proposes a compression algorithm based on K-means clustering for high-through DNA sequence (DNAC-K). In DNAC-K, we create cluster of sequences based on K-means clustering method at first, then iterate clusters according to the edit distances of subsequences, and finally, adopt Huffman coding to encode the result of clustering result. Experimental results on several sequencing data sets demonstrate better performance of DNAC-K than many of the current high-throughput DNA sequence compression algorithms.
Keywords :
DNA; Huffman codes; biology computing; data compression; encoding; pattern clustering; DNA sequence compression algorithms; DNAC-K; Huffman coding; K-means clustering based compression algorithm; edit distances; high-throughput DNA sequence; sequencing data sets; subsequences; Bioinformatics; Clustering algorithms; Clustering methods; Compression algorithms; DNA; Genomics; Huffman coding; DNA sequence compression; Huffman coding; K-means clustering; sequence alignment;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Audio, Language and Image Processing (ICALIP), 2014 International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-1-4799-3902-2
Type :
conf
DOI :
10.1109/ICALIP.2014.7009935
Filename :
7009935
Link To Document :
بازگشت