Title :
Sequence Similarity by Gapped LZW
Author :
Apostolico, Alberto ; Cunial, Fabio
Author_Institution :
Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
Abstract :
Measures of sequence similarity based on some underlying notion of relative compressibility are becoming increasingly of interest in connection with massive tasks of textfile classification such as, notably, in document classification and molecular taxonomy on a genomic scale. Sequences that are similar can be expected to share a large number of common substrings, whence some successful measures in this class have been based on the substring composition of the input sequences. Among the corresponding methods, one finds suitable extensions of the bag-of-words together with more explicit resorts to data compression techniques such as LZ77. The approach presented in this paper explores the potential of LZW - the variant of LZ78 proposed by Welch -- as well as of some of its lossy variants, in this context. Whereas LZW has a faster and simpler implementation than LZ77, the vocabulary underlying LZW is significantly smaller than that of LZ77. In addition, recently introduced "gapped" variants of LZW are considered that are equally straightforward to implement but allow for a controlled number of don\´t cares to be introduced in the substrings that constitute the dictionary used in compression. This study assesses the robustness of compressibility based measures of similarity under these faster yet inherently more dispersive paradigms built around LZW.
Keywords :
data compression; encoding; LZ77; LZ78; Lempel-Ziv-Welchcompression; bag-of-words; document classification; gapped LZW; molecular taxonomy; relative compressibility; sequence similarity; textfile classification; Dictionaries; Encoding; Frequency measurement; Length measurement; Loss measurement; Vocabulary;
Conference_Titel :
Data Compression Conference (DCC), 2011
Conference_Location :
Snowbird, UT
Print_ISBN :
978-1-61284-279-0
DOI :
10.1109/DCC.2011.41