Title :
Natural Language Compression Optimized for Large Set of Files
Author :
Prochazka, Petr ; Holub, J.
Author_Institution :
Dept. of Theor. Comput. Sci., Czech Tech. Univ. in Prague, Prague, Czech Republic
Abstract :
Summary form only given. The web search engines store the web pages in the raw text form to build so called snippets (short text surrounding the searched pattern) or to perform so called positional ranking functions. We address the problem of the compression of a large collection of text files distributed in cluster of computers, where the single files need to be randomly accessed in very short time. The compression algorithm Set-of-Files Semi-Adaptive Two Byte Dense Code (SF-STBDC) is based on the word-based approach and the idea of combination of two statistical models: the global model (common for all the files of the set) and the local model. The latter is built as the set of changes which transform the global model to the proper model of the single compressed file. Except very good compression ratio the compression method allows fast searching on the compressed text, which is an attractive property especially for search engines property especially for search engines. Exactly the same problem (compression of a set of files using byte codes) was first stated in. Our algorithm SF-STBDC overcomes the algorithm based on (s,c) - Dense Code in compression ratio and at the same time it keeps a very good searching and decompression speed. The key idea to achieve this result is a usage of Semi-Adaptive Two Byte Dense Code which provides more effective coding of small portions ofof the text and still allows exact setting of the number of stoppers and continuers.
Keywords :
adaptive codes; data compression; search engines; statistical analysis; word processing; SF-STBDC algorithm; compressed text; computer cluster; global model; local model; natural language compression; positional ranking functions; raw text form; set-of-files semiadaptive two byte dense code algorithm; short text surrounding the searched pattern; single compressed file model; snippets; statistical models; text file collection; web pages; web search engines; word-based approach; Computational modeling; Computer science; Data compression; Educational institutions; Engines; Natural languages; Web search;
Conference_Titel :
Data Compression Conference (DCC), 2013
Conference_Location :
Snowbird, UT
Print_ISBN :
978-1-4673-6037-1
DOI :
10.1109/DCC.2013.93