DocumentCode :
2049456
Title :
Natural Language Compression per Blocks
Author :
Prochazka, Petr ; Holub, Jan
Author_Institution :
Dept. of Theor. Comput. Sci., Czech Tech. Univ. in Prague, Prague, Czech Republic
fYear :
2011
fDate :
21-24 June 2011
Firstpage :
67
Lastpage :
75
Abstract :
We present a new natural language compression method: Semi-adaptive Two Byte Dense Code (STBDC). STBDC performs compression per blocks. It means that the input is divided into the several blocks and each of the blocks is compressed separately according to its own statistical model. To avoid the redundancy the final vocabulary file is composed as the sequence of the changes in the model of the two consecutive blocks. STBDC belongs to the family of Dense codes and keeps all their attractive properties including very high compression and decompression speed and acceptable compression ratio around 32% on natural language text. Moreover STBDC provides other properties applicable in digital libraries and other textual databases. The compression method allows direct searching on the compressed text, whereas the vocabulary can be used as a block index. STBDC is very easy on limited bandwidth in the client/server architecture. It can send namely single compressed blocks only with corresponding part of the vocabulary. Further STBDC enables various approaches of updating and extending of the compressed text.
Keywords :
data compression; natural language processing; statistical analysis; STBDC; dense codes; digital libraries; natural language compression; natural language text; semi adaptive two byte dense code; statistical model; textual databases; vocabulary file; Arrays; Encoding; Indexes; Natural languages; Recycling; Vocabulary; Block indexes; Byte Codes; Natural language compression; Word-based compression;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression, Communications and Processing (CCP), 2011 First International Conference on
Conference_Location :
Palinuro
Print_ISBN :
978-1-4577-1458-0
Electronic_ISBN :
978-0-7695-4528-8
Type :
conf
DOI :
10.1109/CCP.2011.25
Filename :
6061029
Link To Document :
بازگشت