مرکز منطقه ای اطلاع رساني علوم و فناوري

Abstract :

The paper address the problem of block-oriented natural language compression. Adaptive and semi-adaptive compression methods are nowadays very common in natural language compression field, each of them with different application possibilities. The block-oriented compression is semi-adaptive in terms of one block but it is adaptive in terms of whole input. Our block-oriented compression method is based on the Dense Code idea. It achieves very good compression ratio around 32 % on natural language text and proved to be very fast in searching on the compressed text. We show that our method has some interesting properties which could be applied on digital libraries. The compression method allows direct searching on compressed text. Moreover the vocabulary can be used as a block index which makes some kinds of searching very fast. Another property is that the compressor can send single blocks with correspond ing vocabulary which is considerate to limited bandwidth. In addition the compressed file can be continuously extended without need of previous decompression.Our block-oriented compression method is called Semi-adaptive Two Byte Dense Code (STBDC) and it is a semi-adaptive version TBDC proposed. The STBDC codeword is composed of one or two bytes. The values of the first byte are so-called stoppers or continuers. In the second byte any combination of the bits is allowed which is the point of the limited coding space. The decomposition of the input text into the blocks is based on the limit of the coding space. The end of block must always come when the coding space given by the number of stoppers is exhausted. The changes between the following blocks are encoded in the dictionary file so the the original dictionary for the corresponding block can be easily and quickly reconstructed.

Keywords :

adaptive codes; data compression; natural language processing; text analysis; STBDC codeword; adaptive compression methods; block-oriented natural language compression; coding space; digital library; semiadaptive compression methods; semiadaptive two byte dense code; Adaptation model; Data compression; Dictionaries; Encoding; Natural languages; Real time systems; Vocabulary; Dense Code; Digital Libraries; Natural Language Compression;