Title :
Developing an efficient algorithm for representation and compression of large Bengali text
Author :
Marjan, Md Abu ; Uddin, Md Palash ; Ibn Afjal, Masud ; Haque, Md Dulal
Author_Institution :
Fac. of Comput. Sci. & Eng., Hajee Mohammad Danesh Sci. & Technol. Univ., Dinajpur, Bangladesh
Abstract :
Efficient coding is one of the challenging aspects of information and communication theory. On the other hand, the natural languages such as Bengali is coded using Unicode technology which requires more space and thus takes more time to transfer the data of that language. In this paper, we have proposed a novel algorithm to represent Bengali text efficiently and then to compress the text offering a better compression ratio. Each Bengali character is represented by a unique 2-digit intermediate decimal value. Indexing and sorting all the word values successive subtraction is performed on the values in hope to reduce the weight of the numbers. The new values of each word can now be encoded with a very few bits. In comparison to other compressors, the compression ratio of the proposed algorithm decreases in a big amount for the large text which may contain more duplicate or redundant words, more words with the same length and more words of the same length with the same prefix called Uposorgo in Bengali.
Keywords :
data compression; indexing; information theory; natural language processing; sorting; text analysis; 2-digit intermediate decimal value; communication theory; efficient coding; indexing; information theory; large Bengali text compression; large Bengali text representation; natural languages; sorting; unicode technology; Compounds; Compressors; Computers; Encoding; Indexes; Sorting; Standards; Bengali text compression; Bengali text representation; Compression; Decompression; compression ratio;
Conference_Titel :
Strategic Technology (IFOST), 2014 9th International Forum on
Conference_Location :
Cox´s Bazar
Print_ISBN :
978-1-4799-6060-6
DOI :
10.1109/IFOST.2014.6991063