Title :
Text Pre-processing for Lossless Compression
Author :
Batista, Luís ; Alexandre, Luís A.
Author_Institution :
Univ. Beira Interior, Covilha
Abstract :
Textual data holds a number of properties that can be taken into account in order to improve compression. Pre-processing deals with these properties by applying a number of transformations that make the redundancy "more visible" to the compressor. One of the most commonly used concepts in text pre-processing is called capital conversion. Words with capital letters are converted to their lowercase versions while signaling the change with a flag. This way not only context similarities are increased but also dictionaries used for word replacement only need to contain words in their lowercase versions. Word replacement consists of replacing words with shorter codes which are references to their location in a dictionary.
Keywords :
data compression; text analysis; word processing; capital conversion; lossless compression; text preprocessing; textual data; word replacement; Costs; Data compression; Dictionaries; Frequency conversion; Mathematical programming; Mathematics; Testing; Vocabulary; capital conversion; dictionary; lossless compression; text pre-processing;
Conference_Titel :
Data Compression Conference, 2008. DCC 2008
Conference_Location :
Snowbird, UT
Print_ISBN :
978-0-7695-3121-2
DOI :
10.1109/DCC.2008.78