DocumentCode :
3226378
Title :
Text Pre-processing for Lossless Compression
Author :
Batista, Luís ; Alexandre, Luís A.
Author_Institution :
Univ. Beira Interior, Covilha
fYear :
2008
fDate :
25-27 March 2008
Firstpage :
506
Lastpage :
506
Abstract :
Textual data holds a number of properties that can be taken into account in order to improve compression. Pre-processing deals with these properties by applying a number of transformations that make the redundancy "more visible" to the compressor. One of the most commonly used concepts in text pre-processing is called capital conversion. Words with capital letters are converted to their lowercase versions while signaling the change with a flag. This way not only context similarities are increased but also dictionaries used for word replacement only need to contain words in their lowercase versions. Word replacement consists of replacing words with shorter codes which are references to their location in a dictionary.
Keywords :
data compression; text analysis; word processing; capital conversion; lossless compression; text preprocessing; textual data; word replacement; Costs; Data compression; Dictionaries; Frequency conversion; Mathematical programming; Mathematics; Testing; Vocabulary; capital conversion; dictionary; lossless compression; text pre-processing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 2008. DCC 2008
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
978-0-7695-3121-2
Type :
conf
DOI :
10.1109/DCC.2008.78
Filename :
4483333
Link To Document :
بازگشت