Title :
Universal Text Preprocessing and Postprocessing for PPM Using Alphabet Adjustment
Author :
Alhawiti, Khaled M. ; Teahan, William J.
Abstract :
In this paper, we introduce several new universal pre-processing techniques to improve Prediction by Partial Matching (PPM) compression of UTF-8 encoded natural language text. These methods essentially ´adjust´ the alphabet in some manner (for example, by expanding or reducing it) prior to the compression algorithm then being applied to the amended text.
Keywords :
data compression; natural language processing; pattern matching; text analysis; PPM compression algorithm; UTF-8 encoded natural language text; alphabet adjustment; prediction by partial matching; universal text postprocessing; universal text preprocessing; Compression algorithms; Compressors; Computer science; Data compression; Educational institutions; Natural languages; Vocabulary; Bi-graphs; PPM; Text compression;
Conference_Titel :
Data Compression Conference (DCC), 2014
Conference_Location :
Snowbird, UT
DOI :
10.1109/DCC.2014.12