DocumentCode :
1247780
Title :
Universal text preprocessing for data compression
Author :
Abel, Jürgen ; Teahan, William
Author_Institution :
Dept. of Commun. Syst., Univ. Duisburg-Essen, Duisburg, Germany
Volume :
54
Issue :
5
fYear :
2005
fDate :
5/1/2005 12:00:00 AM
Firstpage :
497
Lastpage :
507
Abstract :
Several preprocessing algorithms for text files are presented which complement each other and which are performed prior to the compression scheme. The algorithms need no external dictionary and are language independent. The compression gain is compared along with the costs of speed for the BWT, PPM, and LZ compression schemes. The average overall compression gain is in the range of 3 to 5 percent for the text files of the Calgary Corpus and between 2 to 9 percent for the text files of the large Canterbury Corpus.
Keywords :
data compression; sorting; text analysis; BWT compression; Calgary Corpus; Canterbury Corpus; LZ compression; PPM compression; compression gain; data compression; universal text preprocessing algorithms; Compression algorithms; Costs; Data compression; Decoding; Dictionaries; Encoding; Information systems; Internet; Natural languages; Text recognition; BWT; Index Terms- Algorithms; LZ; PPM; data compression; preprocessing; text compression.;
fLanguage :
English
Journal_Title :
Computers, IEEE Transactions on
Publisher :
ieee
ISSN :
0018-9340
Type :
jour
DOI :
10.1109/TC.2005.85
Filename :
1407841
Link To Document :
بازگشت