DocumentCode :
3062180
Title :
Towards a calibrated corpus for compression testing
Author :
Titchener, M.R. ; Fenwick, P.M. ; Chen, M.C.
Author_Institution :
Dept. of Comput. Sci., Auckland Univ., New Zealand
fYear :
1999
fDate :
29-31 Mar 1999
Firstpage :
554
Abstract :
Summary form only given. A mini-corpus of twelve `calibrated´ binary-data files have been produced for systematic evaluation of compression algorithms. These are generated within the framework of a deterministic theory of string complexity. Here the T-complexity of a string x (measured in taugs) is defined as CT(xi)=Σilog2(ki +1), where the positive integers ki are the T-expansion parameters for the corresponding string production process. CT(x) is observed to be the logarithmic integral of the total information content Ix of x (measured in nats), i.e., CT (x)=li(Ix). The average entropy is H¯x=Ix/|x|, i.e., the total information content divided by the length of x. Thus CT(x)=li(H¯x×|x|). Alternatively, the information rate along a string may be described by an entropy function Hx(n),0⩽n⩽|x| for the string. Assuming that Hx (n) is continuously integrable along the length of the x, then I x=∫0|x|Hx(n)δn. Thus CT(x)=li(∫0|x|Hx (n)δn). Solving for Hx(n): that is differentiating both sides and rearranging, we get: Hx(n)=(δCT(x|n)/δn)×loge (li-1(CT(x|n))). With x being in fact discrete, and the T-complexity function being computed in terms of the discrete T-augmentation steps, we may accordingly re-express the equation in terms of the T-prefix increments: δn≈Δi |x|=ki|pi|; and from the definition of CT(x): δCT(x) is replaced by Δi CT(x)=log2(ki+1). The average slope over the i-th T-prefix pi increment is then simply (ΔiCT(x))/(Δi|x|)=(log 2(ki+1))/(ki|pi|). The entropy function is now replaced by a discrete approximation
Keywords :
approximation theory; computational complexity; data compression; differentiation; entropy; integral equations; string matching; T-complexity; binary-data files; compression testing; deterministic theory; differentiation; discrete approximation; entropy function; logarithmic integral; string complexity; string production process; Compression algorithms; Compressors; Computer science; Entropy; Information rates; Logic; Network address translation; Production; Testing; Tin;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 1999. Proceedings. DCC '99
Conference_Location :
Snowbird, UT
ISSN :
1068-0314
Print_ISBN :
0-7695-0096-X
Type :
conf
DOI :
10.1109/DCC.1999.785711
Filename :
785711
Link To Document :
بازگشت