DocumentCode
928809
Title
A convergent gambling estimate of the entropy of English
Author
Cover, Thomas M. ; King, Roger C.
Volume
24
Issue
4
fYear
1978
fDate
7/1/1978 12:00:00 AM
Firstpage
413
Lastpage
421
Abstract
In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon´s technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. IfS_{n} denotes the subject´s capital aftern bets at27 for1 odds, and if it is assumed that the subject knows the underlying probability distribution for the processX , then the entropy estimate ishat{H}_{n}(X)=(1-(1/n) log_{27}S_{n}) log_{2} 27 bits/symbol. If the subject does not know the true probability distribution for the stochastic process, thenhat{H}_{n}(X) is an asymptotic upper bound for the true entropy. IfX is stationary,Ehat{H}_{n}(X)rightarrowH(X), H(X) being the true entropy of the process. Moreover, if X is ergodic, then by the Shannon-McMillan-Breiman theoremhat{H}_{n}(X)rightarrowH(X) with probability one. Preliminary indications are that English text has an entropy of approximately1.3 bits/symbol, which agrees well with Shannon´s estimate. In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon´s technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. IfS_{n} denotes the subject´s capital aftern bets at27 for1 odds, and if it is assumed that the subject knows the underlying probability distribution for the processX , then the entropy estimate ishat{H}_{n}(X)=(1-(1/n) log_{27}S_{n}) log_{2} 27 bits/symbol. If the subject does not know the true probability distribution for the stochastic process, thenhat{H}_{n}(X) is an asymptotic upper - bound for the true entropy. IfX is stationary,Ehat{H}_{n}(X)rightarrowH(X), H(X) being the true entropy of the process. Moreover, if X is ergodic, then by the Shannon-McMillan-Breiman theoremhat{H}_{n}(X)rightarrowH(X) with probability one. Preliminary indications are that English text has an entropy of approximately1.3 bits/symbol, which agrees well with Shannon´s estimate. In his original paper on the subject, Shannon found upper and lower bounds for the entropy of printed English based on the number of trials required for a subject to guess subsequent symbols in a given text. The guessing approach precludes asymptotic consistency of either the upper or lower bounds except for degenerate ergodic processes. Shannon´s technique of guessing the next symbol is altered by having the subject place sequential bets on the next symbol of text. IfS_{n} denotes the subject´s capital aftern bets at27 for1 odds, and if it is assumed that the subject knows the underlying probability distribution for the processX , then the entropy estimate ishat{H}_{n}(X)=(1-(1/n) log_{27}S_{n}) log_{2} 27 bits/symbol. If the subject does not know the true probability distribution for the stochastic process, thenhat{H}_{n}(X) is an asymptotic upper bound for the true entropy. IfX is stationary,Ehat{H}_{n}(X)rightarrowH(X), H(X) being the true entropy of the process.Moreover, if X is ergodic, then by the Shannon-McMillan-Breiman theoremhat{H}_{n}(X)rightarrowH(X) with probability one. Preliminary indications are that English text has an entropy of a
Keywords
Bibliographies; Entropy functions; Languages; Character generation; Entropy; Helium; Laboratories; Random processes; Random variables; Statistics; Stochastic processes; Upper bound; Yield estimation;
fLanguage
English
Journal_Title
Information Theory, IEEE Transactions on
Publisher
ieee
ISSN
0018-9448
Type
jour
DOI
10.1109/TIT.1978.1055912
Filename
1055912
Link To Document