• DocumentCode
    1525349
  • Title

    Probability Estimation in the Rare-Events Regime

  • Author

    Wagner, Aaron B. ; Viswanath, Pramod ; Kulkarni, Sanjeev R.

  • Author_Institution
    Sch. of Electr. & Comput. Eng., Cornell Univ., Ithaca, NY, USA
  • Volume
    57
  • Issue
    6
  • fYear
    2011
  • fDate
    6/1/2011 12:00:00 AM
  • Firstpage
    3207
  • Lastpage
    3229
  • Abstract
    We address the problem of estimating the probability of an observed string that is drawn i.i.d. from an unknown distribution. Motivated by models of natural language, we consider the regime in which the length of the observed string and the size of the underlying alphabet are comparably large. In this regime, the maximum likelihood distribution tends to overestimate the probability of the observed letters, so the Good-Turing probability estimator is typically used instead. We show that when used to estimate the sequence probability, the Good-Turing estimator is not consistent in this regime. We then introduce a novel sequence probability estimator that is consistent. This estimator also yields consistent estimators for other quantities of interest and a consistent universal classifier.
  • Keywords
    entropy; maximum likelihood estimation; maximum likelihood distribution; probability estimation; rare-events regime; sequence probability estimator; Approximation methods; Data models; Entropy; Estimation; Markov processes; Natural languages; Probability distribution; Classification; entropy estimation; large alphabets; large number of rare events (LNRE); natural language; probability estimation;
  • fLanguage
    English
  • Journal_Title
    Information Theory, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9448
  • Type

    jour

  • DOI
    10.1109/TIT.2011.2137210
  • Filename
    5773059