• DocumentCode
    799044
  • Title

    Universal Lossless Compression With Unknown Alphabets—The Average Case

  • Author

    Shamir, Gil I.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Utah Univ., Salt Lake City, UT
  • Volume
    52
  • Issue
    11
  • fYear
    2006
  • Firstpage
    4915
  • Lastpage
    4944
  • Abstract
    Universal compression of patterns of sequences generated by independent and identically distributed (i.i.d.) sources with unknown, possibly large, alphabets is investigated. A pattern is a sequence of indices that contains all consecutive indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alphabet symbols can be exploited to create the pattern of the sequence. This pattern can in turn be compressed by itself. It is shown that if the alphabet size k is essentially small, then the average minimax and maximin redundancies as well as the redundancy of every code for almost every source, when compressing a pattern, consist of at least 0.5log(n/k3) bits per each unknown probability parameter, and if all alphabet letters are likely to occur, there exist codes whose redundancy is at most 0.5log(n/k2) bits per each unknown probability parameter, where n is the length of the data sequences. Otherwise, if the alphabet is large, these redundancies are essentially at least Theta(n-2/3 ) bits per symbol, and there exist codes that achieve redundancy of O(n-1/2) bits per symbol. Two suboptimal low-complexity sequential algorithms for compression of patterns are presented and their description lengths analyzed, also pointing out that the pattern average universal description length can decrease below the underlying i.i.d. entropy for large enough alphabets
  • Keywords
    data compression; entropy codes; minimax techniques; probability; sequences; source coding; average minimax-maximin redundancy; coding; data sequence; iid entropy; independent-identically distributed source; probability parameter; sequence pattern; universal lossless compression; unknown alphabet; Algorithm design and analysis; Communication system control; Computer aided software engineering; Costs; Decoding; Entropy; Gas insulated transmission lines; Minimax techniques; Pattern analysis; Statistical distributions; Average redundancy; independent and identically distributed (i.i.d.) sources; index sequences; individual redundancy; maximin redundancy; minimax redundancy; minimum description length (MDL); patterns; redundancy for most sources; redundancy–capacity theorem; sequential codes; universal coding;
  • fLanguage
    English
  • Journal_Title
    Information Theory, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9448
  • Type

    jour

  • DOI
    10.1109/TIT.2006.883609
  • Filename
    1715534