• DocumentCode
    987532
  • Title

    The smallest grammar problem

  • Author

    Charikar, Moses ; Lehman, Eric ; Liu, Ding ; Panigrahy, Rina ; Prabhakaran, Manoj ; Sahai, Amit ; Shelat, Abhi

  • Author_Institution
    Dept. of Comput. Sci., Princeton Univ., NJ, USA
  • Volume
    51
  • Issue
    7
  • fYear
    2005
  • fDate
    7/1/2005 12:00:00 AM
  • Firstpage
    2554
  • Lastpage
    2576
  • Abstract
    This paper addresses the smallest grammar problem: What is the smallest context-free grammar that generates exactly one given string σ? This is a natural question about a fundamental object connected to many fields such as data compression, Kolmogorov complexity, pattern identification, and addition chains. Due to the problem´s inherent complexity, our objective is to find an approximation algorithm which finds a small grammar for the input string. We focus attention on the approximation ratio of the algorithm (and implicitly, the worst case behavior) to establish provable performance guarantees and to address shortcomings in the classical measure of redundancy in the literature. Our first results are concern the hardness of approximating the smallest grammar problem. Most notably, we show that every efficient algorithm for the smallest grammar problem has approximation ratio at least 8569/8568 unless P=NP. We then bound approximation ratios for several of the best known grammar-based compression algorithms, including LZ78, B ISECTION, SEQUENTIAL, LONGEST MATCH, GREEDY, and RE-PAIR. Among these, the best upper bound we show is O(n12/). We finish by presenting two novel algorithms with exponentially better ratios of O(log3n) and O(log(n/m*)), where m* is the size of the smallest grammar for that input. The latter algorithm highlights a connection between grammar-based compression and LZ77.
  • Keywords
    context-free grammars; data compression; pattern matching; LZ77; LZ78; MPM; RE-PAIR; SEQUITUR; approximation algorithm; context-free grammar; grammar-based compression algorithms; longest match; multilevel pattern matching; smallest grammar problem; Approximation algorithms; Compression algorithms; Compressors; Computer science; Data compression; Pattern matching; Source coding; Upper bound; Approximation algorithm; LONGEST MATCH; LZ77; LZ78; RE-PAIR; SEQUITUR; data compression; hardness of approximation; multilevel pattern matching (MPM); smallest grammar problem;
  • fLanguage
    English
  • Journal_Title
    Information Theory, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0018-9448
  • Type

    jour

  • DOI
    10.1109/TIT.2005.850116
  • Filename
    1459058