• DocumentCode
    1710009
  • Title

    Discovery of patterns in LZ-78 text discrimination

  • Author

    Malyutov, Mikhail ; Cunningham, Gabriel

  • Author_Institution
    Math. Dept, Northeastern Univ., Boston, MA, USA
  • fYear
    2010
  • Firstpage
    28
  • Lastpage
    33
  • Abstract
    We continue studying a new context-free computationally simple stylometry-based text homogeneity test: the sliced conditional compression complexity (sCCC or simply CCC) of literary texts introduced and inspired by the incomputable Kolmogorov conditional complexity. Other stylometry tools can occasionally almost coincide for different authors. Our CCC-attributor is asymptotically strictly minimal for the true author, if the query texts are sufficiently large but much less than the training texts, universal compressor is good and sampling bias is avoided. This classifier simplifies the homogeneity test (partly based on compression) under insignificant difference of unconditional complexities of training and query texts verifiable via its asymptotic normality for IID and Markov sources and empirical normal plots for real literary texts. It is consistent under large text approximation as a stationary ergodic sequence due to the lower bound for the minimax compression redundancy of piecewise stationary strings, our elementary combinatorial arguments and simulation in present paper for IID sources. The CCC is based on the t-ratio measuring how many standard deviations are in the mean difference of slices´ CCC which enables evaluation of the corresponding P-value of statistical significance based on slices´ CCC asymptotic normality empirically verified by their normal plots in all cases studied and expected to be proved soon for simplified statistical models of literary texts. While good CCC performance holds for a broad class of fast adapting universal compressors introduced, the new application of one of them (LZ-78) is described in this paper (which was first outlined in less detail and without application results): the discovery of patterns contributed the most to the CCC discrimination of texts. We present an algorithm implemented in a PERL program by the second author (available by request) and results of its performance on simulated strings and on Federalist pape- - rs. A much shorter CCC program in PERL written by the second author is in our Appendix. We hope that experimenting with these two programs will help the CCC dissemination. The asymptotic CCC study was complemented so far by many literary case studies, showing consistency in attributing all Madison Federalist papers to Madison, Brodsky´s novels to Brodsky, etc., and a good discrimination between works of different authors in contrast to at least questionable performance of on the same texts. Further testing and tuning CCC-methodology and pattern discovery seems appropriate. New areas of applications for the universal compressor - based nonparametric testing such as change-point detection for arbitrary stationary ergodic sources emerge recently.
  • Keywords
    Markov processes; computational complexity; computational linguistics; data compression; minimax techniques; pattern recognition; statistical analysis; text analysis; CCC asymptotic normality; CCC attributor; LZ-78 text discrimination; Markov sources; PERL program; change point detection; context free computationally simple stylometry based text homogeneity test; elementary combinatorial arguments; incomputable Kolmogorov conditional complexity; minimax compression redundancy; piecewise stationary strings; sliced conditional compression complexity; stationary ergodic sequence; stylometry tools; text approximation; unconditional complexities; Complexity theory; Dictionaries; Entropy; Redundancy; Region 8; Testing; Training;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Technologies in Electrical and Electronics Engineering (SIBIRCON), 2010 IEEE Region 8 International Conference on
  • Conference_Location
    Listvyanka
  • Print_ISBN
    978-1-4244-7625-1
  • Type

    conf

  • DOI
    10.1109/SIBIRCON.2010.5555303
  • Filename
    5555303