• DocumentCode
    2707904
  • Title

    The MARKOV EXPERT for finding episodes in time series

  • Author

    Cheng, Jimming ; Mitzenmacher, Michael

  • Author_Institution
    Harvard Univ., MA, USA
  • fYear
    2005
  • fDate
    29-31 March 2005
  • Firstpage
    454
  • Abstract
    Summary form only given. We describe a domain-independent unsupervised algorithm for segmentation of time series data into meaningful episodes, focusing on the problem of text segmentation. The VOTING EXPERTS algorithm of Cohen et al. (2002) achieves results with fairly low rates of error by combining two experts that analyze the input´s frequency and entropy patterns. The MARKOV EXPERT is a new approach that improves the performance of VOTING EXPERTS by further refining those results with votes from an additional expert. The new expert applies a method inspired by Teahan et al.´s (2000) compression-based approach for Chinese text. Their supervised approach requires a large, correctly segmented training corpus. Segmentation of the input is modeled as a Markov process, with spaces inserted such that the resulting string is smallest under PPM compression with respect to the corpus. In the unsupervised setting, external corpuses are not available. Thus, we draw event pattern data from a new corpus constructed by generating a preliminary segmentation using the original VOTING EXPERTS. Since VOTING EXPERTS finds episode boundaries fairly well (precision and recall around 77% and 75%), the quality of this new corpus is sufficient to allow the MARKOV EXPERT to further improve results significantly. The MARKOV EXPERT votes on possible boundaries by accumulating votes within a sliding window that moves over the input. The context within each window is compared to the corpus using a segmentation utility function. Quality of a particular segmentation is positively correlated to the frequency of the resulting suffixes and prefixes in the corpus, and negatively correlated to instances in which the current context appears intact within a word in the corpus.
  • Keywords
    Markov processes; data compression; entropy codes; text analysis; time series; MARKOV EXPERT; Markov process; PPM compression; VOTING EXPERTS; compression-based approach; corpus; domain-independent unsupervised algorithm; entropy patterns; event pattern data; performance; segmentation utility; sliding window; text segmentation; time series episodes; votes; Computational linguistics; Data compression; Entropy; Frequency; Humans; Markov processes; Pattern analysis; Robots; Voting;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Compression Conference, 2005. Proceedings. DCC 2005
  • ISSN
    1068-0314
  • Print_ISBN
    0-7695-2309-9
  • Type

    conf

  • DOI
    10.1109/DCC.2005.86
  • Filename
    1402211