DocumentCode :
2707904
Title :
The MARKOV EXPERT for finding episodes in time series
Author :
Cheng, Jimming ; Mitzenmacher, Michael
Author_Institution :
Harvard Univ., MA, USA
fYear :
2005
fDate :
29-31 March 2005
Firstpage :
454
Abstract :
Summary form only given. We describe a domain-independent unsupervised algorithm for segmentation of time series data into meaningful episodes, focusing on the problem of text segmentation. The VOTING EXPERTS algorithm of Cohen et al. (2002) achieves results with fairly low rates of error by combining two experts that analyze the input´s frequency and entropy patterns. The MARKOV EXPERT is a new approach that improves the performance of VOTING EXPERTS by further refining those results with votes from an additional expert. The new expert applies a method inspired by Teahan et al.´s (2000) compression-based approach for Chinese text. Their supervised approach requires a large, correctly segmented training corpus. Segmentation of the input is modeled as a Markov process, with spaces inserted such that the resulting string is smallest under PPM compression with respect to the corpus. In the unsupervised setting, external corpuses are not available. Thus, we draw event pattern data from a new corpus constructed by generating a preliminary segmentation using the original VOTING EXPERTS. Since VOTING EXPERTS finds episode boundaries fairly well (precision and recall around 77% and 75%), the quality of this new corpus is sufficient to allow the MARKOV EXPERT to further improve results significantly. The MARKOV EXPERT votes on possible boundaries by accumulating votes within a sliding window that moves over the input. The context within each window is compared to the corpus using a segmentation utility function. Quality of a particular segmentation is positively correlated to the frequency of the resulting suffixes and prefixes in the corpus, and negatively correlated to instances in which the current context appears intact within a word in the corpus.
Keywords :
Markov processes; data compression; entropy codes; text analysis; time series; MARKOV EXPERT; Markov process; PPM compression; VOTING EXPERTS; compression-based approach; corpus; domain-independent unsupervised algorithm; entropy patterns; event pattern data; performance; segmentation utility; sliding window; text segmentation; time series episodes; votes; Computational linguistics; Data compression; Entropy; Frequency; Humans; Markov processes; Pattern analysis; Robots; Voting;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Compression Conference, 2005. Proceedings. DCC 2005
ISSN :
1068-0314
Print_ISBN :
0-7695-2309-9
Type :
conf
DOI :
10.1109/DCC.2005.86
Filename :
1402211
Link To Document :
بازگشت