DocumentCode :
2886150
Title :
Generating Diverse Realistic Data Sets for Episode Mining
Author :
Zimmermann, Armin
Author_Institution :
KU Leuven, Leuven, Belgium
fYear :
2012
fDate :
10-10 Dec. 2012
Firstpage :
611
Lastpage :
618
Abstract :
Frequent episode mining has been proposed as a data mining task with the goal of recovering sequential patterns from temporal data sequences. While several episode mining approaches have been proposed in the last fifteen years, most of the developed techniques have not been evaluated on a common benchmark data set, limiting the insights gained from experimental evaluations. In particular, it is unclear how well episodes are actually being recovered, leaving an episode mining user without guidelines in the knowledge discovery process. One reason for this can be found in non-disclosure agreements that prevent real life data sets on which approaches have been evaluated from entering the public domain. But even easily accessible real life data sets would not allow to ascertain miners´ abilities to identify underlying patterns. A solution to this problem can be seen in generating artificial data, which has the added advantage that patterns can be known, allowing to evaluate the accuracy of mined patterns. Based on insights and experiences stemming from consultations with industrial partners and work with real life data, we propose a data generator for the generation of diverse data sets that reflect realistic data characteristics. We discuss in detail which characteristics real life data can be expected to have and how our generator models them. Finally, we show that we can recreate artificial data that has been used in the literature, contrast it with real life data showing very different characteristics, and show how our generator can be used to create data with realistic characteristics.
Keywords :
data mining; pattern recognition; benchmark data set; data characteristics; data mining; episode mining; generating diverse realistic data sets; knowledge discovery process; pattern mining; temporal data sequences; Context; Data mining; Delay; Delay effects; Generators; Hidden Markov models; Noise; experimental evaluation; temporal mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on
Conference_Location :
Brussels
Print_ISBN :
978-1-4673-5164-5
Type :
conf
DOI :
10.1109/ICDMW.2012.92
Filename :
6406408
Link To Document :
بازگشت