DocumentCode
2886150
Title
Generating Diverse Realistic Data Sets for Episode Mining
Author
Zimmermann, Armin
Author_Institution
KU Leuven, Leuven, Belgium
fYear
2012
fDate
10-10 Dec. 2012
Firstpage
611
Lastpage
618
Abstract
Frequent episode mining has been proposed as a data mining task with the goal of recovering sequential patterns from temporal data sequences. While several episode mining approaches have been proposed in the last fifteen years, most of the developed techniques have not been evaluated on a common benchmark data set, limiting the insights gained from experimental evaluations. In particular, it is unclear how well episodes are actually being recovered, leaving an episode mining user without guidelines in the knowledge discovery process. One reason for this can be found in non-disclosure agreements that prevent real life data sets on which approaches have been evaluated from entering the public domain. But even easily accessible real life data sets would not allow to ascertain miners´ abilities to identify underlying patterns. A solution to this problem can be seen in generating artificial data, which has the added advantage that patterns can be known, allowing to evaluate the accuracy of mined patterns. Based on insights and experiences stemming from consultations with industrial partners and work with real life data, we propose a data generator for the generation of diverse data sets that reflect realistic data characteristics. We discuss in detail which characteristics real life data can be expected to have and how our generator models them. Finally, we show that we can recreate artificial data that has been used in the literature, contrast it with real life data showing very different characteristics, and show how our generator can be used to create data with realistic characteristics.
Keywords
data mining; pattern recognition; benchmark data set; data characteristics; data mining; episode mining; generating diverse realistic data sets; knowledge discovery process; pattern mining; temporal data sequences; Context; Data mining; Delay; Delay effects; Generators; Hidden Markov models; Noise; experimental evaluation; temporal mining;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on
Conference_Location
Brussels
Print_ISBN
978-1-4673-5164-5
Type
conf
DOI
10.1109/ICDMW.2012.92
Filename
6406408
Link To Document