• DocumentCode
    2886150
  • Title

    Generating Diverse Realistic Data Sets for Episode Mining

  • Author

    Zimmermann, Armin

  • Author_Institution
    KU Leuven, Leuven, Belgium
  • fYear
    2012
  • fDate
    10-10 Dec. 2012
  • Firstpage
    611
  • Lastpage
    618
  • Abstract
    Frequent episode mining has been proposed as a data mining task with the goal of recovering sequential patterns from temporal data sequences. While several episode mining approaches have been proposed in the last fifteen years, most of the developed techniques have not been evaluated on a common benchmark data set, limiting the insights gained from experimental evaluations. In particular, it is unclear how well episodes are actually being recovered, leaving an episode mining user without guidelines in the knowledge discovery process. One reason for this can be found in non-disclosure agreements that prevent real life data sets on which approaches have been evaluated from entering the public domain. But even easily accessible real life data sets would not allow to ascertain miners´ abilities to identify underlying patterns. A solution to this problem can be seen in generating artificial data, which has the added advantage that patterns can be known, allowing to evaluate the accuracy of mined patterns. Based on insights and experiences stemming from consultations with industrial partners and work with real life data, we propose a data generator for the generation of diverse data sets that reflect realistic data characteristics. We discuss in detail which characteristics real life data can be expected to have and how our generator models them. Finally, we show that we can recreate artificial data that has been used in the literature, contrast it with real life data showing very different characteristics, and show how our generator can be used to create data with realistic characteristics.
  • Keywords
    data mining; pattern recognition; benchmark data set; data characteristics; data mining; episode mining; generating diverse realistic data sets; knowledge discovery process; pattern mining; temporal data sequences; Context; Data mining; Delay; Delay effects; Generators; Hidden Markov models; Noise; experimental evaluation; temporal mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on
  • Conference_Location
    Brussels
  • Print_ISBN
    978-1-4673-5164-5
  • Type

    conf

  • DOI
    10.1109/ICDMW.2012.92
  • Filename
    6406408