• DocumentCode
    2534298
  • Title

    Distribution-based synthetic database generation techniques for itemset mining

  • Author

    Ramesh, Ganesh ; Zaki, Mohammed J. ; Maniatty, William A.

  • Author_Institution
    British Columbia Univ., Vancouver, BC, Canada
  • fYear
    2005
  • fDate
    25-27 July 2005
  • Firstpage
    307
  • Lastpage
    316
  • Abstract
    The resource requirements of frequent pattern mining algorithms depend mainly on the length distribution of the mined patterns in the database. Synthetic databases, which are used to benchmark performance of algorithms, tend to have distributions far different from those observed in real datasets. In this paper we focus on the problem of synthetic database generation and propose algorithms to effectively embed within the database, any given set of maximal pattern collections, and make the following contributions: 1. A database generation technique is presented which takes k maximal itemset collections as input, and constructs a database which produces these maximal collections as output, when mined at k levels of support. To analyze the efficiency of the procedure, upper bounds are provided on the number of transactions output in the generated database; 2. A compression method is used and extended to reduce the size of the output database. An optimization to the generation procedure is provided which could potentially reduce the number of transactions generated; 3. Preliminary experimental results are presented to demonstrate the feasibility of using the generation technique.
  • Keywords
    data mining; pattern clustering; database mining; distribution-based synthetic database generation; frequent pattern mining algorithms; itemset mining; maximal pattern collections; resource requirements; Availability; Data mining; Engineering profession; Intellectual property; Itemsets; Performance analysis; Statistical distributions; Transaction databases; US Department of Energy; Upper bound;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database Engineering and Application Symposium, 2005. IDEAS 2005. 9th International
  • ISSN
    1098-8068
  • Print_ISBN
    0-7695-2404-4
  • Type

    conf

  • DOI
    10.1109/IDEAS.2005.22
  • Filename
    1540921