• DocumentCode
    3134663
  • Title

    A Monte Carlo sampling method for drawing representative samples from large databases

  • Author

    Guo, Hong ; Hou, Wen-Chi ; Yan, Feng ; Zhu, Qiang

  • Author_Institution
    Dept. of Comput. Sci., Southern Illinois Univ., Carbondale, IL, USA
  • fYear
    2004
  • fDate
    21-23 June 2004
  • Firstpage
    419
  • Lastpage
    420
  • Abstract
    Sampling is important in areas like data mining, OLAP, selectivity estimation, clustering, etc. It has also become a necessity in social, economical, engineering, scientific, and statistical studies where databases are too large to handle. In this paper, a sampling method based on the Metropolis algorithm is proposed. Unlike the conventional uniform sampling methods, this method is able to select objects consistent with the underlying probability distribution. It is a simple, efficient, and powerful method suitable for all distributions. We have performed experiments to examine the qualities of the samples by comparing their statistical properties with the underlying population. The experimental results show that the samples selected by our method are bona fide representative.
  • Keywords
    Monte Carlo methods; data mining; sampling methods; statistical databases; statistical distributions; very large databases; Metropolis algorithm; Monte Carlo sampling method; OLAP; data clustering; data mining; economical studies; engineering studies; large databases; object selection; probability distribution; representative samples; scientific studies; selectivity estimation; social studies; statistical property comparison; statistical studies; Clustering algorithms; Data engineering; Data mining; Databases; Engineering drawings; Monte Carlo methods; Power engineering and energy; Power generation economics; Probability distribution; Sampling methods;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Scientific and Statistical Database Management, 2004. Proceedings. 16th International Conference on
  • ISSN
    1099-3371
  • Print_ISBN
    0-7695-2146-0
  • Type

    conf

  • DOI
    10.1109/SSDM.2004.1311239
  • Filename
    1311239