• DocumentCode
    2369171
  • Title

    Scalable model-based clustering by working on data summaries

  • Author

    Jin, Huidong ; Wong, Man-Leung ; Leung, Kwong-Sak

  • Author_Institution
    Dept. of Inf. Syst., Lingnan Univ., Tuen Mun, China
  • fYear
    2003
  • fDate
    19-22 Nov. 2003
  • Firstpage
    91
  • Lastpage
    98
  • Abstract
    The scalability problem in data mining involves the development of methods for handling large databases with limited computational resources. We present a two-phase scalable model-based clustering framework: first, a large data set is summed up into subclusters; Then, clusters are directly generated from the summary statistics of subclusters by a specifically designed expectation-maximization (EM) algorithm. Taking example for Gaussian mixture models, we establish a provably convergent EM algorithm, EMADS, which embodies cardinality, mean, and covariance information of each subcluster explicitly. Combining with different data summarization procedures, EMADS is used to construct two clustering systems: gEMADS and bEMADS. The experimental results demonstrate that they run several orders of magnitude faster than the classic EM algorithm with little loss of accuracy. They generate significantly better results than other model-based clustering systems using similar computational resources.
  • Keywords
    Gaussian processes; computational complexity; covariance analysis; data mining; pattern clustering; very large databases; EM; Gaussian mixture model; bEMADS; clustering system; covariance information; data mining; data summarization procedures; expectation-maximization algorithm; gEMADS; large databases; two-phase scalable model-based clustering; Algorithm design and analysis; Bridges; Clustering algorithms; Data mining; Databases; Explosives; Information systems; Iterative algorithms; Scalability; Statistics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2003. ICDM 2003. Third IEEE International Conference on
  • Print_ISBN
    0-7695-1978-4
  • Type

    conf

  • DOI
    10.1109/ICDM.2003.1250907
  • Filename
    1250907