• DocumentCode
    2730844
  • Title

    Distributed Data Stream Clustering: A Fast EM-based Approach

  • Author

    Aoying Zhou ; Feng Cao ; Ying Yan ; Chaofeng Sha ; Xiaofeng He

  • Author_Institution
    Fudan Univ., China
  • fYear
    2007
  • fDate
    15-20 April 2007
  • Firstpage
    736
  • Lastpage
    745
  • Abstract
    Clustering data streams has been attracting a lot of research efforts recently. However, this problem has not received enough consideration when the data streams are generated in a distributed fashion, whereas such a scenario is very common in real life applications. There exist constraining factors in clustering the data streams in the distributed environment: the data records generated are noisy or incomplete due to the unreliable distributed system; the system needs to on-line process a huge volume of data; the communication is potentially a bottleneck of the system. All these factors pose great challenge for clustering the distributed data streams. In this paper, we proposed an EM-based (Expectation Maximization) framework to effectively cluster the distributed data streams, with the above fundamental challenges in mind. In the presence of noisy or incomplete data records, our algorithms learn the distribution of underlying data streams by maximizing the likelihood of the data clusters. A test-and-cluster strategy is proposed to reduce the average processing cost, which is especially effective for online clustering over large data streams. Our extensive experimental studies show that the proposed algorithms can achieve a high accuracy with less communication cost, memory consumption and CPU time.
  • Keywords
    distributed processing; expectation-maximisation algorithm; pattern clustering; data streams clustering; distributed data stream clustering; distributed environment; expectation maximization framework; fast EM-based approach; online clustering; unreliable distributed system; Application software; Chaotic communication; Clustering algorithms; Costs; Gaussian processes; Iterative algorithms; Partitioning algorithms; Sensor fusion; Testing; Working environment noise;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on
  • Conference_Location
    Istanbul
  • Print_ISBN
    1-4244-0802-4
  • Type

    conf

  • DOI
    10.1109/ICDE.2007.367919
  • Filename
    4221722