Title :
Approximate Clustering on Distributed Data Streams
Author :
Zhang, Qi ; Liu, Jinze ; Wang, Wei
Author_Institution :
Dept. of Comput. Sci., Univ. of North Carolina, Chapel Hill, NC
Abstract :
We investigate the problem of clustering on distributed data streams. In particular, we consider the k-median clustering on stream data arriving at distributed sites which communicate through a routing tree. Distributed clustering on high speed data streams is a challenging task due to limited communication capacity, storage space, and computing power at each site. In this paper, we propose a suite of algorithms for computing (1 + epsiv) -approximate k-median clustering over distributed data streams under three different topology settings: topology-oblivious, height-aware, and path-aware. Our algorithms reduce the maximum per node transmission to polylog N (opposed to Omega(N) for transmitting the raw data). We have simulated our algorithms on a distributed stream system with both real and synthetic datasets composed of millions of data. In practice, our algorithms are able to reduce the data transmission to a small fraction of the original data. Moreover, our results indicate that the algorithms are scalable with respect to the data volume, approximation factor, and the number of sites.
Keywords :
computational complexity; distributed databases; pattern clustering; trees (mathematics); approximate k-median clustering; distributed data streams; height-aware algorithm; path-aware algorithm; routing tree; topology-oblivious algorithm; Approximation algorithms; Clustering algorithms; Computational modeling; Computer science; Data communication; Delay; Distributed computing; Monitoring; Routing; Topology;
Conference_Titel :
Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on
Conference_Location :
Cancun
Print_ISBN :
978-1-4244-1836-7
Electronic_ISBN :
978-1-4244-1837-4
DOI :
10.1109/ICDE.2008.4497522