Title :
Incremental document clustering using cluster similarity histograms
Author :
Hammouda, Khaled M. ; Kamel, Mohamed S.
Author_Institution :
Dept. of Syst. Design Eng., Waterloo Univ., Ont., Canada
Abstract :
Clustering of large collections of text documents is a key process in providing a higher level of knowledge about the underlying inherent classification of the documents. Web documents, in particular, are of great interest since managing, accessing, searching, and browsing large repositories of Web content requires efficient organization. Incremental clustering algorithms are always preferred to traditional clustering techniques, since they can be applied in a dynamic environment such as the Web. An incremental document clustering algorithm is introduced, which relies only on pair-wise document similarity information. Clusters are represented using a cluster similarity histogram, a concise statistical representation of the distribution of similarities within each cluster, which provides a measure of cohesiveness. The measure guides the incremental clustering process. Complexity analysis and experimental results are discussed and show that the algorithm requires less computational time than standard methods while achieving a comparable or better clustering quality.
Keywords :
Internet; Web sites; computational complexity; content-based retrieval; document handling; pattern clustering; Web document; cluster similarity histogram representation; complexity analysis; document clustering algorithm; pair-wise document similarity information; statistical representation; text document; Algorithm design and analysis; Clustering algorithms; Clustering methods; Content management; Design engineering; Electronic mail; Histograms; Knowledge engineering; Systems engineering and theory; Web sites;
Conference_Titel :
Web Intelligence, 2003. WI 2003. Proceedings. IEEE/WIC International Conference on
Print_ISBN :
0-7695-1932-6
DOI :
10.1109/WI.2003.1241276