DocumentCode :
1566514
Title :
Using topic keyword clusters for automatic document clustering
Author :
Chang, Hsi-Cheng ; Hsu, Chiun-Chieh
Author_Institution :
Dept. of Electron. Eng., Hwa Hsia Inst. of Technol., Taipei, Taiwan
Volume :
1
fYear :
2005
Firstpage :
419
Abstract :
Data clustering is a technique for grouping similar data items together for convenient understanding. Conventional data clustering methods, including agglomerative hierarchical clustering and partitional clustering algorithms frequently perform unsatisfactorily for large text article collections, as well as the computation complexity of the conventional data clustering methods increase very quick with the number of data items. This paper presents a system for automatic document clustering by identifying topic keyword clusters of the text corpus. The proposed system adopts a multi-stage process. First, an aggressive data cleaning approach is employed to reduce the noise in the free text and further identify the topic keywords within the documents. All extracted keywords are then grouped into topic keyword clusters using the k-nearest neighbor graph approach and the keyword clustering function. Finally, all documents in the corpus are clustered based on the topic keyword clusters. The proposed method was assessed against conventional data clustering methods on a Web news collection, indicating that the proposed method is an efficient and effective clustering approach.
Keywords :
computational complexity; document handling; graph theory; Web news collection; agglomerative hierarchical clustering; automatic document clustering; computation complexity; data clustering; k-nearest neighbor graph; keyword clustering function; multistage process; partitional clustering; topic keyword clusters; Cities and towns; Cleaning; Clustering algorithms; Clustering methods; Data mining; Extraterrestrial measurements; Information management; Merging; Noise reduction; Partitioning algorithms;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Technology and Applications, 2005. ICITA 2005. Third International Conference on
Print_ISBN :
0-7695-2316-1
Type :
conf
DOI :
10.1109/ICITA.2005.303
Filename :
1488841
Link To Document :
بازگشت