DocumentCode :
2710528
Title :
Clustering Documents with Active Learning Using Wikipedia
Author :
Huang, Anna ; Milne, David ; Frank, Eibe ; Witten, Ian H.
Author_Institution :
Dept. of Comput. Sci., Univ. of Waikato, Hamilton
fYear :
2008
fDate :
15-19 Dec. 2008
Firstpage :
839
Lastpage :
844
Abstract :
Wikipedia has been applied as a background knowledge base to various text mining problems, but very few attempts have been made to utilize it for document clustering. In this paper we propose to exploit the semantic knowledge in Wikipedia for clustering, enabling the automatic grouping of documents with similar themes. Although clustering is intrinsically unsupervised, recent research has shown that incorporating supervision improves clustering performance, even when limited supervision is provided. The approach presented in this paper applies supervision using active learning. We first utilize Wikipedia to create a concept-based representation of a text document, with each concept associated to a Wikipedia article. We then exploit the semantic relatedness between Wikipedia concepts to find pair-wise instance-level constraints for supervised clustering, guiding clustering towards the direction indicated by the constraints. We test our approach on three standard text document datasets. Empirical results show that our basic document representation strategy yields comparable performance to previous attempts; and adding constraints improves clustering performance further by up to 20%.
Keywords :
Web sites; data mining; knowledge representation; pattern clustering; text analysis; unsupervised learning; Wikipedia; active learning; document clustering; semantic knowledge based representation; text document dataset; text mining; unsupervised learning; Wikipedia; Wikipedia; active learning; document representation; text clustering;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on
Conference_Location :
Pisa
ISSN :
1550-4786
Print_ISBN :
978-0-7695-3502-9
Type :
conf
DOI :
10.1109/ICDM.2008.80
Filename :
4781188
Link To Document :
بازگشت