DocumentCode :
3256231
Title :
An EM Clustering Algorithm which Produces a Dual Representation
Author :
Kim, Sun ; Wilbur, W. John
Author_Institution :
Nat. Center for Biotechnol. Inf., Nat. Inst. of Health, Bethesda, MD, USA
Volume :
2
fYear :
2011
fDate :
18-21 Dec. 2011
Firstpage :
90
Lastpage :
95
Abstract :
Clustering text documents is an important step in mining useful information on the Web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans since it cannot explain the subject of each cluster. Utilizing semantic information such as an ontology can solve this problem, but it needs a well-defined database or pre-labeled gold standard set. In this paper, we present a theme-based clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct themes, hence it converges to an optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for usual clustering performance. The experimental results show that the proposed method provides competitive performance compared to other state-of-the-art approaches. In addition, the extracted themes represent well the topics of clusters on the MEDLINE dataset.
Keywords :
Internet; data mining; medical administrative data processing; medical computing; pattern clustering; probability; text analysis; EM clustering algorithm; MEDLINE® dataset; Web resource; dual representation; information mining; multidimensional space; prelabeled gold standard set; probabilistic framework; semantic information; text based resource; text document clustering; theme based clustering algorithm; Algorithm design and analysis; Clustering algorithms; Humans; Parkinson´s disease; Probabilistic logic; Vectors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Applications and Workshops (ICMLA), 2011 10th International Conference on
Conference_Location :
Honolulu, HI
Print_ISBN :
978-1-4577-2134-2
Type :
conf
DOI :
10.1109/ICMLA.2011.29
Filename :
6147054
Link To Document :
بازگشت