DocumentCode
2335332
Title
Document clustering and cluster topic extraction in multilingual corpora
Author
Silva, Joaquim ; Mexia, João ; Coelho, Agra ; Lopes, Gabriel
Author_Institution
Univ. Nova de Lisboa, Lisbon, Portugal
fYear
2001
fDate
2001
Firstpage
513
Lastpage
520
Abstract
A statistics-based approach for clustering documents and for extracting cluster topics is described relevant (meaningful) expressions (REs) automatically extracted from corpora are used as clustering base features. These features are transformed and its number is strongly reduced in order to obtain a small set of document classification features. This is achieved on the basis of principal components analysis. Model-based clustering analysis finds the best number of clusters. Then, the most important REs are extracted from each cluster and taken as document cluster topics
Keywords
data mining; document handling; pattern clustering; cluster topic extraction; document classification features; document clustering; model-based clustering analysis; multilingual corpora; principal components analysis; relevant expressions; statistics-based approach; Agriculture; Data mining; Dispersion; Feature extraction; Instruction sets; Organizing; Probability; Size measurement;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on
Conference_Location
San Jose, CA
Print_ISBN
0-7695-1119-8
Type
conf
DOI
10.1109/ICDM.2001.989559
Filename
989559
Link To Document