DocumentCode
2830379
Title
Proximity Estimation and Hardness of Short-Text Corpora
Author
Errecalde, Marcelo Luis ; Ingaramo, Diego ; Rosso, Paolo
Author_Institution
Dev. & Res. Lab. in Comput. Intell., Univ. Nac. de San Luis, San Luis
fYear
2008
fDate
1-5 Sept. 2008
Firstpage
15
Lastpage
19
Abstract
In this work, we investigate the relative hardness of short-text corpora in clustering problems and how this hardness relates to traditional similarity measures. Our approach basically attempts to establish a connection between the hardness of a corpus and the precision level exhibited by similarity measures, according to the results obtained with different cluster validity measures on the "ideal" clustering of each corpus. Moreover, we also propose a new validity measure, named contiguity error that allowed us to observe this connection in a consistent way in all the collections considered.
Keywords
estimation theory; pattern clustering; text analysis; clustering problems; contiguity error; proximity estimation; short-text corpora hardness; validity measurement; Clustering algorithms; Computational intelligence; Data engineering; Databases; Euclidean distance; Expert systems; Information systems; Natural languages; Noise measurement; Vocabulary; cluster validity measures; clustering; proximity estimation; short-text corpora;
fLanguage
English
Publisher
ieee
Conference_Titel
Database and Expert Systems Application, 2008. DEXA '08. 19th International Workshop on
Conference_Location
Turin
ISSN
1529-4188
Print_ISBN
978-0-7695-3299-8
Type
conf
DOI
10.1109/DEXA.2008.87
Filename
4624685
Link To Document