DocumentCode :
2830379
Title :
Proximity Estimation and Hardness of Short-Text Corpora
Author :
Errecalde, Marcelo Luis ; Ingaramo, Diego ; Rosso, Paolo
Author_Institution :
Dev. & Res. Lab. in Comput. Intell., Univ. Nac. de San Luis, San Luis
fYear :
2008
fDate :
1-5 Sept. 2008
Firstpage :
15
Lastpage :
19
Abstract :
In this work, we investigate the relative hardness of short-text corpora in clustering problems and how this hardness relates to traditional similarity measures. Our approach basically attempts to establish a connection between the hardness of a corpus and the precision level exhibited by similarity measures, according to the results obtained with different cluster validity measures on the "ideal" clustering of each corpus. Moreover, we also propose a new validity measure, named contiguity error that allowed us to observe this connection in a consistent way in all the collections considered.
Keywords :
estimation theory; pattern clustering; text analysis; clustering problems; contiguity error; proximity estimation; short-text corpora hardness; validity measurement; Clustering algorithms; Computational intelligence; Data engineering; Databases; Euclidean distance; Expert systems; Information systems; Natural languages; Noise measurement; Vocabulary; cluster validity measures; clustering; proximity estimation; short-text corpora;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database and Expert Systems Application, 2008. DEXA '08. 19th International Workshop on
Conference_Location :
Turin
ISSN :
1529-4188
Print_ISBN :
978-0-7695-3299-8
Type :
conf
DOI :
10.1109/DEXA.2008.87
Filename :
4624685
Link To Document :
بازگشت