Title :
Proximity Estimation and Hardness of Short-Text Corpora
Author :
Errecalde, Marcelo Luis ; Ingaramo, Diego ; Rosso, Paolo
Author_Institution :
Dev. & Res. Lab. in Comput. Intell., Univ. Nac. de San Luis, San Luis
Abstract :
In this work, we investigate the relative hardness of short-text corpora in clustering problems and how this hardness relates to traditional similarity measures. Our approach basically attempts to establish a connection between the hardness of a corpus and the precision level exhibited by similarity measures, according to the results obtained with different cluster validity measures on the "ideal" clustering of each corpus. Moreover, we also propose a new validity measure, named contiguity error that allowed us to observe this connection in a consistent way in all the collections considered.
Keywords :
estimation theory; pattern clustering; text analysis; clustering problems; contiguity error; proximity estimation; short-text corpora hardness; validity measurement; Clustering algorithms; Computational intelligence; Data engineering; Databases; Euclidean distance; Expert systems; Information systems; Natural languages; Noise measurement; Vocabulary; cluster validity measures; clustering; proximity estimation; short-text corpora;
Conference_Titel :
Database and Expert Systems Application, 2008. DEXA '08. 19th International Workshop on
Conference_Location :
Turin
Print_ISBN :
978-0-7695-3299-8
DOI :
10.1109/DEXA.2008.87