• DocumentCode
    2830379
  • Title

    Proximity Estimation and Hardness of Short-Text Corpora

  • Author

    Errecalde, Marcelo Luis ; Ingaramo, Diego ; Rosso, Paolo

  • Author_Institution
    Dev. & Res. Lab. in Comput. Intell., Univ. Nac. de San Luis, San Luis
  • fYear
    2008
  • fDate
    1-5 Sept. 2008
  • Firstpage
    15
  • Lastpage
    19
  • Abstract
    In this work, we investigate the relative hardness of short-text corpora in clustering problems and how this hardness relates to traditional similarity measures. Our approach basically attempts to establish a connection between the hardness of a corpus and the precision level exhibited by similarity measures, according to the results obtained with different cluster validity measures on the "ideal" clustering of each corpus. Moreover, we also propose a new validity measure, named contiguity error that allowed us to observe this connection in a consistent way in all the collections considered.
  • Keywords
    estimation theory; pattern clustering; text analysis; clustering problems; contiguity error; proximity estimation; short-text corpora hardness; validity measurement; Clustering algorithms; Computational intelligence; Data engineering; Databases; Euclidean distance; Expert systems; Information systems; Natural languages; Noise measurement; Vocabulary; cluster validity measures; clustering; proximity estimation; short-text corpora;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database and Expert Systems Application, 2008. DEXA '08. 19th International Workshop on
  • Conference_Location
    Turin
  • ISSN
    1529-4188
  • Print_ISBN
    978-0-7695-3299-8
  • Type

    conf

  • DOI
    10.1109/DEXA.2008.87
  • Filename
    4624685