• DocumentCode
    2448948
  • Title

    The performance analysis of a Chi-square similarity measure for topic related clustering of noisy transcripts

  • Author

    Ibrahimov, Oktay ; Sethi, Ishwar ; Dimitrova, Nevenka

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Oakland Univ., Rochester, MI, USA
  • Volume
    4
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    285
  • Abstract
    The goal of the paper is to present a novel Chi-square similarity measure and assess its performance through comparison with well-known similarity measures such as Cosine, Dice, and Jaccard. The Chi-square similarity measure has been designed to withstand the imperfections of transcribed spoken documents. The major difference of our similarity measure from others consists in the fact that in addition to searching for co-occurring words in documents, we also match informative closeness of common words. We assume that co-occurring words, which had been employed to convey the same information, should have the compatible significance in matching documents. To test it we apply the Chi-square method. Experimental results obtained via using an archive of transcribed news broadcasts demonstrate the high efficacy of the proposed methodology.
  • Keywords
    database indexing; information retrieval; multimedia databases; speech recognition; Chi-square similarity measure; Cosine similarity measure; Dice similarity measure; Jaccard similarity measure; co-occurring words; informative closeness; noisy transcripts; performance analysis; topic related clustering; transcribed news broadcasts; transcribed spoken documents; Automatic speech recognition; Broadcasting; Computer science; Indexing; Information retrieval; Laboratories; Multimedia communication; Performance analysis; Testing; Vocabulary;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition, 2002. Proceedings. 16th International Conference on
  • ISSN
    1051-4651
  • Print_ISBN
    0-7695-1695-X
  • Type

    conf

  • DOI
    10.1109/ICPR.2002.1047452
  • Filename
    1047452