• DocumentCode
    28590
  • Title

    A Similarity Measure for Text Classification and Clustering

  • Author

    Yung-Shen Lin ; Jung-Yi Jiang ; Shie-Jue Lee

  • Author_Institution
    Dept. of Electr. Eng., Nat. Sun Yat-Sen Univ., Kaohsiung, Taiwan
  • Volume
    26
  • Issue
    7
  • fYear
    2014
  • fDate
    Jul-14
  • Firstpage
    1575
  • Lastpage
    1590
  • Abstract
    Measuring the similarity between documents is an important operation in the text processing field. In this paper, a new similarity measure is proposed. To compute the similarity between two documents with respect to a feature, the proposed measure takes the following three cases into account: a) The feature appears in both documents, b) the feature appears in only one document, and c) the feature appears in none of the documents. For the first case, the similarity increases as the difference between the two involved feature values decreases. Furthermore, the contribution of the difference is normally scaled. For the second case, a fixed value is contributed to the similarity. For the last case, the feature has no contribution to the similarity. The proposed measure is extended to gauge the similarity between two sets of documents. The effectiveness of our measure is evaluated on several real-world data sets for text classification and clustering problems. The results show that the performance obtained by the proposed measure is better than that achieved by other measures.
  • Keywords
    pattern clustering; text analysis; feature values; fixed value; real-world data sets; similarity measure; text classification; text clustering; text processing field; Approximation methods; Clustering algorithms; Educational institutions; Euclidean distance; Text processing; Vectors; Document classification; accuracy; classifiers; clustering algorithms; document clustering; entropy;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2013.19
  • Filename
    6420834