• DocumentCode
    2380976
  • Title

    A proposal of extended cosine measure for distance metric learning in text classification

  • Author

    Mikawa, Kenta ; Ishida, Tomoyuki ; Goto, Masayuki

  • Author_Institution
    Dept. of Creative Sci. & Eng., Waseda Univ., Tokyo, Japan
  • fYear
    2011
  • fDate
    9-12 Oct. 2011
  • Firstpage
    1741
  • Lastpage
    1746
  • Abstract
    This paper discusses a new similarity measure between documents on a vector space model from the view point of distance metric learning. The documents are represented by points in the vector space by using the information of frequencies of words appearing in each document. The similarity measure between two different documents is useful to recognize the relationship and can be applied to classification or clustering of the data. Usually, the cosine similarity and the Euclid distance have been used in order to measure the similarity between points in the Euclidean space. However, these measures do not take the correlation among words which appear in documents into consideration on an application of the vector space model to document analysis. Generally speaking, many words which appear in documents have correlation to one another depending on the sentence structures, topics and subjects. Therefore, it is effective to build a suitable metric measure taking the correlation of words into consideration on the vector space in order to improve the performance of document classification and clustering. This paper presents a new effective method to acquire a distance measure on the document vector space based on an extended cosine measure. In addition, the way of distance metric learning is proposed to acquire the proper metric from the view point of supervised learning. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of the customer review which is posted on the web site and the newspaper article.
  • Keywords
    Web sites; data structures; electronic publishing; learning (artificial intelligence); pattern classification; pattern clustering; text analysis; Web site; data classification; data clustering; distance metric learning; document classification; document clustering; document representation; extended cosine measure; newspaper article; similarity measure; text classification; vector space model; word processing; Biological system modeling; Presses; Q measurement; Vectors; extended cosine measure; metric learning; similarity measure; text mining; vector space model;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Systems, Man, and Cybernetics (SMC), 2011 IEEE International Conference on
  • Conference_Location
    Anchorage, AK
  • ISSN
    1062-922X
  • Print_ISBN
    978-1-4577-0652-3
  • Type

    conf

  • DOI
    10.1109/ICSMC.2011.6083923
  • Filename
    6083923