Title :
A proposal of extended cosine measure for distance metric learning in text classification
Author :
Mikawa, Kenta ; Ishida, Tomoyuki ; Goto, Masayuki
Author_Institution :
Dept. of Creative Sci. & Eng., Waseda Univ., Tokyo, Japan
Abstract :
This paper discusses a new similarity measure between documents on a vector space model from the view point of distance metric learning. The documents are represented by points in the vector space by using the information of frequencies of words appearing in each document. The similarity measure between two different documents is useful to recognize the relationship and can be applied to classification or clustering of the data. Usually, the cosine similarity and the Euclid distance have been used in order to measure the similarity between points in the Euclidean space. However, these measures do not take the correlation among words which appear in documents into consideration on an application of the vector space model to document analysis. Generally speaking, many words which appear in documents have correlation to one another depending on the sentence structures, topics and subjects. Therefore, it is effective to build a suitable metric measure taking the correlation of words into consideration on the vector space in order to improve the performance of document classification and clustering. This paper presents a new effective method to acquire a distance measure on the document vector space based on an extended cosine measure. In addition, the way of distance metric learning is proposed to acquire the proper metric from the view point of supervised learning. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of the customer review which is posted on the web site and the newspaper article.
Keywords :
Web sites; data structures; electronic publishing; learning (artificial intelligence); pattern classification; pattern clustering; text analysis; Web site; data classification; data clustering; distance metric learning; document classification; document clustering; document representation; extended cosine measure; newspaper article; similarity measure; text classification; vector space model; word processing; Biological system modeling; Presses; Q measurement; Vectors; extended cosine measure; metric learning; similarity measure; text mining; vector space model;
Conference_Titel :
Systems, Man, and Cybernetics (SMC), 2011 IEEE International Conference on
Conference_Location :
Anchorage, AK
Print_ISBN :
978-1-4577-0652-3
DOI :
10.1109/ICSMC.2011.6083923