DocumentCode
1862528
Title
Research on the categorization accuracy of different similarity measures on Chinese texts
Author
Li, Xiangdong ; Liu, Hangyu ; Jia, Han ; Huang, Li
Author_Institution
Sch. of Inf. Manage., Wuhan Univ., Wuhan, China
Volume
4
fYear
2011
fDate
13-15 May 2011
Firstpage
224
Lastpage
227
Abstract
This paper works on the most intensively studied algorithm- k Nearest Neighbor algorithm. The purpose is to investigate the performance of different similarity measures in the kNN on Chinese texts. The two measures that we focus on are cosine value and Jensen-Shannon Divergence. We use both the corpus collected from the Sogou, whose data extracts from the website of Sohu.com, and datasets that we have processed from real word. The results of our experiment indicate that difference of similarity metrics significantly affects the categorization accuracy.
Keywords
Web sites; natural language processing; text analysis; Chinese texts; Jensen-Shannon divergence; Sogou; Sohu.com; Web site; categorization accuracy; cosine value; k-nearest neighbor algorithm; similarity measure; Accuracy; Classification algorithms; Entropy; Libraries; Machine learning algorithms; Support vector machine classification; Text categorization; Chinese text categorization; KNN algorithm; Similarity; Sougou Corpus;
fLanguage
English
Publisher
ieee
Conference_Titel
Business Management and Electronic Information (BMEI), 2011 International Conference on
Conference_Location
Guangzhou
Print_ISBN
978-1-61284-108-3
Type
conf
DOI
10.1109/ICBMEI.2011.5920956
Filename
5920956
Link To Document