Title :
Research on the Technique of Chinese Text Classification Based on the Single Chinese Character Feature
Author :
Zhang, Yubin ; Lu, Jianfeng ; Yang, Jingyu
Author_Institution :
Sch. of Comput. Sci. & Technol., Nanjing Univ. of Sci. & Technol., Nanjing, China
Abstract :
The existence of vast unstructured text and the importance of the text information make the text mining technology be a hot research spot of Data Mining. Text classification is a very important subtask in the text mining. This paper focuses on the study of Chinese text classification based on single Chinese character feature. The experimental results indicate that the feature selection based on single Chinese character is an effective modeling method for Chinese text classification. The techniques of information gain is applied to select features, cosine distance to measure the similarity between documents, and KNN methods as classifier, a systematic comparative experiments have been conducted on the news corpus from Fudan University, which achieves the 86.92% precision and near 87% Macro-F score.
Keywords :
data mining; pattern classification; text analysis; Chinese text classification; data mining; feature selection; k-nearest neighbor method; text information; Computer science; Data mining; Electronic mail; Frequency measurement; Gain measurement; Mutual information; Statistics; Text categorization; Text mining;
Conference_Titel :
Pattern Recognition, 2009. CCPR 2009. Chinese Conference on
Conference_Location :
Nanjing
Print_ISBN :
978-1-4244-4199-0
DOI :
10.1109/CCPR.2009.5344011