Title :
A Study on Automatic Chinese Text Classification
Author :
Luo, Xi ; Ohyama, Wataru ; Wakabayashi, Tetsushi ; Kimura, Fumitaka
Author_Institution :
Grad. Sch. of Eng., Mie Univ., Tsu, Japan
Abstract :
In this paper, we perform Chinese text classification using N-gram (uni-gram, bi-gram and mixed uni-gram/bi-gram) frequency feature instead of word frequency feature to represent documents and propose the use of mixed uni-gram/bi-gram after feature transformation. We further propose a serial approach based on feature transformation and dimension reduction techniques to improve the performance. Experimental results show that our proposed approach is efficient and effective for improving the performance of Chinese text classification. Furthermore, we present several experiments evaluating the selection of features based on part-of-speech analysis and the results show that suitable combination of part-of-speech can lead to better classification performance.
Keywords :
classification; grammars; natural language processing; text analysis; N-gram frequency feature; ]part-of-speech analysis; automatic Chinese text classification; bi-gram; classification performance; dimension reduction techniques; document representation; feature transformation; uni-gram; word frequency; Kernel; Machine learning; Principal component analysis; Support vector machine classification; Text categorization; Vectors; Chinese text classification/categorization; N-gram; dimension reduction; feature selection; part-of-speech; principal component analysis; support vector machines;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
DOI :
10.1109/ICDAR.2011.187