DocumentCode :
586722
Title :
Chinese text categorization using the character N-gram
Author :
Suzuki, M. ; Yamagishi, N. ; Yi-Ching Tsai
Author_Institution :
Shonan Inst. of Technol., Fujisawa, Japan
fYear :
2012
fDate :
28-31 Oct. 2012
Firstpage :
722
Lastpage :
726
Abstract :
We previously proposed the accumulation method, which is a language-independent text classification method that is based on the character N-gram, and classified English, Japanese, and Korean text documents. The accumulation method does not depend on the language structure, because this method uses the character N-gram to form index terms. If text documents are expressed in Unicode, then the accumulation method can classify documents using the same algorithm. In the present paper, we classify Chinese text documents, which are newspaper articles from the People´s Daily 2009-2010 data set. The highest macro-averaged F-measure of the proposed method was 92.6% for the People´s Daily 2009-2010 data set. Thus, we obtain good results for the Chinese language. Moreover, we can construct a framework whereby the computer can automatically distinguish the difficulty of each document classification.
Keywords :
classification; indexing; natural language processing; text analysis; Chinese language; Chinese text categorization; Chinese text document classification; English text document; Japanese text document; Korean text document; Unicode; accumulation method; character N-gram; index term; language structure; language-independent text classification method; macroaveraged F-measure; newspaper article; Accuracy; Computers; Indexes; Machine learning; Text categorization; Training; Vectors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Theory and its Applications (ISITA), 2012 International Symposium on
Conference_Location :
Honolulu, HI
Print_ISBN :
978-1-4673-2521-9
Type :
conf
Filename :
6401036
Link To Document :
بازگشت