مرکز منطقه ای اطلاع رساني علوم و فناوري - Chinese text categorization using the character N-gram

DocumentCode :

586722

Title :

Chinese text categorization using the character N-gram

Author :

Suzuki, M. ; Yamagishi, N. ; Yi-Ching Tsai

Author_Institution :

Shonan Inst. of Technol., Fujisawa, Japan

fYear :

2012

fDate :

28-31 Oct. 2012

Firstpage :

722

Lastpage :

726

Abstract :

We previously proposed the accumulation method, which is a language-independent text classification method that is based on the character N-gram, and classified English, Japanese, and Korean text documents. The accumulation method does not depend on the language structure, because this method uses the character N-gram to form index terms. If text documents are expressed in Unicode, then the accumulation method can classify documents using the same algorithm. In the present paper, we classify Chinese text documents, which are newspaper articles from the People´s Daily 2009-2010 data set. The highest macro-averaged F-measure of the proposed method was 92.6% for the People´s Daily 2009-2010 data set. Thus, we obtain good results for the Chinese language. Moreover, we can construct a framework whereby the computer can automatically distinguish the difficulty of each document classification.

Keywords :

classification; indexing; natural language processing; text analysis; Chinese language; Chinese text categorization; Chinese text document classification; English text document; Japanese text document; Korean text document; Unicode; accumulation method; character N-gram; index term; language structure; language-independent text classification method; macroaveraged F-measure; newspaper article; Accuracy; Computers; Indexes; Machine learning; Text categorization; Training; Vectors;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Information Theory and its Applications (ISITA), 2012 International Symposium on

Conference_Location :

Honolulu, HI

Print_ISBN :

978-1-4673-2521-9

Type :

conf

Filename :

6401036

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=586722