مرکز منطقه ای اطلاع رساني علوم و فناوري - Chinese Text Classification without Automatic Word Segmentation

DocumentCode :

3104966

Title :

Chinese Text Classification without Automatic Word Segmentation

Author :

Liu, Wei ; Allison, Ben ; Guthrie, David ; Guthrie, Louise

fYear :

2007

fDate :

22-24 Aug. 2007

Firstpage :

Lastpage :

Abstract :

Due to the lack of word boundaries in Asian systems of writing, machine processing of these languages often involves segmenting text into word units. This paper tests the assumption that this segmentation is a necessary step for authorship attribution and topic classification tasks in Chinese, and demonstrates that it is not. We show extensive results for both tasks, considering both single words and short phrases as features, and examining the effect of document length on classification accuracy. Our experiments show that a naïve character bigram model of text performs as well as models generated using a state-of-the-art automatic segmenter.

Keywords :

Character generation; Computer science; Context modeling; Information technology; Law; Legal factors; Missiles; Natural languages; Testing; Text categorization; Chinese SegmentationText Classification;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Advanced Language Processing and Web Information Technology, 2007. ALPIT 2007. Sixth International Conference on

Conference_Location :

Luoyang, Henan, China

Print_ISBN :

978-0-7695-2930-1

Type :

conf

DOI :

10.1109/ALPIT.2007.19

Filename :

4460613

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3104966