DocumentCode :
3104966
Title :
Chinese Text Classification without Automatic Word Segmentation
Author :
Liu, Wei ; Allison, Ben ; Guthrie, David ; Guthrie, Louise
fYear :
2007
fDate :
22-24 Aug. 2007
Firstpage :
45
Lastpage :
50
Abstract :
Due to the lack of word boundaries in Asian systems of writing, machine processing of these languages often involves segmenting text into word units. This paper tests the assumption that this segmentation is a necessary step for authorship attribution and topic classification tasks in Chinese, and demonstrates that it is not. We show extensive results for both tasks, considering both single words and short phrases as features, and examining the effect of document length on classification accuracy. Our experiments show that a naïve character bigram model of text performs as well as models generated using a state-of-the-art automatic segmenter.
Keywords :
Character generation; Computer science; Context modeling; Information technology; Law; Legal factors; Missiles; Natural languages; Testing; Text categorization; Chinese SegmentationText Classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advanced Language Processing and Web Information Technology, 2007. ALPIT 2007. Sixth International Conference on
Conference_Location :
Luoyang, Henan, China
Print_ISBN :
978-0-7695-2930-1
Type :
conf
DOI :
10.1109/ALPIT.2007.19
Filename :
4460613
Link To Document :
بازگشت