Title :
Chinese Text Classification without Automatic Word Segmentation
Author :
Liu, Wei ; Allison, Ben ; Guthrie, David ; Guthrie, Louise
Abstract :
Due to the lack of word boundaries in Asian systems of writing, machine processing of these languages often involves segmenting text into word units. This paper tests the assumption that this segmentation is a necessary step for authorship attribution and topic classification tasks in Chinese, and demonstrates that it is not. We show extensive results for both tasks, considering both single words and short phrases as features, and examining the effect of document length on classification accuracy. Our experiments show that a naïve character bigram model of text performs as well as models generated using a state-of-the-art automatic segmenter.
Keywords :
Character generation; Computer science; Context modeling; Information technology; Law; Legal factors; Missiles; Natural languages; Testing; Text categorization; Chinese SegmentationText Classification;
Conference_Titel :
Advanced Language Processing and Web Information Technology, 2007. ALPIT 2007. Sixth International Conference on
Conference_Location :
Luoyang, Henan, China
Print_ISBN :
978-0-7695-2930-1
DOI :
10.1109/ALPIT.2007.19