DocumentCode :
2144724
Title :
Classifying Textual Components of Bilingual Documents with Decision-Tree Support Vector Machines
Author :
Lin, Xiao-Rong ; Guo, Chien-Yang ; Chang, Fu
Author_Institution :
Inst. of Inf. Sci., Acad. Sinica, Taipei, Taiwan
fYear :
2011
fDate :
18-21 Sept. 2011
Firstpage :
498
Lastpage :
502
Abstract :
In this paper, we propose a method for classifying textual entities of bilingual documents written in Chinese and English. In contrast to earlier works that performed classification on the level of text lines or documents, we apply our method to the level of textual components, as we must first identify Chinese components before merging them into intact characters and sending the latter characters to a Chinese recognizer. To cope with a large training data set containing 365,672 samples, we employ a decision-tree support vector machine (DTSVM) method, which decomposes a given data space into small regions and trains local SVMs on those regions. By applying this method to train classifiers on various combinations of feature types, we were able to complete each training process within 3,500 seconds and achieve higher than 99.6% test accuracy in classifying a textual component into Chinese, alphanumeric, and punctuation. Moreover, the classification had no strong bias towards any of the three categories.
Keywords :
decision trees; document image processing; natural language processing; pattern classification; support vector machines; Chinese components; Chinese recognizer; bilingual documents; decision tree support vector machines; textual components classification; Accuracy; Feature extraction; Shape; Support vector machines; Testing; Training; Training data; bilingual document; component; decision-tree support vector machine; script and language identification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
ISSN :
1520-5363
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2011.106
Filename :
6065361
Link To Document :
بازگشت