Title :
Identifying coordinated compound words for Vietnamese word segmentation
Author :
Ngoc Anh Tran ; Thanh Tinh Dao ; Phuong Thai Nguyen
Author_Institution :
Dept. Inf. Technol., Le Quy Don Tech. Univ., Hanoi, Vietnam
Abstract :
This paper proposes a dictionary-based method for determining coordinated compound words in Vietnamese. The main idea to determine whether two contiguous simple words in a text forms a coordinated compound word is based on their properties, part-of-speeches and the similarity between their definitions in the dictionary of the Vietnamese Computational Lexicon (VCL). We also based on the sets of synonym and antonym to identify, recognize, and establish a list of coordinated compound words (coordinated di-syllable phrases). We have used a number of rules to identify 3 or 4 syllable phrases/idioms based on relations of coordinated di-syllable phrases. We carried out two major experiments: one for identifying and creating a list of coordinated compounds, the other for improving the accuracy of Vietnamese word segmentation. The second experiment showed that the word segmentation F-scores increases from 0.11% to 0.41% (the error rate decreases from 3.32% to 12.6%). This is a new approach and highly practical value.
Keywords :
computational linguistics; grammars; natural language processing; word processing; VCL; Vietnamese computational lexicon; Vietnamese word segmentation; coordinated compound word; coordinated di-syllable phrase; dictionary-based method; part-of-speech; Compounds; Dictionaries; Mutual information; Pattern recognition; Semantics; Testing; Training; Vietnamese Computational Lexicon; coordinated compound words; new word; similarity; word segmentation;
Conference_Titel :
Soft Computing and Pattern Recognition (SoCPaR), 2013 International Conference of
Conference_Location :
Hanoi
Print_ISBN :
978-1-4799-3399-0
DOI :
10.1109/SOCPAR.2013.7054145