Title :
Vietnamese Diacritics Restoration as Sequential Tagging
Author :
Nguyen Minh Trung ; Nguyen Quoc Nhan ; Nguyen Hong Phuong
fDate :
Feb. 27 2012-March 1 2012
Abstract :
Diacritics restoration is the process of restoring original script from diacritic-free script by correct insertion of diacritics. In this paper, this problem is casted as a sequential tagging task where each term is tagged with its own accents. We did careful evaluations on three domains of Vietnamese: writing language, spoken language and literature using two methods: conditional random fields (CRFs) and support vector machines (SVMs), and achieved promising results. We also investigated two levels of lexical: learning from letters and learning from syllables. Although the former performs poorly than the latter, it shows stable results in all three language domains. Therefore, the letter level approach is more useful when we have to deal with unknown words or when words in a sentence are reordered and repeated to achieve stylistic and artistic effect.
Keywords :
natural language processing; statistical analysis; support vector machines; Vietnamese diacritics restoration; conditional random fields; diacritic-free script; diacritics insertion; language domain; literature; sequential tagging task; spoken language; support vector machines; writing language; Accuracy; Context; Support vector machines; Tagging; Training; Training data; Writing;
Conference_Titel :
Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2012 IEEE RIVF International Conference on
Conference_Location :
Ho Chi Minh City
Print_ISBN :
978-1-4673-0307-1
DOI :
10.1109/rivf.2012.6169816