DocumentCode
3545408
Title
Vietnamese Diacritics Restoration as Sequential Tagging
Author
Nguyen Minh Trung ; Nguyen Quoc Nhan ; Nguyen Hong Phuong
fYear
2012
fDate
Feb. 27 2012-March 1 2012
Firstpage
1
Lastpage
6
Abstract
Diacritics restoration is the process of restoring original script from diacritic-free script by correct insertion of diacritics. In this paper, this problem is casted as a sequential tagging task where each term is tagged with its own accents. We did careful evaluations on three domains of Vietnamese: writing language, spoken language and literature using two methods: conditional random fields (CRFs) and support vector machines (SVMs), and achieved promising results. We also investigated two levels of lexical: learning from letters and learning from syllables. Although the former performs poorly than the latter, it shows stable results in all three language domains. Therefore, the letter level approach is more useful when we have to deal with unknown words or when words in a sentence are reordered and repeated to achieve stylistic and artistic effect.
Keywords
natural language processing; statistical analysis; support vector machines; Vietnamese diacritics restoration; conditional random fields; diacritic-free script; diacritics insertion; language domain; literature; sequential tagging task; spoken language; support vector machines; writing language; Accuracy; Context; Support vector machines; Tagging; Training; Training data; Writing;
fLanguage
English
Publisher
ieee
Conference_Titel
Computing and Communication Technologies, Research, Innovation, and Vision for the Future (RIVF), 2012 IEEE RIVF International Conference on
Conference_Location
Ho Chi Minh City
Print_ISBN
978-1-4673-0307-1
Type
conf
DOI
10.1109/rivf.2012.6169816
Filename
6169816
Link To Document