DocumentCode :
2870681
Title :
A Rule-Based Method for Thai Elementary Discourse Unit Segmentation (TED-Seg)
Author :
Ketui, N. ; Theeramunkong, Thanaruk ; Onsuwan, C.
Author_Institution :
Sch. of Inf., Comput., & Commun. Technol., Thammasat Univ., Bangkok, Thailand
fYear :
2012
fDate :
8-10 Nov. 2012
Firstpage :
195
Lastpage :
202
Abstract :
Discovering discourse units in Thai, a language without word and sentence boundaries, is not a straightforward task due to its high part-of-speech (POS) ambiguity and serial verb constituents. This paper introduces definitions of Thai elementary discourse units (T-EDUs), grammar rules for T-EDU segmentation and a longest-matching-based chart parser. The T-EDU definitions are used for constructing a set of context free grammar (CFG) rules. As a result, 446 CFG rules are constructed from 1,340 T-EDUs, extracted from the NE- and POS-tagged corpus, Thai-NEST. These T-EDUs are evaluated with two linguists and the kappa score is 0.68. Separately, a two-level evaluation is applied, one is done in an arranged situation where a text is pre-chunked while the other is performed in a normal situation where the original running text is used for test. By specifying one grammar rule per one T-EDU instance, it is possible to make the perfect recall (100%) in a close environment when the testing corpus and the training corpus are the same, but the recall of approximately 36.16% and 31.69% are obtained for the chunked and the running texts, respectively. For an open test with 3-fold cross validation, the recall is around 67% while the precision is only 25-28%. To improve the precision score, two alternative strategies are applied, left-to-right longest matching (L2R-LM) and maximal longest matching (M-LM). The results show that in the L2R-LM and M-LM can improve the precision to 93.97% and 94.03% for the running text in the close test. However, the recall drops slightly to 94.18% and 92.91%. For the running text in the open test, the f-score improves to 57.70% and 54.14% for the L2R-LM and M-LM.
Keywords :
context-free grammars; knowledge based systems; natural language processing; Thai elementary discourse unit segmentation; context free grammar rules; kappa score; linguists; longest matching based chart parser; maximal longest matching; part of speech ambiguity; rule based method; serial verb constituents; Compounds; Educational institutions; Grammar; Indexes; Speech; Speech recognition; Syntactics; Chart parser; Discourse unit Segmentation; Thai Elementary Discourse Unit;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Knowledge, Information and Creativity Support Systems (KICSS), 2012 Seventh International Conference on
Conference_Location :
Melbourne, VIC
Print_ISBN :
978-1-4673-4564-4
Type :
conf
DOI :
10.1109/KICSS.2012.33
Filename :
6405529
Link To Document :
بازگشت