مرکز منطقه ای اطلاع رساني علوم و فناوري - A Rule-Based Method for Thai Elementary Discourse Unit Segmentation (TED-Seg)

DocumentCode :

2870681

Title :

A Rule-Based Method for Thai Elementary Discourse Unit Segmentation (TED-Seg)

Author :

Ketui, N. ; Theeramunkong, Thanaruk ; Onsuwan, C.

Author_Institution :

Sch. of Inf., Comput., & Commun. Technol., Thammasat Univ., Bangkok, Thailand

fYear :

2012

fDate :

8-10 Nov. 2012

Firstpage :

195

Lastpage :

202

Abstract :

Discovering discourse units in Thai, a language without word and sentence boundaries, is not a straightforward task due to its high part-of-speech (POS) ambiguity and serial verb constituents. This paper introduces definitions of Thai elementary discourse units (T-EDUs), grammar rules for T-EDU segmentation and a longest-matching-based chart parser. The T-EDU definitions are used for constructing a set of context free grammar (CFG) rules. As a result, 446 CFG rules are constructed from 1,340 T-EDUs, extracted from the NE- and POS-tagged corpus, Thai-NEST. These T-EDUs are evaluated with two linguists and the kappa score is 0.68. Separately, a two-level evaluation is applied, one is done in an arranged situation where a text is pre-chunked while the other is performed in a normal situation where the original running text is used for test. By specifying one grammar rule per one T-EDU instance, it is possible to make the perfect recall (100%) in a close environment when the testing corpus and the training corpus are the same, but the recall of approximately 36.16% and 31.69% are obtained for the chunked and the running texts, respectively. For an open test with 3-fold cross validation, the recall is around 67% while the precision is only 25-28%. To improve the precision score, two alternative strategies are applied, left-to-right longest matching (L2R-LM) and maximal longest matching (M-LM). The results show that in the L2R-LM and M-LM can improve the precision to 93.97% and 94.03% for the running text in the close test. However, the recall drops slightly to 94.18% and 92.91%. For the running text in the open test, the f-score improves to 57.70% and 54.14% for the L2R-LM and M-LM.

Keywords :

context-free grammars; knowledge based systems; natural language processing; Thai elementary discourse unit segmentation; context free grammar rules; kappa score; linguists; longest matching based chart parser; maximal longest matching; part of speech ambiguity; rule based method; serial verb constituents; Compounds; Educational institutions; Grammar; Indexes; Speech; Speech recognition; Syntactics; Chart parser; Discourse unit Segmentation; Thai Elementary Discourse Unit;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Knowledge, Information and Creativity Support Systems (KICSS), 2012 Seventh International Conference on

Conference_Location :

Melbourne, VIC

Print_ISBN :

978-1-4673-4564-4

Type :

conf

DOI :

10.1109/KICSS.2012.33

Filename :

6405529

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2870681