DocumentCode :
2593071
Title :
A comparative study on Thai word segmentation approaches
Author :
Haruechaiyasak, Choochart ; Kongyoung, Sarawoot ; Dailey, Matthew
Author_Institution :
Nat. Electron. & Comput. Technol. Center (NECTEC), Human Language Technol. Lab. (HLT), Pathumthani
Volume :
1
fYear :
2008
fDate :
14-17 May 2008
Firstpage :
125
Lastpage :
128
Abstract :
In this paper, we analyze and compare various approaches for Thai word segmentation. The word segmentation approaches could be classified into two distinct types, dictionary based (DCB) and machine learning based (MLB). The DCB approach relies on a set of terms for parsing and segmenting input texts. Whereas the MLB approach relies on a model trained from a corpus by using machine learning techniques. We compare between two algorithms from the DCB approach: longest-matching and maximal matching, and four algorithms from the MLB approach: Naive Bayes (NB), decision tree, support vector machine (SVM), and conditional random field (CRF). From the experimental results, the DCB approach yielded better performance than the NB, decision tree and SVM algorithms from the MLB approach. However, the best performance was obtained from the CRF algorithm with the precision and recall of 95.79% and 94.98%, respectively.
Keywords :
Bayes methods; decision trees; learning (artificial intelligence); natural language processing; support vector machines; Thai word segmentation; conditional random field; decision tree; dictionary based word segmentation; longest-matching algorithms; machine learning based word segmentation; maximal matching; naive Bayes; support vector machine; Decision trees; Dictionaries; Information management; Information retrieval; Laboratories; Machine learning; Machine learning algorithms; Natural languages; Niobium; Support vector machines; Word segmentation; dictionary-based; machine learning algorithms; morphological analysis; tokenization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008. ECTI-CON 2008. 5th International Conference on
Conference_Location :
Krabi
Print_ISBN :
978-1-4244-2101-5
Electronic_ISBN :
978-1-4244-2102-2
Type :
conf
DOI :
10.1109/ECTICON.2008.4600388
Filename :
4600388
Link To Document :
بازگشت