Thai personal named entity extraction without using word segmentation or POS tagging

Author

Sutheebanjard, P. ; Premchaiswadi, W.

Author_Institution

Grad. Sch. of Inf. Technol., Siam Univ., Bangkok, Thailand

fYear

2009

fDate

20-22 Oct. 2009

Firstpage

221

Lastpage

226

Abstract

Named entity (NE) extraction for Thai language is a difficult and time consuming task because sentences in Thai language are composed of a series of words formed by a stream of characters. Moreover, there are no delimiters (blank space) to show word boundaries. Currently, most named entity extraction methods for Thai language are associated with word segmentation and part of speech (POS) tagging processes. The accuracy of named entity extraction is mostly affected the efficiency of those processes. At present, it is still lack of suitable methods for identifying the boundary of word for Thai sentence. Therefore this paper proposes the method to extract Thai personal named entity without using word segmentation or POS tagging. The proposed method is composed of 3 steps. Firstly, pre-processing, this process is used to remove non alphabet such as parenthesizes and numerical. Then, personal named entity is extracted by using contextual environment, front and rear, of personal name. Finally, post-processing, a simple rule base is employed to identify personal names. The training corpus of 900 political news articles and the test corpus of 100 political news, 100 financial news and 100 sport news articles were used in the experiments. The results showed that the F-measures in political and financial domain are 91.442% and 91.720% respectively which are nearly the same. However, the proposed scheme used neither word segmentation nor POS tagging process that can significantly reduce the effort and speed up the process in building the training corpus.

Keywords

information retrieval; learning (artificial intelligence); natural language processing; text analysis; POS tagging; Thai personal named entity extraction; blank space; contextual environment; financial news article; machine learning; part of speech; political news article; rule base; sport news article; text analysis; word boundary identification; word segmentation; Data mining; Entropy; Feature extraction; Guidelines; Information retrieval; Natural language processing; Natural languages; Tagging; Testing; Text recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Natural Language Processing, 2009. SNLP '09. Eighth International Symposium on

Conference_Location

Bangkok

Print_ISBN

978-1-4244-4138-9

Electronic_ISBN

978-1-4244-4139-6

Type

conf

DOI

10.1109/SNLP.2009.5340914

Filename

5340914