Title :
Thai personal named entity extraction without using word segmentation or POS tagging
Author :
Sutheebanjard, P. ; Premchaiswadi, W.
Author_Institution :
Grad. Sch. of Inf. Technol., Siam Univ., Bangkok, Thailand
Abstract :
Named entity (NE) extraction for Thai language is a difficult and time consuming task because sentences in Thai language are composed of a series of words formed by a stream of characters. Moreover, there are no delimiters (blank space) to show word boundaries. Currently, most named entity extraction methods for Thai language are associated with word segmentation and part of speech (POS) tagging processes. The accuracy of named entity extraction is mostly affected the efficiency of those processes. At present, it is still lack of suitable methods for identifying the boundary of word for Thai sentence. Therefore this paper proposes the method to extract Thai personal named entity without using word segmentation or POS tagging. The proposed method is composed of 3 steps. Firstly, pre-processing, this process is used to remove non alphabet such as parenthesizes and numerical. Then, personal named entity is extracted by using contextual environment, front and rear, of personal name. Finally, post-processing, a simple rule base is employed to identify personal names. The training corpus of 900 political news articles and the test corpus of 100 political news, 100 financial news and 100 sport news articles were used in the experiments. The results showed that the F-measures in political and financial domain are 91.442% and 91.720% respectively which are nearly the same. However, the proposed scheme used neither word segmentation nor POS tagging process that can significantly reduce the effort and speed up the process in building the training corpus.
Keywords :
information retrieval; learning (artificial intelligence); natural language processing; text analysis; POS tagging; Thai personal named entity extraction; blank space; contextual environment; financial news article; machine learning; part of speech; political news article; rule base; sport news article; text analysis; word boundary identification; word segmentation; Data mining; Entropy; Feature extraction; Guidelines; Information retrieval; Natural language processing; Natural languages; Tagging; Testing; Text recognition;
Conference_Titel :
Natural Language Processing, 2009. SNLP '09. Eighth International Symposium on
Conference_Location :
Bangkok
Print_ISBN :
978-1-4244-4138-9
Electronic_ISBN :
978-1-4244-4139-6
DOI :
10.1109/SNLP.2009.5340914