Title :
Disambiguation of Thai personal name from online news articles
Author :
Sutheebanjard, Phaisarn ; Premchaiswadi, Wichian
Author_Institution :
Grad. Sch. of Inf. Technol., Siam Univ., Bangkok, Thailand
Abstract :
Since online news articles are updated daily, hourly and sometimes every minute, therefore the data from online news articles are glowing rapidly. These data seem like a large corpus of text mining. This research focuses on Thai personal names that appear in the online news which sometimes have slightly different spelling but they actually refer to the same person. From the news data that were collected during 30 July 2009 to 5 November 2009, there are a lot of name variations. The objective of this paper is to disambiguate Thai personal names by applying string matching techniques which are Guth, Levenshtein, Damerau-Levenshtein, Longest Common Substring and Longest Common Subsequence. The experimental results show that the Longest Common Subsequence was the most efficient technique for matching Thai personal name with the F-Score of 94.43%. After that, the two-scan labeling technique was used to identify the unique full Thai personal name. The results show that it can reduce the 6,884 distinct personal names to 830 unique personal named entities which equals to 12.057% reduction.
Keywords :
DP management; data mining; desktop publishing; information resources; string matching; text analysis; Thai personal name; longest common subsequence; online news articles; string matching; text mining; two-scan labeling technique; Application software; Computer science; Couplings; Data mining; Databases; Information technology; Labeling; Search engines; Terminology; Text mining; online news; personal name; string matching; two-scan labeling;
Conference_Titel :
Computer Engineering and Technology (ICCET), 2010 2nd International Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4244-6347-3
DOI :
10.1109/ICCET.2010.5485879