• DocumentCode
    518226
  • Title

    Disambiguation of Thai personal name from online news articles

  • Author

    Sutheebanjard, Phaisarn ; Premchaiswadi, Wichian

  • Author_Institution
    Grad. Sch. of Inf. Technol., Siam Univ., Bangkok, Thailand
  • Volume
    3
  • fYear
    2010
  • fDate
    16-18 April 2010
  • Abstract
    Since online news articles are updated daily, hourly and sometimes every minute, therefore the data from online news articles are glowing rapidly. These data seem like a large corpus of text mining. This research focuses on Thai personal names that appear in the online news which sometimes have slightly different spelling but they actually refer to the same person. From the news data that were collected during 30 July 2009 to 5 November 2009, there are a lot of name variations. The objective of this paper is to disambiguate Thai personal names by applying string matching techniques which are Guth, Levenshtein, Damerau-Levenshtein, Longest Common Substring and Longest Common Subsequence. The experimental results show that the Longest Common Subsequence was the most efficient technique for matching Thai personal name with the F-Score of 94.43%. After that, the two-scan labeling technique was used to identify the unique full Thai personal name. The results show that it can reduce the 6,884 distinct personal names to 830 unique personal named entities which equals to 12.057% reduction.
  • Keywords
    DP management; data mining; desktop publishing; information resources; string matching; text analysis; Thai personal name; longest common subsequence; online news articles; string matching; text mining; two-scan labeling technique; Application software; Computer science; Couplings; Data mining; Databases; Information technology; Labeling; Search engines; Terminology; Text mining; online news; personal name; string matching; two-scan labeling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Engineering and Technology (ICCET), 2010 2nd International Conference on
  • Conference_Location
    Chengdu
  • Print_ISBN
    978-1-4244-6347-3
  • Type

    conf

  • DOI
    10.1109/ICCET.2010.5485879
  • Filename
    5485879