• DocumentCode
    245112
  • Title

    Parallel Corpus Approach for Name Matching in Record Linkage

  • Author

    Sukharev, Jeffrey ; Zhukov, Leonid ; Popescul, Alexandrin

  • Author_Institution
    Ancestry.com, San Francisco, CA, USA
  • fYear
    2014
  • fDate
    14-17 Dec. 2014
  • Firstpage
    995
  • Lastpage
    1000
  • Abstract
    Record linkage, or entity resolution, is an important area of data mining. Name matching is a key component of systems for record linkage. Alternative spellings of the same name are a common occurrence in many applications. We use the largest collection of genealogy person records in the world together with user search query logs to build name-matching models. The procedure for building a crowd-sourced training set is outlined together with the presentation of our method. We cast the problem of learning alternative spellings as a machine translation problem at the character level. We use information retrieval evaluation methodology to show that this method substantially outperforms on our data a number of standard well known phonetic and string similarity methods in terms of precision and recall. Our result can lead to a significant practical impact in entity resolution applications.
  • Keywords
    data mining; language translation; parallel processing; query processing; string matching; character level; crowd-sourced training set; data mining; entity resolution; genealogy person records; information retrieval evaluation methodology; machine translation problem; name matching; parallel corpus approach; phonetic method; record linkage; string similarity methods; user search query logs; Buildings; Computational modeling; Couplings; Data mining; Databases; Probability; Training; Crowd Sourcing; Machine Translation; Record Linkage;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining (ICDM), 2014 IEEE International Conference on
  • Conference_Location
    Shenzhen
  • ISSN
    1550-4786
  • Print_ISBN
    978-1-4799-4303-6
  • Type

    conf

  • DOI
    10.1109/ICDM.2014.76
  • Filename
    7023436