DocumentCode :
245112
Title :
Parallel Corpus Approach for Name Matching in Record Linkage
Author :
Sukharev, Jeffrey ; Zhukov, Leonid ; Popescul, Alexandrin
Author_Institution :
Ancestry.com, San Francisco, CA, USA
fYear :
2014
fDate :
14-17 Dec. 2014
Firstpage :
995
Lastpage :
1000
Abstract :
Record linkage, or entity resolution, is an important area of data mining. Name matching is a key component of systems for record linkage. Alternative spellings of the same name are a common occurrence in many applications. We use the largest collection of genealogy person records in the world together with user search query logs to build name-matching models. The procedure for building a crowd-sourced training set is outlined together with the presentation of our method. We cast the problem of learning alternative spellings as a machine translation problem at the character level. We use information retrieval evaluation methodology to show that this method substantially outperforms on our data a number of standard well known phonetic and string similarity methods in terms of precision and recall. Our result can lead to a significant practical impact in entity resolution applications.
Keywords :
data mining; language translation; parallel processing; query processing; string matching; character level; crowd-sourced training set; data mining; entity resolution; genealogy person records; information retrieval evaluation methodology; machine translation problem; name matching; parallel corpus approach; phonetic method; record linkage; string similarity methods; user search query logs; Buildings; Computational modeling; Couplings; Data mining; Databases; Probability; Training; Crowd Sourcing; Machine Translation; Record Linkage;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining (ICDM), 2014 IEEE International Conference on
Conference_Location :
Shenzhen
ISSN :
1550-4786
Print_ISBN :
978-1-4799-4303-6
Type :
conf
DOI :
10.1109/ICDM.2014.76
Filename :
7023436
Link To Document :
بازگشت