Title :
Data-driven name reduction for record linkage
Author :
Schraagen, M. ; Kosters, W.
Author_Institution :
Leiden Inst. of Adv. Comput. Sci., Leiden Univ., Leiden, Netherlands
Abstract :
Automatic record linkage of data containing personal names is difficult in the presence of name variation and spelling errors. This paper presents a standardization procedure for personal names to address the variation problem. A classification tree based model is constructed using a training set of 65,002 name-variant pairs. The method provides an efficient procedure for record linkage (3500 records per second, F-measure 0.96 on a sample of Dutch historical civil records). The results include links with large edit distance between the records, however recall is lower for this category. A bootstrapping procedure is used to improve recall.
Keywords :
data handling; pattern classification; pattern matching; records management; trees (mathematics); bootstrapping procedure; classification tree based model; data automatic record linkage; data-driven name reduction; edit distance; name variation; personal names; spelling errors; standardization procedure; training set; Accuracy; Couplings; Decision trees; Joining processes; Training; Training data; Vectors;
Conference_Titel :
Innovative Computing Technology (INTECH), 2012 Second International Conference on
Conference_Location :
Casablanca
Print_ISBN :
978-1-4673-2678-0
DOI :
10.1109/INTECH.2012.6457783