Title :
An efficient data cleaning algorithm based on attributes selection
Author :
He, Ling ; Zhang, Zhongnan ; Tan, Yize ; Liao, Minghong
Author_Institution :
Software Sch., Xiamen Univ., Xiamen, China
fDate :
Nov. 29 2011-Dec. 1 2011
Abstract :
In data cleaning, detecting approximately duplicate records in data warehouse is an important task. However, due to the wide range of possible data inconsistencies and the sheer data volume, determining whether two records are equal is not a simple arithmetic predicate. This paper proposes an improved algorithm of approximately duplicate records cleaning based on attributes selection after analyzing the existing basic sorted-neighborhood method (SNM) and multi-pass sorted neighborhood method (MPN), which are widely used in this field.
Keywords :
data analysis; data integrity; data warehouses; sorting; attributes selection; data inconsistencies; data warehouse; duplicate records cleaning; efficient data cleaning algorithm; multipass sorted neighborhood method; sheer data volume; simple arithmetic predicate; sorted-neighborhood method; Accuracy; Algorithm design and analysis; Approximation algorithms; Cleaning; Clustering algorithms; Databases; Educational institutions; Approximately duplicate records; SNM; data cleaning;
Conference_Titel :
Computer Sciences and Convergence Information Technology (ICCIT), 2011 6th International Conference on
Conference_Location :
Seogwipo
Print_ISBN :
978-1-4577-0472-7