DocumentCode
575005
Title
An efficient data cleaning algorithm based on attributes selection
Author
He, Ling ; Zhang, Zhongnan ; Tan, Yize ; Liao, Minghong
Author_Institution
Software Sch., Xiamen Univ., Xiamen, China
fYear
2011
fDate
Nov. 29 2011-Dec. 1 2011
Firstpage
375
Lastpage
379
Abstract
In data cleaning, detecting approximately duplicate records in data warehouse is an important task. However, due to the wide range of possible data inconsistencies and the sheer data volume, determining whether two records are equal is not a simple arithmetic predicate. This paper proposes an improved algorithm of approximately duplicate records cleaning based on attributes selection after analyzing the existing basic sorted-neighborhood method (SNM) and multi-pass sorted neighborhood method (MPN), which are widely used in this field.
Keywords
data analysis; data integrity; data warehouses; sorting; attributes selection; data inconsistencies; data warehouse; duplicate records cleaning; efficient data cleaning algorithm; multipass sorted neighborhood method; sheer data volume; simple arithmetic predicate; sorted-neighborhood method; Accuracy; Algorithm design and analysis; Approximation algorithms; Cleaning; Clustering algorithms; Databases; Educational institutions; Approximately duplicate records; SNM; data cleaning;
fLanguage
English
Publisher
ieee
Conference_Titel
Computer Sciences and Convergence Information Technology (ICCIT), 2011 6th International Conference on
Conference_Location
Seogwipo
Print_ISBN
978-1-4577-0472-7
Type
conf
Filename
6316641
Link To Document