• DocumentCode
    575005
  • Title

    An efficient data cleaning algorithm based on attributes selection

  • Author

    He, Ling ; Zhang, Zhongnan ; Tan, Yize ; Liao, Minghong

  • Author_Institution
    Software Sch., Xiamen Univ., Xiamen, China
  • fYear
    2011
  • fDate
    Nov. 29 2011-Dec. 1 2011
  • Firstpage
    375
  • Lastpage
    379
  • Abstract
    In data cleaning, detecting approximately duplicate records in data warehouse is an important task. However, due to the wide range of possible data inconsistencies and the sheer data volume, determining whether two records are equal is not a simple arithmetic predicate. This paper proposes an improved algorithm of approximately duplicate records cleaning based on attributes selection after analyzing the existing basic sorted-neighborhood method (SNM) and multi-pass sorted neighborhood method (MPN), which are widely used in this field.
  • Keywords
    data analysis; data integrity; data warehouses; sorting; attributes selection; data inconsistencies; data warehouse; duplicate records cleaning; efficient data cleaning algorithm; multipass sorted neighborhood method; sheer data volume; simple arithmetic predicate; sorted-neighborhood method; Accuracy; Algorithm design and analysis; Approximation algorithms; Cleaning; Clustering algorithms; Databases; Educational institutions; Approximately duplicate records; SNM; data cleaning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Sciences and Convergence Information Technology (ICCIT), 2011 6th International Conference on
  • Conference_Location
    Seogwipo
  • Print_ISBN
    978-1-4577-0472-7
  • Type

    conf

  • Filename
    6316641