• DocumentCode
    2719208
  • Title

    Proof positive and negative in data cleaning

  • Author

    Interlandi, Matteo ; Nan Tang

  • Author_Institution
    Qatar Comput. Res. Inst., Doha, Qatar
  • fYear
    2015
  • fDate
    13-17 April 2015
  • Firstpage
    18
  • Lastpage
    29
  • Abstract
    One notoriously hard data cleaning problem is, given a database, how to precisely capture which value is correct (i.e., proof positive) or wrong (i.e., proof negative). Although integrity constraints have been widely studied to capture data errors as violations, the accuracy of data cleaning using integrity constraints has long been controversial. Overall they deem one fundamental problem: Given a set of data values that together forms a violation, there is no evidence of which value is proof positive or negative. Hence, it is known that integrity constraints themselves cannot guide dependable data cleaning. In this work, we introduce an automated method for proof positive and negative in data cleaning, based on Sherlock rules and reference tables. Given a tuple and reference tables, Sherlock rules tell us what attributes are proof positive, what attributes are proof negative and (possibly) how to update them. We study several fundamental problems associated with Sherlock rules. We also present efficient algorithms for cleaning data using Sherlock rules. We experimentally demonstrate that our techniques can not only annotate data with proof positive and negative, but also repair data when enough information is available.
  • Keywords
    data integrity; Sherlock rule; data cleaning; data errors; data integrity; proof negative; proof positive; reference table; Accuracy; Cleaning; Databases; Maintenance engineering; Mobile communication; Semantics; Silicon;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2015 IEEE 31st International Conference on
  • Conference_Location
    Seoul
  • Type

    conf

  • DOI
    10.1109/ICDE.2015.7113269
  • Filename
    7113269