• DocumentCode
    3143727
  • Title

    Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning

  • Author

    Berti-Équille, Laure ; Dasu, Tamraparni ; Srivastava, Divesh

  • Author_Institution
    Univ. of Rennes 1, Rennes, France
  • fYear
    2011
  • fDate
    11-16 April 2011
  • Firstpage
    733
  • Lastpage
    744
  • Abstract
    Quantitative Data Cleaning (QDC) is the use of statistical and other analytical techniques to detect, quantify, and correct data quality problems (or glitches). Current QDC approaches focus on addressing each category of data glitch individually. However, in real-world data, different types of data glitches co-occur in complex patterns. These patterns and interactions between glitches offer valuable clues for developing effective domain-specific quantitative cleaning strategies. In this paper, we address the shortcomings of the extant QDC methods by proposing a novel framework, the DEC (Detect-Explore-Clean) framework. It is a comprehensive approach for the definition, detection and cleaning of complex, multi-type data glitches. We exploit the distributions and interactions of different types of glitches to develop data-driven cleaning strategies that may offer significant advantages over blind strategies. The DEC framework is a statistically rigorous methodology for evaluating and scoring glitches and selecting the quantitative cleaning strategies that result in cleaned data sets that are statistically proximal to user specifications. We demonstrate the efficacy and scalability of the DEC framework on very large real-world and synthetic data sets.
  • Keywords
    data handling; statistical analysis; DEC framework; QDC methods; analytical techniques; complex glitch pattern discovery; data quality problems; data sets; detect-explore-clean framework; quantitative data cleaning; statistical techniques; user specifications; Aggregates; Cleaning; Data mining; Data structures; Joining processes; Joints; Scalability;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2011 IEEE 27th International Conference on
  • Conference_Location
    Hannover
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4244-8959-6
  • Electronic_ISBN
    1063-6382
  • Type

    conf

  • DOI
    10.1109/ICDE.2011.5767864
  • Filename
    5767864