• DocumentCode
    2080889
  • Title

    Detecting inconsistencies in distributed data

  • Author

    Fan, Wenfei ; Geerts, Floris ; Ma, Shuai ; Müller, Heiko

  • Author_Institution
    Univ. of Edinburgh, Edinburgh, UK
  • fYear
    2010
  • fDate
    1-6 March 2010
  • Firstpage
    64
  • Lastpage
    75
  • Abstract
    One of the central problems for data quality is inconsistency detection. Given a database D and a set ¿ of dependencies as data quality rules, we want to identify tuples in D that violate some rules in ¿. When D is a centralized database, there have been effective SQL-based techniques for finding violations. It is, however, far more challenging when data in D is distributed, in which inconsistency detection often necessarily requires shipping data from one site to another. This paper develops techniques for detecting violations of conditional functional dependencies (CFDs) in relations that are fragmented and distributed across different sites. (1) We formulate the detection problem in various distributed settings as optimization problems, measured by either network traffic or response time. (2)We show that it is beyond reach in practice to find optimal detection methods: the detection problem is NP-complete when the data is partitioned either horizontally or vertically, and when we aim to minimize either data shipment or response time. (3) For data that is horizontally partitioned, we provide several algorithms to find violations of a set of CFDs, leveraging the structure of CFDs to reduce data shipment or increase parallelism. (4) We verify experimentally that our algorithms are scalable on large relations and complex CFDs. (5) For data that is vertically partitioned, we provide a characterization for CFDs to be checked locally without requiring data shipment, in terms of dependency preservation. We show that it is intractable to minimally refine a partition and make it dependency preserving.
  • Keywords
    SQL; computational complexity; distributed databases; optimisation; CFD; SQL based techniques; conditional functional dependencies; data quality rules; data shipment; distributed data detecting inconsistencies; inconsistency detection data quality; Cities and towns; Databases; Delay; EMP radiation effects; Marine vehicles; OFDM modulation; Optimization methods; Partitioning algorithms; Remuneration; Time measurement;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2010 IEEE 26th International Conference on
  • Conference_Location
    Long Beach, CA
  • Print_ISBN
    978-1-4244-5445-7
  • Electronic_ISBN
    978-1-4244-5444-0
  • Type

    conf

  • DOI
    10.1109/ICDE.2010.5447855
  • Filename
    5447855