• DocumentCode
    25278
  • Title

    Incremental Detection of Inconsistencies in Distributed Data

  • Author

    Wenfei Fan ; Jianzhong Li ; Nan Tang ; Wenyuan Yu

  • Author_Institution
    Lab. for Foundations of Comput. Sci. (LFCS), Univ. of Edinburgh, Edinburgh, UK
  • Volume
    26
  • Issue
    6
  • fYear
    2014
  • fDate
    Jun-14
  • Firstpage
    1367
  • Lastpage
    1383
  • Abstract
    This paper investigates incremental detection of errors in distributed data. Given a distributed database D, a set Σ of conditional functional dependencies (CFDs), the set V of violations of the CFDs in D, and updates ΔD to D, it is to find, with minimum data shipment, changes ΔV to V in response to ΔD. The need for the study is evident since real-life data is often dirty, distributed and frequently updated. It is often prohibitively expensive to recompute the entire set of violations when D is updated. We show that the incremental detection problem is NP-complete for database D that is partitioned either vertically or horizontally, even when Σ and D are fixed. Nevertheless, we show that it is bounded: there exist algorithms to detect errors such that their computational cost and data shipment are both linear in the size of ΔD and ΔV, independent of the size of the database D. We provide such incremental algorithms for vertically partitioned data and horizontally partitioned data, and show that the algorithms are optimal. We further propose optimization techniques for the incremental algorithm over vertical partitions to reduce data shipment. We verify experimentally, using real-life data on Amazon Elastic Compute Cloud (EC2), that our algorithms substantially outperform their batch counterparts.
  • Keywords
    computational complexity; distributed algorithms; distributed databases; optimisation; Amazon Elastic Compute Cloud; EC2; NP-complete problem; computational cost; conditional functional dependencies; distributed data; distributed database; incremental error detection; incremental inconsistency detection problem; minimum data shipment reduction; optimization techniques; vertical partitions; Cities and towns; Computational fluid dynamics; Database systems; Distributed databases; Partitioning algorithms; Data; Data dependencies; General; Incremental algorithms; Miscellaneous; conditional functional dependencies; distributed data; error detection;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2012.138
  • Filename
    6243140