• DocumentCode
    2079352
  • Title

    ProbClean: A probabilistic duplicate detection system

  • Author

    Beskales, George ; Soliman, Mohamed A. ; Ilyas, Ihab F. ; Ben-David, Shai ; Kim, Yubin

  • Author_Institution
    Sch. of Comput. Sci., Univ. of Waterloo, Waterloo, ON, Canada
  • fYear
    2010
  • fDate
    1-6 March 2010
  • Firstpage
    1193
  • Lastpage
    1196
  • Abstract
    One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. ProbClean efficiently supports relational queries and allows new types of queries against a set of possible repairs.
  • Keywords
    data integrity; ProbClean; data cleaning systems; data quality problems; probabilistic duplicate detection system; Business; Cleaning; Computer science; Data mining; Data processing; Data warehouses; Detection algorithms; Query processing; Relational databases; Uncertainty;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2010 IEEE 26th International Conference on
  • Conference_Location
    Long Beach, CA
  • Print_ISBN
    978-1-4244-5445-7
  • Electronic_ISBN
    978-1-4244-5444-0
  • Type

    conf

  • DOI
    10.1109/ICDE.2010.5447744
  • Filename
    5447744