DocumentCode
2079352
Title
ProbClean: A probabilistic duplicate detection system
Author
Beskales, George ; Soliman, Mohamed A. ; Ilyas, Ihab F. ; Ben-David, Shai ; Kim, Yubin
Author_Institution
Sch. of Comput. Sci., Univ. of Waterloo, Waterloo, ON, Canada
fYear
2010
fDate
1-6 March 2010
Firstpage
1193
Lastpage
1196
Abstract
One of the most prominent data quality problems is the existence of duplicate records. Current data cleaning systems usually produce one clean instance (repair) of the input data, by carefully choosing the parameters of the duplicate detection algorithms. Finding the right parameter settings can be hard, and in many cases, perfect settings do not exist. We propose ProbClean, a system that treats duplicate detection procedures as data processing tasks with uncertain outcomes. We use a novel uncertainty model that compactly encodes the space of possible repairs corresponding to different parameter settings. ProbClean efficiently supports relational queries and allows new types of queries against a set of possible repairs.
Keywords
data integrity; ProbClean; data cleaning systems; data quality problems; probabilistic duplicate detection system; Business; Cleaning; Computer science; Data mining; Data processing; Data warehouses; Detection algorithms; Query processing; Relational databases; Uncertainty;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Engineering (ICDE), 2010 IEEE 26th International Conference on
Conference_Location
Long Beach, CA
Print_ISBN
978-1-4244-5445-7
Electronic_ISBN
978-1-4244-5444-0
Type
conf
DOI
10.1109/ICDE.2010.5447744
Filename
5447744
Link To Document