DocumentCode :
2369314
Title :
Probabilistic noise identification and data cleaning
Author :
Kubica, Jeremy ; Moore, Andrew
Author_Institution :
Robotics Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA
fYear :
2003
fDate :
19-22 Nov. 2003
Firstpage :
131
Lastpage :
138
Abstract :
Real world data is never as perfect as we would like it to be and can often suffer from corruptions that may impact interpretations of the data, models created from the data, and decisions made based on the data. One approach to this problem is to identify and remove records that contain corruptions. Unfortunately, if only certain fields in a record have been corrupted then usable, uncorrupted data will be lost. We present LENS, an approach for identifying corrupted fields and using the remaining noncorrupted fields for subsequent modeling and analysis. Our approach uses the data to learn a probabilistic model containing three components: a generative model of the clean records, a generative model of the noise values, and a probabilistic model of the corruption process. We provide an algorithm for the unsupervised discovery of such models and empirically evaluate both its performance at detecting corrupted fields and, as one example application, the resulting improvement this gives to a classifier.
Keywords :
Gaussian noise; conformance testing; data mining; probability; corrupted field identifying; corruption process; data cleaning; generative model; noise value; probabilistic model; probabilistic noise identification; Cleaning; Data mining;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2003. ICDM 2003. Third IEEE International Conference on
Print_ISBN :
0-7695-1978-4
Type :
conf
DOI :
10.1109/ICDM.2003.1250912
Filename :
1250912
Link To Document :
بازگشت