Title :
A Practical Approach on Cleaning-Up Large Data Sets
Author :
Barat, Marius ; Prelipcean, Dumitru Bogdan ; Gavrilut, Dragos Teodor
Author_Institution :
Bitdefender Lab., ”Al.I. Cuza” Univ., Iaşi, Romania
Abstract :
In this paper we propose a noise detection system based on similarities between instances. Having a data set with instances that belongs to multiple classes, a noise instance denotes a wrongly classified record. The similarity between different labeled instances is determined computing distances between them using several metrics among the standard ones. In order to ensure that this approach is computational feasible for very large data sets, we compute distances between pairs of different labels instances that have a certain degree of similarity. This speed-up is possible through a new clustering method called BDT Clustering presented within this paper, which is based on a supervised learning algorithm.
Keywords :
data handling; learning (artificial intelligence); pattern classification; pattern clustering; BDT clustering method; computing distances; label instances; large data set cleaning-up; noise detection system; supervised learning algorithm; Clustering algorithms; Clustering methods; Computer science; Machine learning algorithms; Malware; Measurement; Noise; clustering; data mining; decision making; machine learning; noise reduction;
Conference_Titel :
Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2014 16th International Symposium on
Conference_Location :
Timisoara
Print_ISBN :
978-1-4799-8447-3
DOI :
10.1109/SYNASC.2014.45