Title :
Selective Sampling Designs to Improve the Performance of Classification Methods
Author :
Ghorbani, S. ; Desmarais, Michel C.
Author_Institution :
Comput. & Software Eng. Dept., Polytech., Montreal, QC, Canada
Abstract :
Selective Sampling design refers to the situation where a study has a fixed number of observations but can decide to allocate them differently among the variables during the data gathering phase, such that some variables will have a greater ratio of missing values than others. In particular, we can decide to allocate more, or less missing values to uncertain variables: those for which the relative frequency is closer to 50% (higher uncertainty), or further from 50% (lower certainty). The main objective of the study is to investigate how a Selective Sampling process helps improve the performance of classification methods. This study specifically asks: "Can Selective Sampling affect the performance of the classification methods?" We focus on the three different classification models of Naïve Bayes, Logistic Regression and Tree Augmented Naive Bayes (TAN) for binary datasets. Three different schemes of sampling are defined: 1-Uniform (random samples) as a baseline, 2-Most Uncertain (higher sampling rate of uncertain items) and 3-Least Uncertain (lower sampling rate of uncertain items). We investigate the impacts of these different schemes on the performance of the three models on 11 different datasets. The results from 100 fold cross-validation show that Selective Sampling in all of the datasets improves the prediction performance of the TAN model and, in more than half of the datasets (54.6%), brings a higher prediction performance to Naïve Bayes and Logistic Regression classifiers.
Keywords :
Bayes methods; data handling; pattern classification; performance evaluation; regression analysis; sampling methods; Naïve Bayes classifier; TAN model; binary datasets; classification method performance improvement; data gathering phase; least uncertain scheme; logistic regression classifier; most uncertain scheme; random samples; relative frequency; sampling rate; selective sampling designs; tree augmented Naive Bayes classifier; uncertain items; uncertain variables; uniform scheme; Computational modeling; Entropy; Logistics; Niobium; Predictive models; Testing; Training; Classification; Planned Missing Data Design; Selective Sampling;
Conference_Titel :
Machine Learning and Applications (ICMLA), 2013 12th International Conference on
Conference_Location :
Miami, FL
DOI :
10.1109/ICMLA.2013.187