Title :
How the Choice of Wrapper Learner and Performance Metric Affects Subset Evaluation
Author :
Wald, Randall ; Khoshgoftaar, Taghi M. ; Napolitano, Antonio
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Abstract :
Due to the widespread problem of high dimensionality(datasets with many features/independent attributes), feature selection has become an important research topic in many areas of machine learning. One form of feature selection, wrapper-based subset evaluation, has been the focus of a moderate amount of research, because its use of classification learners to find optimal feature subsets has the potential to remove redundant features and find feature subsets which directly achieve the goal of improving classification performance. However, while the choice of learner to use within the wrapper framework has previously been studied, no paper has thoroughly investigated the role of the performance metric used within the wrapper process. Especially with imbalanced data (data where one class predominates over other classes), traditional metrics such as accuracy can give a misleading view of how many instances from each class are mislabeled. While it seems intuitive that metrics which take balance into account will affect the chosen features, no previous study has investigated this effect directly. In the present work, we test five different learners and five different performance metrics within the wrapper framework and use a newly-proposed variant of the Tanimoto Index to evaluate the similarity among the different choices of learner and metric as all other factors are held constant, using two datasets from the domain of social network profile mining. We find that while the Best Arithmetic Mean and Best Geometric Mean metrics (both of which find the stated means of True Positive Rate and True Negative Rate) are somewhat similar, they still are quite distinct, and no other metrics are particularly similar to one another. The five learners were also found to produce extremely dissimilar feature subsets. Thus, we show that the choice of both learner and metric has a major effect on which features are selected by through wrapper-based feature selection.
Keywords :
data mining; learning (artificial intelligence); pattern classification; Tanimoto Index; arithmetic mean; classification learners; feature selection; geometric mean metrics; imbalanced data; machine learning; optimal feature subsets; performance metric; social network profile mining; subset evaluation; wrapper learner; wrapper process; Buildings; Feature extraction; Indexes; Measurement; Stability criteria; Support vector machines; Twitter; Wrapper feature selection; imbalanced data; performance metrics; similarity;
Conference_Titel :
Tools with Artificial Intelligence (ICTAI), 2013 IEEE 25th International Conference on
Conference_Location :
Herndon, VA
Print_ISBN :
978-1-4799-2971-9
DOI :
10.1109/ICTAI.2013.70