Title :
How ranker and learner choice affects classification performance on noisy bioinformatics data
Author :
Abu Shanab, Ahmad ; Khoshgoftaar, Taghi M. ; Wald, Randall ; Napolitano, Amri
Abstract :
One of the main characteristics of bioinformatics datasets is noise. Noise refers to incorrect or missing values in a dataset and has a detrimental effect on classification. In this study we evaluate the robustness of six classification algorithms and ten filter-based feature selection techniques, specifically to study how the different techniques are impacted by particularly challenging datasets in order to find the techniques which are less sensitive to class noise. To investigate the robustness of the classification algorithms and feature selection techniques, we injected artificial noise (which varied both in terms of noise level and noise distribution) into 12 relatively noise-free bioinformatics datasets, creating a spectrum of noisy datasets with three levels of learning difficulty (Easy, Moderate, and Hard). We then used ten feature rankers from three different families, along with six classification techniques, to build predictive models. We found that the Random Forest 100 learner is the least sensitive to class noise, and thus is a good candidate for classification across all rankers and learning difficulty levels. Logistic Regression, on the other hand, gave the worst performance across all rankers and learning difficulty levels. Additionally, we found that some rankers were successful at ameliorating the difficulty of hard datasets for all learners other than Logistic Regression, meaning that with these rankers the datasets act as though they are less noisy.
Keywords :
bioinformatics; feature selection; learning (artificial intelligence); pattern classification; random processes; regression analysis; artificial noise injection; classification algorithms; feature rankers; filter-based feature selection techniques; learner choice; learning difficulty; logistic regression; noise distribution; noise level; noise-free bioinformatic datasets; noisy bioinformatics data classification performance; random forest 100 learner; ranker choice; Bioinformatics; Gene expression; Logistics; Noise; Noise measurement; Predictive models; Support vector machines; Noise injection; bioinformatics; difficulty of learning; feature selection;
Conference_Titel :
Information Reuse and Integration (IRI), 2014 IEEE 15th International Conference on
DOI :
10.1109/IRI.2014.7051900