Simplifying the Utilization of Machine Learning Techniques for Bioinformatics

Author

Dittman, David J. ; Khoshgoftaar, Taghi M. ; Wald, Randall ; Napolitano, Antonio

Author_Institution

Florida Atlantic Univ., Boca Raton, FL, USA

Volume

2

fYear

2013

fDate

4-7 Dec. 2013

Firstpage

396

Lastpage

403

Abstract

The domain of bioinformatics has a number of challenges such as handling datasets which exhibit extreme levels of high dimensionality (large number of features per sample) and datasets which are particularly difficult to work with. These datasets contain many pieces of data (features) which are irrelevant and redundant to the problem being studied, which makes analysis quite difficult. However, techniques from the domain of machine learning and data mining are well suited to combating these difficulties. Techniques like feature selection (choosing an optimal subset of features for subsequent analysis by removing irrelevant or redundant features) and classifiers (used to build inductive models in order to classify unknown instances) can assist researchers in working with such difficult datasets. Unfortunately, many practitioners of bioinformatics do not have the machine learning knowledge to choose the correct techniques in order to achieve good classification results. If the choices could be simplified or predetermined then it would be easier to apply the techniques. This study is a comprehensive analysis of machine learning techniques on twenty-five bioinformatics datasets using six classifiers, and twenty-four feature rankers. We analyzed the factors at each of four feature subset sizes chosen for being large enough to be effective in creating inductive models but small enough to be of use for further research. Our results shows that Random Forest with 100 trees is the top performing classifier and that the choice of feature ranker is of little importance as long as feature selection occurs. Statistical analysis confirms our results. By choosing these parameters, machine learning techniques are more accessible to bioinformatics.

Keywords

bioinformatics; data mining; feature selection; learning (artificial intelligence); pattern classification; statistical analysis; bioinformatics datasets; comprehensive analysis; data mining; feature rankers; feature selection; high-dimensional data; inductive models; machine learning techniques; optimal feature subset; random forest classifier; statistical analysis; unknown instance classification; Bioinformatics; Biological system modeling; DNA; Logistics; Lungs; Support vector machines; Vegetation; Bioinformatics; Classification; Feature Selection;

fLanguage

English

Publisher

ieee

Conference_Titel

Machine Learning and Applications (ICMLA), 2013 12th International Conference on

Conference_Location

Miami, FL

Type

conf

DOI

10.1109/ICMLA.2013.155

Filename

6786142