DocumentCode :
1829610
Title :
Simplifying the Utilization of Machine Learning Techniques for Bioinformatics
Author :
Dittman, David J. ; Khoshgoftaar, Taghi M. ; Wald, Randall ; Napolitano, Antonio
Author_Institution :
Florida Atlantic Univ., Boca Raton, FL, USA
Volume :
2
fYear :
2013
fDate :
4-7 Dec. 2013
Firstpage :
396
Lastpage :
403
Abstract :
The domain of bioinformatics has a number of challenges such as handling datasets which exhibit extreme levels of high dimensionality (large number of features per sample) and datasets which are particularly difficult to work with. These datasets contain many pieces of data (features) which are irrelevant and redundant to the problem being studied, which makes analysis quite difficult. However, techniques from the domain of machine learning and data mining are well suited to combating these difficulties. Techniques like feature selection (choosing an optimal subset of features for subsequent analysis by removing irrelevant or redundant features) and classifiers (used to build inductive models in order to classify unknown instances) can assist researchers in working with such difficult datasets. Unfortunately, many practitioners of bioinformatics do not have the machine learning knowledge to choose the correct techniques in order to achieve good classification results. If the choices could be simplified or predetermined then it would be easier to apply the techniques. This study is a comprehensive analysis of machine learning techniques on twenty-five bioinformatics datasets using six classifiers, and twenty-four feature rankers. We analyzed the factors at each of four feature subset sizes chosen for being large enough to be effective in creating inductive models but small enough to be of use for further research. Our results shows that Random Forest with 100 trees is the top performing classifier and that the choice of feature ranker is of little importance as long as feature selection occurs. Statistical analysis confirms our results. By choosing these parameters, machine learning techniques are more accessible to bioinformatics.
Keywords :
bioinformatics; data mining; feature selection; learning (artificial intelligence); pattern classification; statistical analysis; bioinformatics datasets; comprehensive analysis; data mining; feature rankers; feature selection; high-dimensional data; inductive models; machine learning techniques; optimal feature subset; random forest classifier; statistical analysis; unknown instance classification; Bioinformatics; Biological system modeling; DNA; Logistics; Lungs; Support vector machines; Vegetation; Bioinformatics; Classification; Feature Selection;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Applications (ICMLA), 2013 12th International Conference on
Conference_Location :
Miami, FL
Type :
conf
DOI :
10.1109/ICMLA.2013.155
Filename :
6786142
Link To Document :
بازگشت