• DocumentCode
    1829610
  • Title

    Simplifying the Utilization of Machine Learning Techniques for Bioinformatics

  • Author

    Dittman, David J. ; Khoshgoftaar, Taghi M. ; Wald, Randall ; Napolitano, Antonio

  • Author_Institution
    Florida Atlantic Univ., Boca Raton, FL, USA
  • Volume
    2
  • fYear
    2013
  • fDate
    4-7 Dec. 2013
  • Firstpage
    396
  • Lastpage
    403
  • Abstract
    The domain of bioinformatics has a number of challenges such as handling datasets which exhibit extreme levels of high dimensionality (large number of features per sample) and datasets which are particularly difficult to work with. These datasets contain many pieces of data (features) which are irrelevant and redundant to the problem being studied, which makes analysis quite difficult. However, techniques from the domain of machine learning and data mining are well suited to combating these difficulties. Techniques like feature selection (choosing an optimal subset of features for subsequent analysis by removing irrelevant or redundant features) and classifiers (used to build inductive models in order to classify unknown instances) can assist researchers in working with such difficult datasets. Unfortunately, many practitioners of bioinformatics do not have the machine learning knowledge to choose the correct techniques in order to achieve good classification results. If the choices could be simplified or predetermined then it would be easier to apply the techniques. This study is a comprehensive analysis of machine learning techniques on twenty-five bioinformatics datasets using six classifiers, and twenty-four feature rankers. We analyzed the factors at each of four feature subset sizes chosen for being large enough to be effective in creating inductive models but small enough to be of use for further research. Our results shows that Random Forest with 100 trees is the top performing classifier and that the choice of feature ranker is of little importance as long as feature selection occurs. Statistical analysis confirms our results. By choosing these parameters, machine learning techniques are more accessible to bioinformatics.
  • Keywords
    bioinformatics; data mining; feature selection; learning (artificial intelligence); pattern classification; statistical analysis; bioinformatics datasets; comprehensive analysis; data mining; feature rankers; feature selection; high-dimensional data; inductive models; machine learning techniques; optimal feature subset; random forest classifier; statistical analysis; unknown instance classification; Bioinformatics; Biological system modeling; DNA; Logistics; Lungs; Support vector machines; Vegetation; Bioinformatics; Classification; Feature Selection;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Applications (ICMLA), 2013 12th International Conference on
  • Conference_Location
    Miami, FL
  • Type

    conf

  • DOI
    10.1109/ICMLA.2013.155
  • Filename
    6786142