• DocumentCode
    1135106
  • Title

    Feature Selection for Gene Expression Using Model-Based Entropy

  • Author

    Zhu, Shenghuo ; Wang, Dingding ; Yu, Kai ; Li, Tao ; Gong, Yihong

  • Author_Institution
    NEC Labs. America, Cupertino, CA, USA
  • Volume
    7
  • Issue
    1
  • fYear
    2010
  • Firstpage
    25
  • Lastpage
    36
  • Abstract
    Gene expression data usually contain a large number of genes but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. Using machine learning techniques, traditional gene selection based on empirical mutual information suffers the data sparseness issue due to the small number of samples. To overcome the sparseness issue, we propose a model-based approach to estimate the entropy of class variables on the model, instead of on the data themselves. Here, we use multivariate normal distributions to fit the data, because multivariate normal distributions have maximum entropy among all real-valued distributions with a specified mean and standard deviation and are widely used to approximate various distributions. Given that the data follow a multivariate normal distribution, since the conditional distribution of class variables given the selected features is a normal distribution, its entropy can be computed with the log-determinant of its covariance matrix. Because of the large number of genes, the computation of all possible log-determinants is not efficient. We propose several algorithms to largely reduce the computational cost. The experiments on seven gene data sets and the comparison with other five approaches show the accuracy of the multivariate Gaussian generative model for feature selection, and the efficiency of our algorithms.
  • Keywords
    Gaussian distribution; bioinformatics; covariance matrices; entropy; genetics; molecular biophysics; physiological models; covariance matrix; data sparseness; empirical mutual information; feature selection; gene expression; machine learning; model-based entropy; multivariate Gaussian generative model; multivariate normal distributions; Bioinformatics (genome or protein) databases; Data mining; Feature extraction or construction; Feature selection; entropy.; multivariate Gaussian generative model; Algorithms; Artificial Intelligence; Computer Simulation; Entropy; Gene Expression Profiling; Models, Genetic; Models, Statistical; Oligonucleotide Array Sequence Analysis; Pattern Recognition, Automated;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2008.35
  • Filename
    4492763