Title :
Feature Selection and Clustering of Gene Expression Profiles Using Biological Knowledge
Author :
Mitra, Sushmita ; Ghosh, Sampreeti
Author_Institution :
Machine Intell. Unit, Indian Stat. Inst., Kolkata, India
Abstract :
In this paper, a novel feature selection algorithm, which is governed by biological knowledge, is developed. Gene expression data being high dimensional and redundant, dimensionality reduction is of prime concern. We employ the algorithm clustering large applications based on RAN-domized search (CLARANS) for attribute clustering and dimensionality reduction based on gene ontology (GO) study. Feature selection with unsupervised learning is a difficult problem, with neither class labels present nor any guidance available to the search. Determination of the optimal number of clusters is another major issue, and has an impact on the resulting output. The use of GO analysis helps in the automated selection of biologically meaningful partitions. Tools such as Eisen plot and cluster profiles of these clusters help establish their coherence. Important representative features (or genes) are extracted from each correlated set of genes in such partitions. The algorithm is implemented on high-dimensional Yeast cell-cycle, Human Multiple Tissues, and Leukemia microarray data. In the second pass, clustering on the reduced gene space validates preservation of the inherent behavior of the original high-dimensional expression profiles. While the reduced gene set forms a biologically meaningful gene space, it simultaneously leads to a decrease in computational burden. External validation of the reduced subspace, using various well-known classifiers, establishes the effectiveness of the proposed methodology.
Keywords :
biological tissues; biology computing; cellular biophysics; feature extraction; genetics; learning (artificial intelligence); ontologies (artificial intelligence); pattern clustering; CLARANS; GO study; attribute clustering; automatic biologically meaningful partition selection; biological knowledge; biologically meaningful gene space; clustering large applications based on randomized search; dimensionality reduction; feature extraction; feature selection algorithm; gene expression data; gene expression profile clustering; gene ontology study; gene space reduction; high dimensional data; high-dimensional expression profiles; high-dimensional yeast cell-cycle; human multiple tissues data; leukemia microarray data; redundant data; unsupervised learning; Clustering algorithms; Feature extraction; Gene expression; Indexes; Ontologies; Attribute clustering; clustering large applications based on RAN-domized search (CLARANS); feature selection; gene ontology (GO) medoid;
Journal_Title :
Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on
DOI :
10.1109/TSMCC.2012.2209416