Title :
[Inside front cover]
Author :
Yuchun Tang ; Yan-Qing Zhang ; Zhen Huang
Author_Institution :
Secure Comput. Corp., Alpharetta
Abstract :
Extracting a subset of informative genes from microarray expression data is a critical data preparation step in cancer classification and other biological function analyses. Though many algorithms have been developed, the support vector machine-recursive feature elimination (SVM-RFE) algorithm is one of the best gene feature selection algorithms. It assumes that a smaller "filter-out" factor in the SVM-RFE, which results in a smaller number of gene features eliminated in each recursion, should lead to extraction of a better gene subset. Because the SVM-RFE is highly sensitive to the "filter-out" factor, our simulations have shown that this assumption is not always correct and that the SVM-RFE is an unstable algorithm. To select a set of key gene features for reliable prediction of cancer types or subtypes and other applications, a new two-stage SVM-RFE algorithm has been developed. It is designed to effectively eliminate most of the irrelevant, redundant, and noisy genes while keeping information loss small at the first stage. A fine selection for the final gene subset is then performed at the second stage. The two-stage SVM-RFE overcomes the instability problem of the SVM-RFE to achieve better algorithm utility. We have demonstrated that the two-stage SVM-RFE is significantly more accurate and more reliable than the SVM-RFE and three correlation-based methods based on our analysis of three publicly available microarray expression data sets. Furthermore, the two-stage SVM-RFE is computationally efficient because its time complexity is O(d*log2d), where d is the size of the original gene set.
Keywords :
biology computing; cancer; genetics; medical computing; support vector machines; cancer classification; cancer type prediction; correlation-based methods; gene feature selection algorithms; microarray expression data analysis; support vector machine-recursive feature elimination; two-stage SVM-RFE gene selection strategy; Algorithms; Artificial Intelligence; Diagnosis, Computer-Assisted; Gene Expression Profiling; Humans; Neoplasm Proteins; Neoplasms; Oligonucleotide Array Sequence Analysis; Pattern Recognition, Automated; Tumor Markers, Biological;
Journal_Title :
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
DOI :
10.1109/TCBB.2007.1028