Investigating the Efficacy of Nonlinear Dimensionality Reduction Schemes in Classifying Gene and Protein Expression Studies

Author

Lee, George ; Rodriguez, Carlos ; Madabhushi, Anant

Author_Institution

Dept. of Biomed. Eng., State Univ. of New Jersey, Piscataway, NJ

Volume

Issue

fYear

2008

Firstpage

368

Lastpage

384

Abstract

The recent explosion in procurement and availability of high-dimensional gene and protein expression profile data sets for cancer diagnostics has necessitated the development of sophisticated machine learning tools with which to analyze them. While some investigators are focused on identifying informative genes and proteins that play a role in specific diseases, other researchers have attempted instead to use patients based on their expression profiles to prognosticate disease status. A major limitation in the ability to accurately classify these high-dimensional data sets stems from the "curse of dimensionality," occurring in situations where the number of genes or peptides significantly exceeds the total number of patient samples. Previous attempts at dealing with this issue have mostly centered on the use of a dimensionality reduction (DR) scheme, principal component analysis (PCA), to obtain a low-dimensional projection of the high-dimensional data. However, linear PCA and other linear DR methods, which rely on euclidean distances to estimate object similarity, do not account for the inherent underlying nonlinear structure associated with most biomedical data. While some researchers have begun to explore nonlinear DR methods for computer vision problems such as face detection and recognition, to the best of our knowledge, few such attempts have been made for classification and visualization of high-dimensional biomedical data. The motivation behind this work is to identify the appropriate DR methods for analysis of high-dimensional gene and protein expression studies. Toward this end, we empirically and rigorously compare three nonlinear (Isomap, Locally Linear Embedding, and Laplacian Eigenmaps) and three linear DR schemes (PCA, Linear Discriminant Analysis, and Multidimensional Scaling) with the intent of determining a reduced subspace representation in which the individual object classes are more easily discriminable. Owing to the inherent nonlinear structure- - of gene and protein expression studies, our claim is that the nonlinear DR methods provide a more truthful low-dimensional representation of the data compared to the linear DR schemes. Evaluation of the DR schemes was done by 1) assessing the discriminability of two supervised classifiers (Support Vector Machine and C4.5 Decision Trees) in the different low- dimensional data embeddings and 2) five cluster validity measures to evaluate the size, distance, and tightness of object aggregates in the low-dimensional space. For each of the seven evaluation measures considered, statistically significant improvement in the quality of the embeddings across 10 cancer data sets via the use of three nonlinear DR schemes over three linear DR techniques was observed. Similar trends were observed when linear and nonlinear DR was applied to the high-dimensional data following feature pruning to isolate the most informative features. Qualitative evaluation of the low-dimensional data embedding obtained via the six DR methods further suggests that the nonlinear schemes are better able to identify potential novel classes (e.g., cancer subtypes) within the data.

Keywords

biology computing; cancer; computer vision; data visualisation; decision trees; genetics; learning (artificial intelligence); molecular biophysics; pattern clustering; principal component analysis; proteins; support vector machines; C4.5 decision trees; Isomap; Laplacian Eigenmaps; Laplacian eigenmaps; biomedical data processing; cancer diagnostics; computer vision; data visualization; gene classification; gene expression; isomap; linear PCA; linear discriminant analysis; locally linear embedding; machine learning; machine learning tools; multidimensional scaling; nonlinear dimensionality reduction scheme; nonlinear dimensionality reduction schemes; peptides; principal component analysis; prognosticate disease status; protein expression; reduced subspace representation; support vector machine; Bioinformatics (genome or protein) databases; Clustering; Data and knowledge visualization; Data mining; Feature extraction or construction; and association rules; classification; Algorithms; Data Interpretation, Statistical; Gene Expression Profiling; Nonlinear Dynamics; Pattern Recognition, Automated; Reproducibility of Results; Sensitivity and Specificity; Software;

fLanguage

English

Journal_Title

Computational Biology and Bioinformatics, IEEE/ACM Transactions on

Publisher

ieee

ISSN

1545-5963

Type

jour

DOI

10.1109/TCBB.2008.36

Filename

4492764

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=1135117