Title :
A Comparative Study on Feature Extraction from Protein Sequences for Subcellular Localization Prediction
Author :
Yang, Wen-Yun ; Lu, Bao-Liang ; Yang, Yang
Author_Institution :
Dept. of Comput. Sci. & Eng., Shanghai Jiao Tong Univ.
Abstract :
One of the central problems in computational biology is to identify the protein function in an automated and high-throughput fashion. A key step in this process is to predict subcellular compartment the protein belongs to, since the protein localization closely correlates with its function. A wide variety of methods for protein subcellular localization has been proposed over recent years. They fall into two categories, sequence-based and database-based. The first one is to extract useful features from amino acid sequences and strives to discover the principles behind protein localization process. The second one is more apt to conduct data mining from existing public annotation databases. This paper focuses on the sequence-based approach and exploits the discriminative ability contained in amino acid sequences for protein subcellular localization. By using support vector machines (SVMs) as predictors, we conducted comparisons among amino acid composition approach, amino acid tuple approach, voting scheme, and a new characteristic representation of proteins proposed in this paper. Our experiments are carried out on 7579 eukaryotic protein sequences from 12 subcellular locations. The highest accuracy, 82.8% across 5-fold cross validation, is obtained by voting scheme using five predictors. This is the best performance achieved on this dataset using sequence-based approach. Our experiments demonstrate that there are considerable potentials on improving prediction accuracy by exploiting protein sequences, which have not been fully utilized so far, and more explorations are still needed in this direction
Keywords :
biology computing; cellular biophysics; data mining; feature extraction; proteins; sequences; amino acid sequences; amino acid tuple approach; computational biology; eukaryotic protein sequences; feature extraction; protein function; protein localization process; public annotation databases; subcellular compartment; subcellular localization prediction; support vector machines; Accuracy; Amino acids; Computational biology; Data mining; Feature extraction; Proteins; Sequences; Spatial databases; Support vector machines; Voting;
Conference_Titel :
Computational Intelligence and Bioinformatics and Computational Biology, 2006. CIBCB '06. 2006 IEEE Symposium on
Conference_Location :
Toronto, Ont.
Print_ISBN :
1-4244-0623-4
Electronic_ISBN :
1-4244-0624-2
DOI :
10.1109/CIBCB.2006.330991