DocumentCode :
167293
Title :
Multiclass unbalanced protein data classification using sequence features
Author :
Vani, K. Suvarna ; Sravani, T.D.
Author_Institution :
Dept. of Comput. Sci. & Eng., V.R. Siddhartha Eng. Coll., Vijayawada, India
fYear :
2014
fDate :
21-24 May 2014
Firstpage :
1
Lastpage :
8
Abstract :
Protein fold classification is one of the challenging problems in bioinformatics. The main objective of this work addresses the problem of protein fold classification using sequence features which is a multi-class problem having unbalanced classes. A simple and computationally inexpensive algorithm called feature extraction algorithm is proposed to extract novel features from the primary sequences. It is found that of Support Vector Machine (SVM) which can be effectively extended from a binary to a multi-class classifier does not perform well on this problem. Hence in order to boost the performance, boosting algorithm like SMOTE technique of Chawla et al. [17] is applied to rebalance the data set and then apply different classifiers methods like J48 [15] decision tree classifier is used to classify folds from the features of sequences. The classification is performed across the four major protein structural classes as well as among the different folds within the classes. The results obtained are promising validating the simple methodology of boosting to obtain improved performance on the fold classification problem using features derived from the sequences alone is to extract features based on the protein sequences and apply the extracted feature set to the improved oversampling method which reduces the imbalance present in the extracted feature set. In order to tackle the multi-classes we use different boosting algorithms like Adaboost and Logitboost which handle multi-datasets effectively.
Keywords :
bioinformatics; decision trees; feature extraction; learning (artificial intelligence); molecular biophysics; molecular configurations; pattern classification; proteins; support vector machines; Adaboost; J48; Logitboost; SMOTE technique; SVM; bioinformatics; boosting algorithm; computationally inexpensive algorithm; dataset rebalance; decision tree classifier; feature extraction algorithm; fold classification problem; major protein structural classes; multiclass classifier; multiclass problem; multiclass unbalanced protein data classification; oversampling method; protein fold classification; sequence features; support vector machine; unbalanced classes; Accuracy; Amino acids; Boosting; Clustering algorithms; Feature extraction; Proteins; Vectors; AdaBoost; Feature Extraction; LogitBoost; Oversampling; Protein fold classification; SMOTE; Unbalanced data;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence in Bioinformatics and Computational Biology, 2014 IEEE Conference on
Conference_Location :
Honolulu, HI
Type :
conf
DOI :
10.1109/CIBCB.2014.6845517
Filename :
6845517
Link To Document :
بازگشت