• DocumentCode
    3394079
  • Title

    A method for improving protein localization prediction from datasets with outliers

  • Author

    Tian, Jiang ; Gu, Hong ; Liu, Wenqi

  • Author_Institution
    Sch. of Electron. & Inf. Eng., Dalian Univ. of Technol., Dalian
  • fYear
    2009
  • fDate
    March 30 2009-April 2 2009
  • Firstpage
    100
  • Lastpage
    105
  • Abstract
    Large-scale genome analysis and drug discovery require an automated prediction method for protein subcellular localization, and Support Vector Machines (SVMs) effectively solve this problem in a supervised manner. However, the protein subcellular localization datasets obtained from experiments always contain outliers, which can lead to poor generalization ability and classification accuracy. To address this issue, we first analyzed the influence of Principal Component Analysis (PCA) on classification performance, and then proposed a hybrid method for prediction of protein subcellular localization based on Weighted Supported Vector Machine (WSVM) and PCA. Different weights were assigned to different data points, so the training algorithm could learn the decision boundary according to the relative importance of the data points. After performing dimension reduction operations on the datasets, kernel-based possibilistic c-means (KPCM) was chosen to generate weights for this algorithm, as it generates relative high values for important data points but low values for outliers. Experimental results on a benchmark dataset show promising results, which confirms the effectiveness of the proposed method in terms of prediction accuracy.
  • Keywords
    biology computing; data reduction; genomics; learning (artificial intelligence); medical computing; molecular biophysics; pattern classification; principal component analysis; proteins; support vector machines; KPCM; PCA; WSVM; automated prediction method; classification performance; dataset outliers; decision boundary learning; dimension reduction operations; drug discovery; kernel based possibilistic c means; large scale genome analysis; principal component analysis; protein localization prediction; protein subcellular localization; training algorithm; weighted supported vector machine; Bioinformatics; Drugs; Genomics; Large-scale systems; Performance analysis; Prediction methods; Principal component analysis; Proteins; Support vector machine classification; Support vector machines;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computational Intelligence in Bioinformatics and Computational Biology, 2009. CIBCB '09. IEEE Symposium on
  • Conference_Location
    Nashville, TN
  • Print_ISBN
    978-1-4244-2756-7
  • Type

    conf

  • DOI
    10.1109/CIBCB.2009.4925714
  • Filename
    4925714