• DocumentCode
    33697
  • Title

    A Partial Least Squares Based Procedure for Upstream Sequence Classification in Prokaryotes

  • Author

    Mehmood, Tahir ; Bohlin, Jon ; Snipen, Lars

  • Author_Institution
    Dept. of Chem., Biotechnol. & Food Sci., Norwegian Univ. of Life Sci., Akershous, Norway
  • Volume
    12
  • Issue
    3
  • fYear
    2015
  • fDate
    May-June 1 2015
  • Firstpage
    560
  • Lastpage
    567
  • Abstract
    The upstream region of coding genes is important for several reasons, for instance locating transcription factor, binding sites, and start site initiation in genomic DNA. Motivated by a recently conducted study, where multivariate approach was successfully applied to coding sequence modeling, we have introduced a partial least squares (PLS) based procedure for the classification of true upstream prokaryotic sequence from background upstream sequence. The upstream sequences of conserved coding genes over genomes were considered in analysis, where conserved coding genes were found by using pan-genomics concept for each considered prokaryotic species. PLS uses position specific scoring matrix (PSSM) to study the characteristics of upstream region. Results obtained by PLS based method were compared with Gini importance of random forest (RF) and support vector machine (SVM), which is much used method for sequence classification. The upstream sequence classification performance was evaluated by using cross validation, and suggested approach identifies prokaryotic upstream region significantly better to RF (p-value <; 0.01) and SVM (p-value <; 0.01). Further, the proposed method also produced results that concurred with known biological characteristics of the upstream region.
  • Keywords
    DNA; bioinformatics; cellular biophysics; classification; genetics; genomics; least squares approximations; molecular biophysics; molecular configurations; support vector machines; Gini importance; PLS; PSSM; RF; SVM; background upstream sequence; binding sites; coding genes; coding sequence modeling; genomic DNA; pan-genomics concept; partial least squares; position specific scoring matrix; prokaryotes; random forest; site initiation; support vector machine; transcription factor; true upstream prokaryotic sequence; upstream sequence classification; Bioinformatics; Encoding; Genomics; Radio frequency; Strain; Support vector machines; Vectors; Partial Least Squares; Partial least squares; classification; prokaryotes;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2014.2366146
  • Filename
    6951340