DocumentCode
33697
Title
A Partial Least Squares Based Procedure for Upstream Sequence Classification in Prokaryotes
Author
Mehmood, Tahir ; Bohlin, Jon ; Snipen, Lars
Author_Institution
Dept. of Chem., Biotechnol. & Food Sci., Norwegian Univ. of Life Sci., Akershous, Norway
Volume
12
Issue
3
fYear
2015
fDate
May-June 1 2015
Firstpage
560
Lastpage
567
Abstract
The upstream region of coding genes is important for several reasons, for instance locating transcription factor, binding sites, and start site initiation in genomic DNA. Motivated by a recently conducted study, where multivariate approach was successfully applied to coding sequence modeling, we have introduced a partial least squares (PLS) based procedure for the classification of true upstream prokaryotic sequence from background upstream sequence. The upstream sequences of conserved coding genes over genomes were considered in analysis, where conserved coding genes were found by using pan-genomics concept for each considered prokaryotic species. PLS uses position specific scoring matrix (PSSM) to study the characteristics of upstream region. Results obtained by PLS based method were compared with Gini importance of random forest (RF) and support vector machine (SVM), which is much used method for sequence classification. The upstream sequence classification performance was evaluated by using cross validation, and suggested approach identifies prokaryotic upstream region significantly better to RF (p-value <; 0.01) and SVM (p-value <; 0.01). Further, the proposed method also produced results that concurred with known biological characteristics of the upstream region.
Keywords
DNA; bioinformatics; cellular biophysics; classification; genetics; genomics; least squares approximations; molecular biophysics; molecular configurations; support vector machines; Gini importance; PLS; PSSM; RF; SVM; background upstream sequence; binding sites; coding genes; coding sequence modeling; genomic DNA; pan-genomics concept; partial least squares; position specific scoring matrix; prokaryotes; random forest; site initiation; support vector machine; transcription factor; true upstream prokaryotic sequence; upstream sequence classification; Bioinformatics; Encoding; Genomics; Radio frequency; Strain; Support vector machines; Vectors; Partial Least Squares; Partial least squares; classification; prokaryotes;
fLanguage
English
Journal_Title
Computational Biology and Bioinformatics, IEEE/ACM Transactions on
Publisher
ieee
ISSN
1545-5963
Type
jour
DOI
10.1109/TCBB.2014.2366146
Filename
6951340
Link To Document