DocumentCode
1848361
Title
Training Set Reduction Methods for Protein Secondary Structure Prediction in Single-Sequence Condition
Author
Aydin, Z. ; Altunbasak, Y. ; Pakatci, I.K. ; Erdogan, H.
Author_Institution
Georgia Inst. of Technol., Atlanta
fYear
2007
fDate
22-26 Aug. 2007
Firstpage
5025
Lastpage
5028
Abstract
Orphan proteins are characterized by the lack of significant sequence similarity to database proteins. To infer the functional properties of the orphans, more elaborate techniques that utilize structural information are required. In this regard, the protein structure prediction gains considerable importance. Secondary structure prediction algorithms designed for orphan proteins (also known as single-sequence algorithms) cannot utilize multiple alignments or alignment profiles, which are derived from similar proteins. This is a limiting factor for the prediction accuracy. One way to improve the performance of a single-sequence algorithm is to perform re-training. In this approach, first, the models used by the algorithm are trained by a representative set of proteins and a secondary structure prediction is computed. Then, using a distance measure, the original training set is refined by removing proteins that are dissimilar to the given protein. This step is followed by the re-estimation of the model parameters and the prediction of the secondary structure. In this paper, we compare training set reduction methods that are used to re-train the hidden semi- Markov models employed by the IPSSP algorithm [1]. We found that the composition based reduction method has the highest performance compared to the alignment based and the Chou- Fasman based reduction methods. In addition, threshold-based reduction performed better than the reduction technique that selects the first 80% of the dataset proteins.
Keywords
hidden Markov models; molecular biophysics; proteins; Chou-Fasman based reduction; hidden semiMarkov models; orphan proteins; secondary structure prediction; single-sequence condition; training set reduction; Accuracy; Amino acids; Data engineering; Databases; Hidden Markov models; Machine learning; Prediction algorithms; Prediction methods; Predictive models; Protein engineering; Algorithms; Amino Acid Sequence; Predictive Value of Tests; Protein Structure, Secondary; Proteins; Sequence Alignment;
fLanguage
English
Publisher
ieee
Conference_Titel
Engineering in Medicine and Biology Society, 2007. EMBS 2007. 29th Annual International Conference of the IEEE
Conference_Location
Lyon
ISSN
1557-170X
Print_ISBN
978-1-4244-0787-3
Type
conf
DOI
10.1109/IEMBS.2007.4353469
Filename
4353469
Link To Document