Title :
Jointly Encoding Protein Sequences and their Secondary Structure Information
Author :
Hategan, Andrea ; Tabus, Ioan
Author_Institution :
Tampere Univ. of Technol., Tampere
Abstract :
In this paper we study the problem of jointly encoding the amino acid sequence and the secondary structure information of proteins, in the current format in which more and more proteins are stored in Swiss-Prot database. The new method, dubbed ProtCompSecS, combines the compressor ProtComp previously designed only for amino acid sequences, with a dictionary based method, where the dictionary containing the patterns for representing the secondary structure is obtained by suitably processing the Dictionary of Protein Secondary Structure (DSSP) data base. We experimented with protein sequences of 14 complete proteomes. When comparing the performance of ProtCompSecS algorithm with that of ProtComp algorithm, for those sequences that have annotated secondary structure information, it surprisingly appeared that encoding both sequence and secondary structure information is more efficient than encoding the protein sequence alone (without knowledge of the secondary structure). This is a strong argument for claiming that the secondary structure has a high descriptive value for modeling and understanding the primary structure (the amino acid sequence) of a protein.
Keywords :
biology computing; database management systems; molecular biophysics; proteins; ProtCompSecS algorithm; Swiss-Prot database; amino acid sequence; dictionary based method; protein sequences; secondary structure information; Amino acids; Biological information theory; Compression algorithms; Databases; Decision support systems; Dictionaries; Encoding; Proteins; Sequences; Signal processing algorithms;
Conference_Titel :
Genomic Signal Processing and Statistics, 2007. GENSIPS 2007. IEEE International Workshop on
Conference_Location :
Tuusula
Print_ISBN :
978-1-4244-0998-3
Electronic_ISBN :
978-1-4244-0999-0
DOI :
10.1109/GENSIPS.2007.4365849