A Sequence Data Mining Protocol to Identify Best Representative Sequence for Protein Domain Families

Author

Gowri, V.S. ; Shameer, Khader ; Reddy, Chilamakuri Chandra Sekhar ; Shingate, Prashant ; Sowdhamini, Ramanathan

Author_Institution

Nat. Centre for Biol. Sci. (TIFR), Bangalore, India

fYear

2010

fDate

13-13 Dec. 2010

Firstpage

703

Lastpage

710

Abstract

Protein domains are the compact, evolutionarily conserved units of proteins that can be utilized for function association of the large number of gene products realised from whole genome sequencing projects. Homology, inferred by sequence similarity, is usually a reason for transfer of function annotation from pre-existing domain families to gene products. Sequence analysis protocols are directed by the reference sequence of families used for homology searches to reduce computational time in such large-scale data mining processes. As protein domain families are diverse in nature, it is an important task to identify a single best representative sequence member from a protein domain family using a well-defined, reproducible bioinformatics protocol. We report a new bioinformatics protocol that can be used to identify best representative sequence (BRS) from protein domain families. The method is based on “coverage analysis” score implemented using three different sequence search programs and the trends obtained in reporting best representative sequence are assessed. The highest average coverage for BRPs was 66% when searched using Hidden Markov Models. Further, it is crucial to select BRS specific for a sequence search method when searching in large sequence databases.

Keywords

bioinformatics; data mining; genomics; hidden Markov models; proteins; best representative sequence; bioinformatics protocol; coverage analysis; gene product; genome sequencing project; hidden Markov model; protein domain families; sequence analysis protocol; sequence data mining protocol; best representative sequence; data mining; protein domain; protein family; sequence analysis; sequence data mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Mining Workshops (ICDMW), 2010 IEEE International Conference on

Conference_Location

Sydney, NSW

Print_ISBN

978-1-4244-9244-2

Electronic_ISBN

978-0-7695-4257-7

Type

conf

DOI

10.1109/ICDMW.2010.153

Filename

5693365