DocumentCode
2192780
Title
A Sequence Data Mining Protocol to Identify Best Representative Sequence for Protein Domain Families
Author
Gowri, V.S. ; Shameer, Khader ; Reddy, Chilamakuri Chandra Sekhar ; Shingate, Prashant ; Sowdhamini, Ramanathan
Author_Institution
Nat. Centre for Biol. Sci. (TIFR), Bangalore, India
fYear
2010
fDate
13-13 Dec. 2010
Firstpage
703
Lastpage
710
Abstract
Protein domains are the compact, evolutionarily conserved units of proteins that can be utilized for function association of the large number of gene products realised from whole genome sequencing projects. Homology, inferred by sequence similarity, is usually a reason for transfer of function annotation from pre-existing domain families to gene products. Sequence analysis protocols are directed by the reference sequence of families used for homology searches to reduce computational time in such large-scale data mining processes. As protein domain families are diverse in nature, it is an important task to identify a single best representative sequence member from a protein domain family using a well-defined, reproducible bioinformatics protocol. We report a new bioinformatics protocol that can be used to identify best representative sequence (BRS) from protein domain families. The method is based on “coverage analysis” score implemented using three different sequence search programs and the trends obtained in reporting best representative sequence are assessed. The highest average coverage for BRPs was 66% when searched using Hidden Markov Models. Further, it is crucial to select BRS specific for a sequence search method when searching in large sequence databases.
Keywords
bioinformatics; data mining; genomics; hidden Markov models; proteins; best representative sequence; bioinformatics protocol; coverage analysis; gene product; genome sequencing project; hidden Markov model; protein domain families; sequence analysis protocol; sequence data mining protocol; best representative sequence; data mining; protein domain; protein family; sequence analysis; sequence data mining;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Mining Workshops (ICDMW), 2010 IEEE International Conference on
Conference_Location
Sydney, NSW
Print_ISBN
978-1-4244-9244-2
Electronic_ISBN
978-0-7695-4257-7
Type
conf
DOI
10.1109/ICDMW.2010.153
Filename
5693365
Link To Document