• DocumentCode
    2192780
  • Title

    A Sequence Data Mining Protocol to Identify Best Representative Sequence for Protein Domain Families

  • Author

    Gowri, V.S. ; Shameer, Khader ; Reddy, Chilamakuri Chandra Sekhar ; Shingate, Prashant ; Sowdhamini, Ramanathan

  • Author_Institution
    Nat. Centre for Biol. Sci. (TIFR), Bangalore, India
  • fYear
    2010
  • fDate
    13-13 Dec. 2010
  • Firstpage
    703
  • Lastpage
    710
  • Abstract
    Protein domains are the compact, evolutionarily conserved units of proteins that can be utilized for function association of the large number of gene products realised from whole genome sequencing projects. Homology, inferred by sequence similarity, is usually a reason for transfer of function annotation from pre-existing domain families to gene products. Sequence analysis protocols are directed by the reference sequence of families used for homology searches to reduce computational time in such large-scale data mining processes. As protein domain families are diverse in nature, it is an important task to identify a single best representative sequence member from a protein domain family using a well-defined, reproducible bioinformatics protocol. We report a new bioinformatics protocol that can be used to identify best representative sequence (BRS) from protein domain families. The method is based on “coverage analysis” score implemented using three different sequence search programs and the trends obtained in reporting best representative sequence are assessed. The highest average coverage for BRPs was 66% when searched using Hidden Markov Models. Further, it is crucial to select BRS specific for a sequence search method when searching in large sequence databases.
  • Keywords
    bioinformatics; data mining; genomics; hidden Markov models; proteins; best representative sequence; bioinformatics protocol; coverage analysis; gene product; genome sequencing project; hidden Markov model; protein domain families; sequence analysis protocol; sequence data mining protocol; best representative sequence; data mining; protein domain; protein family; sequence analysis; sequence data mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshops (ICDMW), 2010 IEEE International Conference on
  • Conference_Location
    Sydney, NSW
  • Print_ISBN
    978-1-4244-9244-2
  • Electronic_ISBN
    978-0-7695-4257-7
  • Type

    conf

  • DOI
    10.1109/ICDMW.2010.153
  • Filename
    5693365