• DocumentCode
    950291
  • Title

    Bootstrapping and normalization for enhanced evaluations of pairwise sequence comparison

  • Author

    Green, Richard E. ; Brenner, Steven E.

  • Author_Institution
    Dept. of Plant & Microbial Biol., & Molecular & Cell Biol., California Univ., Berkeley, CA, USA
  • Volume
    90
  • Issue
    12
  • fYear
    2002
  • fDate
    12/1/2002 12:00:00 AM
  • Firstpage
    1834
  • Lastpage
    1847
  • Abstract
    The exponentially growing library of known protein sequences represents molecules connected by, an intricate network of evolutionary and functional relationships. To reveal these relationships, virtually every molecular biology experiment incorporates computational sequence analysis. The workhorse methods for this task make alignments between two sequences to measure their similarity. Informed use of these methods, such as NCBI BLAST, WU-BLAST, FASTA and SSEARCH, requires understanding of their effectiveness. To permit informed sequence analysis, we. have assessed the effectiveness of modern versions of these algorithms using the trusted relationships among ASTRAL sequences in the Structural Classification of Proteins database classification of protein structures. We have reduced database representation artifacts through the use of a normalization method that addresses the uneven distribution of superfamily sizes. To allow for more meaningful and interpretable comparisons of results, we have implemented a bootstrapping procedure. We find that the most difficult pairwise relations to detect are those between members of larger superfamilies, and our test set is biased toward these. However even when results are normalized, most distant evolutionary relationships elude detection.
  • Keywords
    biology computing; bootstrapping; evolution (biological); molecular biophysics; proteins; FASTA; NCBI BLAST; SSEARCH; WU-BLAST; algorithms; database representation artifacts reduction; distant evolutionary relationships; functional relationships; molecular biology experiment; sequence alignments; superfamily sizes; uneven distribution; Algorithm design and analysis; Bioinformatics; Biological information theory; Biology computing; Computational biology; Databases; Genomics; Libraries; Proteins; Sequences;
  • fLanguage
    English
  • Journal_Title
    Proceedings of the IEEE
  • Publisher
    ieee
  • ISSN
    0018-9219
  • Type

    jour

  • DOI
    10.1109/JPROC.2002.805303
  • Filename
    1058228