DocumentCode
950291
Title
Bootstrapping and normalization for enhanced evaluations of pairwise sequence comparison
Author
Green, Richard E. ; Brenner, Steven E.
Author_Institution
Dept. of Plant & Microbial Biol., & Molecular & Cell Biol., California Univ., Berkeley, CA, USA
Volume
90
Issue
12
fYear
2002
fDate
12/1/2002 12:00:00 AM
Firstpage
1834
Lastpage
1847
Abstract
The exponentially growing library of known protein sequences represents molecules connected by, an intricate network of evolutionary and functional relationships. To reveal these relationships, virtually every molecular biology experiment incorporates computational sequence analysis. The workhorse methods for this task make alignments between two sequences to measure their similarity. Informed use of these methods, such as NCBI BLAST, WU-BLAST, FASTA and SSEARCH, requires understanding of their effectiveness. To permit informed sequence analysis, we. have assessed the effectiveness of modern versions of these algorithms using the trusted relationships among ASTRAL sequences in the Structural Classification of Proteins database classification of protein structures. We have reduced database representation artifacts through the use of a normalization method that addresses the uneven distribution of superfamily sizes. To allow for more meaningful and interpretable comparisons of results, we have implemented a bootstrapping procedure. We find that the most difficult pairwise relations to detect are those between members of larger superfamilies, and our test set is biased toward these. However even when results are normalized, most distant evolutionary relationships elude detection.
Keywords
biology computing; bootstrapping; evolution (biological); molecular biophysics; proteins; FASTA; NCBI BLAST; SSEARCH; WU-BLAST; algorithms; database representation artifacts reduction; distant evolutionary relationships; functional relationships; molecular biology experiment; sequence alignments; superfamily sizes; uneven distribution; Algorithm design and analysis; Bioinformatics; Biological information theory; Biology computing; Computational biology; Databases; Genomics; Libraries; Proteins; Sequences;
fLanguage
English
Journal_Title
Proceedings of the IEEE
Publisher
ieee
ISSN
0018-9219
Type
jour
DOI
10.1109/JPROC.2002.805303
Filename
1058228
Link To Document