• DocumentCode
    2428746
  • Title

    Three improvements to the BLASTP search of genome databases

  • Author

    Delaney, Shawn ; Butler, Greg ; Lam, Clement ; Thiel, Larry

  • Author_Institution
    Dept. of Comput. Sci., Concordia Univ., Montreal, Que., Canada
  • fYear
    2000
  • fDate
    2000
  • Firstpage
    14
  • Lastpage
    24
  • Abstract
    The BLASTP program is a search tool for databases of protein sequences that is widely used by biologists as a first step in investigating new genome sequences. BLASTP finds high-scoring local alignments (qiqi+1…qi+k||s jsj+1…sj+k) without gaps between a query sequence q and sequences s in the database. The score of an alignment is the sum of the scores of individual alignments qi+t ||sj+t between amino acids that make up the protein. These individual scores come from a scoring matrix modeling the rate of evolutionary mutation. Here we provide a detailed description of the original program and three separate optimisations to it. BLASTP consists of three steps, that we call neighbourhood construction, hit detection, and hit extension. The three optimisations target hit extension since it accounts for 93% of the execution time. The first optimisation alters the data representation of the query sequence and the related code for indexing the scoring matrix. The second optimisation performs extensions in step-sizes of two rather than one. The third optimisation forstalls the calling of the hit extension step in cases that are unlikely to lead to a high-scoring alignment. Individually the three optimisations show speed ups of 15%, 48%, and 63% respectively
  • Keywords
    data structures; database indexing; medical information systems; BLASTP search; amino acids; data representation; evolutionary mutation; genome databases; genome sequences; hit detection; hit extension; indexing; local alignments; protein sequences database; query sequence; scoring matrix; search tool; Amino acids; Bioinformatics; Biological information theory; Computer science; Databases; Genetic mutations; Genomics; Indexing; Proteins; Uninterruptible power systems;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Scientific and Statistical Database Management, 2000. Proceedings. 12th International Conference on
  • Conference_Location
    Berlin
  • ISSN
    1099-3371
  • Print_ISBN
    0-7695-0686-0
  • Type

    conf

  • DOI
    10.1109/SSDM.2000.869775
  • Filename
    869775