• DocumentCode
    19565
  • Title

    Improving Retrieval Efficacy of Homology Searches Using the False Discovery Rate

  • Author

    Carroll, Hyrum D. ; Williams, Alex C. ; Davis, Anthony G. ; Spouge, John L.

  • Author_Institution
    Dept. of Comput. Sci., Middle Tennessee State Univ., Murfreesboro, TN, USA
  • Volume
    12
  • Issue
    3
  • fYear
    2015
  • fDate
    May-June 1 2015
  • Firstpage
    531
  • Lastpage
    537
  • Abstract
    Over the past few decades, discovery based on sequence homology has become a widely accepted practice. Consequently, comparative accuracy of retrieval algorithms (e.g., BLAST) has been rigorously studied for improvement. Unlike most components of retrieval algorithms, the E-value threshold criterion has yet to be thoroughly investigated. An investigation of the threshold is important as it exclusively dictates which sequences are declared relevant and irrelevant. In this paper, we introduce the false discovery rate (FDR) statistic as a replacement for the uniform threshold criterion in order to improve efficacy in retrieval systems. Using NCBI´s BLAST and PSI-BLAST software packages, we demonstrate the applicability of such a replacement in both non-iterative (BLASTFDR) and iterative (PSI-BLASTFDR) homology searches. For each application, we performed an evaluation of retrieval efficacy with five different multiple testing methods on a large training database. For each algorithm, we choose the best performing method, Benjamini-Hochberg, as the default statistic. As measured by the threshold average precision, BLASTFDR yielded 14.1 percent better retrieval performance than BLAST on a large (5,161 queries) test database and PSI-BLASTFDR attained 11.8 percent better retrieval performance than PSI-BLAST. The C++ source code specific to BLASTFDR and PSI-BLASTFDR and instructions are available at http://www.cs.mtsu.edu/~hcarroll/blast_fdr/.
  • Keywords
    C++ language; bioinformatics; information retrieval; iterative methods; molecular configurations; proteins; query formulation; software packages; ++ source code; BLAST software package; Benjamini-Hochberg method; E-value threshold criterion; FDR statistic; PSI-BLAST software package; false discovery rate; homology searches; iterative homology searches; noniterative homology searches; retrieval algorithms; retrieval efficacy; sequence homology; threshold average precision; uniform threshold criterion; Bioinformatics; Computational biology; Databases; Histograms; IEEE transactions; Testing; Training; Homology search; false discovery rate; retrieval efficacy; uniform E-value thresholding;
  • fLanguage
    English
  • Journal_Title
    Computational Biology and Bioinformatics, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    1545-5963
  • Type

    jour

  • DOI
    10.1109/TCBB.2014.2366112
  • Filename
    6940294