• DocumentCode
    2422007
  • Title

    PIPA: A High-Throughput Pipeline for Protein Function Annotation

  • Author

    Yu, Chenggang ; Desai, Valmik ; Zavaljevski, Nela ; Reifman, Jaques

  • Author_Institution
    Telemedicine & Adv. Technol. Res. Center, US Army Med. Res. & Materiel Command, Fort Detrick, MD
  • fYear
    2008
  • fDate
    14-17 July 2008
  • Firstpage
    241
  • Lastpage
    246
  • Abstract
    We developed Pipeline for Protein Annotation (PIPA), a genome-wide protein function annotation pipeline that runs in a high performance computing environment. PIPA integrates different tools and employs the Gene Ontology (GO) to provide consistent annotation and resolve prediction conflicts. PIPA has three modules that allow for easy development of specialized databases and integration of various bioinformatics tools. The first module, the pipeline execution module, consists of programs that enable the user access to and control of the pipelinepsilas parallel execution of multiple jobs, each searching a particular database for a chunk of the input data. The execution module wraps the second module, the core pipeline module. The integrated resources, the program for terminology conversion to GO, and the consensus annotation program constitute the main components of the core module. The third module is the preprocessing module. This last module contains the program for customized generation of protein function databases and the GO-mapping generation program, which creates GO mappings for the terminology conversion program. The current implementation of PIPA annotates protein functions by combining the results of an inhouse-developed database for enzyme catalytic function prediction (CatFam) and the results of multiple integrated resources, such as the 11 member databases of InterPro and the Conserved Domains Database, into common GO terms. A Web-page-based graphical user interface is developed based on the User Interface Toolkit. The pipeline is deployed on two LINUX clusters, JVN at the Army Research Laboratory Major Shared Resource Center and JAWS at the Maui High Performance Computing Center. Currently, scientists at the Naval Medical Research Center are using PIPA to predict protein functions for newly sequenced bacterial pathogens and their near-neighbor strains. Validation tests show that, on average, the CatFam database yields predictions of enzyme catalytic fu- - nctions with accuracy greater than 95%. Test results of the consensus GO annotation show an improvement in performance of up to 8% when compared with annotations in which consensus is not used.
  • Keywords
    Internet; bioinformatics; database management systems; genetics; graphical user interfaces; ontologies (artificial intelligence); parallel databases; parallel programming; pipeline processing; proteins; Web page; bioinformatics tool; enzyme catalytic function prediction; gene ontology; graphical user interface; high performance computing environment; high-throughput pipeline; protein function annotation; specialized database development; terminology conversion program; Biochemistry; Bioinformatics; Databases; Genomics; High performance computing; Ontologies; Pipelines; Protein engineering; Terminology; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    DoD HPCMP Users Group Conference, 2008. DOD HPCMP UGC
  • Conference_Location
    Seattle, WA
  • Print_ISBN
    978-1-4244-3323-0
  • Type

    conf

  • DOI
    10.1109/DoD.HPCMP.UGC.2008.24
  • Filename
    4755872