PIPA: A High-Throughput Pipeline for Protein Function Annotation

Author

Yu, Chenggang ; Desai, Valmik ; Zavaljevski, Nela ; Reifman, Jaques

Author_Institution

Telemedicine & Adv. Technol. Res. Center, US Army Med. Res. & Materiel Command, Fort Detrick, MD

fYear

2008

fDate

14-17 July 2008

Firstpage

241

Lastpage

246

Abstract

We developed Pipeline for Protein Annotation (PIPA), a genome-wide protein function annotation pipeline that runs in a high performance computing environment. PIPA integrates different tools and employs the Gene Ontology (GO) to provide consistent annotation and resolve prediction conflicts. PIPA has three modules that allow for easy development of specialized databases and integration of various bioinformatics tools. The first module, the pipeline execution module, consists of programs that enable the user access to and control of the pipelinepsilas parallel execution of multiple jobs, each searching a particular database for a chunk of the input data. The execution module wraps the second module, the core pipeline module. The integrated resources, the program for terminology conversion to GO, and the consensus annotation program constitute the main components of the core module. The third module is the preprocessing module. This last module contains the program for customized generation of protein function databases and the GO-mapping generation program, which creates GO mappings for the terminology conversion program. The current implementation of PIPA annotates protein functions by combining the results of an inhouse-developed database for enzyme catalytic function prediction (CatFam) and the results of multiple integrated resources, such as the 11 member databases of InterPro and the Conserved Domains Database, into common GO terms. A Web-page-based graphical user interface is developed based on the User Interface Toolkit. The pipeline is deployed on two LINUX clusters, JVN at the Army Research Laboratory Major Shared Resource Center and JAWS at the Maui High Performance Computing Center. Currently, scientists at the Naval Medical Research Center are using PIPA to predict protein functions for newly sequenced bacterial pathogens and their near-neighbor strains. Validation tests show that, on average, the CatFam database yields predictions of enzyme catalytic fu- - nctions with accuracy greater than 95%. Test results of the consensus GO annotation show an improvement in performance of up to 8% when compared with annotations in which consensus is not used.

Keywords

Internet; bioinformatics; database management systems; genetics; graphical user interfaces; ontologies (artificial intelligence); parallel databases; parallel programming; pipeline processing; proteins; Web page; bioinformatics tool; enzyme catalytic function prediction; gene ontology; graphical user interface; high performance computing environment; high-throughput pipeline; protein function annotation; specialized database development; terminology conversion program; Biochemistry; Bioinformatics; Databases; Genomics; High performance computing; Ontologies; Pipelines; Protein engineering; Terminology; Testing;

fLanguage

English

Publisher

ieee

Conference_Titel

DoD HPCMP Users Group Conference, 2008. DOD HPCMP UGC

Conference_Location

Seattle, WA

Print_ISBN

978-1-4244-3323-0

Type

conf

DOI

10.1109/DoD.HPCMP.UGC.2008.24

Filename

4755872

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=2422007