DocumentCode :
2422007
Title :
PIPA: A High-Throughput Pipeline for Protein Function Annotation
Author :
Yu, Chenggang ; Desai, Valmik ; Zavaljevski, Nela ; Reifman, Jaques
Author_Institution :
Telemedicine & Adv. Technol. Res. Center, US Army Med. Res. & Materiel Command, Fort Detrick, MD
fYear :
2008
fDate :
14-17 July 2008
Firstpage :
241
Lastpage :
246
Abstract :
We developed Pipeline for Protein Annotation (PIPA), a genome-wide protein function annotation pipeline that runs in a high performance computing environment. PIPA integrates different tools and employs the Gene Ontology (GO) to provide consistent annotation and resolve prediction conflicts. PIPA has three modules that allow for easy development of specialized databases and integration of various bioinformatics tools. The first module, the pipeline execution module, consists of programs that enable the user access to and control of the pipelinepsilas parallel execution of multiple jobs, each searching a particular database for a chunk of the input data. The execution module wraps the second module, the core pipeline module. The integrated resources, the program for terminology conversion to GO, and the consensus annotation program constitute the main components of the core module. The third module is the preprocessing module. This last module contains the program for customized generation of protein function databases and the GO-mapping generation program, which creates GO mappings for the terminology conversion program. The current implementation of PIPA annotates protein functions by combining the results of an inhouse-developed database for enzyme catalytic function prediction (CatFam) and the results of multiple integrated resources, such as the 11 member databases of InterPro and the Conserved Domains Database, into common GO terms. A Web-page-based graphical user interface is developed based on the User Interface Toolkit. The pipeline is deployed on two LINUX clusters, JVN at the Army Research Laboratory Major Shared Resource Center and JAWS at the Maui High Performance Computing Center. Currently, scientists at the Naval Medical Research Center are using PIPA to predict protein functions for newly sequenced bacterial pathogens and their near-neighbor strains. Validation tests show that, on average, the CatFam database yields predictions of enzyme catalytic fu- - nctions with accuracy greater than 95%. Test results of the consensus GO annotation show an improvement in performance of up to 8% when compared with annotations in which consensus is not used.
Keywords :
Internet; bioinformatics; database management systems; genetics; graphical user interfaces; ontologies (artificial intelligence); parallel databases; parallel programming; pipeline processing; proteins; Web page; bioinformatics tool; enzyme catalytic function prediction; gene ontology; graphical user interface; high performance computing environment; high-throughput pipeline; protein function annotation; specialized database development; terminology conversion program; Biochemistry; Bioinformatics; Databases; Genomics; High performance computing; Ontologies; Pipelines; Protein engineering; Terminology; Testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
DoD HPCMP Users Group Conference, 2008. DOD HPCMP UGC
Conference_Location :
Seattle, WA
Print_ISBN :
978-1-4244-3323-0
Type :
conf
DOI :
10.1109/DoD.HPCMP.UGC.2008.24
Filename :
4755872
Link To Document :
بازگشت