A Tool for Creating and Parallelizing Bioinformatics Pipelines

Author

Yu, Chenggang ; Wilson, Paul A.

Author_Institution

US Army Med. Res. & Materiel Command, Detrick

fYear

2007

fDate

18-21 June 2007

Firstpage

417

Lastpage

420

Abstract

Bioinformatics pipelines enable life scientists to effectively analyze biological data through automated multi-step processes constructed by individual programs and databases. The huge amount of data and time consuming computations require effectively parallelized pipelines to provide results within a reasonable time. To reduce researchers´ programming burden for pipeline creation and parallelization, we developed the bioinformatics pipeline generation and parallelization toolkit (BioGent). A user needs only to create a pipeline definition file that describes the data processing sequence and input/output files. A program termed schedpipe in the BioGent toolkit takes the definition file and executes the designed procedure. Schedpipe automatically parallelizes the pipeline execution by performing independent data processing steps on multiple CPUs, and by decomposing big datasets into small chunks and processing them in parallel. Schedpipe controls program execution on multiple CPUs through a simple application programming interface (API) of the Parallel Job Manager (PJM) library. As a part of the BioGent toolkit, PJM was developed to effectively launch and monitor programs on multiple CPUs using a message passing interface (MPI) protocol. The PJM API can also be used to parallelize other serial programs. A demonstration using PJM for parallelization shows 10% to 50% savings in time compared to an indigenous parallelization through a batch queuing system.

Keywords

application program interfaces; batch processing (computers); biology computing; message passing; pipeline processing; queueing theory; software tools; API; BioGent toolkit; MPI protocol; Parallel Job Manager library; Schedpipe controls; application programming interface; automated multistep processes; batch queuing system; bioinformatics pipeline generation; biological data; data processing sequence; databases; independent data processing; message passing interface; multiple CPUs; parallelization toolkit; parallelized pipelines; pipeline execution; program execution; serial programs; Automatic control; Bioinformatics; Biology computing; Concurrent computing; Data analysis; Data processing; Databases; Libraries; Parallel programming; Pipelines;

fLanguage

English

Publisher

ieee

Conference_Titel

DoD High Performance Computing Modernization Program Users Group Conference, 2007

Conference_Location

Pittsburgh, PA

Print_ISBN

978-0-7695-3088-5

Type

conf

DOI

10.1109/HPCMP-UGC.2007.5

Filename

4438020