Title :
Shifting the bioinformatics computing paradigm: A case study in parallelizing genome annotation using MAKER and Work Queue
Author :
Thrasher, Andrew ; Thain, Douglas ; Emrich, Scott ; Musgrave, Zachary
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of Notre Dame, Notre Dame, IN, USA
Abstract :
Next generation sequencing technologies have enabled various entities, ranging from large sequencing centers to individual laboratories, to sequence organisms of choice and analyze them on demand. Sequencing and analysis, however, is only part of the equation: to learn about a certain organism, scientists need to annotate it. Each of these problems is highly parallel at a basic level of computation; however, only a few applications support single parallelization frameworks such as MPI. Because of the overall increasing demand for computational analysis and the inherent parallelism available in these problems, applications should easily run on clusters, clouds, and/or grids (even simultaneously); this would enable labs of various sizes to harness the computing power available to them without forcing them to invest in a particular type of batch system. Here we describe modifications made to one particular tool, MAKER. MAKER is a tool for genome annotation that is provided as both a serial application and as an MPI application. We make modifications to enable it to run without MPI and to utilize a wide variety of distributed computing platforms. Further, our proposed parallel framework allows for easy explicit data transfer, which helps overcome a major limitation of bioinformatics tools that generally rely on a shared filesystem. The distributed computing framework we chose to utilize can be used, even during early stages of development, to run bioinformatics tools on clusters, grids, and clouds. We present an evaluation of our modifications using the Caenorhabditis japonica genome comprising 180 megabases of data and achieve a speedup of 45× using 50 workers.
Keywords :
bioinformatics; cellular biophysics; genomics; microorganisms; Caenorhabditis japonica genome; MAKER queue; bioinformatics computing paradigm; bioinformatics tools; clouds; clusters; genome annotation; grids; large sequencing centers; next generation sequencing technologies; parallel framework; parallelizing genome annotation; sequence organisms; support single parallelization frameworks; work queue; Bioinformatics; Distributed computing; Genomics; Hidden Markov models; Libraries; Pipelines; Proteins; Bioinformatics; Distributed computing;
Conference_Titel :
Computational Advances in Bio and Medical Sciences (ICCABS), 2012 IEEE 2nd International Conference on
Conference_Location :
Las Vegas, NV
Print_ISBN :
978-1-4673-1320-9
Electronic_ISBN :
978-1-4673-1319-3
DOI :
10.1109/ICCABS.2012.6182647