Title :
Accelerating Comparative Genomics Workflows in a Distributed Environment with Optimized Data Partitioning
Author :
Choudhury, Olivia ; Hazekamp, Nicholas L. ; Thain, D. ; Emrich, S.
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of Notre Dame, Notre Dame, IN, USA
Abstract :
The advent of new sequencing technology has generated massive amounts of biological data at unprecedented rates. High-throughput bioinformatics tools are required to keep pace with this. Here, we implement a workflow-based model for parallelizing the data intensive task of genome alignment and variant calling with BWA and GATK´s Haplotype Caller. We explore different approaches of partitioning data and how each affect the run time. We observe granularity-based partitioning for BWA and alignment-based partitioning for Halo type Caller to be the optimal choices for the pipeline. We identify the various challenges encountered while developing such an application and provide an insight into addressing them. We report significant performance improvements, from 12 days to 4 hours, while running the BWA-GATK pipeline using 100 nodes for analyzing high-coverage oak tree data.
Keywords :
biology computing; distributed processing; genomics; tree data structures; workflow management software; BWA-GATK pipeline; GATK HaplotypeCaller; alignment-based partitioning; biological data; comparative genomics workflow; data intensive task; distributed environment; genome alignment; granularity-based partitioning; high-coverage oak tree data; high-throughput bioinformatics tools; optimized data partitioning; partitioning data; sequencing technology; workflow-based model; Bioinformatics; Genomics; III-V semiconductor materials; Pipelines; Runtime; Sequential analysis; BWA; Bioinformatics; Comparative Genomics; Data Partitioning; Distributed Computing; GATK; Genome Alignment; Makeflow; Variant Calling; Work Queue;
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Conference_Location :
Chicago, IL
DOI :
10.1109/CCGrid.2014.79