Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline

Author

Hamid Mushtaq;Zaid Al-Ars

Author_Institution

Computer Engineering Laboratory, Delft University of Technology, The Netherlands

fYear

2015

Firstpage

1471

Lastpage

1477

Abstract

Fast progress in next generation sequencing has dramatically increased the throughout of DNA sequencing, resulting in the availability of large DNA data sets ready for analysis. However, post-sequencing DNA analysis has become the bottleneck in using these data sets, as it requires powerful and scalable tools to perform the needed analysis. A typical analysis pipeline consists of a number of steps, not all of which can readily scale on a distributed computing infrastructure. Recently, tools like Halvade, a Hadoop MapReduce solution, and Churchill, an HPC cluster-based solution, addressed this issue of scalability in the GATK DNA analysis pipeline. In this paper, we present a framework that implements an in-memory distributed version of the GATK pipeline using Apache Spark. Our framework reduced execution time by keeping data active in the memory between the map and reduce steps. In addition, it has a dynamic load balancing algorithm that better utilizes system performance using runtime statistics of the active workload. Experiments on a 4 node cluster with 64 virtual cores show that this approach is 63% faster than a Hadoop MapReduce based solution.

Keywords

"Pipelines","DNA","Biological cells","Sparks","Sequential analysis","Scalability","Load management"

Publisher

ieee

Conference_Titel

Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on

Type

conf

DOI

10.1109/BIBM.2015.7359893

Filename

7359893