DocumentCode
3714589
Title
Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline
Author
Hamid Mushtaq;Zaid Al-Ars
Author_Institution
Computer Engineering Laboratory, Delft University of Technology, The Netherlands
fYear
2015
Firstpage
1471
Lastpage
1477
Abstract
Fast progress in next generation sequencing has dramatically increased the throughout of DNA sequencing, resulting in the availability of large DNA data sets ready for analysis. However, post-sequencing DNA analysis has become the bottleneck in using these data sets, as it requires powerful and scalable tools to perform the needed analysis. A typical analysis pipeline consists of a number of steps, not all of which can readily scale on a distributed computing infrastructure. Recently, tools like Halvade, a Hadoop MapReduce solution, and Churchill, an HPC cluster-based solution, addressed this issue of scalability in the GATK DNA analysis pipeline. In this paper, we present a framework that implements an in-memory distributed version of the GATK pipeline using Apache Spark. Our framework reduced execution time by keeping data active in the memory between the map and reduce steps. In addition, it has a dynamic load balancing algorithm that better utilizes system performance using runtime statistics of the active workload. Experiments on a 4 node cluster with 64 virtual cores show that this approach is 63% faster than a Hadoop MapReduce based solution.
Keywords
"Pipelines","DNA","Biological cells","Sparks","Sequential analysis","Scalability","Load management"
Publisher
ieee
Conference_Titel
Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on
Type
conf
DOI
10.1109/BIBM.2015.7359893
Filename
7359893
Link To Document