• DocumentCode
    3714589
  • Title

    Cluster-based Apache Spark implementation of the GATK DNA analysis pipeline

  • Author

    Hamid Mushtaq;Zaid Al-Ars

  • Author_Institution
    Computer Engineering Laboratory, Delft University of Technology, The Netherlands
  • fYear
    2015
  • Firstpage
    1471
  • Lastpage
    1477
  • Abstract
    Fast progress in next generation sequencing has dramatically increased the throughout of DNA sequencing, resulting in the availability of large DNA data sets ready for analysis. However, post-sequencing DNA analysis has become the bottleneck in using these data sets, as it requires powerful and scalable tools to perform the needed analysis. A typical analysis pipeline consists of a number of steps, not all of which can readily scale on a distributed computing infrastructure. Recently, tools like Halvade, a Hadoop MapReduce solution, and Churchill, an HPC cluster-based solution, addressed this issue of scalability in the GATK DNA analysis pipeline. In this paper, we present a framework that implements an in-memory distributed version of the GATK pipeline using Apache Spark. Our framework reduced execution time by keeping data active in the memory between the map and reduce steps. In addition, it has a dynamic load balancing algorithm that better utilizes system performance using runtime statistics of the active workload. Experiments on a 4 node cluster with 64 virtual cores show that this approach is 63% faster than a Hadoop MapReduce based solution.
  • Keywords
    "Pipelines","DNA","Biological cells","Sparks","Sequential analysis","Scalability","Load management"
  • Publisher
    ieee
  • Conference_Titel
    Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on
  • Type

    conf

  • DOI
    10.1109/BIBM.2015.7359893
  • Filename
    7359893