SparkSW: Scalable Distributed Computing System for Large-Scale Biological Sequence Alignment

Author

Guoguang Zhao ; Cheng Ling ; Donghong Sun

Author_Institution

Tsinghua Univ., China, Beijing, China

fYear

2015

fDate

4-7 May 2015

Firstpage

845

Lastpage

852

Abstract

The Smith-Waterman (SW) algorithm is universally used for a database search owing to its high sensitively. The widespread impact of the algorithm is reflected in over 8000 citations that the algorithm has received in the past decades. However, the algorithm is prohibitively high in terms of time and space complexity, and so poses significant computational challenges. Apache Spark is an increasingly popular fast big data analytics engine, which has been highly successful in implementing large-scale data-intensive applications on commercial hardware. This paper presents the first ever reported system that implements the SW algorithm on Apache Spark based distributed computing framework, with a couple of off-the-shelf workstations, which is named as SparkSW. The scalability and load-balancing efficiency of the system are investigated by realistic ultra-large database from the state-of-the-art UniRef100. The experimental results indicate that 1) SparkSW is load-balancing for parallel adaptive on workloads and scales extremely well with the increases of computing resource, 2) SparkSW provides a fast and universal option high sensitively biological sequence alignments. The success of SparkSW also reveals that Apache Spark framework provides an efficient solution to facilitate coping with ever increasing sizes of biological sequence databases, especially generated by second-generation sequencing technologies.

Keywords

Big Data; bioinformatics; data analysis; parallel processing; resource allocation; very large databases; Apache Spark; SW algorithm; Smith-Waterman algorithm; SparkSW; UniRef100; big data analytics engine; biological sequence databases; distributed computing system; large-scale biological sequence alignment; load-balancing efficiency; off-the-shelf workstations; second-generation sequencing technologies; ultra-large database; Algorithm design and analysis; Biology; Distributed databases; Heuristic algorithms; Sparks; Apache Spark; Distributed computing; pairwise sequence alignment;

fLanguage

English

Publisher

ieee

Conference_Titel

Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on

Conference_Location

Shenzhen

Type

conf

DOI

10.1109/CCGrid.2015.55

Filename

7152568