DocumentCode :
3078705
Title :
SparkSW: Scalable Distributed Computing System for Large-Scale Biological Sequence Alignment
Author :
Guoguang Zhao ; Cheng Ling ; Donghong Sun
Author_Institution :
Tsinghua Univ., China, Beijing, China
fYear :
2015
fDate :
4-7 May 2015
Firstpage :
845
Lastpage :
852
Abstract :
The Smith-Waterman (SW) algorithm is universally used for a database search owing to its high sensitively. The widespread impact of the algorithm is reflected in over 8000 citations that the algorithm has received in the past decades. However, the algorithm is prohibitively high in terms of time and space complexity, and so poses significant computational challenges. Apache Spark is an increasingly popular fast big data analytics engine, which has been highly successful in implementing large-scale data-intensive applications on commercial hardware. This paper presents the first ever reported system that implements the SW algorithm on Apache Spark based distributed computing framework, with a couple of off-the-shelf workstations, which is named as SparkSW. The scalability and load-balancing efficiency of the system are investigated by realistic ultra-large database from the state-of-the-art UniRef100. The experimental results indicate that 1) SparkSW is load-balancing for parallel adaptive on workloads and scales extremely well with the increases of computing resource, 2) SparkSW provides a fast and universal option high sensitively biological sequence alignments. The success of SparkSW also reveals that Apache Spark framework provides an efficient solution to facilitate coping with ever increasing sizes of biological sequence databases, especially generated by second-generation sequencing technologies.
Keywords :
Big Data; bioinformatics; data analysis; parallel processing; resource allocation; very large databases; Apache Spark; SW algorithm; Smith-Waterman algorithm; SparkSW; UniRef100; big data analytics engine; biological sequence databases; distributed computing system; large-scale biological sequence alignment; load-balancing efficiency; off-the-shelf workstations; second-generation sequencing technologies; ultra-large database; Algorithm design and analysis; Biology; Distributed databases; Heuristic algorithms; Sparks; Apache Spark; Distributed computing; pairwise sequence alignment;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on
Conference_Location :
Shenzhen
Type :
conf
DOI :
10.1109/CCGrid.2015.55
Filename :
7152568
Link To Document :
بازگشت