DocumentCode :
2786084
Title :
Optimizing Sequence Alignment in Cloud Using Hadoop and MPP Database
Author :
Vijayakumar, Senthilkumar ; Bhargavi, Anjani ; Praseeda, Uma ; Ahamed, Syed Azar
Author_Institution :
ITPB, TATA Consultancy Services Ltd., Bangalore, India
fYear :
2012
fDate :
24-29 June 2012
Firstpage :
819
Lastpage :
827
Abstract :
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. This information can effectively be used for medical and biological research only if one can extract functional insight from it. To obtain functional insight the factors to be considered while aligning sequences are: optimized querying of sequences, high speed matching and accuracy of alignment. The FAST-All (FASTA) for both proteins and nucleotides program considers all these factors and follows a largely heuristic method, which contributes to the high speed of its execution. The program initially observes the pattern of word hits, word-to-word matches of a given length, and marks potential matches rather than performing a more time-consuming, optimized search using a Smith-Waterman type of algorithm. In this paper, we propose an optimized approach to sequence alignment using FASTA algorithm, which incorporates high speed word-to-word matching. In the current scenario where data growth is in petabytes a day and processing requires state of the art technologies, Greenplum Massively Parallel Processing (MPP) database and Hadoop are emerging parallel technologies which form the backbone of this proposal. The complex nature of the algorithm, coupled with data and computational parallelism of Hadoop grid and Massively Parallel Processing database for querying from big datasets containing petabytes of sequences, improves the accuracy, speed of sequence alignment and optimizes querying from big datasets. Bioinformatics labs and centers across the globe today upload enormous amount of data and sequences in a central location for the scientific analysis. The transfer of such large datasets can also be simplified with Cloud approaches. So, Cloud Computing Technology is used in our implementation for the ease of gathering such sequences an- data from various sources like medical research centers, scientists and biomedical labs around the globe. A plan for the final "publicly consumable" form of the program is to make it web-based and running on the Cloud.
Keywords :
DNA; RNA; bioinformatics; cloud computing; grid computing; parallel databases; proteins; public domain software; query processing; DNA sequence arrangement; FAST-All; FASTA algorithm; Greenplum massively parallel processing database; Hadoop grid; MPP database; RNA sequence arrangement; bioinformatics; biological research; cloud computing technology; data growth; evolutionary relationships; functional relationships; medical research; nucleotides program; parallel technologies; protein sequence arrangement; proteins program; scientific analysis; sequence alignment optimization; structural relationships; word hits; word-to-word matching; Bioinformatics; Cloud computing; Distributed databases; Genomics; Green products; Parallel processing; Cloud Computing; Computational Biology; FASTA; Greenplum MPP Database; Hadoop; Sequence Alignment;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on
Conference_Location :
Honolulu, HI
ISSN :
2159-6182
Print_ISBN :
978-1-4673-2892-0
Type :
conf
DOI :
10.1109/CLOUD.2012.34
Filename :
6253584
Link To Document :
بازگشت