DocumentCode
2786084
Title
Optimizing Sequence Alignment in Cloud Using Hadoop and MPP Database
Author
Vijayakumar, Senthilkumar ; Bhargavi, Anjani ; Praseeda, Uma ; Ahamed, Syed Azar
Author_Institution
ITPB, TATA Consultancy Services Ltd., Bangalore, India
fYear
2012
fDate
24-29 June 2012
Firstpage
819
Lastpage
827
Abstract
In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. This information can effectively be used for medical and biological research only if one can extract functional insight from it. To obtain functional insight the factors to be considered while aligning sequences are: optimized querying of sequences, high speed matching and accuracy of alignment. The FAST-All (FASTA) for both proteins and nucleotides program considers all these factors and follows a largely heuristic method, which contributes to the high speed of its execution. The program initially observes the pattern of word hits, word-to-word matches of a given length, and marks potential matches rather than performing a more time-consuming, optimized search using a Smith-Waterman type of algorithm. In this paper, we propose an optimized approach to sequence alignment using FASTA algorithm, which incorporates high speed word-to-word matching. In the current scenario where data growth is in petabytes a day and processing requires state of the art technologies, Greenplum Massively Parallel Processing (MPP) database and Hadoop are emerging parallel technologies which form the backbone of this proposal. The complex nature of the algorithm, coupled with data and computational parallelism of Hadoop grid and Massively Parallel Processing database for querying from big datasets containing petabytes of sequences, improves the accuracy, speed of sequence alignment and optimizes querying from big datasets. Bioinformatics labs and centers across the globe today upload enormous amount of data and sequences in a central location for the scientific analysis. The transfer of such large datasets can also be simplified with Cloud approaches. So, Cloud Computing Technology is used in our implementation for the ease of gathering such sequences an- data from various sources like medical research centers, scientists and biomedical labs around the globe. A plan for the final "publicly consumable" form of the program is to make it web-based and running on the Cloud.
Keywords
DNA; RNA; bioinformatics; cloud computing; grid computing; parallel databases; proteins; public domain software; query processing; DNA sequence arrangement; FAST-All; FASTA algorithm; Greenplum massively parallel processing database; Hadoop grid; MPP database; RNA sequence arrangement; bioinformatics; biological research; cloud computing technology; data growth; evolutionary relationships; functional relationships; medical research; nucleotides program; parallel technologies; protein sequence arrangement; proteins program; scientific analysis; sequence alignment optimization; structural relationships; word hits; word-to-word matching; Bioinformatics; Cloud computing; Distributed databases; Genomics; Green products; Parallel processing; Cloud Computing; Computational Biology; FASTA; Greenplum MPP Database; Hadoop; Sequence Alignment;
fLanguage
English
Publisher
ieee
Conference_Titel
Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on
Conference_Location
Honolulu, HI
ISSN
2159-6182
Print_ISBN
978-1-4673-2892-0
Type
conf
DOI
10.1109/CLOUD.2012.34
Filename
6253584
Link To Document