DocumentCode :
1784903
Title :
MRSMRS: Mining repetitive sequences in a MapReduce setting
Author :
Hongfei Cao ; Phinney, Michael ; Petersohn, Devin ; Merideth, Benjamin ; Chi-Ren Shyu
Author_Institution :
Dept. of Comput. Sci., Univ. of Missouri, Columbia, MO, USA
fYear :
2014
fDate :
2-5 Nov. 2014
Firstpage :
463
Lastpage :
470
Abstract :
Recent research suggests DNA repeats play critical roles in cellular regulatory functions and disease development. Also, repeat variability among different species, or the same species, is an important indicator for the development of specific phenotypes. Similarities in repetitive sequences among different species have been shown to indicate deeply conserved functions. Patterns such as ultra conserved elements (UCEs), tandem repeats, and palindromes have been of interest. Researchers utilize various computational approaches to aid in the identification of each of these types of patterns. The challenge associated with identifying repeats across a collection of genomes arises from the amount of data stored within DNA. The human genome alone consists of more than 3.1 billion base pairs, and intermediate data generated by alignment- and hash-based approaches are substantial. This sort of all-against-all analysis on a large collection of genomic sequence data often requires data to be reprocessed when new genomes are collected. To handle data of this scale, we utilize the Hadoop Distributed File System running on a cluster of 11 relatively inexpensive nodes, each containing a quad-core commodity processor. Furthermore, to alleviate redundant computation, intermediate data are organized in HBase, allowing us to incrementally process new genomic data without having to reprocess existing genomes. Our approach lends a cost-effective, flexible, robust, and scalable solution to the challenge of identifying various types of repetitive sequences across a collection of genomes. In this study, we benchmark our method using a collection of 6 genomes, summing to an approximate total of 14.2 billion base pairs. Three case studies are presented, demonstrating a 10.4 times speedup over previous state-of-the-art approaches and linear scalability.
Keywords :
DNA; bioinformatics; data mining; genomics; molecular biophysics; molecular clusters; molecular configurations; parallel processing; DNA repeats; Hadoop distributed file system; MRSMRS; alignment-based approaches; cellular regulatory functions; cluster; computational approaches; deeply conserved functions; disease development; genomic sequence data; hash-based approaches; human genome; intermediate data generation; linear scalability; mining repetitive sequences-in-a-mapreduce setting; palindromes; quad-core commodity processor; redundant computation; repeat variability; scalable solution; specific phenotypes development; ultraconserved elements; Big data; Bioinformatics; DNA; Diseases; Genomics; Phantoms; Runtime; Big Data; palindromes; repetitive sequences; sequence analysis; tandem repeats;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on
Conference_Location :
Belfast
Type :
conf
DOI :
10.1109/BIBM.2014.6999201
Filename :
6999201
Link To Document :
بازگشت