MRSMRS: Mining repetitive sequences in a MapReduce setting

Author

Hongfei Cao ; Phinney, Michael ; Petersohn, Devin ; Merideth, Benjamin ; Chi-Ren Shyu

Author_Institution

Dept. of Comput. Sci., Univ. of Missouri, Columbia, MO, USA

fYear

2014

fDate

2-5 Nov. 2014

Firstpage

463

Lastpage

470

Abstract

Recent research suggests DNA repeats play critical roles in cellular regulatory functions and disease development. Also, repeat variability among different species, or the same species, is an important indicator for the development of specific phenotypes. Similarities in repetitive sequences among different species have been shown to indicate deeply conserved functions. Patterns such as ultra conserved elements (UCEs), tandem repeats, and palindromes have been of interest. Researchers utilize various computational approaches to aid in the identification of each of these types of patterns. The challenge associated with identifying repeats across a collection of genomes arises from the amount of data stored within DNA. The human genome alone consists of more than 3.1 billion base pairs, and intermediate data generated by alignment- and hash-based approaches are substantial. This sort of all-against-all analysis on a large collection of genomic sequence data often requires data to be reprocessed when new genomes are collected. To handle data of this scale, we utilize the Hadoop Distributed File System running on a cluster of 11 relatively inexpensive nodes, each containing a quad-core commodity processor. Furthermore, to alleviate redundant computation, intermediate data are organized in HBase, allowing us to incrementally process new genomic data without having to reprocess existing genomes. Our approach lends a cost-effective, flexible, robust, and scalable solution to the challenge of identifying various types of repetitive sequences across a collection of genomes. In this study, we benchmark our method using a collection of 6 genomes, summing to an approximate total of 14.2 billion base pairs. Three case studies are presented, demonstrating a 10.4 times speedup over previous state-of-the-art approaches and linear scalability.

Keywords

DNA; bioinformatics; data mining; genomics; molecular biophysics; molecular clusters; molecular configurations; parallel processing; DNA repeats; Hadoop distributed file system; MRSMRS; alignment-based approaches; cellular regulatory functions; cluster; computational approaches; deeply conserved functions; disease development; genomic sequence data; hash-based approaches; human genome; intermediate data generation; linear scalability; mining repetitive sequences-in-a-mapreduce setting; palindromes; quad-core commodity processor; redundant computation; repeat variability; scalable solution; specific phenotypes development; ultraconserved elements; Big data; Bioinformatics; DNA; Diseases; Genomics; Phantoms; Runtime; Big Data; palindromes; repetitive sequences; sequence analysis; tandem repeats;

fLanguage

English

Publisher

ieee

Conference_Titel

Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on

Conference_Location

Belfast

Type

conf

DOI

10.1109/BIBM.2014.6999201

Filename

6999201

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=1784903