مرکز منطقه ای اطلاع رساني علوم و فناوري - MRSMRS: Mining repetitive sequences in a MapReduce setting

DocumentCode :

1784903

Title :

MRSMRS: Mining repetitive sequences in a MapReduce setting

Author :

Hongfei Cao ; Phinney, Michael ; Petersohn, Devin ; Merideth, Benjamin ; Chi-Ren Shyu

Author_Institution :

Dept. of Comput. Sci., Univ. of Missouri, Columbia, MO, USA

fYear :

2014

fDate :

2-5 Nov. 2014

Firstpage :

463

Lastpage :

470

Abstract :

Recent research suggests DNA repeats play critical roles in cellular regulatory functions and disease development. Also, repeat variability among different species, or the same species, is an important indicator for the development of specific phenotypes. Similarities in repetitive sequences among different species have been shown to indicate deeply conserved functions. Patterns such as ultra conserved elements (UCEs), tandem repeats, and palindromes have been of interest. Researchers utilize various computational approaches to aid in the identification of each of these types of patterns. The challenge associated with identifying repeats across a collection of genomes arises from the amount of data stored within DNA. The human genome alone consists of more than 3.1 billion base pairs, and intermediate data generated by alignment- and hash-based approaches are substantial. This sort of all-against-all analysis on a large collection of genomic sequence data often requires data to be reprocessed when new genomes are collected. To handle data of this scale, we utilize the Hadoop Distributed File System running on a cluster of 11 relatively inexpensive nodes, each containing a quad-core commodity processor. Furthermore, to alleviate redundant computation, intermediate data are organized in HBase, allowing us to incrementally process new genomic data without having to reprocess existing genomes. Our approach lends a cost-effective, flexible, robust, and scalable solution to the challenge of identifying various types of repetitive sequences across a collection of genomes. In this study, we benchmark our method using a collection of 6 genomes, summing to an approximate total of 14.2 billion base pairs. Three case studies are presented, demonstrating a 10.4 times speedup over previous state-of-the-art approaches and linear scalability.

Keywords :

DNA; bioinformatics; data mining; genomics; molecular biophysics; molecular clusters; molecular configurations; parallel processing; DNA repeats; Hadoop distributed file system; MRSMRS; alignment-based approaches; cellular regulatory functions; cluster; computational approaches; deeply conserved functions; disease development; genomic sequence data; hash-based approaches; human genome; intermediate data generation; linear scalability; mining repetitive sequences-in-a-mapreduce setting; palindromes; quad-core commodity processor; redundant computation; repeat variability; scalable solution; specific phenotypes development; ultraconserved elements; Big data; Bioinformatics; DNA; Diseases; Genomics; Phantoms; Runtime; Big Data; palindromes; repetitive sequences; sequence analysis; tandem repeats;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on

Conference_Location :

Belfast

Type :

conf

DOI :

10.1109/BIBM.2014.6999201

Filename :

6999201

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1784903