• DocumentCode
    1784903
  • Title

    MRSMRS: Mining repetitive sequences in a MapReduce setting

  • Author

    Hongfei Cao ; Phinney, Michael ; Petersohn, Devin ; Merideth, Benjamin ; Chi-Ren Shyu

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Missouri, Columbia, MO, USA
  • fYear
    2014
  • fDate
    2-5 Nov. 2014
  • Firstpage
    463
  • Lastpage
    470
  • Abstract
    Recent research suggests DNA repeats play critical roles in cellular regulatory functions and disease development. Also, repeat variability among different species, or the same species, is an important indicator for the development of specific phenotypes. Similarities in repetitive sequences among different species have been shown to indicate deeply conserved functions. Patterns such as ultra conserved elements (UCEs), tandem repeats, and palindromes have been of interest. Researchers utilize various computational approaches to aid in the identification of each of these types of patterns. The challenge associated with identifying repeats across a collection of genomes arises from the amount of data stored within DNA. The human genome alone consists of more than 3.1 billion base pairs, and intermediate data generated by alignment- and hash-based approaches are substantial. This sort of all-against-all analysis on a large collection of genomic sequence data often requires data to be reprocessed when new genomes are collected. To handle data of this scale, we utilize the Hadoop Distributed File System running on a cluster of 11 relatively inexpensive nodes, each containing a quad-core commodity processor. Furthermore, to alleviate redundant computation, intermediate data are organized in HBase, allowing us to incrementally process new genomic data without having to reprocess existing genomes. Our approach lends a cost-effective, flexible, robust, and scalable solution to the challenge of identifying various types of repetitive sequences across a collection of genomes. In this study, we benchmark our method using a collection of 6 genomes, summing to an approximate total of 14.2 billion base pairs. Three case studies are presented, demonstrating a 10.4 times speedup over previous state-of-the-art approaches and linear scalability.
  • Keywords
    DNA; bioinformatics; data mining; genomics; molecular biophysics; molecular clusters; molecular configurations; parallel processing; DNA repeats; Hadoop distributed file system; MRSMRS; alignment-based approaches; cellular regulatory functions; cluster; computational approaches; deeply conserved functions; disease development; genomic sequence data; hash-based approaches; human genome; intermediate data generation; linear scalability; mining repetitive sequences-in-a-mapreduce setting; palindromes; quad-core commodity processor; redundant computation; repeat variability; scalable solution; specific phenotypes development; ultraconserved elements; Big data; Bioinformatics; DNA; Diseases; Genomics; Phantoms; Runtime; Big Data; palindromes; repetitive sequences; sequence analysis; tandem repeats;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Bioinformatics and Biomedicine (BIBM), 2014 IEEE International Conference on
  • Conference_Location
    Belfast
  • Type

    conf

  • DOI
    10.1109/BIBM.2014.6999201
  • Filename
    6999201