• DocumentCode
    249335
  • Title

    A Processing Pipeline for Cassandra Datasets Based on Hadoop Streaming

  • Author

    Dede, E. ; Sendir, B. ; Kuzlu, P. ; Weachock, J. ; Govindaraju, M. ; Ramakrishnan, Lavanya

  • Author_Institution
    Grid & Cloud Comput. Res. Lab., SUNY Binghamton, Binghamton, NY, USA
  • fYear
    2014
  • fDate
    June 27 2014-July 2 2014
  • Firstpage
    168
  • Lastpage
    175
  • Abstract
    The progressive transition in the nature of both scientific and industrial datasets has been the driving force behind the development and research interests in the NoSQL data model. Loosely structured data poses a challenge to traditional data store systems, and when working with the NoSQL model, these systems are often considered impractical and expensive. As the quantity of unstructured data grows, so does the demand for a processing pipeline that is capable of seamlessly combining the NoSQL storage model and a "Big Data" processing platform such as MapReduce. Although, MapReduce is the paradigm of choice for data-intensive computing, Java-based frameworks such as Hadoop require users to write MapReduce code in Java. Hadoop Streaming, on the other hand, allows users to define non-Java executables as map and reduce operations. Similarly, for legacy C/C++ applications and other non-Java executables, there is a need to allow NoSQL data stores access to the features of Hadoop Streaming. In this paper, we present approaches in solving the challenge of integrating NoSQL data stores with MapReduce for non-Java application scenarios, along with advantages and disadvantages of each approach. We compare Hadoop Streaming alongside our own streaming framework, MARISSA, to show performance implications of coupling NoSQL data stores like Cassandra with MapReduce frameworks that normally rely on file-system based data stores.
  • Keywords
    Big Data; C++ language; Java; SQL; parallel processing; pipeline processing; Big Data processing platform; C applications; C++ applications; Cassandra datasets; Hadoop streaming; Java-based frameworks; MARISSA; MapReduce; NoSQL data model; pipeline processing; Big data; Data models; Databases; Java; Pipelines; Servers; Cassandra; Hadoop; Hadoop Streaming; MARISSA; MapReduce; NoSQL;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data (BigData Congress), 2014 IEEE International Congress on
  • Conference_Location
    Anchorage, AK
  • Print_ISBN
    978-1-4799-5056-0
  • Type

    conf

  • DOI
    10.1109/BigData.Congress.2014.32
  • Filename
    6906775