• DocumentCode
    2545700
  • Title

    Beyond Simple Integration of RDBMS and MapReduce -- Paving the Way toward a Unified System for Big Data Analytics: Vision and Progress

  • Author

    Xiongpai Qin ; Huiju Wang ; Furong Li ; Baoyao Zhou ; Yu Cao ; Cuiping Li ; Hong Chen ; Xuan Zhou ; Xiaoyong Du ; Shan Wang

  • Author_Institution
    Key Lab. of Data Eng. & Knowledge Eng. (RUC), Beijing, China
  • fYear
    2012
  • fDate
    1-3 Nov. 2012
  • Firstpage
    716
  • Lastpage
    725
  • Abstract
    MapReduce has shown vigorous vitality and penetrated both academia and industry in recent years. MapReduce not only can be used as an ETL tool, it can do even much more. The technique has been applied to SQL summation, OLAP, data mining, machine learning, information retrieval, multimedia data processing, science data processing etc. Basically MapReduce is a general purpose parallel computing framework for large dataset processing. A big data analytics ecosystem built around MapReduce is emerging alongside the traditional one built around RDBMS. The objectives of RDBMS and MapReduce, as well as the ecosystems built around them, overlap much really, in some sense they do the same thing and MapReduce can accomplish more works, such as graph processing, which RDBMS can not handle well. RBDMS enjoys high performance of relational data processing, which MapReduce needs to catch up. The authors envision that the two techniques are fusing into a unified system for big data analytics. With the ongoing endeavor to build up the system, much of the groundwork has been laid while some critical issues are still unresolved, we try to identify some of them. Two of our works as well as experiment results are presented, one is applying a hierarchical encoding to star schema data in Hadoop for high performance of OLAP processing, another is leveraging the natural three copies of HDFS blocks to exploit different data layouts to speed up queries in a OLAP workload, a cost model is used to route user queries to different data layouts.
  • Keywords
    SQL; data analysis; data mining; ecology; information retrieval; learning (artificial intelligence); multimedia computing; parallel processing; relational databases; ETL tool; HDFS blocks; Hadoop; MapReduce; OLAP processing; OLAP workload; RDBMS; SQL summation; data analytics ecosystem; data layouts; data mining; ecosystems; general purpose parallel computing framework; graph processing; hierarchical encoding; information retrieval; large dataset processing; machine learning; multimedia data processing; relational data processing; science data processing; star schema data; user query; Data handling; Data mining; Data processing; Data storage systems; Databases; Information management; Layout; Big Data Analytics; MapReduce; OLAP; RDBMS; Unified System; Vision;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cloud and Green Computing (CGC), 2012 Second International Conference on
  • Conference_Location
    Xiangtan
  • Print_ISBN
    978-1-4673-3027-5
  • Type

    conf

  • DOI
    10.1109/CGC.2012.39
  • Filename
    6382895