• DocumentCode
    1924689
  • Title

    Mastiff: A MapReduce-based System for Time-Based Big Data Analytics

  • Author

    Guo, Sijie ; Xiong, Jin ; Wang, Weiping ; Lee, Rubao

  • Author_Institution
    State Key Lab. of Comput. Archit., Inst. of Comput. Technol., Beijing, China
  • fYear
    2012
  • fDate
    24-28 Sept. 2012
  • Firstpage
    72
  • Lastpage
    80
  • Abstract
    Existing MapReduce-based warehousing systems are not specially optimized for time-based big data analysis applications. Such applications have two characteristics: 1) data are continuously generated and are required to be stored persistently for a long period of time, 2) applications usually process data in some time period so that typical queries use time-related predicates. Time-based big data analytics requires both high data loading speed and high query execution performance. However, existing systems including current MapReduce-based solutions do not solve this problem well because the two requirements are contradictory. We have implemented a MapReduce-based system, called Mastiff, which provides a solution to achieve both high data loading speed and high query performance. Mastiff exploits a systematic combination of a column group store structure and a lightweight helper structure. Furthermore, Mastiff uses an optimized table scan method and a column-based query execution engine to boost query performance. Based on extensive experiments results with diverse workloads, we will show that Mastiff can significantly outperform existing systems including Hive, HadoopDB, and GridSQL.
  • Keywords
    data analysis; data warehouses; distributed processing; query processing; storage management; GridSQL; HadoopDB; Hive; MapReduce-based warehousing systems; Mastiff; column group store structure; column-based query execution engine; data continuous generation; data processing; data storage; helper structure; high data loading speed; high query execution performance; optimized table scan method; query performance; time-based big data analytics; time-related predicate query; Data handling; Data storage systems; Engines; Indexes; Information management; Loading; Servers; MapReduce; time-based data analytics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster Computing (CLUSTER), 2012 IEEE International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4673-2422-9
  • Type

    conf

  • DOI
    10.1109/CLUSTER.2012.10
  • Filename
    6337767