• DocumentCode
    659619
  • Title

    Index-based join operations in Hive

  • Author

    Mofidpoor, Mahsa ; Shiri, Nematollaah ; Radhakrishnan, Thiruvengadam

  • Author_Institution
    Comput. Sci. & Software Eng., Concordia Univ., Montreal, QC, Canada
  • fYear
    2013
  • fDate
    6-9 Oct. 2013
  • Firstpage
    26
  • Lastpage
    33
  • Abstract
    Indexing techniques are crucial for efficiency and scalability of processing queries over big data. Hive is a batch-oriented big data management engine that is well suited for data OLAP and data analysis applications. For very “selective” queries whose output sizes are a small fraction of the contributing data, the brute-force approach suffers from poor performance due to redundant disk I/O´s or initiations of extra map operations. We make a first attempt and propose an index-based join technique to speed up the process and integrate it in Hive by mapping our design to the conceptual optimization flow. To evaluate the performance, we create and evaluate test queries on datasets generated using TPC-H benchmark. Our results indicate significant performance gain over relatively large data and/or highly selective queries having a two-way join and a single join condition.
  • Keywords
    data mining; indexing; query processing; Hive; Indexing techniques; TPC-H benchmark; batch oriented big data management engine; data OLAP applications; data analysis applications; index based join operations; query processing; selective queries; Data handling; Data structures; Indexing; Information management; Optimization; Time factors; Hadoop; Hive; Indexing Techniques; Join Operation; Map and Reduce functions;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data, 2013 IEEE International Conference on
  • Conference_Location
    Silicon Valley, CA
  • Type

    conf

  • DOI
    10.1109/BigData.2013.6691768
  • Filename
    6691768