• DocumentCode
    2493001
  • Title

    Using Index in the MapReduce Framework

  • Author

    An, Mingyuan ; Wang, Yang ; Wang, Weiping

  • Author_Institution
    Key Lab. of Comput. Syst. & Archit., Chinese Acad. of Sci., Beijing, China
  • fYear
    2010
  • fDate
    6-8 April 2010
  • Firstpage
    52
  • Lastpage
    58
  • Abstract
    MapReduce is a programming framework introduced by Google for large-scale data processing. It is usually used in a scan-centric fashion where all the data are split into blocks and Maps are generated for each block to scan and process the data in the block, then Reduces merge outputs from all the Maps. When a query intends to process only a subset of the data selected by a predicate, this brute-force method may cause extra I/O overhead spent on irrelevant data, and the overhead for initiating so many Maps may be non-trivial given that the actually interesting data for the query is comparatively small in volume. We propose an approach to integrate the index into the MapReduce execution in which only an appropriate number of Maps are generated, each of which accesses the data using an index. This approach incurs random I/O and remote access to data, so the overall performance depends on both system parameters and the query characteristics. We build a cost model for both this index access execution and the traditional full scan execution. This cost model can be used to choose between the two execution modes before executing a query. Experiments show that the index access execution can greatly outperform full scan execution when the selectivity of the predicate is low, and the cost model predicts the actual execution cost very well so can be used to determine the execution plan for a query.
  • Keywords
    data structures; parallel programming; Google; MapReduce framework; index; large scale data processing; random I/O; remote data access; Computer science; Costs; Delay; Energy efficiency; Energy storage; Flash memory; Indexing; Mechanical factors; Nonvolatile memory; Tree data structures; MapReduce; access methods; cost model; index;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Conference (APWEB), 2010 12th International Asia-Pacific
  • Conference_Location
    Busan
  • Print_ISBN
    978-1-7695-4012-2
  • Electronic_ISBN
    978-1-4244-6600-9
  • Type

    conf

  • DOI
    10.1109/APWeb.2010.12
  • Filename
    5474155