• DocumentCode
    1436346
  • Title

    An Optimized High-Throughput Strategy for Constructing Inverted Files

  • Author

    Wei, Zheng ; Jaja, Joseph

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Univ. of Maryland, College Park, MD, USA
  • Volume
    23
  • Issue
    11
  • fYear
    2012
  • Firstpage
    2033
  • Lastpage
    2044
  • Abstract
    Current high-throughput algorithms for constructing inverted files all follow the MapReduce framework, which presents a high-level programming model that hides the complexities of parallel programming. In this paper, we take an alternative approach and develop a novel strategy that exploits the current and emerging architectures of multicore processors. Our algorithm is based on a high-throughput pipelined strategy that produces parallel parsed streams, which are immediately consumed at the same rate by parallel indexers. We have performed extensive tests of our algorithm on a cluster of 32 nodes, and were able to achieve a throughput close to the peak throughput of the I/O system: a throughput of 280 MB/s on a single node and a throughput that ranges between 5.15 GB/s (1 Gb/s Ethernet interconnect) and 6.12 GB/s (10 Gb/s InfiniBand interconnect) on a cluster with 32 nodes for processing the ClueWeb09 data set. Such a performance represents a substantial gain over the best known MapReduce algorithms even when comparing the single node performance of our algorithm to MapReduce algorithms running on large clusters. Our results shed a light on the extent of the performance cost that may be incurred by using the simpler, higher level MapReduce programming model for large scale applications.
  • Keywords
    grammars; indexing; input-output programs; multiprocessing systems; parallel programming; pipeline processing; I/O system; MapReduce framework; high-level programming model; high-throughput algorithms; inverted files; multicore processors; optimized high-throughput pipelined strategy; parallel parsed streams; parallel programming; Clustering algorithms; Dictionaries; Indexing; Multicore processing; Program processors; Throughput; I/O throughput; Inverted files; MapReduce; cluster; multicore processors; parallel algorithms; parallel parsing and indexing; pipeline;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2012.43
  • Filename
    6143923