• DocumentCode
    2958879
  • Title

    Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency

  • Author

    Chhugani, Jatin ; Satish, Nadathur ; Kim, Changkyu ; Sewall, Jason ; Dubey, Pradeep

  • fYear
    2012
  • fDate
    21-25 May 2012
  • Firstpage
    378
  • Lastpage
    389
  • Abstract
    Graph-based structures are being increasingly used to model data and relations among data in a number of fields. Graph-based databases are becoming more popular as a means to better represent such data. Graph traversal is a key component in graph algorithms such as reachability and graph matching. Since the scale of data stored and queried in these databases is increasing, it is important to obtain high performing implementations of graph traversal that can efficiently utilize the processing power of modern processors. In this work, we present a scalable Breadth-First Search Traversal algorithm for modern multi-socket, multi-core CPUs. Our algorithm uses lock- and atomic-free operations on a cache-resident structure for arbitrary sized graphs to filter out expensive main memory accesses, and completely and efficiently utilizes all available bandwidth resources. We propose a work distribution approach for multi-socket platforms that ensures load-balancing while keeping cross-socket communication low. We provide a detailed analytical model that accurately projects the performance of our single- and multi-socket traversal algorithms to within 5-10% of obtained performance. Our analytical model serves as a useful tool to analyze performance bottlenecks on modern CPUs. When measured on various synthetic and real-world graphs with a wide range of graph sizes, vertex degrees and graph diameters, our implementation on a dual-socket Intel® Xeon® X5570 (Intel microarchitecture code name Nehalem) system achieves 1.5X-13.2X performance speedup over the best reported numbers. We achieve around 1 Billion traversed edges per second on a scale-free R-MAT graph with 64M vertices and 2 Billion edges on a dual-socket Nehalem system. Our optimized algorithm is useful as a building block for efficient multi-node implementations and future exascale systems, thereby allowing them to ride the trend of increasing per-node compute and bandwidth resource- .
  • Keywords
    cache storage; complex networks; data structures; multiprocessing systems; network theory (graphs); performance evaluation; query processing; reachability analysis; resource allocation; tree searching; analytical model; arbitrary sized graphs; atomic-free operations; bandwidth resource utilization; cache-resident structure; cross-socket communication; data modeling; data representation; database querying; dual-socket Intel Xeon X5570 system; dual-socket Nehalem system; exascale systems; graph diameters; graph edges; graph matching; graph sizes; graph traversal algorithm; graph-based databases; graph-based structures; load balancing; lock-free operations; multisocket multicore CPU; performance bottleneck analysis; reachability; scalable breadth-first search traversal algorithm; scale-free R-MAT graph; single-node efficiency maximization; single-socket traversal algorithms; vertex degrees; work distribution approach; Arrays; Bandwidth; Instruction sets; Partitioning algorithms; Sockets; Graph traversal; bandwidth; efficient; multi-socket; single node;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International
  • Conference_Location
    Shanghai
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-4673-0975-2
  • Type

    conf

  • DOI
    10.1109/IPDPS.2012.43
  • Filename
    6267875