Title :
Fast and Efficient Graph Traversal Algorithm for CPUs: Maximizing Single-Node Efficiency
Author :
Chhugani, Jatin ; Satish, Nadathur ; Kim, Changkyu ; Sewall, Jason ; Dubey, Pradeep
Abstract :
Graph-based structures are being increasingly used to model data and relations among data in a number of fields. Graph-based databases are becoming more popular as a means to better represent such data. Graph traversal is a key component in graph algorithms such as reachability and graph matching. Since the scale of data stored and queried in these databases is increasing, it is important to obtain high performing implementations of graph traversal that can efficiently utilize the processing power of modern processors. In this work, we present a scalable Breadth-First Search Traversal algorithm for modern multi-socket, multi-core CPUs. Our algorithm uses lock- and atomic-free operations on a cache-resident structure for arbitrary sized graphs to filter out expensive main memory accesses, and completely and efficiently utilizes all available bandwidth resources. We propose a work distribution approach for multi-socket platforms that ensures load-balancing while keeping cross-socket communication low. We provide a detailed analytical model that accurately projects the performance of our single- and multi-socket traversal algorithms to within 5-10% of obtained performance. Our analytical model serves as a useful tool to analyze performance bottlenecks on modern CPUs. When measured on various synthetic and real-world graphs with a wide range of graph sizes, vertex degrees and graph diameters, our implementation on a dual-socket Intel® Xeon® X5570 (Intel microarchitecture code name Nehalem) system achieves 1.5X-13.2X performance speedup over the best reported numbers. We achieve around 1 Billion traversed edges per second on a scale-free R-MAT graph with 64M vertices and 2 Billion edges on a dual-socket Nehalem system. Our optimized algorithm is useful as a building block for efficient multi-node implementations and future exascale systems, thereby allowing them to ride the trend of increasing per-node compute and bandwidth resource- .
Keywords :
cache storage; complex networks; data structures; multiprocessing systems; network theory (graphs); performance evaluation; query processing; reachability analysis; resource allocation; tree searching; analytical model; arbitrary sized graphs; atomic-free operations; bandwidth resource utilization; cache-resident structure; cross-socket communication; data modeling; data representation; database querying; dual-socket Intel Xeon X5570 system; dual-socket Nehalem system; exascale systems; graph diameters; graph edges; graph matching; graph sizes; graph traversal algorithm; graph-based databases; graph-based structures; load balancing; lock-free operations; multisocket multicore CPU; performance bottleneck analysis; reachability; scalable breadth-first search traversal algorithm; scale-free R-MAT graph; single-node efficiency maximization; single-socket traversal algorithms; vertex degrees; work distribution approach; Arrays; Bandwidth; Instruction sets; Partitioning algorithms; Sockets; Graph traversal; bandwidth; efficient; multi-socket; single node;
Conference_Titel :
Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International
Conference_Location :
Shanghai
Print_ISBN :
978-1-4673-0975-2
DOI :
10.1109/IPDPS.2012.43