XBFS: eXploring Runtime Optimizations for Breadth-First Search on GPUs

doi:10.1145/3307681.3326606

Open AccessProceedings Article10.1145/3307681.3326606

XBFS: eXploring Runtime Optimizations for Breadth-First Search on GPUs

Anil Gaihre, +3 more

- 17 Jun 2019

- pp 121-131

47

TL;DR: XBFS is proposed that leverages the runtime optimizations atop GPUs to cope with the nondeterministic characteristics of BFS with the following three techniques: first, XBFS adaptively exploits four either new or optimized frontier queue generation designs to accommodate various BFS levels that present dissimilar features.

Abstract: Attracted by the enormous potentials of Graphics Processing Units (GPUs), an array of efforts has surged to deploy Breadth-First Search (BFS) on GPUs, which, however, often exploits the static mechanisms to address the challenges that are dynamic in nature. Such a mismatch prevents us from achieving the optimal performance for offloading graph traversal on GPUs. To this end, we propose XBFS that leverages the runtime optimizations atop GPUs to cope with the nondeterministic characteristics of BFS with the following three techniques: First, XBFS adaptively exploits four either new or optimized frontier queue generation designs to accommodate various BFS levels that present dissimilar features. Second, inspired by the observation that the workload associated with each vertex is not proportional to its degree in bottom-up, we design three new strategies to better balance the workload. Third, XBFS introduces the first truly asynchronous bottom-up traversal which allows BFS to visit vertices for multiple levels at a single iteration with both theoretical soundness and practical benefits. Taken together, XBFS is, on average, 3.5×, 4.9×, 11.2× and 6.1× faster than the state-of-the-art Enterprise, Tigr, Gunrock on a Quadro P6000 GPU and Ligra on a 24-core Intel Xeon Platinum 8175M CPU. Note, the CPU used for Ligra is more expensive than the GPU for XBFS.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Proceedings Article•10.1145/3342195.3387537

Subway: minimizing data transfer during out-of-GPU-memory graph processing

Amir Hossein Nodehi Sabet, +2 more

- 15 Apr 2020

TL;DR: This work designs a fast subgraph generation algorithm with a simple yet efficient subgraph representation and a GPU-accelerated implementation, and brings asynchrony to the subgraph processing, delaying the synchronization between a subgraph in the GPU memory and the rest of the graph in the CPU memory.

...read moreread less

70

Proceedings Article•10.1145/3447786.3456247

Seastar: vertex-centric programming for graph neural networks

Yidi Wu, +7 more

- 21 Apr 2021

TL;DR: Seastar as discussed by the authors is a vertex-centric programming model for GNN training on GPU and provides idiomatic python constructs to enable easy development of novel homogeneous and heterogeneous GNN models.

...read moreread less

62

•Journal Article•10.1145/3570638

Optimization Techniques for GPU Programming

Pieter Hijma, +4 more

- 14 Nov 2022

- ACM Computing Surveys

TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.

...read moreread less

54

•Proceedings Article•10.1109/SC41405.2020.00060

C-SAW: A Framework for Graph Sampling and Random Walk on GPUs

Santosh Pandey, +4 more

- 18 Sep 2020

- arXiv: Distributed, Parallel, and Cluste...

TL;DR: C-SAW is introduced, the first framework that accelerates Sampling and Random Walk framework on GPUs, and provides a generic API which allows users to implement a wide range of sampling and random walk algorithms with ease.

...read moreread less

46

Journal Article•10.14778/3425879.3425883

EMOGI: efficient memory-access for out-of-memory graph-traversal in GPUs

Seungwon Min, +5 more

- 01 Oct 2020

TL;DR: This paper addresses the open question of whether a sufficiently large number of overlapping cacheline-sized accesses can be sustained to tolerate the long latency to host memory, fully utilize the available bandwidth, and achieve favorable execution performance and proposes EMOGI, an alternative approach to traverse graphs that do not fit in GPU memory using direct cacheline -sized access to data stored in host memory.

...read moreread less

44

...

Expand

References

•Book

Lapack Users' Guide

Ed Anderson

- 01 Feb 1995

TL;DR: The third edition of LAPACK provided a guide to troubleshooting and installation of Routines, as well as providing examples of how to convert from LINPACK or EISPACK to BLAS.

...read moreread less

3.2K

•Book

LINPACK Users' Guide

Jack Dongarra, +3 more

- 01 Jan 1987

TL;DR: General matrices Band matrices positive definite matrices Positive definite band matrices Symmetric Indefinite Matrices Triangular matrices Tridiagonal matrices The Cholesky decomposition The QR decomposition up to and including the singular value decomposition is studied.

...read moreread less

1.7K

Proceedings Article•10.1145/2442516.2442530

Ligra: a lightweight graph processing framework for shared memory

Julian Shun, +1 more

- 23 Feb 2013

TL;DR: This paper presents a lightweight graph processing framework that is specific for shared-memory parallel/multicore machines, which makes graph traversal algorithms easy to write and significantly more efficient than previously reported results using graph frameworks on machines with many more cores.

...read moreread less

964

Parallel Prefix Sum (Scan) with CUDA

Mark J. Harris

- 01 Jan 2011

TL;DR: The water needs of this region have changed in recent years from being primarily for agricultural purposes to domestic and industrial uses now, and the needs of these industries have changed as well.

...read moreread less

788

•Proceedings Article•10.5555/1280094.1280110

Scan primitives for GPU computing

Shubhabrata Sengupta, +3 more

- 04 Aug 2007

TL;DR: Using the scan primitives, this work shows novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyzes the performance of the scanPrimitives, several sort algorithms that use the scan Primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver.

...read moreread less

655

...

Expand

XBFS: eXploring Runtime Optimizations for Breadth-First Search on GPUs

Chat with Paper

AI Agents for this Paper

Citations

Subway: minimizing data transfer during out-of-GPU-memory graph processing

Seastar: vertex-centric programming for graph neural networks

Optimization Techniques for GPU Programming

C-SAW: A Framework for Graph Sampling and Random Walk on GPUs

EMOGI: efficient memory-access for out-of-memory graph-traversal in GPUs

References

Lapack Users' Guide

LINPACK Users' Guide

Ligra: a lightweight graph processing framework for shared memory

Parallel Prefix Sum (Scan) with CUDA

Scan primitives for GPU computing

Related Papers (5)

Gunrock: a high-performance graph processing library on the GPU

Ligra: a lightweight graph processing framework for shared memory

Scalable GPU graph traversal

Pregel: a system for large-scale graph processing

CuSha: vertex-centric graph processing on GPUs