Proceedings Article10.1145/2851141.2851169
GPU multisplit
Saman Ashkiani,Andrew Davidson,Ulrich Meyer,John D. Owens +3 more
- 27 Feb 2016
pp 12
36
TL;DR: This work provides a parallel model and multiple implementations for the multisplit problem, and uses warp-synchronous programming models to avoid branch divergence and reduce memory usage, as well as hierarchical reordering of input elements to achieve better coalescing of global memory accesses.
read more
Abstract: Multisplit is a broadly useful parallel primitive that permutes its input data into contiguous buckets or bins, where the function that categorizes an element into a bucket is provided by the programmer. Due to the lack of an efficient multisplit on GPUs, programmers often choose to implement multisplit with a sort. However, sort does more work than necessary to implement multisplit, and is thus inefficient. In this work, we provide a parallel model and multiple implementations for the multisplit problem. Our principal focus is multisplit for a small number of buckets. In our implementations, we exploit the computational hierarchy of the GPU to perform most of the work locally, with minimal usage of global operations. We also use warp-synchronous programming models to avoid branch divergence and reduce memory usage, as well as hierarchical reordering of input elements to achieve better coalescing of global memory accesses. On an NVIDIA K40c GPU, for key-only (key-value) multisplit, we demonstrate a 3.0-6.7x (4.4-8.0x) speedup over radix sort, and achieve a peak throughput of 10.0 G keys/s.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Gunrock: GPU Graph Analytics
Yangzihao Wang,Yuechao Pan,Andrew Davidson,Yuduo Wu,Carl Yang,Leyuan Wang,Muhammad Osama,Chenshan Yuan,Weitang Liu,Andy Riffel,John D. Owens +10 more
- 23 Aug 2017
TL;DR: The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries, and better performance than any other GPU high-level graph library.
A Dynamic Hash Table for the GPU
Saman Ashkiani,Martin Farach-Colton,John D. Owens +2 more
- 21 May 2018
TL;DR: A warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing is proposed, which builds a dynamic non-blocking concurrent linked list, the slab list, that supports asynchronous, concurrent updates as well as search queries.
•Posted Content
Gunrock: GPU Graph Analytics
Yangzihao Wang,Yuechao Pan,Andrew Davidson,Yuduo Wu,Carl Yang,Leyuan Wang,Muhammad Osama,Chenshan Yuan,Weitang Liu,Andy Riffel,John D. Owens +10 more
TL;DR: Gunrock as discussed by the authors is a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier for large-scale graph analytics.
61
Optimization Techniques for GPU Programming
TL;DR: In this article , a survey discusses various optimization techniques found in 450 articles published in the last 14 years and analyzes the optimizations from different perspectives which shows that the various optimizations are highly interrelated, explaining the need for techniques such as auto-tuning.
54
A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs
Elias Stehle,Hans-Arno Jacobsen +1 more
TL;DR: This work proposes a novel approach that almost halves the amount of memory transfers and, therefore, considerably lifts the memory bandwidth limitation, and builds on the efficient GPU sorting approach with a pipelined heterogeneous sorting algorithm that mitigates the overhead associated with PCIe data transfers.
39
References
A note on two problems in connexion with graphs
TL;DR: A tree is a graph with one and only one path between every two nodes, where at least one path exists between any two nodes and the length of each branch is given.
The university of Florida sparse matrix collection
Timothy A. Davis,Yifan Hu +1 more
TL;DR: The University of Florida Sparse Matrix Collection, a large and actively growing set of sparse matrices that arise in real applications, is described and a new multilevel coarsening scheme is proposed to facilitate this task.
4.3K
•Book
Digraphs Theory Algorithms And Applications
Jrgen Bang-Jensen,Gregory Gutin +1 more
- 05 Aug 2002
TL;DR: Digraphs is an essential, comprehensive reference for undergraduate and graduate students, and researchers in mathematics, operations research and computer science, and it will also prove invaluable to specialists in related areas, such as meteorology, physics and computational biology.
2.4K
Scalable parallel programming with CUDA
John R. Nickolls,Ian Buck,Michael Garland,Kevin Skadron +3 more
- 11 Aug 2008
TL;DR: Presents a collection of slides covering the following topics: CUDA parallel programming model; CUDA toolkit and libraries; performance optimization; and application development.
NVIDIA Tesla: A Unified Graphics and Computing Architecture
TL;DR: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.
Related Papers (5)
Saman Ashkiani,Martin Farach-Colton,John D. Owens +2 more
- 21 May 2018
Prabhakar Misra,Mainak Chaudhuri +1 more
- 17 Dec 2012