How does the framework perform Vertex Refinement?

B. Vertex Refinement: Efficient inter-GPU CommunicationTo eliminate the overhead of transferring unnecessary vertices between devices, their framework performs Vertex Refinement in two steps: offline and online.

What is the way to scale the graph processing over multiple GPUs?

To scale the graph processing over multiple GPUs in their framework, the authors introduced Vertex Refinement that collects and transfers only those vertices that are boundary and recently updated.

What is the effect of using host as the hub?

Their experiments show that using host as the hub is always beneficial in reducing the communication traffic and overall multi-GPU processing time in comparison to using inbox and outbox buffers residing inside the GPUs.–

What is the speedup of the graphs and benchmarks?

When averaged over all the graphs and benchmarks, their approach provides 1.81x and 1.30x speedups over ALL and 1.77x and 1.28x speedups over MS for three-GPU and two-GPU configurations respectively.

What is the effect of a virtual warp on the vertex’s connected edges?

If the virtual warp is smaller than the vertex’s number of neighbors, it will have to iterate over the vertex’s connected edges hence dragging with it other virtual warps that have already finished their jobs (see the example in Figure 2(a)); and–

What is the effect of a portion of a virtual warp being idle?

If the virtual warp has a size that is larger than the number of neighbors for a vertex, a portion of virtual warp’s lanes stays idle during the reduction leading to underutilization (see the example in Figure 2(b)).

How does the graph processing process with more GPUs work?

By comparing large graphs and small graphsin Figure 10, the authors observe that as the graphs get larger with greater number of edges, adding more GPUs produces greater reductions in graph processing time.

What is the difference between PRAM and VWC?

In both PRAM-style and VWC, assigning fixed number of SIMD threads to process one vertex and its edges leads to threadidling due to highly irregular vertex degree distribution.

Open AccessProceedings Article10.1109/PACT.2015.15

Scalable SIMD-Efficient Graph Processing on GPUs

Farzad Khorasani, +2 more

- 18 Oct 2015

- pp 39-50

159

TL;DR: Warp Segmentation is presented, a novel method that greatly enhances GPU device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex with irregular-sized neighbors while employing compact CSR representation to maximize the graph size that can be kept inside the GPU global memory.

Abstract: The vast computing power of GPUs makes them an attractive platform for accelerating large scale data parallel computations such as popular graph processing applications. However, the inherent irregularity and large sizes of real-world power law graphs makes effective use of GPUs a major challenge. In this paper we develop techniques that greatly enhance the performance and scalability of vertex-centric graph processing on GPUs. First, we present Warp Segmentation, a novel method that greatly enhances GPU device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex with irregular-sized neighbors while employing compact CSR representation to maximize the graph size that can be kept inside the GPU global memory. Prior works can either maximize graph sizes (VWC uses the CSR representation) or device utilization (e.g., CuSha uses the CW representation, however, CW is roughly 2.5x the size of CSR). Second, we further scale graph processing to make use of multiple GPUs while proposing Vertex Refinement to address the challenge of judiciously using the limited bandwidth available for transferring data between GPUs via the PCIe bus. Vertex refinement employs parallel binary prefix sum to dynamically collect only the updated boundary vertices inside GPUs' outbox buffers for dramatically reducing inter-GPU data transfer volume. Whereas existing multi-GPU techniques (Medusa, TOTEM) perform high degree of wasteful vertex transfers. On a single GPU, our framework delivers average speedups of 1.29x to 2.80x over VWC. When scaled to multiple GPUs, our framework achieves up to 2.71x performance improvement compared to inter-GPU vertex communication schemes used by other multi-GPU techniques (i.e., Medusa, TOTEM).

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Most frequently asked questions

1. What are the contributions in "Scalable simd-efficient graph processing on gpus" ?

In this paper the authors develop techniques that greatly enhance the performance and scalability of vertexcentric graph processing on GPUs.. First, the authors present Warp Segmentation, a novel method that greatly enhances GPU device utilization by dynamically assigning appropriate number of SIMD threads to process a vertex with irregular-sized neighbors while employing compact CSR representation to maximize the graph size that can be kept inside the GPU global memory.. Second, the authors further scale graph processing to make use of multiple GPUs while proposing Vertex Refinement to address the challenge of judiciously using the limited bandwidth available for transferring data between GPUs via the PCIe bus.

2. How many vertex values will be held by each GPU?

In addition to CSR representation buffers, each GPU will hold one Outbox buffer that is filled with updated vertex indices and vertex values of the GPU-specific division.

3. How do the authors avoid contention over the atomic variable?

The authors avoid the contention over the atomic variable by mainly relying on binary prefix sum for vertex refinement and involving onlyone warp lane in the outbox region reservation process.

4. What is the effect of adding more GPUs on the processing time of graphs?

In addition, higher density in larger graphs signifies the reduction in the processing time when scaling to multiple GPUs by downsizing inter-device vertex transfer volumes.

Figure 6. Organization of data structures in multi-GPU processing required for Vertex Refinement. The example graph represented in above configuration has the total number of M +N +P vertices and Q+R+ T edges. The letters inside the boxes stand for the number of elements in the buffer. The size of inbox and outbox buffers are determined during Offline Vertex Refinement.

Table IX THE SPEEDUP OF OUR FRAMEWORK WHEN SCALING TO MORE GPUS: FROM 2 TO 3 GPUS FOR THE TOP 6 GRAPHS; AND FROM 2 TO 3 AND FROM 1 TO 2 GPUS FOR THE REST OF THE GRAPHS.

Figure 10. The scalability of our framework over graphs with different number of edges and densities for SSSP benchmark. All the graphs are Rmat created with parameters a = 0.45, b = 0.25, and c = 0.15. y axis is the processing time (lower is better).

Figure 9. Processing-time break down into computation time and communication time for the Vertex Refinement (VR) compared to ALL and MS. Computation time is the total duration of kernel execution, and communication time is the total duration of inbox/outbox management plus inter-device memory copies. For each benchmark, the times are normalized with respect to the longest time. Note that this times cannot be used to infer the overall speedup due to asynchronicity of devices.

Table I THE PERCENTAGE OF USEFUL VERTEX DATA AMONG ALL THE TRANSFERRED DATA WHEN ALL THE VERTICES (ALL) OR THE MAXIMAL SUBSET OF THEM (MS) ARE COPIED FROM ONE GPU TO ANOTHER. IN THIS TWO-GPU CONFIGURATION, THE GRAPH UNDER THE EXAMINATION IS AN RMAT GRAPH WITH APPROXIMATELY 40 MILLION VERTICES AND 470 MILLION EDGES.

Figure 7. An example of online vertex refinement stages via intra-warp inclusive binary prefix sum – warp size in the figure is 8.

Citations

Journal Article•10.1109/TCAD.2018.2821565

GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing

Guohao Dai, +8 more

- 01 Apr 2019

- IEEE Transactions on Computer-Aided Desi...

TL;DR: GraphH, a PIM architecture for graph processing on the hybrid memory cube array, is proposed to tackle all four problems mentioned above, including random access pattern causing local bandwidth degradation, poor locality leading to unpredictable global data access, heavy conflicts on updating the same vertex, and unbalanced workloads across processing units.

...read moreread less

189

Proceedings Article•10.1145/3020078.3021739

ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture

Guohao Dai, +5 more

- 22 Feb 2017

TL;DR: ForeGraph, a large-scale graph processing framework based on the multi-FPGA architecture, is proposed, which outperforms state-of-the-art FPGA-based large- scale graph processing systems by 4.54x when executing PageRank on the Twitter graph.

...read moreread less

174

Proceedings Article•10.1145/3352460.3358318

Alleviating Irregularity in Graph Analytics Acceleration: a Hardware/Software Co-Design Approach

Mingyu Yan, +13 more

- 12 Oct 2019

TL;DR: This work proposes GraphDynS, a hardware/software co-design with decoupled datapath and data-aware dynamic scheduling that can elaborately schedule the program on-the-fly to maximize parallelism and extract data dependencies at runtime.

...read moreread less

Proceedings Article•10.1145/3173162.3173180

Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing

Amir Hossein Nodehi Sabet, +2 more

- 19 Mar 2018

TL;DR: Inspired by the question, Tigr is introduced -- a graph transformation framework that can effectively reduce the irregularity of real-world graphs with correctness guarantees for a wide range of graph analytics.

...read moreread less

Journal Article•10.14778/3389133.3389137

Pangolin: an efficient and flexible graph mining system on CPU and GPU

Xuhao Chen, +3 more

- 01 Apr 2020

TL;DR: Pangolin this paper is an efficient and flexible in-memory graph pattern mining (GPM) framework targeting shared-memory CPUs and GPUs that provides high-level abstractions for GPU processing.

...read moreread less

...

Expand

References

•Proceedings Article

The PageRank Citation Ranking : Bringing Order to the Web

Lawrence Page, +3 more

- 11 Nov 1999

TL;DR: This paper describes PageRank, a mathod for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them, and shows how to efficiently compute PageRank for large numbers of pages.

...read moreread less

16.4K

{SNAP Datasets}: {Stanford} Large Network Dataset Collection

Jure Leskovec, +1 more

- 01 Jun 2014

TL;DR: A collection of more than 50 large network datasets from tens of thousands of node and edges to tens of millions of nodes and edges that includes social networks, web graphs, road networks, internet networks, citation networks, collaboration networks, and communication networks.

...read moreread less

4.2K

•Journal Article•10.1145/1232722.1232727

The dynamics of viral marketing

Jure Leskovec, +2 more

- 01 May 2007

- ACM Transactions on The Web

TL;DR: While on average recommendations are not very effective at inducing purchases and do not spread very far, this work presents a model that successfully identifies communities, product, and pricing categories for which viral marketing seems to be very effective.

...read moreread less

2.7K

Proceedings Article•10.1145/1150402.1150412

Group formation in large social networks: membership, growth, and evolution

Lars Backstrom, +3 more

- 20 Aug 2006

TL;DR: It is found that the propensity of individuals to join communities, and of communities to grow rapidly, depends in subtle ways on the underlying network structure, and decision-tree techniques are used to identify the most significant structural determinants of these properties.

...read moreread less

2.1K

•Journal Article•10.1080/15427951.2009.10129177

Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters

Jure Leskovec, +3 more

- 01 Jan 2009

- Internet Mathematics

TL;DR: This paper employs approximation algorithms for the graph-partitioning problem to characterize as a function of size the statistical and structural properties of partitions of graphs that could plausibly be interpreted as communities, and defines the network community profile plot, which characterizes the "best" possible community—according to the conductance measure—over a wide range of size scales.

...read moreread less

...

Expand

Scalable SIMD-Efficient Graph Processing on GPUs

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the contributions in "Scalable simd-efficient graph processing on gpus" ?

2. How many vertex values will be held by each GPU?

3. How do the authors avoid contention over the atomic variable?

4. What is the effect of adding more GPUs on the processing time of graphs?

5. How does the framework perform Vertex Refinement?

6. What is the way to scale the graph processing over multiple GPUs?

7. What is the effect of using host as the hub?

8. What is the speedup of the graphs and benchmarks?

9. What is the effect of a virtual warp on the vertex’s connected edges?

10. What is the effect of a portion of a virtual warp being idle?

11. How does the graph processing process with more GPUs work?

12. What is the difference between PRAM and VWC?

Figures

Citations

GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing

ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture

Alleviating Irregularity in Graph Analytics Acceleration: a Hardware/Software Co-Design Approach

Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing

Pangolin: an efficient and flexible graph mining system on CPU and GPU

References

The PageRank Citation Ranking : Bringing Order to the Web

{SNAP Datasets}: {Stanford} Large Network Dataset Collection

The dynamics of viral marketing

Group formation in large social networks: membership, growth, and evolution

Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters

Related Papers (5)

Pregel: a system for large-scale graph processing

Ligra: a lightweight graph processing framework for shared memory

PowerGraph: distributed graph-parallel computation on natural graphs

X-Stream: edge-centric graph processing using streaming partitions

Distributed GraphLab: a framework for machine learning and data mining in the cloud