A performance analysis framework for optimizing OpenCL applications on FPGAs

doi:10.1109/HPCA.2016.7446058

Proceedings Article10.1109/HPCA.2016.7446058

A performance analysis framework for optimizing OpenCL applications on FPGAs

Zeke Wang, +3 more

- 12 Mar 2016

- pp 114-125

105

TL;DR: This paper presents an FPGA-based performance analysis framework that can shed light on the performance bottleneck and thus guide the code tuning for OpenCL applications on FPGAs, and demonstrates that its analytical performance model can accurately predict the performance of OpenCL programs with different optimization combinations onFPGAs.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1145/3357375

A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications

Leibo Liu, +7 more

- 16 Oct 2019

- ACM Computing Surveys

TL;DR: The architecture and design of CGRAs are reviewed thoroughly, a novel multidimensional taxonomy is proposed, and major challenges and the corresponding state-of-the-art techniques are surveyed and analyzed.

...read moreread less

200

Proceedings Article•10.1145/3174243.3174255

Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs

Yuan Zhou, +11 more

- 15 Feb 2018

TL;DR: Rosetta is a realistic benchmark suite for software programmable FPGAs that can be useful for the HLS research community, but can also serve as a set of design tutorials for non-expert HLS users.

...read moreread less

137

Proceedings Article•10.1109/FCCM48280.2020.00024

Shuhai: Benchmarking High Bandwidth Memory On FPGAS

Zeke Wang, +3 more

- 03 May 2020

TL;DR: This paper benchmarking HBM on a state-of-the-art FPGA, i.e., a Xilinx Alveo U280 featuring a two-stack HBM subsystem, observes that HBM is able to provide up to 425 GB/s memory bandwidth, and demonstrates the importance of unveiling the performance characteristics of HBM so as to select the best approach.

...read moreread less

92

•Proceedings Article•10.1145/3431920.3439290

ThunderGP: HLS-based Graph Processing Framework on FPGAs

Xinyu Chen, +5 more

- 17 Feb 2021

TL;DR: In this article, the authors propose ThunderGP, an open-source HLS-based graph processing framework on FPGAs, with which developers could enjoy the performance of FPGA-accelerated graph processing by writing only a few high-level functions with no knowledge of the hardware.

...read moreread less

90

Proceedings Article•10.23919/DATE.2017.7927161

Design Space exploration of FPGA-based accelerators with multi-level parallelism

Guanwen Zhong, +5 more

- 27 Mar 2017

TL;DR: A rapid estimation framework, MPSeeker, to evaluate performance/area metrics of various accelerator options for an application at an early design phase and can rapidly (in minutes) explore the complex design space and accurately estimate performance/ area of various design points to identify the near-optimal combination of parallelism options.

...read moreread less

67

...

Expand

References

•Proceedings Article•10.5555/977395.977673

LLVM: a compilation framework for lifelong program analysis & transformation

Chris Lattner, +1 more

- 20 Mar 2004

TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.

...read moreread less

5.4K

•Proceedings Article•10.1109/IISWC.2009.5306797

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che, +6 more

- 04 Oct 2009

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

3.2K

•Proceedings Article•10.1109/ISPASS.2009.4919648

Analyzing CUDA workloads using a detailed GPU simulator

Ali Bakhoda, +4 more

- 26 Apr 2009

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.

...read moreread less

1.8K

•Journal Article•10.1145/2996868

A reconfigurable fabric for accelerating large-scale datacenter services

Andrew Putnam, +22 more

- 28 Oct 2016

- Communications of The ACM

TL;DR: The authors deployed the reconfigurable fabric in a bed of 1,632 servers and FPGAs in a production datacenter and successfully used it to accelerate the ranking portion of the Bing Web search engine by nearly a factor of two.

...read moreread less

1K

Proceedings Article•10.1145/1345206.1345220

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Shane Ryoo, +5 more

- 20 Feb 2008

TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.

...read moreread less

1K