Proceedings Article10.1109/HPCA.2016.7446058
A performance analysis framework for optimizing OpenCL applications on FPGAs
Zeke Wang,Bingsheng He,Wei Zhang,Shunning Jiang +3 more
- 12 Mar 2016
- pp 114-125
105
TL;DR: This paper presents an FPGA-based performance analysis framework that can shed light on the performance bottleneck and thus guide the code tuning for OpenCL applications on FPGAs, and demonstrates that its analytical performance model can accurately predict the performance of OpenCL programs with different optimization combinations onFPGAs.
read more
Abstract: Recently, FPGA vendors such as Altera and Xilinx have released OpenCL SDK for programming FPGAs. However, the architecture of FPGA is significantly different from that of CPU/GPU, for which OpenCL is originally designed. Tuning the OpenCL code for good performance on FPGAs is still an open problem, since the existing OpenCL tools and models designed for CPUs/GPUs are not directly applicable to FPGAs. In the paper, we present an FPGA-based performance analysis framework that can shed light on the performance bottleneck and thus guide the code tuning for OpenCL applications on FPGAs. Particularly, we leverage static and dynamic analysis to develop an analytical performance model, which has captured the key architectural features of FPGA abstractions under OpenCL. Then, we provide four programmer-interpretable metrics to quantify the performance potentials of the OpenCL program with input optimization combination for the next optimization step. We evaluate our framework with a number of user cases, and demonstrate that 1) our analytical performance model can accurately predict the performance of OpenCL programs with different optimization combinations on FPGAs, and 2) our tool can be used to effectively guide the code tuning on alleviating the performance bottleneck.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A Survey of Coarse-Grained Reconfigurable Architecture and Design: Taxonomy, Challenges, and Applications
TL;DR: The architecture and design of CGRAs are reviewed thoroughly, a novel multidimensional taxonomy is proposed, and major challenges and the corresponding state-of-the-art techniques are surveyed and analyzed.
200
Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs
Yuan Zhou,Udit Gupta,Steve Dai,Ritchie Zhao,Nitish Srivastava,Hanchen Jin,Joseph Featherston,Yi-Hsiang Lai,Gai Liu,Gustavo Angarita Velasquez,Wenping Wang,Zhiru Zhang +11 more
- 15 Feb 2018
TL;DR: Rosetta is a realistic benchmark suite for software programmable FPGAs that can be useful for the HLS research community, but can also serve as a set of design tutorials for non-expert HLS users.
137
Shuhai: Benchmarking High Bandwidth Memory On FPGAS
Zeke Wang,Hongjing Huang,Jie Zhang,Gustavo Alonso +3 more
- 03 May 2020
TL;DR: This paper benchmarking HBM on a state-of-the-art FPGA, i.e., a Xilinx Alveo U280 featuring a two-stack HBM subsystem, observes that HBM is able to provide up to 425 GB/s memory bandwidth, and demonstrates the importance of unveiling the performance characteristics of HBM so as to select the best approach.
92
ThunderGP: HLS-based Graph Processing Framework on FPGAs
Xinyu Chen,Hongshi Tan,Yao Chen,Bingsheng He,Weng-Fai Wong,Deming Chen +5 more
- 17 Feb 2021
TL;DR: In this article, the authors propose ThunderGP, an open-source HLS-based graph processing framework on FPGAs, with which developers could enjoy the performance of FPGA-accelerated graph processing by writing only a few high-level functions with no knowledge of the hardware.
90
Design Space exploration of FPGA-based accelerators with multi-level parallelism
Guanwen Zhong,Alok Prakash,Siqi Wang,Yun Liang,Tulika Mitra,Smail Niar +5 more
- 27 Mar 2017
TL;DR: A rapid estimation framework, MPSeeker, to evaluate performance/area metrics of various accelerator options for an application at an early design phase and can rapidly (in minutes) explore the complex design space and accurately estimate performance/ area of various design points to identify the near-optimal combination of parallelism options.
67
References
LLVM: a compilation framework for lifelong program analysis & transformation
Chris Lattner,Vikram Adve +1 more
- 20 Mar 2004
TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.
Rodinia: A benchmark suite for heterogeneous computing
Shuai Che,Michael Boyer,Jiayuan Meng,David Tarjan,Jeremy W. Sheaffer,Sang-Ha Lee,Kevin Skadron +6 more
- 04 Oct 2009
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Analyzing CUDA workloads using a detailed GPU simulator
Ali Bakhoda,George L. Yuan,Wilson W. L. Fung,Henry Wong,Tor M. Aamodt +4 more
- 26 Apr 2009
TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
A reconfigurable fabric for accelerating large-scale datacenter services
Andrew Putnam,Adrian M. Caulfield,Eric S. Chung,Derek Chiou,Kypros Constantinides,John Demme,Hadi Esmaeilzadeh,Jeremy Fowers,Gopi Prashanth Gopal,Jan Gray,Michael Haselman,Scott Hauck,Stephen F. Heil,Amir Hormati,Joo-Young Kim,Sitaram Lanka,James R. Larus,Eric C. Peterson,Simon Pope,Aaron L. Smith,Jason Thong,Phillip Yi Xiao,Doug Burger +22 more
TL;DR: The authors deployed the reconfigurable fabric in a bed of 1,632 servers and FPGAs in a production datacenter and successfully used it to accelerate the ranking portion of the Bing Web search engine by nearly a factor of two.
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Shane Ryoo,Christopher I. Rodrigues,Sara S. Baghsorkhi,Sam S. Stone,David B. Kirk,Wen-mei W. Hwu +5 more
- 20 Feb 2008
TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.
Related Papers (5)
Andrew Putnam,Adrian M. Caulfield,Eric S. Chung,Derek Chiou,Kypros Constantinides,John Demme,Hadi Esmaeilzadeh,Jeremy Fowers,Gopi Prashanth Gopal,Jan Gray,Michael Haselman,Scott Hauck,Stephen F. Heil,Amir Hormati,Joo-Young Kim,Sitaram Lanka,James R. Larus,Eric C. Peterson,Simon Pope,Aaron L. Smith,Jason Thong,Phillip Yi Xiao,Doug Burger +22 more
- 14 Jun 2014
Zeke Wang,Bingsheng He,Wei Zhang +2 more
- 01 Sep 2015