Proceedings Article10.23919/DATE.2017.7927202
A static-placement, dynamic-issue framework for CGRA loop accelerator
Zhongyuan Zhao,Weiguang Sheng,Weifeng He,Zhigang Mao,Zhaoshi Li +4 more
- 27 Mar 2017
- pp 1348-1353
7
TL;DR: A static-placement, dynamic-issue (SPDI) framework for the coarse-grained reconfigurable architecture (CGRA), which includes the compiler that statically places the operations and hardware design, a SPDI CGRA, that automatically schedule the operations.
read more
Abstract: This paper presents a static-placement, dynamic-issue (SPDI) framework for the coarse-grained reconfigurable architecture (CGRA) in order to tackle the inefficiencies of the static-issue, static-placement (SISP) CGRA. This framework includes the compiler that statically places the operations and hardware design, a SPDI CGRA, that automatically schedule the operations. We stress on introducing the SPDI CGRA in this paper. This newly designed hardware model adds the token buffer, which is capable of automatically scheduling the operations inside processing elements (PE), along with a router network that can effectively transform and control data flow among the PE array. This design lets the hardware share the responsibility for the compiler, making them cooperate to deal with the issuing, placement and routing problem. Evaluation of our study shows that our framework can reach on average 1.28, 1.30 and 1.33 higher than three state-of-the-art SISP CGRA using REGIMap, RS compile flow and the EPIMap approaches respectively. The area overhead is nearly 0.93% per token buffer entry for each PE relative to SISP CGRA.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A Black-Box Fork-Join Latency Prediction Model for Data-Intensive Applications
TL;DR: A black-box Fork-Join model is proposed that covers a wide range of Fork- join structures for the prediction of tail and mean latency, called ForkTail and ForkMean, respectively, and can be used as a powerful tool to aid the design of tail-and-mean-latency guaranteed job scheduling and resource provisioning, especially at high load, for datacenter applications.
25
C-MAP: Improving the Effectiveness of Mapping Method for CGRA by Reducing NoC Congestion
An Shuqian,Mingzhe Zhang,Xiaochun Ye,Da Wang,Hao Zhang,Dongrui Fan,Zhimin Tang +6 more
- 01 Aug 2019
TL;DR: C-Map improves the effectiveness of CGRA mapping in the perspective of reducing network congestion and enhancing the continuity of the data-flow and analyzes the impact of several key considerations in CGRA instruction mapping, such as NoC workload reduction and workload balance.
4
The Principle and Progress of Dynamically Reconfigurable Computing Technologies
Shaojun Wei,Yan'an Lu +1 more
TL;DR: This paper summarizes the latest progress in key technologies of dynamically reconfigurable computing and provides an introduction of the application achievements.
3
A Multi-Domain Architectural Efficiency Metric
Sumeet Singh Nagi,Dejan Markovic +1 more
- 01 Oct 2021
TL;DR: In this article, an architectural efficiency metric that quantifies the number of instructions or the size of reconfiguration bits required to perform a computation over a range of program sizes in the architecture is proposed.
DFGAS: Exploring the Balance of HW-SW Scheduling through the DFG-Aware Scheme
Tianyu Liu,Zhihua Fan,Li Wenming,Zhen Wang,Yuhang Qiu,Shengzhong Tang,Haibin Wu,Yanhuan Liu,Xiaochun Ye,Dongrui Fan +9 more
Abstract: Coarse-Grained Reconfigurable Architectures (CGRAs) have been regarded as promising spatial computing fabric for the ever-evolving algorithms in multiple domains. However, pure software scheduling cannot compensate for the deficiencies in over-serialization and load imbalancing of these pure static CGRA designs. To address the issues caused by limited hardware flexibility, an in-depth study on the balance between the software and hardware scheduling design of CGRA is needed to achieve more precise, accurate, and adaptive scheduling of dataflow. In this paper, we propose DFGAS ( DFG - A ware S cheduling), a dataflow-driven CGRA which provides a comprehensive scheduling approach that encompasses software prediction, runtime adaptive execution, and post-execution refinement. Prior to execution, the TimeStamp prediction algorithm, coupled with the inherent dataflow execution model, enables coarse-grained (block-level) prediction for prioritized transfer and computation on NoC and PEs. During execution, the execution of key dataflow graph (DFG) blocks and edges is accelerated by incorporating a dynamic and adaptive dataflow mechanism. It leverages hardware-software co-design to obtain a holistic view of the entire DFG and continuously self-adaptively optimizes the scheduling process. Furthermore, a complete workflow is implemented, supporting making refinements to the software DFG mapping results. DFGAS represents a scheduling scheme of CGRA that is worth exploring, achieving hardware-software co-design that balances energy efficiency and flexibility. Experiments show that DFGAS achieves 1.35 × energy efficiency improvement over a dataflow-driven CGRA and 1.9 × energy efficiency improvement over a state-of-the-art pure static CGRA.
References
LLVM: a compilation framework for lifelong program analysis & transformation
Chris Lattner,Vikram Adve +1 more
- 20 Mar 2004
TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.
MiBench: A free, commercially representative embedded benchmark suite
Matthew R. Guthaus,Jeff Ringenberg,Daniel J. Ernst,Todd Austin,Trevor Mudge,Richard B. Brown +5 more
- 02 Dec 2001
TL;DR: A new version of SimpleScalar that has been adapted to the ARM instruction set is used to characterize the performance of the benchmarks using configurations similar to current and next generation embedded processors.
3.7K
The Landscape of Parallel Computing Research: A View from Berkeley
Krste Asanovic,Ras Bodik,Bryan Catanzaro,Joseph Gebis,Parry Husbands,Kurt Keutzer,David A. Patterson,William Plishker,John Shalf,Samuel Williams,Katherine Yelick +10 more
- 18 Dec 2006
TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture
Karthikeyan Sankaralingam,Ramadass Nagarajan,Haiming Liu,Changkyu Kim,Jaehyuk Huh,Doug Burger,Stephen W. Keckler,Charles R. Moore +7 more
- 01 May 2003
TL;DR: Results show that high performance can be obtained in each of the three modes--ILP, TLP, and DLP-demonstrating the viability of the polymorphous coarse-grained approach for future microprocessors.
Exploiting ILP, TLP, and DLP with the polymorphous trips architecture
Karthikeyan Sankaralingam,Ramadass Nagarajan,Haiming Liu,Changkyu Kim,Jaehyuk Huh,Doug Burger,Stephen W. Keckler,Charles R. Moore +7 more
TL;DR: The Tera-op reliable intelligently adaptive processing system (TRIPS) architecture seeks to deliver system-level configurability to applications and runtime systems by employing the concept of polymorphism.
318