Compiling for the Impulse Memory Controller
Xianglong Huang,Zhenlin Wang,Kathryn S. McKinley +2 more
- 08 Sep 2001
- pp 141-150
TL;DR: Comp compiler cost models using dependence and locality analysis are presented that determine when to use Impulse to improve performance based on the reduction in misses, the additional cost for misses in Impulse, and the fixed cost for setting up a remapping.
read more
Abstract: The Impulse memory controller provides an interface for remapping irregular or sparse memory accesses into dense accesses in the cache memory. This capability significantly increases processor cache and system bus utilization, and previous work shows performance improvements from a factor of 1.2 to 5 with current technology models for hand-coded kernels in a cycle-level simulator. To attain widespread use of any specialized hardware feature requires automating its use in a compiler. In this paper, we present compiler cost models using dependence and locality analysis that determine when to use Impulse to improve performance based on the reduction in misses, the additional cost for misses in Impulse, and the fixed cost for setting up a remapping. We implement the cost models and generate the appropriate Impulse system calls in the Scale compiler framework. Our results demonstrate that our cost models correctly choose when and when not to use Impulse. We also combine and compare Impulse with our implementation of loop permutation for improving locality. If loop permutation can achieve the same dense access pattern as Impulse, we prefer it, since it has no overheads, but we show that the combination can yield better performance.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Dynamic Memory Access Management for High-Performance DSP Applications Using High-Level Synthesis
TL;DR: This paper focuses on implementing memory interfacing modules that can be automatically generated from a high-level synthesis tool and which can efficiently handle predictable address patterns as well as random ones to save power consumption and reduce latency.
30
Value-Profile Guided Stride Prefetching for Irregular Code
Youfeng Wu,Mauricio J. Serrano,Rakesh Krishnaiyer,Wei Li,Jesse Fang +4 more
- 08 Apr 2002
TL;DR: A novel compiler technique to profile and prefetch for those loads with near-constant strides, which captures not only the dominant stride values for each profiled load, but also the differences between the successive strides of the load.
A lifetime optimal algorithm for speculative PRE
Jingling Xue,Qiong Cai +1 more
TL;DR: A lifetime optimal algorithm, called MC-PRE, is presented for the first time that performs speculative PRE based on edge profiles and is capable of eliminating more partial redundancies than both LCM and CMP-PRE (especially in functions with complex control flow), and, in addition, MC-pre inserts temporaries with shorter lifetimes than MC- PREcopt.
•Dissertation
Contribution à la prise en compte des contraintes des applications TDSI dans la synthèse de haut niveau
Bertrand Le Gal
- 01 Jan 2005
TL;DR: The concept of composant virtuel de niveau comportemental, proposed par le LESTER, autorise une grande flexibilite and une bonne adequation entre algorithme and architecture as discussed by the authors.
8
Region Based Structure Layout Optimization by Selective Data Copying
Sandya Mannarswamy,Ramaswamy Govindarajan,Rishi Surendran +2 more
- 12 Sep 2009
TL;DR: The RBSL framework is described, implemented in the production compiler for C/C++ on HP-UX IA-64 and it is shown that acting in complement to the existing and mature WPSL transformation framework in the compiler, RBSS improves application performance in pointer intensive SPEC benchmarks ranging from 3% to 28% over WpsL.
6
References
A practical algorithm for exact array dependence analysis
TL;DR: A fundamental analis step in an ad',nced optimizing compiler (as well as many other software tools) is data dependence analysis f o r arrays, which determines whether two references to an array can refer to the same e lement and under what conditions.
617
Improving data locality with loop transformations
TL;DR: This article presents compiler optimizations to improve data locality based on a simple yet accurate cost model and finds performance improvements were difficult to achieve, but improved several programs.
Impulse: building a smarter memory controller
John B. Carter,Wilson C. Hsieh,Leigh Stoller,M. Swanson,Lixin Zhang,Erik Brunvand,Al Davis,Chen-Chi Kuo,R. Kuramkote,Michael Parker,Lambert Schaelicke,Terry Tateyama +11 more
- 09 Jan 1999
TL;DR: The design of the Impulse architecture is described, and how an Impulse memory system can be used to improve the performance of memory-bound programs is shown, which improves performance for the NAS conjugate gradient benchmark by 67%.
Data transformations for eliminating conflict misses
Gabriel Rivera,Chau-Wen Tseng +1 more
- 01 May 1998
TL;DR: Experiments on arange of programs indicate PADLITE can eliminate conflicts for benchmarks, but PAD is more effective over a range of cache and problem sizes, with some SPEC95 programs improving up to 15%.
258
Cache miss equations: an analytical representation of cache misses
Soumyadip Ghosh,Margaret Martonosi,Sharad Malik +2 more
- 11 Jul 1997
TL;DR: In this article, the authors describe methods for generating and solving cache miss equations that give a detailed representation of the cache misses in loop-oriented scientific code, which can be used to guide code optimizations for improving cache performance.
Related Papers (5)
Wei Ding,Diana Guttman,Mahmut Kandemir +2 more
- 13 Dec 2014
Xavier Vera,Jingling Xue +1 more
- 02 Feb 2002
Andreas Abel,Jan Reineke +1 more
- 09 Apr 2013