Zero-cycle loads: microarchitecture support for reducing load latency
Todd Austin,Gurindar S. Sohi +1 more
- 01 Dec 1995
- pp 82-92
TL;DR: A family of compiler techniques, called critical path reduction (CPR) techniques, which reduce the length of critical paths through control and data dependences are described, which improves the performance of branch intensive codes on processors with inadequate branch throughput or excessive branch latency.
read more
Abstract: Untolerated load instruction latencies often have a significant impact on overall program performance. As one means of mitigating this effect we present an aggressive hardware-based mechanism that provides effective support for reducing the latency of load instructions. Through the judicious use of instruction predecode, base register caching, and fast address calculation, it becomes possible to complete load instructions up to two cycles earlier than traditional pipeline designs. For a pipeline with one cycle data cache access, this results in what we term a zero-cycle load. A zero-cycle load produces a result prior to reaching the execute stage of the pipeline, allowing subsequent dependent instructions to issue unfettered by load dependencies. Programs executing on processors with support for zero-cycle loads experience significantly fewer pipeline stalls due to load instructions and increased overall performance. We present two pipeline designs supporting zero-cycle loads: one for pipelines with a single stage of instruction decode, and another for pipelines with multiple decode stages. We evaluate these designs in a number of contexts: with and without software support, in-order vs. out-of-order issue, and on architectures with many and few registers. We find that our approach is quite effective at reducing the impact of load latency, even more so on architectures with in-order issue and few registers.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
DAISY: dynamic compilation for 100% architectural compatibility
Kemal Ebcioglu,Erik R. Altman +1 more
- 01 May 1997
TL;DR: The architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O are discussed.
Reducing set-associative cache energy via way-prediction and selective direct-mapping
Michael D. Powell,Amit Agarwal,T. N. Vijaykumar,Babak Falsafi,Kaushik Roy +4 more
- 01 Dec 2001
TL;DR: Two previously-proposed techniques, way-prediction and selective direct-mapping, are applied to reducing L1 cache dynamic energy while maintaining high performance, and caches achieve the energy-delay of sequential access while maintaining the performance of parallel access.
The performance potential of data dependence speculation and collapsing
Yiannakis Sazeides,Stamatis Vassiliadis,James E. Smith +2 more
- 02 Dec 1996
TL;DR: In this article, two hardware methods for remedying the effects of true data dependences are studied, dependence speculation and dependence collapsing, which are used to eliminate address generation-load dependences.
135
Improving the accuracy and performance of memory communication through renaming
Gary Tyson,Todd Austin +1 more
- 01 Dec 1997
TL;DR: This work extends previous studies of data value and dependence speculation by introducing a novel modification of the processor pipeline called memory renaming, which allows the processor to speculatively fetch values when the producer of the data can be reliably determined without the need for an effective address.
133
Predictive techniques for aggressive load speculation
Glenn Reinman,Brad Calder +1 more
- 01 Nov 1998
TL;DR: This paper examines in detail the interaction and performance tradeoffs of these four load speculation techniques in the presence of two miss-speculation recovery architectures-reexecution and squash to create a load speculation chooser which provides performance improvement over using any one technique in isolation.
References
•Book
Compilers: Principles, Techniques, and Tools
Alfred V. Aho,Ravi Sethi,Jeffrey D. Ullman +2 more
- 01 Jan 1986
TL;DR: This book discusses the design of a Code Generator, the role of the Lexical Analyzer, and other topics related to code generation and optimization.
9.7K
An effective on-chip preloading scheme to reduce data access penalty
Jean-Loup Baer,Tien-Fu Chen +1 more
- 01 Aug 1991
TL;DR: In this article, a new hardware prefetching scheme based on the prediction of the execution of the instruction stream and associated operand references is proposed. But this scheme requires the use of a reference prediction table and its associated logic.
499
•Book
Instruction issue logic for high-performance, interruptible, multiple functional unit, pipelined computers
Gurindar S. Sohi
- 01 Mar 1995
TL;DR: Simulation studies show that by resolving dependencies the proposed mechanism is able to obtain a significant speedup over a simple instruction issue mechanism as well as implement precise interrupts.
242
A load-instruction unit for pipelined processors
TL;DR: It is shown, using trace-driven simulations, that the proposed mechanism, when incorporated in a design, may contribute to a significant increase in processor performance.
144
Register allocation for free: The C machine stack cache
David R. Ditzel,H. R. McLellan +1 more
- 01 Mar 1982
TL;DR: A piece of the C Machine architecture is described which effectively eliminates the register allocation problem, and improves procedure calling by drastically reducing storage references required by traditional register saving.
142