Zero-cycle loads: microarchitecture support for reducing load latency

doi:10.5555/225160.225176

Open AccessProceedings Article10.5555/225160.225176

Zero-cycle loads: microarchitecture support for reducing load latency

Todd Austin, +1 more

- 01 Dec 1995

- pp 82-92

121

TL;DR: A family of compiler techniques, called critical path reduction (CPR) techniques, which reduce the length of critical paths through control and data dependences are described, which improves the performance of branch intensive codes on processors with inadequate branch throughput or excessive branch latency.

Abstract: Untolerated load instruction latencies often have a significant impact on overall program performance. As one means of mitigating this effect we present an aggressive hardware-based mechanism that provides effective support for reducing the latency of load instructions. Through the judicious use of instruction predecode, base register caching, and fast address calculation, it becomes possible to complete load instructions up to two cycles earlier than traditional pipeline designs. For a pipeline with one cycle data cache access, this results in what we term a zero-cycle load. A zero-cycle load produces a result prior to reaching the execute stage of the pipeline, allowing subsequent dependent instructions to issue unfettered by load dependencies. Programs executing on processors with support for zero-cycle loads experience significantly fewer pipeline stalls due to load instructions and increased overall performance. We present two pipeline designs supporting zero-cycle loads: one for pipelines with a single stage of instruction decode, and another for pipelines with multiple decode stages. We evaluate these designs in a number of contexts: with and without software support, in-order vs. out-of-order issue, and on architectures with many and few registers. We find that our approach is quite effective at reducing the impact of load latency, even more so on architectures with in-order issue and few registers.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1145/264107.264126

DAISY: dynamic compilation for 100% architectural compatibility

Kemal Ebcioglu, +1 more

- 01 May 1997

TL;DR: The architectural requirements for such a VLIW, to deal with issues including self-modifying code, precise exceptions, and aggressive reordering of memory references in the presence of strong MP consistency and memory mapped I/O are discussed.

...read moreread less

426

•Proceedings Article•10.5555/563998.564007

Reducing set-associative cache energy via way-prediction and selective direct-mapping

Michael D. Powell, +4 more

- 01 Dec 2001

TL;DR: Two previously-proposed techniques, way-prediction and selective direct-mapping, are applied to reducing L1 cache dynamic energy while maintaining high performance, and caches achieve the energy-delay of sequential access while maintaining the performance of parallel access.

...read moreread less

321

•Proceedings Article•10.5555/243846.243890

The performance potential of data dependence speculation and collapsing

Yiannakis Sazeides, +2 more

- 02 Dec 1996

TL;DR: In this article, two hardware methods for remedying the effects of true data dependences are studied, dependence speculation and dependence collapsing, which are used to eliminate address generation-load dependences.

...read moreread less

135

•Proceedings Article•10.5555/266800.266821

Improving the accuracy and performance of memory communication through renaming

Gary Tyson, +1 more

- 01 Dec 1997

TL;DR: This work extends previous studies of data value and dependence speculation by introducing a novel modification of the processor pipeline called memory renaming, which allows the processor to speculatively fetch values when the producer of the data can be reliably determined without the need for an effective address.

...read moreread less

133

•Proceedings Article•10.5555/290940.290969

Predictive techniques for aggressive load speculation

Glenn Reinman, +1 more

- 01 Nov 1998

TL;DR: This paper examines in detail the interaction and performance tradeoffs of these four load speculation techniques in the presence of two miss-speculation recovery architectures-reexecution and squash to create a load speculation chooser which provides performance improvement over using any one technique in isolation.

...read moreread less

99

...

Expand

References

•Book

Compilers: Principles, Techniques, and Tools

Alfred V. Aho, +2 more

- 01 Jan 1986

TL;DR: This book discusses the design of a Code Generator, the role of the Lexical Analyzer, and other topics related to code generation and optimization.

...read moreread less

9.7K

•Proceedings Article•10.1145/125826.125932

An effective on-chip preloading scheme to reduce data access penalty

Jean-Loup Baer, +1 more

- 01 Aug 1991

TL;DR: In this article, a new hardware prefetching scheme based on the prediction of the execution of the instruction stream and associated operand references is proposed. But this scheme requires the use of a reference prediction table and its associated logic.

...read moreread less

499

•Book

Instruction issue logic for high-performance, interruptible, multiple functional unit, pipelined computers

Gurindar S. Sohi

- 01 Mar 1995

TL;DR: Simulation studies show that by resolving dependencies the proposed mechanism is able to obtain a significant speedup over a simple instruction issue mechanism as well as implement precise interrupts.

...read moreread less

242

Journal Article•10.1147/RD.374.0547

A load-instruction unit for pipelined processors

Richard J. Eickemeyer, +1 more

- 01 Jul 1993

- Ibm Journal of Research and Development

TL;DR: It is shown, using trace-driven simulations, that the proposed mechanism, when incorporated in a design, may contribute to a significant increase in processor performance.

...read moreread less

144

Proceedings Article•10.1145/800050.801825

Register allocation for free: The C machine stack cache

David R. Ditzel, +1 more

- 01 Mar 1982

TL;DR: A piece of the C Machine architecture is described which effectively eliminates the register allocation problem, and improves procedure calling by drastically reducing storage references required by traditional register saving.

...read moreread less

142

...

Expand

Zero-cycle loads: microarchitecture support for reducing load latency

Chat with Paper

AI Agents for this Paper

Citations

DAISY: dynamic compilation for 100% architectural compatibility

Reducing set-associative cache energy via way-prediction and selective direct-mapping

The performance potential of data dependence speculation and collapsing

Improving the accuracy and performance of memory communication through renaming

Predictive techniques for aggressive load speculation

References

Compilers: Principles, Techniques, and Tools

An effective on-chip preloading scheme to reduce data access penalty

Instruction issue logic for high-performance, interruptible, multiple functional unit, pipelined computers

A load-instruction unit for pipelined processors

Register allocation for free: The C machine stack cache

Related Papers (5)

A load-instruction unit for pipelined processors

Dynamic speculation and synchronization of data dependences

Streamlining data cache access with fast address calculation

The performance potential of data dependence speculation and collapsing

Effective hardware-based data prefetching for high-performance processors