TL;DR: This paper presents an architecture for simultaneous multithreading that minimizes the architectural impact on the conventional superscalar design, has minimal performance impact on a single thread executing alone, and achieves significant throughput gains when running multiple threads.
Abstract: Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.
TL;DR: The Markov prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing computer designs and reduces the overall execution stalls due to instruction and data memory operations by an average of 54% for various commercial benchmarks while only using two thirds the memory of a demand-fetch cache organization.
Abstract: Prefetching is one approach to reducing the latency of memory operations in modern computer systems. In this paper, we describe the Markov prefetcher. This prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing computer designs. The Markov prefetcher is distinguished by prefetching multiple reference predictions from the memory subsystem, and then prioritizing the delivery of those references to the processor.This design results in a prefetching system that provides good coverage, is accurate and produces timely results that can be effectively used by the processor. In our cycle-level simulations, the Markov Prefetcher reduces the overall execution stalls due to instruction and data memory operations by an average of 54% for various commercial benchmarks while only using two thirds the memory of a demand-fetch cache organization.
TL;DR: This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window.
Abstract: Today's high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel/spl reg/ Pentium/spl reg/ processor, having a 128-entry instruction window, adding runahead execution improves the IPC (instructions per cycle) by 22% across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1% of a machine with no runahead execution and a 384-entry instruction window.
TL;DR: In this article, a new hardware prefetching scheme based on the prediction of the execution of the instruction stream and associated operand references is proposed. But this scheme requires the use of a reference prediction table and its associated logic.
Abstract: Conventional cache prefetching approaches can be either hardware-based, generally by using a one-blockIookahead technique, or compiler-directed, with insertions of non-blocking prefetch instructions. We introduce a new hardware scheme based on the prediction of the execution of the instruction stream and associated operand references. It consists of a reference prediction table and a look-ahead program counter and its associated logic. With this scheme, data with regular access patterns is preloaded, independently of the stride size, and preloading of data with irregular access patterns is prevented. We evaluate our design through trace driven simulation by comparing it with a pure data cache approach under three different memory access models. Our experiments show that this scheme is very effective for reducing the data access penalty for scientific programs and that is has moderate success for other applications.
TL;DR: This work proposes creating a shorter but otherwise equivalent version of the original program by removing ineffectual computation and computation related to highly-predictable control flow by running concurrently with the full program on a chip multiprocessor or simultaneous multithreaded processor.
Abstract: Processors execute the full dynamic instruction stream to arrive at the final output of a program, yet there exist shorter instruction streams that produce the same overall effect. We propose creating a shorter but otherwise equivalent version of the original program by removing ineffectual computation and computation related to highly-predictable control flow. The shortened program is run concurrently with the full program on a chip multiprocessor simultaneous multithreaded processor, with two key advantages:1) Improved single-program performance. The shorter program speculatively runs ahead of the full program and supplies the full program with control and data flow outcomes. The full program executes efficiently due to the communicated outcomes, at the same time validating the speculative, shorter program. The two programs combined run faster than the original program alone. Detailed simulations of an example implementation show an average improvement of 7% for the SPEC95 integer benchmarks.2) Fault tolerance. The shorter program is a subset of the full program and this partial-redundancy is transparently leveraged for detecting and recovering from transient hardware faults.