TL;DR: The concept of the sphere of replication is introduced, which abstract both the physical redundancy of a lockstepped system and the logical redundancy of an SRT processor, and two mechanisms-slack fetch and branch outcome queue-are proposed and evaluated that enhance the performance of anSRT processor by allowing one thread to prefetch cache misses and branch results for the other thread.
Abstract: Smaller feature sizes, reduced voltage levels, higher transistor counts, and reduced noise margins make future generations of microprocessors increasingly prone to transient hardware faults. Most commercial fault-tolerant computers use fully replicated hardware components to detect microprocessor faults. The components are lockstepped (cycle-by-cycle synchronized) to ensure that, in each cycle, they perform the same operation on the same inputs, producing the same outputs in the absence of faults. Unfortunately, for a given hardware budget, full replication reduces performance by statically partitioning resources among redundant operations.We demonstrate that a Simultaneous and Redundantly Threaded (SRT) processor—derived from a Simultaneous Multithreaded (SMT) processor—provides transient fault coverage with significantly higher performance. An SRT processor provides transient fault coverage by running identical copies of the same program simultaneously as independent threads. An SRT processor provides higher performance because it dynamically schedules its hardware resources among the redundant copies. However, dynamic scheduling makes it difficult to implement lockstepping, because corresponding instructions from redundant threads may not execute in the same cycle or in the same order.This paper makes four contributions to the design of SRT processors. First, we introduce the concept of the sphere of replication, which abstracts both the physical redundancy of a lockstepped system and the logical redundancy of an SRT processor. This framework aids in identifying the scope of fault coverage and the input and output values requiring special handling. Second, we identify two viable spheres of replication in an SRT processor, and show that one of them provides fault detection while checking only committed stores and uncached loads. Third, we identify the need for consistent replication of load values, and propose and evaluate two new mechanisms for satisfying this requirement. Finally, we propose and evaluate two mechanisms—slack fetch and branch outcome queue—that enhance the performance of an SRT processor by allowing one thread to prefetch cache misses and branch results for the other thread. Our results with 11 SPEC95 benchmarks show that an SRT processor can outperform an equivalently sized, on-chip, hardware-replicated solution by 16% on average, with a maximum benefit of up to 29%.
TL;DR: In this article, a new hardware prefetching scheme based on the prediction of the execution of the instruction stream and associated operand references is proposed. But this scheme requires the use of a reference prediction table and its associated logic.
Abstract: Conventional cache prefetching approaches can be either hardware-based, generally by using a one-blockIookahead technique, or compiler-directed, with insertions of non-blocking prefetch instructions. We introduce a new hardware scheme based on the prediction of the execution of the instruction stream and associated operand references. It consists of a reference prediction table and a look-ahead program counter and its associated logic. With this scheme, data with regular access patterns is preloaded, independently of the stride size, and preloading of data with irregular access patterns is prevented. We evaluate our design through trace driven simulation by comparing it with a pure data cache approach under three different memory access models. Our experiments show that this scheme is very effective for reducing the data access penalty for scientific programs and that is has moderate success for other applications.
TL;DR: The Dead-Block Predictors (DBPs) are proposed, trace-based predictors that accurately identify “when” an Ll data cache block becomes evictable or “dead”, and a DBCP enables effective data prefetching in a wide spectrum of pointer-intensive, integer, and floating-point applications.
Abstract: Effective data prefetching requires accurate mechanisms to predict both “which” cache blocks to prefetch and “when” to prefetch them. This paper proposes the Dead-Block Predictors (DBPs), trace-based predictors that accurately identify “when” an Ll data cache block becomes evictable or “dead”. Predicting a dead block significantly enhances prefetching lookahead and opportunity, and enables placing data directly into Ll, obviating the need for auxiliary prefetch buffers. This paper also proposes Dead-Block Correlating Prefetchers (DBCPs), that use address correlation to predict “which” subsequent block to prefetch when a block becomes evictable. A DBCP enables effective data prefetching in a wide spectrum of pointer-intensive, integer, and floating-point applications.We use cycle-accurate simulation of an out-of-order superscalar processor and memory-intensive benchmarks to show that: (1) dead-block prediction enhances prefetching lookahead at least by an order of magnitude as compared to previous techniques, (2) a DBP can predict dead blocks on average with a coverage of 90% only mispredicting 4% of the time, (3) a DBCP offers an address prediction coverage of 86% only mispredicting 3% of the time, and (4) DBCPs improve performance by 62% on average and 282% at best in the benchmarks we studied.
TL;DR: To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead, and secondary effects such as cache pollution and increased memory bandwidth requirements must be taken into consideration.
Abstract: The expanding gap between microprocessor and DRAM performance has necessitated the use of increasingly aggressive techniques designed to reduce or hide the latency of main memory access. Although large cache hierarchies have proven to be effective in reducing this latency for the most frequently used data, it is still not uncommon for many programs to spend more than half their run times stalled on memory requests. Data prefetching has been proposed as a technique for hiding the access latency of data referencing patterns that defeat caching strategies. Rather than waiting for a cache miss to initiate a memory fetch, data prefetching anticipates such misses and issues a fetch to the memory system in advance of the actual memory reference. To be effective, prefetching must be implemented in such a way that prefetches are timely, useful, and introduce little overhead. Secondary effects such as cache pollution and increased memory bandwidth requirements must also be taken into consideration. Despite these obstacles, prefetching has the potential to significantly improve overall program execution time by overlapping computation with memory accesses. Prefetching strategies are diverse, and no single strategy has yet been proposed that provides optimal performance. The following survey examines several alternative approaches, and discusses the design tradeoffs involved when implementing a data prefetch strategy.
TL;DR: In this paper, a method and system for prefetching data for clients associated with a proxy server hierarchy and an enhanced content server data transfer to the clients is presented, where the prefetch hint information annotates requested information or objects requested by the clients.
Abstract: A method and system for prefetching data for clients associated with a proxy server hierarchy and an enhanced content server data transfer to the clients. The method for prefetching data identifies data access patterns and prefetches select information based on a dynamic interpretation of the data access patterns. The content server or proxy server identifies data access reference patterns of clients associated with the content server or the proxy server hierarchy. The data access patterns are preferably stored in a statistics table. The content server generates prefetch hint information based on inputs from the statistics table. The prefetch hint information annotates requested information or objects requested by the clients. The prefetch hint information optimally is updated as the requested information passes through the proxy server hierarchy. The prefetch hint information annotates the requested information requested by the clients. The ultimate decision to prefetch select information for the clients is made based on the prefetch hint information and prefetch hint values. The prefetch hint value is calculated based on local node-specific factors, such as storage constraints and access time.