TL;DR: In this article, the authors have shown that the microarchitecture necessary to support threads on a CMT can also achieve high single-thread performance, up to 40 percent.
Abstract: CMT processors offer a way to significantly improve the performance of computer systems. The return on investment for multithreading is among the highest in computer microarchitectural techniques. If you design a core from scratch to support multithreading, gains as high as 3/spl times/ are possible for just a 20 percent increase in area. Even with throughput performance as the main target, we have shown that the microarchitecture necessary to support threads on a CMT can also achieve high single-thread performance. Hardware scouting, which Sun is implementing on the Rock microprocessor, can increase the single-thread performance of applications by up to 40 percent. Alternatively, scouting is a technique that makes the on-chip caches appear much larger, performance robustness technique, making up for code tailored for different on-chip cache sizes or even a different number and levels of caches.
TL;DR: Experimental results show that store handling optimizations are crucial for mitigating the substantial performance impact of stores in commercial applications and new optimizations are demonstrated to virtually eliminate the impact of off-chip store misses.
Abstract: This paper studies the impact of off-chip store misses on processor performance for modern commercial applications. The performance impact of off-chip store misses is largely determined by the extent of their overlap with other off-chip cache misses. The epoch MLP model is used to explain and quantify how these overlaps are affected by various store handling optimizations and by the memory consistency model implemented by the processor. The extent of these overlaps are then translated to off-chip CPI. Experimental results show that store handling optimizations are crucial for mitigating the substantial performance impact of stores in commercial applications. While some previously proposed optimizations, such as store prefetching, are highly effective, they are unable to fully mitigate the performance impact of off-chip store misses and they also leave a performance gap between the stronger and weaker memory consistency models. New optimizations, such as the Store Miss Accelerator, an optimization of Hardware Scout and a new application of Speculative Lock Elision, are demonstrated to virtually eliminate the impact of off-chip store misses.
TL;DR: Hardware scout threading as mentioned in this paper generates prefetches by speculatively executing code during stalls through a technique known as "hardware scoping threading" to avoid committing results of speculative execution to architectural state of the processor.
Abstract: One embodiment of the present invention provides a system that generates prefetches by speculatively executing code during stalls through a technique known as “hardware scout threading” The system starts by executing code within a processor Upon encountering a stall, the system speculatively executes the code from the point of the stall, without committing results of the speculative execution to the architectural state of the processor If the system encounters a memory reference during this speculative execution, the system determines if a target address for the memory reference can be resolved If so, the system issues a prefetch for the memory reference to load a cache line for the memory reference into a cache within the processor
TL;DR: This paper provides an overview of the architectural highlights and describes the physical implementation challenges and solutions including circuit innovations in memory arrays, register files, and floating-point hardware that boost the performance and circuit robustness with low area overhead.
Abstract: This third-generation Chip-Multithreading (CMT) SPARC processor consists of 16 cores with shared memory architecture and supports a total of 32 main threads plus 32 scout threads. It is targeted for high-performance servers, and is optimized for both single- and multi-threaded applications. The 396 mm2 chip is fabricated in an 11 metal layer 65-nm CMOS process and operates at a nominal frequency of 2.3 GHz, consuming a maximum power of 250 W at 1.2 V. This paper provides an overview of the architectural highlights and describes the physical implementation challenges and solutions including circuit innovations in memory arrays, register files, and floating-point hardware that boost the performance and circuit robustness with low area overhead.
TL;DR: In this article, the authors describe a speculative execution of a processor that generates prefetches by speculatively executing code during stalls through a technique known as "hardware scout threading." The system starts by executing code within a processor.
Abstract: One embodiment of the present invention provides a system that generates prefetches by speculatively executing code during stalls through a technique known as 'hardware scout threading.' The system starts by executing code within a processor. Upon encountering a stall, the system speculatively executes the code from the point of the stall, without committing results of the speculative execution to the architectural state of the processor. If the system encounters a memory reference during this speculative execution, the system determines if a target address for the memory reference can be resolved. If so, the system issues a prefetch for the memory reference to load a cache line for the memory reference into a cache within the processor. In a variation on this embodiment, the processor supports simultaneous multithreading (SMT), which enables multiple threads to execute concurrently through time-multiplexed interleaving in a single processor pipeline. In this variation, the non-speculative execution is carried out by a first thread and the speculative execution is carried out by a second thread, wherein the first thread and the second thread simultaneously execute on the processor.