Hardware scout

Topic Tools

Papers

Journal Article•10.1109/MM.2005.49•

High-performance throughput computing

[...]

Shailender Chaudhry, P. Caprioli, Sherman H. Yip, Marc Tremblay

01 May 2005-IEEE Micro

TL;DR: In this article, the authors have shown that the microarchitecture necessary to support threads on a CMT can also achieve high single-thread performance, up to 40 percent.

...read moreread less

Abstract: CMT processors offer a way to significantly improve the performance of computer systems. The return on investment for multithreading is among the highest in computer microarchitectural techniques. If you design a core from scratch to support multithreading, gains as high as 3/spl times/ are possible for just a 20 percent increase in area. Even with throughput performance as the main target, we have shown that the microarchitecture necessary to support threads on a CMT can also achieve high single-thread performance. Hardware scouting, which Sun is implementing on the Rock microprocessor, can increase the single-thread performance of applications by up to 40 percent. Alternatively, scouting is a technique that makes the on-chip caches appear much larger, performance robustness technique, making up for code tailored for different on-chip cache sizes or even a different number and levels of caches.

...read moreread less

108 citations

Proceedings Article•10.1109/MICRO.2005.31•

Store Memory-Level Parallelism Optimizations for Commercial Applications

[...]

Yuan Chou¹, Lawrence Spracklen¹, Santosh G. Abraham¹•Institutions (1)

Sun Microsystems¹

12 Nov 2005

TL;DR: Experimental results show that store handling optimizations are crucial for mitigating the substantial performance impact of stores in commercial applications and new optimizations are demonstrated to virtually eliminate the impact of off-chip store misses.

...read moreread less

Abstract: This paper studies the impact of off-chip store misses on processor performance for modern commercial applications. The performance impact of off-chip store misses is largely determined by the extent of their overlap with other off-chip cache misses. The epoch MLP model is used to explain and quantify how these overlaps are affected by various store handling optimizations and by the memory consistency model implemented by the processor. The extent of these overlaps are then translated to off-chip CPI. Experimental results show that store handling optimizations are crucial for mitigating the substantial performance impact of stores in commercial applications. While some previously proposed optimizations, such as store prefetching, are highly effective, they are unable to fully mitigate the performance impact of off-chip store misses and they also leave a performance gap between the stronger and weaker memory consistency models. New optimizations, such as the Store Miss Accelerator, an optimization of Hardware Scout and a new application of Speculative Lock Elision, are demonstrated to virtually eliminate the impact of off-chip store misses.

...read moreread less

42 citations

Patent•

Generating prefetches by speculatively executing code through hardware scout threading

[...]

Shailender Chaudhry, Marc Tremblay

19 Dec 2003

TL;DR: Hardware scout threading as mentioned in this paper generates prefetches by speculatively executing code during stalls through a technique known as "hardware scoping threading" to avoid committing results of speculative execution to architectural state of the processor.

...read moreread less

Abstract: One embodiment of the present invention provides a system that generates prefetches by speculatively executing code during stalls through a technique known as “hardware scout threading” The system starts by executing code within a processor Upon encountering a stall, the system speculatively executes the code from the point of the stall, without committing results of the speculative execution to the architectural state of the processor If the system encounters a memory reference during this speculative execution, the system determines if a target address for the memory reference can be resolved If so, the system issues a prefetch for the memory reference to load a cache line for the memory reference into a cache within the processor

...read moreread less

22 citations

Journal Article•10.1109/JSSC.2008.2007144•

Architecture and Physical Implementation of a Third Generation 65 nm, 16 Core, 32 Thread Chip-Multithreading SPARC Processor

[...]

Georgios Konstadinidis¹, Marc Tremblay², Shailender Chaudhry², Mamun Rashid², Peter F. Lai, Y. Otaguro, Y. Orginos, S. Parampalli, M. Steigerwald², S. Gundala, Rambabu Pyapali², Leonard D. Rarick², Ilyas Elkin², Yuefei Ge¹, I. Parulkar - Show less +11 more•Institutions (2)

Oracle Corporation¹, Sun Microsystems²

01 Jan 2009-IEEE Journal of Solid-state Circuits

TL;DR: This paper provides an overview of the architectural highlights and describes the physical implementation challenges and solutions including circuit innovations in memory arrays, register files, and floating-point hardware that boost the performance and circuit robustness with low area overhead.

...read moreread less

Abstract: This third-generation Chip-Multithreading (CMT) SPARC processor consists of 16 cores with shared memory architecture and supports a total of 32 main threads plus 32 scout threads. It is targeted for high-performance servers, and is optimized for both single- and multi-threaded applications. The 396 mm2 chip is fabricated in an 11 metal layer 65-nm CMOS process and operates at a nominal frequency of 2.3 GHz, consuming a maximum power of 250 W at 1.2 V. This paper provides an overview of the architectural highlights and describes the physical implementation challenges and solutions including circuit innovations in memory arrays, register files, and floating-point hardware that boost the performance and circuit robustness with low area overhead.

...read moreread less

18 citations

Patent•

Performing hardware scout threading in a system that supports simultaneous multithreading

[...]

Chaudhry Shailender, Marc Tremblay

19 Dec 2003

TL;DR: In this article, the authors describe a speculative execution of a processor that generates prefetches by speculatively executing code during stalls through a technique known as "hardware scout threading." The system starts by executing code within a processor.

...read moreread less

Abstract: One embodiment of the present invention provides a system that generates prefetches by speculatively executing code during stalls through a technique known as 'hardware scout threading.' The system starts by executing code within a processor. Upon encountering a stall, the system speculatively executes the code from the point of the stall, without committing results of the speculative execution to the architectural state of the processor. If the system encounters a memory reference during this speculative execution, the system determines if a target address for the memory reference can be resolved. If so, the system issues a prefetch for the memory reference to load a cache line for the memory reference into a cache within the processor. In a variation on this embodiment, the processor supports simultaneous multithreading (SMT), which enables multiple threads to execute concurrently through time-multiplexed interleaving in a single processor pipeline. In this variation, the non-speculative execution is carried out by a first thread and the speculative execution is carried out by a second thread, wherein the first thread and the second thread simultaneously execute on the processor.

...read moreread less

15 citations

Topic Tools

Papers

High-performance throughput computing

Store Memory-Level Parallelism Optimizations for Commercial Applications

Generating prefetches by speculatively executing code through hardware scout threading

Architecture and Physical Implementation of a Third Generation 65 nm, 16 Core, 32 Thread Chip-Multithreading SPARC Processor

Performing hardware scout threading in a system that supports simultaneous multithreading

Related Topics (5)

Performance Metrics