Robust SIMD: Dynamically Adapted SIMD Width and Multi-Threading Depth
Jiayuan Meng,Jeremy W. Sheaffer,Kevin Skadron +2 more
- 21 May 2012
- pp 107-118
TL;DR: This paper proposes Robust SIMD, which provides wide SIMD and then dynamically adjusts SIMD width and multi-threading depth according to performance feedback, to address the above issues and reduce design risks.
read more
Abstract: Architectures that aggressively exploit SIMD often have many data paths execute in lockstep and use multi-threading to hide latency. They can yield high through-put in terms of area- and energy-efficiency for many data-parallel applications. To balance productivity and performance, many recent SIMD organizations incorporate implicit cache hierarchies. Examples of such architectures include Intel's MIC, AMD's Fusion, and NVIDIA's Fermi. However, unlike software-managed streaming memories used in conventional graphics processors (GPUs), hardware-managed caches are more disruptive to SIMD execution, therefore the interaction between implicit caching and aggressive SIMD execution may no longer follow the conventional wisdom gained from streaming memories. We show that due to more frequent memory latency divergence, lower latency in non-L1 data accesses, and relatively unpredictable L1 contention, cache hierarchies favor different SIMD widths and multi-threading depths than streaming memories. In fact, because the above effects are subject to runtime dynamics, a fixed combination of SIMD width and multi-threading depth no longer works ubiquitously across diverse applications or when cache capacities are reduced due to pollution or power saving. To address the above issues and reduce design risks, this paper proposes Robust SIMD, which provides wide SIMD and then dynamically adjusts SIMD width and multi-threading depth according to performance feedback. Robust SIMD can trade wider SIMD for deeper multi-threading by splitting a wider SIMD group into multiple narrower SIMD groups. Compared to the performance generated by running every benchmark on its individually preferred SIMD organization, the same Robust SIMD organization performs similarly -- sometimes even better due to phase adaptation -- and out per-forms the best fixed SIMD organization by 17%. When D-cache capacity is reduced due to runtime disruptiveness, Robust SIMD offers graceful performance degradation, with 25% polluted cache lines in a 32 KB D-cache, Robust SIMD performs 1.4× better compared to a conventional SIMD architecture.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Pannotia: Understanding irregular GPGPU graph applications
Shuai Che,Bradford M. Beckmann,Steven K. Reinhardt,Kevin Skadron +3 more
- 01 Sep 2013
TL;DR: This paper characterizes a suite of GPGPU graph applications, Pannotia, which is implemented in OpenCL and contains problems from diverse and important graph application domains and makes architectural and scheduling suggestions that will improve their execution efficiency on GPUs.
A variable warp size architecture
Timothy G. Rogers,Daniel R. Johnson,Mike O'Connor,Stephen W. Keckler +3 more
- 13 Jun 2015
TL;DR: Variable Warp Sizing (VWS) is proposed which improves the performance of divergent applications by using a small base warp size in the presence of control flow and memory divergence, and eliminates the performance degradation due to memory convergence slip that is observed when convergent applications are executed with smaller warp sizes.
A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity
TL;DR: A survey about GPUs from two perspectives is provided: architectural advances to improve performance and programmability and advances to enhance CPU–GPU integration in heterogeneous systems.
24
Programmable multimedia platform based on reconfigurable processor for 8K UHD TV
TL;DR: This paper introduces the world's first programmable video-processing platform for the enhancement of the video quality of the 8K (7680 × 4320) Ultra High Definition (UHD) TV that operates at a maximum rate of 60 frames per second.
13
Exploiting tightly-coupled cores
Daniel Bates,Alex Bradbury,Andreas Koltes,Robert Mullins +3 more
- 15 Jul 2013
TL;DR: This paper focuses on the design of a single 8-core tile, conceived as the building block for a larger many-core system, and explores the tile’s ability to support a range of parallelisation opportunities and detail the control and communication mechanisms needed to exploit each cores’ resources in a flexible manner.
References
The SPLASH-2 programs: characterization and methodological considerations
Steven Cameron Woo,Moriyoshi Ohara,Evan Torrie,Jaswinder Pal Singh,Anoop Gupta +4 more
- 01 May 1995
TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
The Landscape of Parallel Computing Research: A View from Berkeley
Krste Asanovic,Ras Bodik,Bryan Catanzaro,Joseph Gebis,Parry Husbands,Kurt Keutzer,David A. Patterson,William Plishker,John Shalf,Samuel Williams,Katherine Yelick +10 more
- 18 Dec 2006
TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches
Moinuddin K. Qureshi,Yale N. Patt +1 more
- 09 Dec 2006
TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
The M5 Simulator: Modeling Networked Systems
TL;DR: The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically.
940
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0
Naveen Muralimanohar,Rajeev Balasubramonian,Norm Jouppi +2 more
- 01 Dec 2007
TL;DR: This work implements two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache, and adopts state-of-the-art design space exploration strategies for non-uniform cache access (NUCA).