Robust SIMD: Dynamically Adapted SIMD Width and Multi-Threading Depth

doi:10.1109/IPDPS.2012.20

Open AccessProceedings Article10.1109/IPDPS.2012.20

Robust SIMD: Dynamically Adapted SIMD Width and Multi-Threading Depth

Jiayuan Meng, +2 more

- 21 May 2012

- pp 107-118

29

TL;DR: This paper proposes Robust SIMD, which provides wide SIMD and then dynamically adjusts SIMD width and multi-threading depth according to performance feedback, to address the above issues and reduce design risks.

Abstract: Architectures that aggressively exploit SIMD often have many data paths execute in lockstep and use multi-threading to hide latency. They can yield high through-put in terms of area- and energy-efficiency for many data-parallel applications. To balance productivity and performance, many recent SIMD organizations incorporate implicit cache hierarchies. Examples of such architectures include Intel's MIC, AMD's Fusion, and NVIDIA's Fermi. However, unlike software-managed streaming memories used in conventional graphics processors (GPUs), hardware-managed caches are more disruptive to SIMD execution, therefore the interaction between implicit caching and aggressive SIMD execution may no longer follow the conventional wisdom gained from streaming memories. We show that due to more frequent memory latency divergence, lower latency in non-L1 data accesses, and relatively unpredictable L1 contention, cache hierarchies favor different SIMD widths and multi-threading depths than streaming memories. In fact, because the above effects are subject to runtime dynamics, a fixed combination of SIMD width and multi-threading depth no longer works ubiquitously across diverse applications or when cache capacities are reduced due to pollution or power saving. To address the above issues and reduce design risks, this paper proposes Robust SIMD, which provides wide SIMD and then dynamically adjusts SIMD width and multi-threading depth according to performance feedback. Robust SIMD can trade wider SIMD for deeper multi-threading by splitting a wider SIMD group into multiple narrower SIMD groups. Compared to the performance generated by running every benchmark on its individually preferred SIMD organization, the same Robust SIMD organization performs similarly -- sometimes even better due to phase adaptation -- and out per-forms the best fixed SIMD organization by 17%. When D-cache capacity is reduced due to runtime disruptiveness, Robust SIMD offers graceful performance degradation, with 25% polluted cache lines in a 32 KB D-cache, Robust SIMD performs 1.4× better compared to a conventional SIMD architecture.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Proceedings Article•10.1109/IISWC.2013.6704684

Pannotia: Understanding irregular GPGPU graph applications

Shuai Che, +3 more

- 01 Sep 2013

TL;DR: This paper characterizes a suite of GPGPU graph applications, Pannotia, which is implemented in OpenCL and contains problems from diverse and important graph application domains and makes architectural and scheduling suggestions that will improve their execution efficiency on GPUs.

...read moreread less

222

•Proceedings Article•10.1145/2749469.2750410

A variable warp size architecture

Timothy G. Rogers, +3 more

- 13 Jun 2015

TL;DR: Variable Warp Sizing (VWS) is proposed which improves the performance of divergent applications by using a small base warp size in the presence of control flow and memory divergence, and eliminates the performance degradation due to memory convergence slip that is observed when convergent applications are executed with smaller warp sizes.

...read moreread less

41

Journal Article•10.1016/J.JPDC.2018.11.012

A survey of architectural approaches for improving GPGPU performance, programmability and heterogeneity

Mahmoud Khairy, +3 more

- 01 May 2019

- Journal of Parallel and Distributed Comp...

TL;DR: A survey about GPUs from two perspectives is provided: architectural advances to improve performance and programmability and advances to enhance CPU–GPU integration in heterogeneous systems.

...read moreread less

24

Journal Article•10.1109/TCE.2015.7389807

Programmable multimedia platform based on reconfigurable processor for 8K UHD TV

Young-Hwan Park, +4 more

- 01 Nov 2015

- IEEE Transactions on Consumer Electronic...

TL;DR: This paper introduces the world's first programmable video-processing platform for the enhancement of the video quality of the 8K (7680 × 4320) Ultra High Definition (UHD) TV that operates at a maximum rate of 60 frames per second.

...read moreread less

13

•Journal Article•10.1007/S11265-014-0944-6

Exploiting tightly-coupled cores

Daniel Bates, +3 more

- 15 Jul 2013

TL;DR: This paper focuses on the design of a single 8-core tile, conceived as the building block for a larger many-core system, and explores the tile’s ability to support a range of parallelisation opportunities and detail the control and communication mechanisms needed to exploit each cores’ resources in a flexible manner.

...read moreread less

10

...

Expand

References

Proceedings Article•10.1145/223982.223990

The SPLASH-2 programs: characterization and methodological considerations

Steven Cameron Woo, +4 more

- 01 May 1995

TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.

...read moreread less

4.1K

The Landscape of Parallel Computing Research: A View from Berkeley

Krste Asanovic, +10 more

- 18 Dec 2006

TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.

...read moreread less

2.4K

Proceedings Article•10.1109/MICRO.2006.49

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Moinuddin K. Qureshi, +1 more

- 09 Dec 2006

TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.

...read moreread less

1.1K

Journal Article•10.1109/MM.2006.82

The M5 Simulator: Modeling Networked Systems

Nathan Binkert, +5 more

- 01 Jul 2006

- IEEE Micro

TL;DR: The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically.

...read moreread less

940

•Proceedings Article•10.1109/MICRO.2007.30

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Naveen Muralimanohar, +2 more

- 01 Dec 2007

TL;DR: This work implements two major extensions to the CACTI cache modeling tool that focus on interconnect design for a large cache, and adopts state-of-the-art design space exploration strategies for non-uniform cache access (NUCA).

...read moreread less

870