Proceedings Article10.1145/2694344.2694356
DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations
Hyojin Sung,Sarita V. Adve +1 more
- 14 Mar 2015
- Vol. 50, Iss: 4, pp 545-559
TL;DR: DeNovoSync is proposed, a technique to support arbitrary synchronization in DeNovo using a novel combination of registration of all synchronization reads with a judicious hardware backoff to limit unnecessary registrations, and shows comparable or up to 22% lower execution time and up to 58% lower network traffic.
read more
Abstract: Current shared-memory hardware is complex and inefficient. Prior work on the DeNovo coherence protocol showed that disciplined shared-memory programming models can enable more complexity-, performance-, and energy-efficient hardware than the state-of-the-art MESI protocol. DeNovo, however, severely restricted the synchronization constructs an application can support. This paper proposes DeNovoSync, a technique to support arbitrary synchronization in DeNovo. The key challenge is that DeNovo exploits race-freedom to use reader-initiated local self-invalidations (instead of conventional writer-initiated remote cache invalidations) to ensure coherence. Synchronization accesses are inherently racy and not directly amenable to self-invalidations. DeNovoSync addresses this challenge using a novel combination of registration of all synchronization reads with a judicious hardware backoff to limit unnecessary registrations. For a wide variety of synchronization constructs and applications, compared to MESI, DeNovoSync shows comparable or up to 22% lower execution time and up to 58% lower network traffic, enabling DeNovo's advantages for a much broader class of software than previously possible.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Stash: have your scratchpad and cache it too
Rakesh Komuravelli,Matthew D. Sinclair,Johnathan Alsop,Muhammad Huzaifa,Maria Kotsifakou,Prakalp Srivastava,Sarita V. Adve,Vikram Adve +7 more
- 13 Jun 2015
TL;DR: It is shown that the stash provides better performance and energy than a cache and a scratchpad, while enabling new use cases for heterogeneous systems, and proposes an efficient heterogeneous memory system where specialized memory components are tightly coupled in a unified and coherent address space.
Efficient GPU synchronization without scopes: saying no to complex consistency models
Matthew D. Sinclair,Johnathan Alsop,Sarita V. Adve +2 more
- 05 Dec 2015
TL;DR: This work applies the DeNovo coherence protocol to GPUs and compares it with conventional GPU coherence under the DRF and HRF consistency models, and shows that the complexity of the HRF model is neither necessary nor sufficient to obtain high performance.
Spandex: a flexible interface for efficient heterogeneous coherence
Johnathan Alsop,Matthew D. Sinclair,Sarita V. Adve +2 more
- 02 Jun 2018
TL;DR: Spandex is introduced, an improved coherence interface based on the simple and scalable DeNovo coherence protocol that directly interfaces devices with diverse coherence properties and memory demands, enabling each device to communicate in a manner appropriate for its specific access properties.
50
Lazy release consistency for GPUs
Johnathan Alsop,Marc S. Orr,Bradford M. Beckmann,Darien Wood +3 more
- 15 Oct 2016
TL;DR: This work proposes to adapt lazy release consistency - previously only proposed for homogeneous CPU systems - to a heterogeneous system, and uses a DeNovo-like mechanism to track ownership of synchronization variables, lazily performing coherence actions only when a synchronization variable changes locations.
43
Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems
Matthew D. Sinclair,Johnathan Alsop,Sarita V. Adve +2 more
- 24 Jun 2017
TL;DR: A new model is introduced, Data-Race-Free-Relaxed (DRFrlx), that extends DRF0 to provide SC-centric semantics for the common use cases of relaxed atomics, and is evaluated in CPU-GPU systems for these use cases.
43
References
The SPLASH-2 programs: characterization and methodological considerations
Steven Cameron Woo,Moriyoshi Ohara,Evan Torrie,Jaswinder Pal Singh,Anoop Gupta +4 more
- 01 May 1995
TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Simics: A full system simulation platform
Peter S. Magnusson,M. Christensson,J. Eskilson,D. Forsgren,G. Hallberg,J. Hogberg,Fredrik Larsson,A. Moestedt,Bengt Werner +8 more
TL;DR: Simics is a platform for full system simulation that can run actual firmware and completely unmodified kernel and driver code, and it provides both functional accuracy for running commercial workloads and sufficient timing accuracy to interface to detailed hardware models.
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
Milo M. K. Martin,Daniel J. Sorin,Bradford M. Beckmann,Michael R. Marty,Min Xu,Alaa R. Alameldeen,Kevin E. Moore,Mark D. Hill,Darien Wood +8 more
TL;DR: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers as mentioned in this paper, which includes a set of timing simulator modules for modeling the timing of the memory system and microprocessors.
Multifacets General Execution-Driven Multiprocessor Simulator (GEMS) Toolset
M. M. Martin
- 01 Jan 2005
TL;DR: The Wisconsin Multifacet Project has created a simulation toolset to characterize and evaluate the performance of multiprocessor hardware systems commonly used as database and web servers and has released a set of timing simulator modules for modeling the timing of the memory system and microprocessors.
1.4K
Shared memory consistency models: a tutorial
TL;DR: This work describes an alternative, programmer-centric view of relaxed consistency models that describes them in terms of program behavior, not system optimizations, and most of these models emphasize the system optimizations they support.