Proceedings Article10.1145/1250662.1250668
Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures
Weirong Zhu,Vugranam C. Sreedhar,Ziang Hu,Guang R. Gao +3 more
- 09 Jun 2007
- Vol. 35, Iss: 2, pp 35-45
117
TL;DR: The Synchronization State Buffer is proposed, a scalable architectural design for fine-grain synchronization that efficiently performs synchronizations between concurrent threads that records and manages the states of frequently synchronized data using modest hardware support.
read more
Abstract: Efficient fine-grain synchronization is extremely important to effectively harness the computational power of many-core architectures. However, designing and implementing finegrain synchronization in such architectures presents several challenges, including issues of synchronization induced overhead, storage cost, scalability, and the level of granularity to which synchronization is applicable. This paper proposes the Synchronization State Buffer (SSB), a scalable architectural design for fine-grain synchronization that efficiently performs synchronizations between concurrent threads. The design of SSB is motivated by the following observation: at any instance during the parallel execution only a small fraction of memory locations are actively participating in synchronization. Based on this observation we present a fine-grain synchronization design that records and manages the states of frequently synchronized data using modest hardware support. We have implemented the SSB design in the context of the 160-core IBM Cyclops-64 architecture. Using detailed simulation, we present our experience for a set of benchmarks with different workload characteristics.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Approximate Communication: Techniques for Reducing Communication Bottlenecks in Large-Scale Parallel Systems
TL;DR: Compression and approximate value prediction show great promise for reducing the communication bottleneck in bandwidth-constrained applications, while relaxed synchronization is found to provide large speedups for select error-tolerant applications, but suffers from limited general applicability and unreliable output degradation guarantees.
122
SynCron: Efficient Synchronization Support for Near-Data-Processing Architectures
Christina Giannoula,Nandita Vijaykumar,Nikela Papadopoulou,Vasileios Karakostas,Iván López Fernández,Juan Gómez-Luna,Lois Orosa,Nectarios Koziris,Georgios Goumas,Onur Mutlu +9 more
- 01 Feb 2021
TL;DR: SynCron as discussed by the authors is an end-to-end synchronization solution for near-data-processing (NDP) systems that adds low-cost hardware support near memory for synchronization acceleration, and avoids the need for hardware cache coherence support.
Fine-Grained Synchronizations and Dataflow Programming on GPUs
Ang Li,Gert-Jan van den Braak,Henk Corporaal,Akash Kumar +3 more
- 08 Jun 2015
TL;DR: This paper proposes a novel approach for fine-grained inter-thread synchronizations on the shared memory of modern GPUs, and demonstrates its performance, and applies it to Needleman-Wunsch - a 2D wavefront application involving massive cross-loop data dependencies.
59
Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions
Dongrui Fan,Nan Yuan,Junchao Zhang,Yongbin Zhou,Wei Lin,Fenglong Song,Xiaochun Ye,He Huang,Lei Yu,Guoping Long,Hao Zhang,Lei Liu +11 more
TL;DR: This work proposes a many-core architecture, Godson-T, which features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization.
46
Low-Overhead, high-speed multi-core barrier synchronization
John Sartori,Rakesh Kumar +1 more
- 25 Jan 2010
TL;DR: Three barrier implementations that are hybrids of software and dedicated hardware barriers and are specifically tailored for CMPs are presented and evaluated, providing low latency comparable to that of dedicated hardware networks at a fraction of the cost.
44
References
Transactional memory: architectural support for lock-free data structures
Maurice Herlihy,J. Eliot B. Moss +1 more
- 01 May 1993
TL;DR: Simulation results show that transactional memory matches or outperforms the best known locking techniques for simple benchmarks, even in the absence of priority inversion, convoying, and deadlock.
Algorithms for scalable synchronization on shared-memory multiprocessors
TL;DR: The principal conclusion is that contention due to synchronization need not be a problemin large-scale shared-memory multiprocessors, and the existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides protection against so-called “dance hall” architectures.
The SGI Origin: a ccNUMA highly scalable server
James Laudon,Daniel E. Lenoski +1 more
- 01 May 1997
TL;DR: The motivation for building the Origin 2000 is discussed and the architecture and implementation of the multiprocessor is described, and performance results are presented for the NAS Parallel Benchmarks V2.2 and the SPLASH2 applications.
LogTM: log-based transactional memory
Kevin E. Moore,Jayaram Bobba,M.J. Moravan,Mark D. Hill,Darien Wood +4 more
- 27 Feb 2006
TL;DR: This paper presents a new implementation of transactional memory, log-based transactionalMemory (LogTM), that makes commits fast by storing old values to a per-thread log in cacheable virtual memory and storing new values in place.
Hazard pointers: safe memory reclamation for lock-free objects
TL;DR: Hazard pointers is presented, a memory management methodology that allows memory reclamation for arbitrary reuse and offers a lock-free solution for the ABA problem using only practical single-word instructions and guaranteeing continuous progress and availability, even in the presence of thread failures and arbitrary delays.
633