Journal Article10.1145/225830.223990
The SPLASH-2 programs
485
TL;DR: The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors.
read more
Abstract: The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has t...
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
CleanupSpec: An "Undo" Approach to Safe Speculation
Gururaj Saileshwar,Moinuddin K. Qureshi +1 more
- 12 Oct 2019
TL;DR: CleanupSpec is a hardware-based solution that mitigates speculation-based attacks by undoing the changes to the cache sub-system caused by speculative instructions, in the event they are squashed on a mis-speculation.
141
Power and performance of read-write aware hybrid caches with non-volatile memories
Xiaoxia Wu,Jian Li,Lixin Zhang,Evan Speight,Yuan Xie +4 more
- 20 Apr 2009
TL;DR: It is demonstrated that a RWHCA design with a conservative setup can provide a geometric mean 55% power reduction and yet 5% IPC improvement over a baseline SRAM cache design across a collection of 30 workloads.
128
Machine Learning for Power, Energy, and Thermal Management on Multicore Processors: A Survey
TL;DR: This paper presents an overview of several research efforts that propose to use machine learning techniques for power and thermal management on single-core and multicore processors, and can potentially adapt to varying system conditions and workloads.
118
A-WiNoC: Adaptive Wireless Network-on-Chip Architecture for Chip Multiprocessors
TL;DR: This paper proposes A-WiNoC, a scalable, adaptable wireless Network-on-Chip architecture that uses energy efficient wireless transceivers and improves network throughput by dynamically re-assigning channels in response to bandwidth demands from different cores.
103
Wireless networks-on-chips: architecture, wireless channel, and devices
TL;DR: It is shown that the integration of wireless interconnects with wired interConnects in NoCs can reduce overall network power by 34 percent while achieving a speedup of 2.54 on real applications.
References
Parallelism in random access machines
Steven Fortune,James C. Wyllie +1 more
- 01 May 1978
TL;DR: A model of computation based on random access machines operating in parallel and sharing a common memory is presented and can accept in polynomial time exactly the sets accepted by nondeterministic exponential time bounded Turing machines.
A rapid hierarchical radiosity algorithm
Pat Hanrahan,David Salzman,Larry Aupperle +2 more
- 01 Jul 1991
TL;DR: Standard techniques for shooting and gathering can be used with the hierarchical representation to solve for equilibrium radiosities, but the paper also discusses using a brightness-weighted error criteria, in conjunction with multigridding, to even more rapidly progressively refine the image.
642
A comparison of sorting algorithms for the connection machine CM-2
Guy E. Blelloch,Charles E. Leiserson,Bruce M. Maggs,C. Greg Plaxton,Stephen J. Smith,Marco Zagha +5 more
- 01 Jun 1991
TL;DR: A fast sorting algorithm for the Connection Machine Supercomputer model CM-2 is developed and it is shown that any U(lg n)-depth family of sorting networks can be used to sort n numbers in U( lg n) time in the bounded-degree fixed interconnection network domain.
False sharing and spatial locality in multiprocessor caches
TL;DR: To mitigate false sharing and to enhance spatial locality, the layout of shared data in cache blocks is optimized in a programmer-transparent manner and it is shown that this approach can reduce the number of misses on shared data by about 10% on average.
FFTs in external or hierarchical memory
TL;DR: Advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory that require as few as two passes through the external data set, employ strictly unit stride, long vector transfers between main memory and external storage, and are well suited for vector and parallel computation are described.
266