Proceedings Article10.1109/ICPP.2008.29
Taming Single-Thread Program Performance on Many Distributed On-Chip L2 Caches
Lei Jin,Sangyeun Cho +1 more
- 09 Sep 2008
- pp 487-494
TL;DR: A dynamic cache management scheme is proposed that determines the home cache slice and cache bin for memory pages without any static program information that adapts to multiprogrammed workloads' behavior well and performs significantly better than both the private caching scheme and the shared caching scheme.
read more
Abstract: This paper presents a two-part study on managing distributed NUCA (non-uniform cache architecture) L2caches in a future many core processor to obtain high single thread program performance. The first part of our study is a limit study where we determine data to cache slice mappings at the memory page granularity based on detailed inter-page conflict information derived from program's memory reference trace. By considering cache access latency and cache miss rate simultaneously when mapping data to L2 cache slices, this "oracle" scheme outperforms the conventional shared caching scheme by up to 208% with an average of 45% on a sixteen-core processor. In the second part of the study, we propose and evaluate a dynamic cache management scheme that determines the home cache slice and cache bin for memory pages without any static program information. The dynamic scheme outperforms the shared caching scheme by up to 191% with an average of 32%, achieving much of the performance we observed in the limit study. We also find that the proposed dynamic scheme adapts to multiprogrammed workloads' behavior well and performs significantly better than both the private caching scheme and the shared caching scheme.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
CloudCache: Expanding and shrinking private caches
Hyunjin Lee,Sangyeun Cho,Bruce R. Childers +2 more
- 12 Feb 2011
TL;DR: This work proposes a novel scalable cache management framework called CloudCache that creates dynamically expanding and shrinking L2 caches for working threads with fine-grained hardware monitoring and control and demonstrates that CloudCache significantly improves performance of a wide range of workloads when all or a subset of cores are occupied.
SOS: A Software-Oriented Distributed Shared Cache Management Approach for Chip Multiprocessors
Lei Jin,Sangyeun Cho +1 more
- 12 Sep 2009
TL;DR: SOS, the authors' software-oriented distributed shared cache management approach, infers a program’s data affinity hints through a novel machine learning based analysis of its L2 cache access behavior, and achieves an average speedup of 10% and up to 23% over the shared cache scheme.
28
Cache equalizer: a placement mechanism for chip multiprocessor distributed shared caches
Mohammad Hammoud,Sangyeun Cho,Rami Melhem +2 more
- 24 Jan 2011
TL;DR: Cache Equalizer decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences, a novel distributed cache management scheme for large-scale chip multiprocessors (CMPs).
9
MAESTRO: Orchestrating predictive resource management in future multicore systems
Sangyeun Cho,Socrates Demetriades +1 more
- 06 Jun 2011
TL;DR: A case is made for a novel framework called MAESTRO which predictively manages system resources in shared-memory parallel computing platforms built with advanced multicore processors.
4
Private cache partitioning: A method to reduce the off-chip missrate of concurrently executing applications in Chip-Multiprocessors
Li Hao,Liu Tao,Liu Guanghui,Xie Lunguo +3 more
- 11 Mar 2011
TL;DR: Private Cache Partitioning is presented, a low-overhead, runtime mechanism that partitions all of the private low level caches which are organized as a large shared cache by a distributed directory.
1
References
•Book
Computer Architecture: A Quantitative Approach
John L. Hennessy,David A. Patterson +1 more
- 01 Dec 1989
TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.
12.6K
The Landscape of Parallel Computing Research: A View from Berkeley
Krste Asanovic,Ras Bodik,Bryan Catanzaro,Joseph Gebis,Parry Husbands,Kurt Keutzer,David A. Patterson,William Plishker,John Shalf,Samuel Williams,Katherine Yelick +10 more
- 18 Dec 2006
TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
SimpleScalar: an infrastructure for computer system modeling
TL;DR: The SimpleScalar tool set provides an infrastructure for simulation and architectural modeling that can model a variety of platforms ranging from simple unpipelined processors to detailed dynamically scheduled microarchitectures with multiple-level memory hierarchies.
1.8K
Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches
Moinuddin K. Qureshi,Yale N. Patt +1 more
- 09 Dec 2006
TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Changkyu Kim,Doug Burger,Stephen W. Keckler +2 more
- 01 Oct 2002
TL;DR: This paper proposes physical designs for these Non-Uniform Cache Architectures (NUCAs) and extends these physical designs with logical policies that allow important data to migrate toward the processor within the same level of the cache.