About: Average memory access time is a research topic. Over the lifetime, 99 publications have been published within this topic receiving 4111 citations.
TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Abstract: design issues. Specific aspects of cache memories tha t are investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size. Our discussion includes other aspects of memory system architecture, including translation lookaside buffers. Throughout the paper, we use as examples the implementation of the cache in the Amdahl 470V/6 and 470V/7, the IBM 3081, 3033, and 370/168, and the DEC VAX 11/780. An extensive bibliography is provided.
TL;DR: In this paper, cache coherence is used instead of virtual memory for tracking applications' memory accesses transparently, at cache-line granularity, eliminating page faults from the application critical path when accessing remote data, and decoupling the application memory access tracking from the virtual memory page size.
Abstract: Disaggregated memory can address resource provisioning inefficiencies in current datacenters. Multiple software runtimes for disaggregated memory have been proposed in an attempt to make disaggregated memory practical. These systems rely on the virtual memory subsystem to transparently offer disaggregated memory to applications using a local memory abstraction. Unfortunately, using virtual memory for disaggregation has multiple limitations, including high overhead that comes from the use of page faults to identify what data to fetch and cache locally, and high dirty data amplification that comes from the use of page-granularity for tracking changes to the cached data (4KB or higher). In this paper, we propose a fundamentally new approach to designing software runtimes for disaggregated memory that addresses these limitations. Our main observation is that we can use cache coherence instead of virtual memory for tracking applications' memory accesses transparently, at cache-line granularity. This simple idea (1) eliminates page faults from the application critical path when accessing remote data, and (2) decouples the application memory access tracking from the virtual memory page size, enabling cache-line granularity dirty data tracking and eviction. Using this observation, we implemented a new software runtime for disaggregated memory that improves average memory access time by 1.7-5X and reduces dirty data amplification by 2-10X, compared to state-of-the-art systems.
TL;DR: The skewing scheme evaluated here does not eliminate all memory conflicts but it does improve the average performance of vector access over interleaved systems for a wide range of strides.
Abstract: The degree to which high-speed vector processors approach their peak performance levels is closely tied to the amount of interference they encounter while accessing vectors in memory. In this paper we present an evaluation of a storage scheme that reduces the average memory access time in a vector-oriented architecture. A skewing scheme is used to map vector components into parallel memory modules such that, for most vector access patterns, the number of memory conflicts is reduced over that observed in interleaved parallel memory systems. Address and data buffers are used locally in each module so that transient nonuniformities which occur in some access patterns do not degrade performance. Previous investigations into skewing techniques have attempted to provide conflict-free access for a limited subset of access patterns. The goal of this investigation is different. The skewing scheme evaluated here does not eliminate all memory conflicts but it does improve the average performance of vector access over interleaved systems for a wide range of strides. It is shown that little extra hardware is required to implement the skewing scheme. Also, far fewer restrictions are placed on the number of memory modules in the system than are present in other proposed schemes.
TL;DR: This work considers a practical system architecture where the distributed LLC and the NoC share a voltage/frequency domain which is separate from the core domain, and proposes an average memory access time (AMAT)-based monitoring technique and integrate it with DVFS based on PID control theory.
Abstract: In chip design today and for a foreseeable future, the last-level cache and on-chip interconnect is not only performance critical but also a substantial power consumer. This work focuses on employing dynamic voltage and frequency scaling (DVFS) policies for networks-on-chip (NoC) and shared, distributed last-level caches (LLC). In particular, we consider a practical system architecture where the distributed LLC and the NoC share a voltage/frequency domain that is separate from the core domain. This architecture enables the control of the relative speed between the cores and memory hierarchy without introducing synchronization delays within the NoC. DVFS for this architecture is more complex than individual link/core-based DVFS since it involves spatially distributed monitoring and control. We propose an average memory access time (AMAT)-based monitoring technique and integrate it with DVFS based on PID control theory. Simulations on PARSEC benchmarks yield a 27p energy savings with a negligible impact on system performance.
TL;DR: A worst case timing analysis for direct mapped data caches that classifies memory accesses as predictable or unpredictable and a novel analysis framework is proposed that tightly bounds the impact on the existing cache contents as well as cache behavior of unpredictableMemory accesses themselves.
Abstract: Data caches significantly reduce the average memory access time and are necessary for an efficient design. Due to its direct dependency on input data is difficult to predict the worst case timing behavior, which is crucial for a reliable system. While simulation is too time-consuming, current worst case execution time approaches focus on instruction caches only. Current approaches to data cache analysis restrict cache behavior to predictable data accesses or classify input dependent memory accesses as non-cache able. In this paper we propose a worst case timing analysis for direct mapped data caches that classifies memory accesses as predictable or unpredictable. For unpredictable memory accesses, a novel analysis framework is proposed that tightly bounds the impact on the existing cache contents as well as cache behavior of unpredictable memory accesses themselves. For predictable memory accesses, we use a local cache simulation and dataflow techniques. Furthermore, we describe an implementation of the analysis framework. Several experiments demonstrate its applicability. The approach targets real-time software verification but is also useful for design space exploration.