TL;DR: In this article, when the cache contents and the user demands are fixed, the authors connect the caching problem to an index coding problem and show the optimality of the MAN scheme under the conditions that the cache placement phase is restricted to be uncoded (i.e., pieces of the files can only copied into the user's cache).
Abstract: Caching is an effective way to reduce peak-hour network traffic congestion by storing some contents at user's local cache. Maddah-Ali and Niesen (MAN) initiated a fundamental study of caching systems by proposing a scheme (with uncoded cache placement and linear network coding delivery) that is provably optimal to within a factor 4.7. In this paper, when the cache contents and the user demands are fixed, we connect the caching problem to an index coding problem and show the optimality of the MAN scheme under the conditions that (i) the cache placement phase is restricted to be uncoded (i.e, pieces of the files can only copied into the user's cache), and (ii) the number of users is no more than the number of files. As a consequence, further improvements to the MAN scheme are only possible through the use of coded cache placement.
TL;DR: This paper will present the first set of shared cache QoS techniques designed and implemented in state-of-the-art commercial servers (the Intel® Xeon® processor E5-2600 v3 product family) and describe two key technologies: Cache Monitoring Technology (CMT) to enable monitoring of shared caches usage by different applications and Cache Allocation Technology(CAT) which enables redistribution of shared Cache space between applications to address contention.
Abstract: Over the last decade, addressing quality of service (QoS) in multi-core server platforms has been growing research topic. QoS techniques have been proposed to address the shared resource contention between co-running applications or virtual machines in servers and thereby provide better isolation, performance determinism and potentially improve overall throughput. One of the most important shared resources is cache space. Most proposals for addressing shared cache contention are based on simulations and analysis and no commercial platforms were available that integrated such techniques and provided a practical solution. In this paper, we will present the first set of shared cache QoS techniques designed and implemented in state-of-the-art commercial servers (the Intel® Xeon® processor E5-2600 v3 product family). We will describe two key technologies: (i) Cache Monitoring Technology (CMT) to enable monitoring of shared cache usage by different applications and (ii) Cache Allocation Technology (CAT) which enables redistribution of shared cache space between applications to address contention. This is the first paper to describing these techniques as they moved from concept to reality, starting from early research to product implementation. We will also present case studies highlighting the value of these techniques using example scenarios of multi-programmed workloads, virtualized platforms in datacenters and communications platforms. Finally, we will describe initial software infrastructure and enabling for industry practitioners and researchers to take advantage of these technologies for their QoS needs.
TL;DR: An improved design of Newcache is presented, in terms of security, circuit design and simplicity, and it is shown that Newcache can be used as L1 data and instruction caches to improve security without impacting performance.
Abstract: Newcache is a secure cache that can thwart cache side-channel attacks to prevent the leakage of secret information All caches today are susceptible to cache side-channel attacks, despite software isolation of memory pages in virtual address spaces or virtual machines These cache attacks can leak secret encryption keys or private identity keys, nullifying any protection provided by strong cryptography Newcache uses a novel dynamic, randomized memory-to-cache mapping to thwart contention-based side-channel attacks, rather than the static mapping used by conventional set-associative caches In this article, the authors present an improved design of Newcache, in terms of security, circuit design and simplicity They show Newcache's security against a suite of cache side-channel attacks They evaluate Newcache's system performance for cloud computing, smartphone, and SPEC benchmarks and find that Newcache performs as well as conventional set-associative caches, and sometimes better They also designed a VLSI test chip with a 32-Kbyte Newcache and a 32-Kbyte, eight-way, set-associative cache and verified that the access latency, power, and area of the two caches are comparable These results show that Newcache can be used as L1 data and instruction caches to improve security without impacting performance
TL;DR: The proposed SecDCP scheme changes the size of cache partitions at run time for better performance while preventing insecure information leakage between processes, and improves performance by up to 43% and by an average of 12.5% over static cache partitioning.
Abstract: In today's multicore processors, the last-level cache is often shared by multiple concurrently running processes to make efficient use of hardware resources. However, previous studies have shown that a shared cache is vulnerable to timing channel attacks that leak confidential information from one process to another. Static cache partitioning can eliminate the cache timing channels but incurs significant performance overhead. In this paper, we propose Secure Dynamic Cache Partitioning (SecDCP), a partitioning technique that defeats cache timing channel attacks. The SecDCP scheme changes the size of cache partitions at run time for better performance while preventing insecure information leakage between processes. For cache-sensitive multiprogram workloads, our experimental results show that SecDCP improves performance by up to 43% and by an average of 12.5% over static cache partitioning.
TL;DR: It is proved that the gap between the hit rate achieved by Trend-Caching and that by the optimal caching policy with hindsight is sublinear in the number of video requests, thereby guaranteeing both fast convergence and asymptotically optimal cache hit rate.
Abstract: This paper presents Trend-Caching, a novel cache replacement method that optimizes cache performance according to the trends of video content. Trend-Caching explicitly learns the popularity trend of video content and uses it to determine which video it should store and which it should evict from the cache. Popularity is learned in an online fashion and requires no training phase, hence it is more responsive to continuously changing trends of videos. We prove that the learning regret of Trend-Caching (i.e., the gap between the hit rate achieved by Trend-Caching and that by the optimal caching policy with hindsight) is sublinear in the number of video requests, thereby guaranteeing both fast convergence and asymptotically optimal cache hit rate. We further validate the effectiveness of Trend-Caching by applying it to a movie.douban.com dataset that contains over 38 million requests. Our results show significant cache hit rate lift compared to existing algorithms, and the improvements can exceed 40% when the cache capacity is limited. Furthermore, Trend-Caching has low complexity.
TL;DR: Cliffhanger is developed, a lightweight iterative algorithm that runs on memory cache servers, which incrementally optimizes the resource allocations across and within applications based on dynamically changing workloads and designs a novel technique for dealing with performance cliffs incrementally and locally.
Abstract: Web-scale applications are heavily reliant on memory cache systems such as Memcached to improve throughput and reduce user latency. Small performance improvements in these systems can result in large end-to-end gains. For example, a marginal increase in hit rate of 1% can reduce the application layer latency by over 35%. However, existing web cache resource allocation policies are workload oblivious and first-come-first-serve. By analyzing measurements from a widely used caching service, Memcachier, we demonstrate that existing cache allocation techniques leave significant room for improvement. We develop Cliffhanger, a lightweight iterative algorithm that runs on memory cache servers, which incrementally optimizes the resource allocations across and within applications based on dynamically changing workloads. It has been shown that cache allocation algorithms underperform when there are performance cliffs, in which minor changes in cache allocation cause large changes in the hit rate. We design a novel technique for dealing with performance cliffs incrementally and locally. We demonstrate that for the Memcachier applications, on average, Cliffhanger increases the overall hit rate 1.2%, reduces the total number of cache misses by 36.7% and achieves the same hit rate with 45% less memory capacity.
TL;DR: Both the fundamental performance limits of cache networks and the practical challenges that need to be overcome in real-life scenarios are discussed.
Abstract: Caching is an essential technique to improve throughput and latency in a vast variety of applications. The core idea is to duplicate content in memories distributed across the network, which can then be exploited to deliver requested content with less congestion and delay. The traditional role of cache memories is to deliver the maximal amount of requested content locally rather than from a remote server. While this approach is optimal for single-cache systems, it has recently been shown to be significantly suboptimal for systems with multiple caches (i.e., cache networks). Instead, cache memories should be used to enable a coded multicasting gain. In this article, we survey these recent developments. We discuss both the fundamental performance limits of cache networks and the practical challenges that need to be overcome in real-life scenarios.
TL;DR: This article provides a survey on static cache analysis for real-time systems, presenting the challenges and static analysis techniques for independent programs with respect to different cache features, followed by a survey of existing tools based on static techniques for cache analysis.
Abstract: Real-time systems are reactive computer systems that must produce their reaction to a stimulus within given time bounds. A vital verification requirement is to estimate the Worst-Case Execution Time (WCET) of programs. These estimates are then used to predict the timing behavior of the overall system. The execution time of a program heavily depends on the underlying hardware, among which cache has the biggest influence. Analyzing cache behavior is very challenging due to the versatile cache features and complex execution environment. This article provides a survey on static cache analysis for real-time systems. We first present the challenges and static analysis techniques for independent programs with respect to different cache features. Then, the discussion is extended to cache analysis in complex execution environment, followed by a survey of existing tools based on static techniques for cache analysis. An outlook for future research is provided at last.
TL;DR: It is found that the only way to achieve both isolation-guarantee and strategy-proofness is through blocking, which is efficiently adapted in a new policy called FairRide, which can lead to better cache efficiency and fairness in many scenarios.
Abstract: Memory caches continue to be a critical component to many systems. In recent years, there has been larger amounts of data into main memory, especially in shared environments such as the cloud. The nature of such environments requires resource allocations to provide both performance isolation for multiple users/applications and high utilization for the systems. We study the problem of fair allocation of memory cache for multiple users with shared files. We find that, surprisingly, no memory allocation policy can provide all three desirable properties (isolation-guarantee, strategy-proofness and Pareto-efficiency) that are typically achievable by other types of resources, e.g., CPU or network. We also show that there exist policies that achieve any two of the three properties. We find that the only way to achieve both isolation-guarantee and strategy-proofness is through blocking, which we efficiently adapt in a new policy called FairRide. We implement FairRide in a popular memory-centric storage system using an efficient form of blocking, named as expected delaying, and demonstrate that FairRide can lead to better cache efficiency (2.6× over isolated caches) and fairness in many scenarios.
TL;DR: STT-MRAM-based last level cache memory (LLC) is presented, using novel power optimization with high-speed power gating (HS-PG), considering processor architectures and cache memory accesses and has high reliability to reduce the write-error rate with novel write-verify-write.
Abstract: Two performance gaps in the memory hierarchy, between CPU cache and main memory, and main memory and mass storage, will become increasingly severe bottlenecks for computing-system performance. Although it is necessary to increase memory capacity to fill these gaps, power also increases when conventional volatile memories are used. A new nonvolatile memory for this purpose has been anticipated. Storage class memory is used to fill the second gap. Many candidates exist: ReRAM, PRAM, and 3D-cross point type with resistive change RAM. However, nonvolatile last level cache (LLC) is used to fill the first gap. Advanced STT-MRAM has achieved sub-4ns read and write accesses with perpendicular magnetic tunnel junctions (p-MTJ) [1–2]. Furthermore, mature integration processes have been developed and 8Mb STT-MRAM with sub-5ns operation has shown high reliability [3]. Moreover, because of its non-volatility, STT-MRAM can reduce operation energy by more than 81% compared to SRAM for cache [1]. This paper presents STT-MRAM-based last level cache memory (LLC) including MRAM memory core, peripherals and cache logic circuits, using novel power optimization with high-speed power gating (HS-PG),considering processor architectures and cache memory accesses. The STT-MRAM-based cache has high reliability to reduce the write-error rate with novel write-verify-write. Furthermore, a read-modify-write scheme is implemented to reduce active power without penalty. Figure 7.2.1 presents a block diagram of a 4Mb STT-MRAM based cache.
TL;DR: The paper proposes a new cache demand model, Reuse Working Set (RWS), to capture only the data with good temporal locality, and uses the RWS size (RWSS) to model a workload's cache demand.
Abstract: Host-side flash caching has emerged as a promising solution to the scalability problem of virtual machine (VM) storage in cloud computing systems, but it still faces serious limitations in capacity and endurance. This paper presents CloudCache, an on-demand cache management solution to meet VM cache demands and minimize cache wear-out. First, to support on-demand cache allocation, the paper proposes a new cache demand model, Reuse Working Set (RWS), to capture only the data with good temporal locality, and uses the RWS size (RWSS) to model a workload's cache demand. By predicting the RWSS online and admitting only RWS into the cache, CloudCache satisfies the workload's actual cache demand and minimizes the induced wear-out. Second, to handle situations where a cache is insufficient for the VMs' demands, the paper proposes a dynamic cache migration approach to balance cache load across hosts by live migrating cached data along with the VMs. It includes both on-demand migration of dirty data and background migration of RWS to optimize the performance of the migrating VM. It also supports rate limiting on the cache data transfer to limit the impact to the co-hosted VMs. Finally, the paper presents comprehensive experimental evaluations using real-world traces to demonstrate the effectiveness of CloudCache.
TL;DR: A new probabilistic cache model designed for high-performance replacement policies is presented that uses absolute reuse distances instead of stack distances, and models replacement policies as abstract ranking functions, which let us model arbitrary age-based replacement policies.
Abstract: Modern processors use high-performance cache replacement policies that outperform traditional alternatives like least-recently used (LRU). Unfortunately, current cache models do not capture these high-performance policies as most use stack distances, which are inherently tied to LRU or its variants. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. It uses absolute reuse distances instead of stack distances, and models replacement policies as abstract ranking functions. These innovations let us model arbitrary age-based replacement policies. Our model achieves median error of less than 1% across several high-performance policies on both synthetic and SPEC CPU2006 benchmarks. Finally, we present a case study showing how to use the model to improve shared cache performance.
TL;DR: This paper proposes a novel data-aware hybrid STT-RAM/SRAM cache architecture which stores data in the two partitions based on their bit counts and employs an asymmetric low-power 5T-SRAM structure which has high reliability for majority `one' data.
Abstract: Static Random Access Memories (SRAMs) occupy a large area of today's microprocessors, and are a prime source of leakage power in highly scaled technologies. Low leakage and high density Spin-Transfer Torque RAMs (STT-RAMs) are ideal candidates for a power-efficient memory. However, STT-RAM suffers from high write energy and latency, especially when writing ‘one’ data. In this paper we propose a novel data-aware hybrid STT-RAM/SRAM cache architecture which stores data in the two partitions based on their bit counts. To exploit the new resultant data distribution in the SRAM partition, we employ an asymmetric low-power 5T-SRAM structure which has high reliability for majority ‘one’ data. The proposed design significantly reduces the number of writes and hence dynamic energy in both STT-RAM and SRAM partitions. We employed a write cache policy and a small swap memory to control data migration between cache partitions. Our evaluation on UltraSPARC-III processor shows that utilizing STT-RAM/6T-SRAM and STT-RAM/5T-SRAM architectures for the L2 cache results in 42% and 53% energy efficiency, 9.3% and 9.1% performance improvement and 16.9% and 20.3% area efficiency respectively, with respect to SRAM-based cache running SPEC CPU 2006 benchmarks.
TL;DR: This paper proposes a utility-driven cache partitioning approach to cache resource allocation among multiple content providers, where each provider is associated with each content provider a utility that is a function of the hit rate to its content.
Abstract: In-network cache deployment is recognized as an effective technique for reducing content access delay. Caches serve content from multiple content providers, and wish to provide them differentiated services due to monetary incentives and legal obligations. Partitioning is a common approach in providing differentiated storage services. In this paper, we propose a utility-driven cache partitioning approach to cache resource allocation among multiple content providers, where we associate with each content provider a utility that is a function of the hit rate to its content. A cache is partitioned into slices with each partition being dedicated to a particular content provider. We formulate an optimization problem where the objective is to maximize the sum of weighted utilities over all content providers through proper cache partitioning, and mathematically show its convexity. We also give a formal proof that partitioning the cache yields better performance compared to sharing it. We validate the effectiveness of cache partitioning through numerical evaluations, and investigate the impact of various factors (e.g., content popularity, request rate) on the hit rates observed by contending content providers.
TL;DR: A cache memory has cache memory circuitry comprising a nonvolatile memory cell to store at least a portion of a data which is stored or is to be stored in a lower-level memory than the cache memory as mentioned in this paper.
Abstract: A cache memory has cache memory circuitry comprising a nonvolatile memory cell to store at least a portion of a data which is stored or is to be stored in a lower-level memory than the cache memory circuitry, a first redundancy code storage comprising a nonvolatile memory cell capable of storing a redundancy code of the data stored in the cache memory circuitry, and a second redundancy code storage comprising a volatile memory cell capable of storing the redundancy code.
TL;DR: The Bunker Cache is proposed, a design that maps similar data to the same cache storage location based solely on their memory address, sacrificing some application quality loss for greater efficiency.
Abstract: The cost of moving and storing data is still a fundamental concern for computer architects. Inefficient handling of data can be attributed to conventional architectures being oblivious to the nature of the values that these data bits carry. We observe the phenomenon of spatio-value similarity, where data elements that are approximately similar in value exhibit spatial regularity in memory. This is inherent to 1) the data values of real-world applications, and 2) the way we store data structures in memory. We propose the Bunker Cache, a design that maps similar data to the same cache storage location based solely on their memory address, sacrificing some application quality loss for greater efficiency. The Bunker Cache enables performance gains (ranging from 1.08x to 1.19x) via reduced cache misses and energy savings (ranging from 1.18x to 1.39x) via reduced off-chip memory accesses and lower cache storage requirements. The Bunker Cache requires only modest changes to cache indexing hardware, integrating easily into commodity systems.
TL;DR: This paper yields a survey of current generation processors on the basis of various factors effecting cache memory performance and throughput and the main focus of this paper is the study and performance analysis of the cache replacement policies on the based of simulation on several benchmarks.
Abstract: Memory hierarchy in current generation computers is formed by keeping registers inside, cache on or outside the processor and virtual memory on Hard disk. The principle of locality of reference is used to make memory hierarchy work efficiently. In recent years various advances have been made to improve the cache memory performance on the basis of hit rate, latency, speed, replacement policies and energy consumption. Cache replacement policy is one of the important design parameter which affects the overall processor performance and also become more important with recent technological moves towards highly associative cache. This paper yields a survey of current generation processors on the basis of various factors effecting cache memory performance and throughput. The main focus of this paper is the study and performance analysis of the cache replacement policies on the basis of simulation on several benchmarks.
TL;DR: A new rootkit called CacheKit is developed that hides in the cache of the normal world and is able to evade memory introspection from the secure world and has small performance impacts on the rich OS.
Abstract: With the growing importance of networked embedded devices in the upcoming Internet of Things, new attacks targeting embedded OSes are emerging. ARM processors, which power over 60% of embedded devices, introduce a hardware security extension called TrustZone to protect secure applications in an isolated secure world that cannot be manipulated by a compromised OS in the normal world. LeveragingTrustZone technology, a number of memory integrity checking schemes have been proposed in the secure world to introspect malicious memory modification of the normal world. In this paper, we first discover and verify an ARM TrustZone cache incoherence behavior, which results in the cache contents of the two worlds, secure and non-secure, potentially being different even when they are mapped to the same physical address. Furthermore, code in one TrustZone world cannot access the cache content in the other world. Based on this observation, we develop a new rootkit called CacheKit that hides in the cache of the normal world and is able to evade memory introspection from the secure world. We implement a CacheKit prototype on Cortex-A8 processors after solving a number of challenges. First, we employ the Cache-as-RAM technique to ensure that the malicious code is only loaded into the CPU cache and not RAM. Thus, the secure world cannot detect the existence of the malicious code by examining the RAM. Second, we use the ARM processor's hardware support on cache settings to keep the malicious code persistent in the cache. Third, to evade introspection that flushes cache content back into RAM, we utilize physical addresses from the I/O address range that is not backed by any real I/O devices or RAM. The experimental results show that CacheKit can successfully evade memory introspection from the secure world and has small performance impacts on the rich OS. We discuss potential countermeasures to detect this type of rootkit attack.
TL;DR: In this paper, the authors proposed replacement and checkpoint policies for SRAM and NVM based hybrid cache in NVPs whose execution is interrupted frequently, and the experimental results show that the proposed architectures and polices outperform existing cache architectures for NVP.
Abstract: Energy harvesting is one of the most promising battery alternatives to power future generation embedded systems in Internet of Things (IoT). However, energy harvesting powered embedded systems suffer from frequent execution interruption due to unstable energy supply. To bridge intermittent program execution across different power cycles, non-volatile processor (NVP) was proposed to checkpoint register contents during power failure. Together with register contents, the cache contents also need to be preserved during power failure. While pure non-volatile memory (NVM) based cache is an intuitive option, it suffers from inferior performance due to high write latency and energy overhead. In this paper, we will propose replacement and checkpoint policies for SRAM and NVM based hybrid cache in NVPs whose execution is interrupted frequently. Checkpointing aware cache replacement polices and smart checkpointing polices are proposed to achieve satisfactory performance and efficient checkpointing upon a power failure and fast resumption when power returns. The experimental results show that the proposed architectures and polices outperform existing cache architectures for NVPs.
TL;DR: This paper presents a new buffer cache architecture that adopts a small amount of non-volatile memory to maintain modified data and improves the buffer cache performance through space-efficient management techniques, such as delta-write and fragment-grouping.
Abstract: File I/O buffer caching plays an important role to narrow the wide speed gap between main memory and secondary storage. However, data loss or inconsistency may occur if the system crashes before updated data in the buffer cache is flushed to storage. Thus, most operating systems adopt a daemon that periodically flushes dirty data to secondary storage. This periodic flush degrades the caching efficiency seriously because most write requests lead to direct storage accesses. We show that periodic flush accounts for 30-70 percent of the total write traffic to storage. To remove this inefficiency, this paper presents a new buffer cache architecture that adopts a small amount of non-volatile memory to maintain modified data. This novel buffer cache architecture removes almost all storage accesses due to periodic flush operations without any loss of reliability. It also improves the buffer cache performance through space-efficient management techniques, such as delta-write and fragment-grouping. Our experimental results show that the proposed buffer cache reduces the storage write traffic by 44.3 percent and also improves the throughput by 23.6 percent on average.
TL;DR: This work designs and implements a suite of algorithms to deduce the 128-bit AES key using as input the set of (unordered) cache line numbers captured by the spy threads in an access-driven cache-based side channel attack.
Abstract: Leakage of information between two processes sharing the same processor cache has been exploited in many novel approaches targeting various cryptographic algorithms. The software implementation of AES is an specially attractive target since it makes extensive use of cache-resident table lookups. We consider two attack scenarios where either the plaintext or ciphertext is known. We employ a multi-threaded spy process and ensure that each time slice provided to the victim (running AES) is small enough so that it makes a very limited number of table accesses. We design and implement a suite of algorithms to deduce the 128-bit AES key using as input the set of (unordered) cache line numbers captured by the spy threads in an access-driven cache-based side channel attack. Our algorithms are expressed using simple relational algebraic operations and run in under a minute. Above all, our attack is highly efficient -- we demonstrate recovery of the full AES key given only about 6 -- 7 blocks of plaintext or ciphertext (theoretically even a single block would suffice). This is a substantial improvement over previous cache-based side channel attacks that require between 100 and a million encryptions. Moreover, our attack supports varying cache hit/miss observation granularities, does not need frequent interruptions of the victim and will work even if the victim makes up to 60 cache accesses before being interrupted. Finally, we develop analytic models to estimate the number of encryptions/decryptions required as a function of access granularity and compare model results with those obtained from our experiments.
TL;DR: DISH is proposed, a dictionary based cache compression scheme that reduces cache space wastage and outperforms the compression schemes, such as BDI and CPACK+Z, in terms of compression ratio, system performance, and energy efficiency.
Abstract: The effectiveness of a compressed cache depends on three features: i) the compression scheme, ii) the compaction scheme, and iii) the cache layout of the compressed cache. Skewed compressed cache (SCC) and yet another compressed cache (YACC) are two recently proposed compressed cache layouts that feature minimal storage and latency overheads, while achieving comparable performance over more complex compressed cache layouts. Both SCC and YACC use compression techniques to compress individual cache blocks, and then a compaction technique to compact multiple contiguous compressed blocks into a single data entry. The primary attribute used by these techniques for compaction is the compression factor of the cache blocks, and in this process, they waste cache space. In this paper, we propose dictionary sharing (DISH), a dictionary based cache compression scheme that reduces this wastage. DISH compresses a cache block by keeping in mind that the block is a potential candidate for the compaction process. DISH encodes a cache block with a dictionary that stores the distinct 4-byte chunks of a cache block and the dictionary is shared among multiple neighboring cache blocks. The simple encoding scheme of DISH also provides a single cycle decompression latency and it does not change the cache layout of compressed caches. Compressed cache layouts that use DISH outperforms the compression schemes, such as BDI and CPACK+Z, in terms of compression ratio (from 1.7X to 2.3X), system performance (from 7.2% to 14.3%), and energy efficiency (from 6% to 16%).
TL;DR: The achievable rate region is characterized as a function of the memory sizes and the erasure probabilities, and the proposed delivery scheme, based on the broadcasting scheme proposed by Wang and Gatzianas et al., exploits the receiver side information established during the placement phase.
Abstract: We study a content delivery problem in the context of a K-user erasure broadcast channel such that a content providing server wishes to deliver requested files to users, each equipped with a cache of a finite memory. Assuming that the transmitter has state feedback and user caches can be filled during off-peak hours reliably by decentralized cache placement, we characterize the achievable rate region as a function of the memory sizes and the erasure probabilities. The proposed delivery scheme, based on the broadcasting scheme proposed by Wang and Gatzianas et al., exploits the receiver side information established during the placement phase. Our results can be extended to centralized cache placement as well as multi-antenna broadcast channels with state feedback.
TL;DR: Simulation results of the proposed caching schemes impose moderate network overhead and show considerable improvement to the client's cache hit ratio, even under churn.
Abstract: This paper investigates cache placement on a cooperative cache built from individual client caches in an online social network or web service. We use a service that maintains a mapping between content and the clients that cache it, and propose cache placement schemes that leverage relationships between clients (for example, social links) and workload statistics, proactively placing content on clients that are likely to access it. We evaluate efficacy through simulation, comparing our schemes against commonly used cache placement algorithms as well as optimal placement. We synthesize a workload to match characteristics of online social networks. Simulation results of our proposed caching schemes impose moderate network overhead and show considerable improvement to the client's cache hit ratio, even under churn.
TL;DR: FusedCache is a two-level set-associative Racetrack memory (RM) cache design that utilizes RM's high density for providing fast uniform access at one level, and non-uniform access at the next, and provides similar performance with a dramatic 69 percent cache energy reduction.
Abstract: We propose FusedCache , a two-level set-associative Racetrack memory (RM) cache design that utilizes RM’s high density for providing fast uniform access at one level, and non-uniform access at the next. FusedCache is well suited for private L1/L2 caches enforcing alignment of L1 data with the RM access points with the remaining non-aligned data serving as L2. It uses traditional LRU eviction for L1 misses. Promotion and demotion between L1 and L2 are performed through shifts and, when necessary, background swap operations. These swap operations do not require physical stores or loads, making accesses both faster and more energy efficient. Further, unlike a traditional inclusive cache hierarchy, fused L1 cache lines naturally exist in L2 avoiding duplicated storage and tag structures, promotions, and evictions. L1 status on each track is strictly enforced by track LRU maintenance and background swapping. Our results demonstrate that compared to an iso-area L1 SRAM cache replacement, FusedCache improves application performance by 7 percent while reducing cache energy by 33 percent. Compared to an iso-capacity two level (L1/L2) SRAM cache replacement, FusedCache provides similar performance with a dramatic 69 percent cache energy reduction. Compared to a TapeCache L1 scheme, FusedCache gains a 7 percent performance improvement with a 6 percent cache energy saving which translates to a 13 percent improvement in energy-delay product.
TL;DR: In this article, the authors present a system where a cache has one or more memories and a metadata service, which is able to receive requests for data stored in the cache from a first client and from a second client.
Abstract: In one embodiment, a computer system includes a cache having one or more memories and a metadata service. The metadata service is able to receive requests for data stored in the cache from a first client and from a second client. The metadata service is further able to determine whether the performance of the cache would be improved by relocating the data stored in the cache. The metadata service is further operable to relocate the data stored in the cache when such relocation would improve the performance of the cache.
TL;DR: This work applies the Flush+Reload and Prime+Probe methods to mount cache side-channel attacks on a popular OpenSSL implementation of AES and demonstrates the effectiveness of the attack across co-located instances on the Amazon EC2 cloud.
Abstract: Cache based attacks can overcome software-level isolation techniques to recover cryptographic keys across VM-boundaries. Therefore, cache attacks are believed to pose a serious threat to public clouds. In this work, we investigate the effectiveness of cache attacks in such scenarios. Specifically, we apply the Flush+Reload and Prime+Probe methods to mount cache side-channel attacks on a popular OpenSSL implementation of AES. The attacks work across cores in the cross-VM setting and succeeds to recover the full encryption keys in a short time—suggesting a practical threat to real-life systems. Our results show that there is strong information leakage through cache in virtualized systems and the software implementations of AES must be approached with caution. Indeed, for the first time, we demonstrate the effectiveness of the attack across co-located instances on the Amazon EC2 cloud. We argue that for secure usage of world's most commonly used block cipher such as AES, one should rely on secure, constant-time hardware implementations offered by CPU vendors.
TL;DR: Recent research activities for caching networks in content oriented networks are surveyed, with focusing on important factors which affect caching network performance, i.e. content request routing, caching decision, and replacement policy of cache.
Abstract: Content oriented network is expected to be one of the most promising approaches for resolving design concept difference between content oriented network services and location oriented architecture of current network infrastructure. There have been proposed several content oriented network architectures, but research efforts for content oriented networks have just started and technical issues to be resolved are still remained. Because of content oriented feature, content data transmitted in a network can be reused by content requests from other users. Pervasive cache is one of the most important benefits brought by the content oriented network architecture, which forms interconnected caching networks. Caching network is the hottest research area and lots of research activities have been published. This paper surveys recent research activities for caching networks in content oriented networks, with focusing on important factors which affect caching network performance, i.e. content request routing, caching decision, and replacement policy of cache. And this paper also discusses future direction of caching network researches. key words: cache networks, content oriented networks, in-network cache
TL;DR: This article proposes the Blurred Persistence mechanism, a mechanism to reduce the transaction overhead of persistent memory by blurring the volatility-persistence boundary and improves cache efficiency.
Abstract: Persistent memory provides data durability in main memory and enables memory-level storage systems. To ensure consistency of such storage systems, memory writes need to be transactional and are carefully moved across the boundary between the volatile CPU cache and the persistent main memory. Unfortunately, cache management in the CPU cache is hardware-controlled. Legacy transaction mechanisms, which are designed for disk-based storage systems, are inefficient in ordered data persistence of transactions in persistent memory. In this article, we propose the Blurred Persistence mechanism to reduce the transaction overhead of persistent memory by blurring the volatility-persistence boundary. Blurred Persistence consists of two techniques. First, Execution in Log executes a transaction in the log to eliminate duplicated data copies for execution. It allows persistence of the volatile uncommitted data, which are detectable with reorganized log structure. Second, Volatile Checkpoint with Bulk Persistence allows the committed data to aggressively stay volatile by leveraging the data durability in the log, as long as the commit order across threads is kept. By doing so, it reduces the frequency of forced persistence and improves cache efficiency. Evaluations show that our mechanism improves system performance by 56.3p to 143.7p for a variety of workloads.
TL;DR: A cache contention aware VM placement and migration algorithm (CacheVM) that estimates the total cache contention degree of co-locating a given VM with a group of VMs in a PM based on the cache stack distance profiles of the VMs and chooses the VM from a PM that generates the maximum cache contentiondegree to migrate out.
Abstract: In cloud datacenters, multiple Virtual Machines (VMs) are co-located in a Physical Machine (PM) to serve different applications. Prior VM consolidation methods for cloud datacenters schedule VMs mainly based on resource (CPU and memory) constraints in PMs but neglect serious shared Last Level cache contention between VMs. This may cause severe VM performance degradation due to cache thrashing and starvation for VMs. Current cache contention aware VM consolidation strategies either estimate cache contention by coarse VM classification for each individual VM without considering co-location and (or) require the aid of hardware to monitor the online miss rate of each VM. Therefore, these strategies are insufficiently accurate and (or) difficult to adopt for VM consolidation in clouds. In this paper, we formalize the problem of cache contention aware VM placement and migration in cloud datacenters using integer linear programming. We then propose a cache contention aware VM placement and migration algorithm (CacheVM). It estimates the total cache contention degree of co-locating a given VM with a group of VMs in a PM based on the cache stack distance profiles of the VMs. Then, it places the VM to the PM with the minimum cache contention degree and chooses the VM from a PM that generates the maximum cache contention degree to migrate out. We implemented CacheVM and its comparison methods on a supercomputing cluster. Trace-driven simulation and real-testbed experiments show that CacheVM outperforms other methods in terms of the number of cache misses, execution time and throughput.