About: eDRAM is a research topic. Over the lifetime, 416 publications have been published within this topic receiving 7782 citations. The topic is also known as: embedded DRAM & embedded dynamic random-access memory.
TL;DR: This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner.
Abstract: A number of recent efforts have attempted to design accelerators for popular machine learning algorithms, such as those involving convolutional and deep neural networks (CNNs and DNNs). These algorithms typically involve a large number of multiply-accumulate (dot-product) operations. A recent project, DaDianNao, adopts a near data processing approach, where a specialized neural functional unit performs all the digital arithmetic operations and receives input weights from adjacent eDRAM banks.This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner. While the use of crossbar memory as an analog dot-product engine is well known, no prior work has designed or characterized a full-fledged accelerator based on crossbars. In particular, our work makes the following contributions: (i) We design a pipelined architecture, with some crossbars dedicated for each neural network layer, and eDRAM buffers that aggregate data between pipeline stages. (ii) We define new data encoding techniques that are amenable to analog computations and that can reduce the high overheads of analog-to-digital conversion (ADC). (iii) We define the many supporting digital components required in an analog CNN accelerator and carry out a design space exploration to identify the best balance of memristor storage/compute, ADCs, and eDRAM storage on a chip. On a suite of CNN and DNN workloads, the proposed ISAAC architecture yields improvements of 14.8×, 5.5×, and 7.5× in throughput, energy, and computational density (respectively), relative to the state-of-the-art DaDianNao architecture.
TL;DR: The significant impact of variations on refresh time and cache power consumption for large eDRAM caches is shown and Hi-ECC, a technique that incorporates multi-bit error-correcting codes to significantly reduce refresh rate, is proposed.
Abstract: Technology advancements have enabled the integration of large on-die embedded DRAM (eDRAM) caches. eDRAM is significantly denser than traditional SRAMs, but must be periodically refreshed to retain data. Like SRAM, eDRAM is susceptible to device variations, which play a role in determining refresh time for eDRAM cells. Refresh power potentially represents a large fraction of overall system power, particularly during low-power states when the CPU is idle. Future designs need to reduce cache power without incurring the high cost of flushing cache data when entering low-power states. In this paper, we show the significant impact of variations on refresh time and cache power consumption for large eDRAM caches. We propose Hi-ECC, a technique that incorporates multi-bit error-correcting codes to significantly reduce refresh rate. Multi-bit error-correcting codes usually have a complex decoder design and high storage cost. Hi-ECC avoids the decoder complexity by using strong ECC codes to identify and disable sections of the cache with multi-bit failures, while providing efficient single-bit error correction for the common case. Hi-ECC includes additional optimizations that allow us to amortize the storage cost of the code over large data words, providing the benefit of multi-bit correction at same storage cost as a single-bit error-correcting (SECDED) code (2% overhead). Our proposal achieves a 93% reduction in refresh power vs. a baseline eDRAM cache without error correcting capability, and a 66% reduction in refresh power vs. a system using SECDED codes.
TL;DR: The Intel Architecture core delivers higher power efficiency, higher frequency, and a wider dynamic power range, supporting smaller form factors, and offers a rich performance monitoring unit that enhances software developers' ability to optimize their applications.
Abstract: Skylake's core, processor graphics, and system on chip were designed to meet a demanding set of requirements for a wide range of power-performance points. Its coherent fabric was designed to provide high-memory bandwidth from multiple memory sources. Skylake's power management, which includes Intel Speed Shift technology, was designed to provide the largest dynamic power range among prior Intel processors. The Intel Architecture core delivers higher power efficiency, higher frequency, and a wider dynamic power range, supporting smaller form factors. Skylake's Gen9 graphics provides new features designed to maximize energy efficiency and bring the best visual experience for gaming and media. Skylake offers a rich performance monitoring unit that enhances software developers' ability to optimize their applications.
TL;DR: In this article, a hierarchical differential sensing approach is employed where an array of 1T DRAM cells are organized in rows and columns in which the rows represent words and the columns represent bits of the word, each bit column having more than one pair of balanced, true and complement local bit lines.
Abstract: In an embodiment of the present invention, a hierarchical differential sensing approach is effectuated wherein an array of 1T DRAM cells are organized in rows and columns in which the rows represent words and the columns represent bits of the word, each bit column having more than one pair of balanced, true and complement local bit lines, the local bit lines being connected to a pair of balanced, true and complement global bit lines by way of CMOS transistor switches.
TL;DR: In this paper, the intrinsic charge sharing operation during a dynamic memory access can be used effectively to perform analog CIM computations: by reconfiguring existing eDRAM columns as charge domain circuits, thus, greatly minimizing peripheral circuit area and power overhead.
Abstract: The unprecedented growth in deep neural networks (DNN) size has led to massive amounts of data movement from off-chip memory to on-chip processing cores in modern machine learning (ML) accelerators. Compute-in-memory (CIM) designs performing analog DNN computations within a memory array, along with peripheral mixed-signal circuits, are being explored to mitigate this memory-wall bottleneck: consisting of memory latency and energy overhead. Embedded-dynamic random-access memory (eDRAM) [1], [2], which integrates the 1T1C (T=Transistor, C=Capacitor) DRAM bitcell monolithically along with high-performance logic transistors and interconnects, can enable custom CIM designs. It offers the densest embedded bitcell, a low pJ/bit access energy, a low soft error rate, high-endurance, high-performance, and high-bandwidth: all desired attributes for ML accelerators. In addition, the intrinsic charge sharing operation during a dynamic memory access can be used effectively to perform analog CIM computations: by reconfiguring existing eDRAM columns as charge domain circuits, thus, greatly minimizing peripheral circuit area and power overhead. Configuring a part of eDRAM as a CIM engine (for data conversion, DNN computations, and weight storage) and retaining the remaining part as a regular memory (for inputs, gradients during training, and non-CIM workload data) can help to meet the layer/kernel dependent variable storage needs during a DNN inference/training step. Thus, the high cost/bit of eDRAM can be amortized by repurposing part of existing large capacity, level-4 eDRAM caches [7] in high-end microprocessors, into large-scale CIM engines.