Scratchpad memory

Topic Tools

Papers published on a yearly basis

Papers

Proceedings Article•10.1145/774789.774805•

Scratchpad memory: a design alternative for cache on-chip memory in embedded systems

[...]

Rajeshwari M. Banakar¹, Stefan Steinke¹, Bo-Sik Lee¹, Mahesh Balakrishnan¹, Peter Marwedel¹ - Show less +1 more•Institutions (1)

Indian Institute of Technology Delhi¹

6 May 2002

TL;DR: The results clearly establish scratch pad memory as a low power alternative in most situations with an average energy reduction of 40% and the average area-time reduction for the scratchpad memory was 46% of the cache memory.

...read moreread less

Abstract: In this paper we address the problem of on-chip memory selection for computationally intensive applications, by proposing scratch pad memory as an alternative to cache. Area and energy for different scratch pad and cache sizes are computed using the CACTI tool while performance was evaluated using the trace results of the simulator. The target processor chosen for evaluation was AT91M40400. The results clearly establish scratehpad memory as a low power alternative in most situations with an average energy reducation of 40%. Further the average area-time reduction for the seratchpad memory was 46% of the cache memory.

...read moreread less

793 citations

Proceedings Article•10.1145/378239.379049•

Dynamic management of scratch-pad memory space

[...]

Mahmut Kandemir¹, J. Ramanujam², J. Irwin¹, N. Vijaykrishnan¹, Ismail Kadayif¹, A. Parikh¹ - Show less +2 more•Institutions (2)

Pennsylvania State University¹, Louisiana State University²

22 Jun 2001

TL;DR: A compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework that uses both loop and data transformations is proposed that indicates significant reductions in data transfer activity between SPM and off-chip memory.

...read moreread less

Abstract: Optimizations aimed at improving the efficiency of on-chip memories are extremely important. We propose a compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework that uses both loop and data transformations. Experimental results obtained using a generic cost model indicate significant reductions in data transfer activity between SPM and off-chip memory.

...read moreread less

306 citations

Proceedings Article•10.1145/951710.951747•

Compiler-decided dynamic memory allocation for scratch-pad based embedded systems

[...]

Sumesh Udayakumaran¹, Rajeev Barua¹•Institutions (1)

University of Maryland, College Park¹

30 Oct 2003

TL;DR: A dynamic allocation method for global and stack data that accounts for changing program requirements at runtime, has no software-caching tags, requires no run-time checks, has extremely low overheads, and yields 100% predictable memory access times is presented.

...read moreread less

Abstract: This paper presents a highly predictable, low overhead and yet dynamic, memory allocation strategy for embedded systems with scratch-pad memory. A scratch-pad is a fast compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its significantly lower overheads in energy consumption, area and overall runtime, even with a simple allocation scheme [4].Existing scratch-pad allocation methods are of two types. First, software-caching schemes emulate the workings of a hardware cache in software. Instructions are inserted before each load/store to check the software-maintained cache tags. Such methods incur large overheads in runtime, code size, energy consumption and SRAM space for tags and deliver poor real-time guarantees just like hardware caches. A second category of algorithms partitionsm variables at compile-time into the two banks. For example, our previous work in [3] derives a provably optimal static allocation for global and stack variables and achieves a speedup over all earlier methods. However, a drawback of such static allocation schemes is that they do not account for dynamic program behavior. It is easy to see why a data allocation that never changes at runtime cannot achieve the full locality benefits of a cache.In this paper we present a dynamic allocation method for global and stack data that for the first time, (i) accounts for changing program requirements at runtime (ii) has no software-caching tags (iii) requires no run-time checks (iv) has extremely low overheads, and (v) yields 100% predictable memory access times. In this method data that is about to be accessed frequently is copied into the SRAM using compiler-inserted code at fixed and infrequent points in the program. Earlier data is evicted if necessary. When compared to a provably optimal static allocation our results show runtime reductions ranging from 11% to 38%, averaging 31.2%, using no additional hardware support. With hardware support for pseudo-DMA and full DMA, which is already provided in some commercial systems, the runtime reductions increase to 33.4% and 34.2% respectively.

...read moreread less

251 citations

Proceedings Article•10.1109/SC.2014.68•

Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format

[...]

Joseph L. Greathouse¹, Mayank Daga¹•Institutions (1)

Advanced Micro Devices¹

16 Nov 2014

TL;DR: This work proposes a novel algorithm, CSR-Adaptive, which keeps the CSR format intact and maps well to GPUs, and achieves an average speedup of 14.7× over existingCSR-based algorithms and 2.3× over clSpMV cocktail, which uses an assortment of matrix formats.

...read moreread less

Abstract: The performance of sparse matrix vector multiplication (SpMV) is important to computational scientists. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMV on graphics processing units (GPUs) has poor performance due to irregular memory access patterns, load imbalance, and reduced parallelism. This has led researchers to propose new storage formats. Unfortunately, dynamically transforming CSR into these formats has significant runtime and storage overheads. We propose a novel algorithm, CSR-Adaptive, which keeps the CSR format intact and maps well to GPUs. Our implementation addresses the aforementioned challenges by (i) efficiently accessing DRAM by streaming data into the local scratchpad memory and (ii) dynamically assigning different numbers of rows to each parallel GPU compute unit. CSR-Adaptive achieves an average speedup of 14.7 × over existing CSR-based algorithms and 2.3× over clSpMV cocktail, which uses an assortment of matrix formats.

...read moreread less

238 citations

Proceedings Article•10.1109/HPCA.2011.5749714•

Thread block compaction for efficient SIMT control flow

[...]

Wilson W. L. Fung¹, Tor M. Aamodt¹•Institutions (1)

University of British Columbia¹

12 Feb 2011

TL;DR: This paper proposes and evaluates the benefits of extending the sharing of resources in a block of warps, already used for scratchpad memory, to exploit control flow locality among threads, and shows that this compaction mechanism provides an average speedup of 22% over a baseline per-warp, stack-based reconvergence mechanism.

...read moreread less

Abstract: Manycore accelerators such as graphics processor units (GPUs) organize processing units into single-instruction, multiple data “cores” to improve throughput per unit hardware cost. Programming models for these accelerators encourage applications to run kernels with large groups of parallel scalar threads. The hardware groups these threads into warps/wavefronts and executes them in lockstep-dubbed single-instruction, multiple-thread (SIMT) by NVIDIA. While current GPUs employ a per-warp (or per-wavefront) stack to manage divergent control flow, it incurs decreased efficiency for applications with nested, data-dependent control flow. In this paper, we propose and evaluate the benefits of extending the sharing of resources in a block of warps, already used for scratchpad memory, to exploit control flow locality among threads (where such sharing may at first seem detrimental). In our proposal, warps within a thread block share a common block-wide stack for divergence handling. At a divergent branch, threads are compacted into new warps in hardware. Our simulation results show that this compaction mechanism provides an average speedup of 22% over a baseline per-warp, stack-based reconvergence mechanism, and 17% versus dynamic warp formation on a set of CUDA applications that suffer significantly from control flow divergence.

...read moreread less

223 citations

...

Expand

Year	Papers
2021	23
2020	20
2019	27
2018	30
2017	36
2016	23

Topic Tools

Papers published on a yearly basis

Papers

Scratchpad memory: a design alternative for cache on-chip memory in embedded systems

Dynamic management of scratch-pad memory space

Compiler-decided dynamic memory allocation for scratch-pad based embedded systems

Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format

Thread block compaction for efficient SIMT control flow

Related Topics (5)

Performance Metrics