Register pointer architecture for efficient embedded processors
Jongsoo Park,Sung-Boem Park,James Balfour,David Black-Schaffer,Christos Kozyrakis,William J. Dally +5 more
- 16 Apr 2007
- pp 600-605
TL;DR: The RPA is introduced, which allows registers to be accessed indirectly through register pointers and affords additional flexibility in naming registers, which reduces the need to apply loop unrolling in order to maximize reuse of register allocated variables.
read more
Abstract: Conventional register file architectures cannot optimally exploit temporal locality in data references due to their limited capacity and static encoding of register addresses in instructions. In conventional embedded architectures, the register file capacity cannot be increased without resorting to longer instruction words. Similarly, loop unrolling is often required to exploit locality in the register file accesses across iterations because naming registers statically is inflexible. Both optimizations lead to significant code size increases, which is undesirable in embedded systems. In this paper, the authors introduce the register pointer architecture (RPA), which allows registers to be accessed indirectly through register pointers. Indirection allows a larger register file to be used without increasing the length of instruction words. Additional register file capacity allows many loads and stores, such as those introduced by spill code, to be eliminated, which improves performance and reduces energy consumption. Moreover, indirection affords additional flexibility in naming registers, which reduces the need to apply loop unrolling in order to maximize reuse of register allocated variables
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
VEGAS: soft vector processor with scratchpad memory
Christopher Han-Yu Chou,Aaron Severance,Alex D. Brant,Zhiduo Liu,Saurabh Sant,Guy G.F. Lemieux +5 more
- 27 Feb 2011
TL;DR: VEGAS as mentioned in this paper is a new soft vector architecture, in which the vector processor reads and writes directly to a scratchpad memory instead of a vector register file, allowing up to 9x more data elements to fit into on-chip memory, and the use of fracturable ALUs allow efficient processing of bytes, halfwords and words in the same processor instance, providing up to 4x the operations compared to existing fixed-width soft vector ALUs.
A Polymorphic Register File for matrix operations
Cbtblin Ciobanu,Georgi Kuzmanov,Georgi Gaydadjiev,Alex Ramirez +3 more
- 19 Jul 2010
TL;DR: A register file organization which allows dynamic creation of a variable number of multidimensional registers of arbitrary sizes referred to as a Polymorphic Register File is proposed to evaluate the performance benefits of the proposed organization.
Scalability evaluation of a polymorphic register file: A CG case study
Catalin Bogdan Ciobanu,Xavier Martorell,Georgi Kuzmanov,Alex Ramirez,Georgi Gaydadjiev +4 more
- 24 Feb 2011
TL;DR: A heterogeneous multi-processor architecture, taking into consideration critical parameters such as cache bandwidth and memory latency, is focused on, with results suggesting that for the Sparse Matrix Vector Multiplication kernel, absolute speedups of up to 200 times can be obtained.
Patent
Computation table for block computation
Ravi Kumar Arimilli,Balaram Sinharoy +1 more
- 16 Dec 2008
TL;DR: In this paper, a compiler identifies a code section that is not candidate for acceleration and identifying a code block specifying an iterated operation that is a candidate for accelerated code, and in response to identifying the code block, the compiler creates and outputs an operation data structure separate from the post-processed code that identifies the iterated operations.
7
Patent
Block driven computation using a caching policy specified in an operand data structure
Ravi Kumar Arimilli,Balaram Sinharoy +1 more
- 16 Dec 2008
TL;DR: A processor has an associated memory hierarchy including a cache memory as discussed by the authors, and a computation engine computes and stores operands in the memory hierarchy in accordance with the cache policies indicated within the operand data structure.
7
References
MiBench: A free, commercially representative embedded benchmark suite
Matthew R. Guthaus,Jeff Ringenberg,Daniel J. Ernst,Todd Austin,Trevor Mudge,Richard B. Brown +5 more
- 02 Dec 2001
TL;DR: A new version of SimpleScalar that has been adapted to the ARM instruction set is used to characterize the performance of the benchmarks using configurations similar to current and next generation embedded processors.
3.7K
Software pipelining: an effective scheduling technique for VLIW machines
Monica S. Lam
- 01 Jun 1988
TL;DR: This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors, and proposes a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block.
A 160 MHz 32 b 0.5 W CMOS RISC microprocessor
J. Montanaro,R. Witek,K. Anne,A.J. Black,Elizabeth M. Cooper,Daniel W. Dobberpuhl,P. Donahue,J. Eno,A. Farell,G. Hoeppner,D. Kruckemyer,Thomas H. Lee,P. Lin,L. Madden,Daniel C. Murray,M. Pearce,S. Santhanam,K. Snyder,R. Stephany,S.C. Thierauf +19 more
- 08 Feb 1996
TL;DR: This custom VLSI implementation of a microprocessor architecture delivers 184 Drystone/MIPS at 162 MHz dissipating 0.5 W using an 1.5 V internal supply and Clock generation uses an on-chip PLL with 3.68 MHz input clock to minimize high frequency clock signals on the board.
733
Vectorizing for a SIMdD DSP architecture
Dorit Naishlos,Marina Biberstein,Shay Ben-David,Ayal Zaks +3 more
- 30 Oct 2003
TL;DR: This paper demonstrates how SIMdD (SIMD on disjoint data) supports e#ective vectorization of digital signal processing (DSP) benchmarks, by facilitating data reorganization and reuse.
79
Evaluating the use of register queues in software pipelined loops
TL;DR: It is shown that decoupling the architected register space from the physical register space can greatly increase the applicability of software pipelining, even as memory latencies increase.