TL;DR: Digital Integrated Circuits addresses today's most significant and compelling industry topics, including: the impact of interconnect, design for low power, issues in timing and clocking, design methodologies, and the tremendous effect of design automation on the digital design perspective.
Abstract: A digital integrated circuit is described in which the internal registers are organized into a number of serial shift paths to facilitate testing. Each path has a number of modes: USER, HOLD, SHIFT and SELF-TEST modes. These modes are controlled by shifting a control function into a control shift register. When the shifting of the control shift register stops, a command is automatically loaded from the control shift register (or another source) into a command register, which controls the serial shift paths. The provision of a separate command register allows a new control function to be shifted into the control shift register while a preceding command is still active in the command register.
TL;DR: It is shown that partitioning the register file along three axes reduces the cost of register storage and communication without significantly impacting performance, and develops a taxonomy of register architectures by partitioning across the data-parallel, instruction-level-par parallel and memory-hierarchy axes.
Abstract: Processor architectures with tens to hundreds of arithmetic units are emerging to handle media processing applications. These applications, such as image coding, image synthesis and image understanding, require arithmetic rates of up to 10/sup 11/ operations per second. As the number of arithmetic units in a processor increases to meet these demands, register storage and communication between the arithmetic units dominate the area, delay and power of the arithmetic units. In this paper, we show that partitioning the register file along three axes reduces the cost of register storage and communication without significantly impacting performance. We develop a taxonomy of register architectures by partitioning across the data-parallel, instruction-level-parallel and memory-hierarchy axes, and by optimizing the hierarchical register organization for operation on streams of data. Compared to a centralized global register file, the most compact of these organizations reduces the register file area, delay and power dissipation of a media processor by factors of 195, 230 and 430 respectively. This reduction in cost is achieved with a performance degradation of only 8% on a representative set of media processing benchmarks.
TL;DR: Two complementary techniques for reducing energy on massively-threaded processors such as GPUs are presented and it is shown that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively.
Abstract: Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we examine register file caching to replace accesses to the large main register file with accesses to a smaller structure containing the immediate register working set of active threads. Second, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Combined with register file caching, a two-level thread scheduler provides a further reduction in energy by limiting the allocation of temporary register cache resources to only the currently active subset of threads. We show that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively. We further show that the active thread count can be reduced by a factor of 4 with minimal impact on performance, resulting in a 36% reduction of register file energy.
TL;DR: In this paper, an improved flash EEPROM memory-based storage subsystem includes one or more flash memory arrays, each with three data registers and a controller circuit, each of which is used to control the program operation, hold the target data value, and load the next sector's data.
Abstract: An improved flash EEPROM memory-based storage subsystem includes one or more flash memory arrays, each with three data registers and a controller circuit. During a flash program operation, one data register is used to control the program operation, a second register is used to hold the target data value, and a third register is used to load the next sector's data. Subsequent to a flash program operation, a sector's data are read from a flash array into the first data register and compared to the target data stored in the second register. When the data is verified good, the data from the third register is copied into the first and second registers for the next program operation. This creates an improved performance system that doesn't suffer data transfer latency during program operations that require data verification after the program operation is complete. Alternate embodiments perform the comparison using two register implementations and a single register implementations. The post-writer verification can be repeated and use different bias conditions for reading the data. The process can be automatic or executed by command that can specify the read conditions.
TL;DR: In this paper, the alignment and ordering of vector elements for SIMD processing is described, and a starting byte specifying the first byte of an aligned vector is determined, and then a vector is extracted from the first register and the second register, and replicated into the elements in the third register in a particular order suitable for subsequent SIMD vector processing.
Abstract: The present invention provides alignment and ordering of vector elements for SIMD processing. In the alignment of vector elements for SIMD processing, one vector is loaded from a memory unit into a first register and another vector is loaded from the memory unit into a second register. The first vector contains a first byte of an aligned vector to be generated. Then, a starting byte specifying the first byte of an aligned vector is determined. Next, a vector is extracted from the first register and the second register beginning from the first bit in the first byte of the first register continuing through the bits in the second register. Finally, the extracted vector is replicated into a third register such that the third register contains a plurality of elements aligned for SIMD processing. In the ordering of vector elements for SIMD processing, a first vector is loaded from a memory unit into a first register and a second vector is loaded from the memory unit into a second register. Then, a subset of elements are selected from the first register and the second register. The elements from the subset are then replicated into the elements in the third register in a particular order suitable for subsequent SIMD vector processing.