About: Status register is a research topic. Over the lifetime, 3441 publications have been published within this topic receiving 46191 citations. The topic is also known as: flag register & condition code register.
TL;DR: Two complementary techniques for reducing energy on massively-threaded processors such as GPUs are presented and it is shown that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively.
Abstract: Modern graphics processing units (GPUs) use a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complicated thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we examine register file caching to replace accesses to the large main register file with accesses to a smaller structure containing the immediate register working set of active threads. Second, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Combined with register file caching, a two-level thread scheduler provides a further reduction in energy by limiting the allocation of temporary register cache resources to only the currently active subset of threads. We show that on average, across a variety of real world graphics and compute workloads, a 6-entry per-thread register file cache reduces the number of reads and writes to the main register file by 50% and 59% respectively. We further show that the active thread count can be reduced by a factor of 4 with minimal impact on performance, resulting in a 36% reduction of register file energy.
TL;DR: In this paper, an improved flash EEPROM memory-based storage subsystem includes one or more flash memory arrays, each with three data registers and a controller circuit, each of which is used to control the program operation, hold the target data value, and load the next sector's data.
Abstract: An improved flash EEPROM memory-based storage subsystem includes one or more flash memory arrays, each with three data registers and a controller circuit. During a flash program operation, one data register is used to control the program operation, a second register is used to hold the target data value, and a third register is used to load the next sector's data. Subsequent to a flash program operation, a sector's data are read from a flash array into the first data register and compared to the target data stored in the second register. When the data is verified good, the data from the third register is copied into the first and second registers for the next program operation. This creates an improved performance system that doesn't suffer data transfer latency during program operations that require data verification after the program operation is complete. Alternate embodiments perform the comparison using two register implementations and a single register implementations. The post-writer verification can be repeated and use different bias conditions for reading the data. The process can be automatic or executed by command that can specify the read conditions.
TL;DR: In this paper, the alignment and ordering of vector elements for SIMD processing is described, and a starting byte specifying the first byte of an aligned vector is determined, and then a vector is extracted from the first register and the second register, and replicated into the elements in the third register in a particular order suitable for subsequent SIMD vector processing.
Abstract: The present invention provides alignment and ordering of vector elements for SIMD processing. In the alignment of vector elements for SIMD processing, one vector is loaded from a memory unit into a first register and another vector is loaded from the memory unit into a second register. The first vector contains a first byte of an aligned vector to be generated. Then, a starting byte specifying the first byte of an aligned vector is determined. Next, a vector is extracted from the first register and the second register beginning from the first bit in the first byte of the first register continuing through the bits in the second register. Finally, the extracted vector is replicated into a third register such that the third register contains a plurality of elements aligned for SIMD processing. In the ordering of vector elements for SIMD processing, a first vector is loaded from a memory unit into a first register and a second vector is loaded from the memory unit into a second register. Then, a subset of elements are selected from the first register and the second register. The elements from the subset are then replicated into the elements in the third register in a particular order suitable for subsequent SIMD vector processing.
TL;DR: Construction of the call graph allows us to use the same register for locals of procedures that are not simultaneously active, giving us most of the advantages of a full-scale coloring without the expense.
Abstract: In previous work in global register allocation, the compiler colors a conflict graph constructed from liveness dataflow information, in order to allocate the same register to many variables that are not simultaneously live. If two procedures are in separately compiled modules, however, the compiler must do this allocation separately for each procedure. As a result, the two procedures might use different registers for the same global, or the same register for different locals.We can remove these problems if we delay the register allocation until link time. Our compiler produces object modules that can be linked and run without global register allocation, but includes with each object module a body of information describing how the module uses variables and procedures. A link-time register allocator then decides which variables are used most frequently, selects registers for them, and rewrites the code to reflect the decision that these variables reside in registers rather than in memory. Construction of the call graph allows us to use the same register for locals of procedures that are not simultaneously active, giving us most of the advantages of a full-scale coloring without the expense.When we use our method for 52 registers, our benchmarks speed up by 10 to 25 percent. Even with only 8 registers, the speedup can be nearly that large if we use previously collected profile information to guide the allocation. We cannot do much better, because programs whose variables all fit in registers rarely speed up by more than 30%. Moreover, profiling shows us that we usually remove 60% to 90% of the loads and stores of scalar variables that the program performs during its execution, and often much more.
Abstract: A security apparatus including a number input device (302), an address register (312) responsive to the number input device, an encryption schema memory (316) addressable by the address register to produce an output code and a relative address code, and address incrementing logic (310) responsive the relative address code and operative to increment the address register. The apparatus also preferably includes a PIN register (304) coupled to the number input device, a public code register (306) coupled to the number input device, and merging logic (308) merging outputs of the PIN register and the public code register to be input to the address register. The apparatus also preferably includes an output shift register operative to shift out the output code of the encryption schema memory. The encryption schema memory can be read only memory, writeable memory, or both.