TL;DR: EDDI can provide over 98% fault-coverage without any extra hardware for error detection, which is especially useful when designers cannot change the hardware, but they need dependability in the computer system.
Abstract: This paper proposes a pure software technique "error detection by duplicated instructions" (EDDI), for detecting errors during usual system operation. Compared to other error-detection techniques that use hardware redundancy, EDDI does not require any hardware modifications to add error detection capability to the original system. EDDI duplicates instructions during compilation and uses different registers and variables for the new instructions. Especially for the fault in the code segment of memory, formulas are derived to estimate the error-detection coverage of EDDI using probabilistic methods. These formulas use statistics of the program, which are collected during compilation. EDDI was applied to eight benchmark programs and the error-detection coverage was estimated. Then, the estimates were verified by simulation, in which a fault injector forced a bit-flip in the code segment of executable machine codes. The simulation results validated the estimated fault coverage and show that approximately 1.5% of injected faults produced incorrect results in eight benchmark programs with EDDI, while on average, 20% of injected faults produced undetected incorrect results in the programs without EDDI. Based on the theoretical estimates and actual fault-injection experiments, EDDI can provide over 98% fault-coverage without any extra hardware for error detection. This pure software technique is especially useful when designers cannot change the hardware, but they need dependability in the computer system. To reduce the performance overhead, EDDI schedules the instructions that are added for detecting errors such that "instruction-level parallelism" (ILP) is maximized. Performance overhead can be reduced by increasing ILP within a single super-scalar processor. The execution time overhead in a 4-way super-scalar processor is less than the execution time overhead in the processors that can issue two instructions in one cycle.
TL;DR: The Imagine architecture supports the stream programming model by providing a bandwidth hierarchy tailored to the demands of media applications by reducing the global register and memory bandwidth required by typical applications by factors of 13 and 21 respectively.
Abstract: Media applications are characterized by large amounts of available parallelism, little data reuse, and a high computation to memory access ratio. While these characteristics are poorly matched to conventional microprocessor architectures, they are a good fit for modern VLSI technology with its high arithmetic capacity but limited global bandwidth. The stream programming model, in which an application is coded as streams of data records passing through computation kernels, exposes both parallelism and locality in media applications that can be exploited by VLSI architectures. The Imagine architecture supports the stream programming model by providing a bandwidth hierarchy tailored to the demands of media applications. Compared to a conventional scalar processor. Imagine reduces the global register and memory bandwidth required by typical applications by factors of 13 and 21 respectively. This bandwidth efficiency enables a single chip Imagine processor to achieve a peak performance of 16.2GFLOPS (single-precision floating point) and sustained performance of up to 8.5GFLOPS on media processing kernels.
TL;DR: The Parallel DSP Chip as discussed by the authors is a general purpose, reduced instruction set for parallel digital signal processing (DSP) processors with a single, pipelined instruction unit that controls the operation of four 16-bit vector processors in parallel with one group of bits and a 24-bit scalar processor with another group of bit.
Abstract: The Parallel DSP Chip has a general purpose, reduced instruction set for parallel digital signal processing. The following pertains to the preferred embodiment. Most instruction words are 32 bits long and execute at the rate of one per clock cycle. Each instruction word is executed by a single, pipelined instruction unit that controls the operation of four, 16-bit vector processors in parallel with one group of bits, and the operation of a 24-bit scalar processor with another group of bits. Thus five instructions are typically executed for every instruction word as a result of the parallel architecture. A single, linear, 16 MB, memory address space is used, simplifying program development. The storage of 8- and 16-bit operands for use by the vector processors is supported to maximize memory utilization. The Parallel DSP Chip is specifically designed to support an enhanced C compiler and has two software stack pointers, one for the vector processors and one for the scalar processor, plus an interrupt stack pointer. The Parallel DSP Chip executes a single task in parallel. Using an enhanced C compiler, simple, familiar, scalar processing programming techniques can be used, and a simple, single-task operating system can be used for software development. The basic programming concept is to define one or more arrays of four-element structures using the enhanced C compiler. One element in such a structure is provided for each of the four vector processors. The structure to be processed at any one time is selected by an address computed by the scalar processor. The same operation is applied to all of the elements of the structure by the simultaneous operation of the vector processors. To access the next structure in an array of structures, the scalar processor advances the address by the number of bytes in the structure.
TL;DR: Several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor are described and results indicate that significant speedup can be achieved with a high level of support from the compiler.
Abstract: Developed for multimedia and game applications, as well as other numerically intensive workloads, the CELL processor provides support both for highly parallel codes, which have high computation and memory requirements, and for scalar codes, which require fast response time and a full-featured programming environment. This first generation CELL processor implements on a single chip a Power Architecture processor with two levels of cache, and eight attached streaming processors with their own local memories and globally coherent DMA engines. In addition to processor-level parallelism, each processing element has a Single Instruction Multiple Data (SIMD) unit that can process from 2 double precision floating points up to 16 bytes per instruction. This paper describes, in the context of a research prototype, several compiler techniques that aim at automatically generating high quality codes over a wide range of heterogeneous parallelism available on the CELL processor. Techniques include compiler-supported branch prediction, compiler-assisted instruction fetch, generation of scalar codes on SIMD units, automatic generation of SIMD codes, and data and code partitioning across the multiple processor elements in the system. Results indicate that significant speedup can be achieved with a high level of support from the compiler.
TL;DR: In this article, the authors propose a vector processor with a single-instruction-multiple-data (SIMD) architecture, where the control control processor can create a fork in a program flow by starting the vector processor.
Abstract: An integrated multiprocessor architecture simplifies synchronization of multiple processing units. The multiple processing units constitute a general-purpose or control processor and a vector processor which has a single-instruction-multiple-data (SIMD) architecture so that multiple parallel processing units in the vector processor all complete an instruction simultaneously and do not require software synchronization. The control control processor controls the vector processor and creates a fork in a program flow by starting the vector processor. An instruction set for the control processor includes special instructions that enable the control processor to access registers of the vector processor, start or halt execution by the vector processor, and test flags written by the vector processor to indicate completion of tasks. The two processors then execute separate program threads in parallel until the control processor stops the vector processor, an exception is encountered, or the vector processor completes its program thread and enters an idle state. An instruction set for the vector processor includes special instructions that interrupt the first processor to indicate a task is complete. A register coupled to and accessible by both processors stores a state bit indicating whether the vector processor is running or idle. The control processor can synchronize the separate program threads by executing a loop which polls the state bit. When the state bit indicates the vector processor is idle, the general-purpose processor can process results from the vector processor and restart the vector processor.