TL;DR: This work describes the CUDA programming model and motivate its use in the biomedical imaging community and enables high-performance computing developers to harness the tremendous computational power and memory bandwidth of the GPU in a familiar programming environment - the C programming language.
Abstract: Graphics processing units (GPUs) originally designed for computer video cards have emerged as the most powerful chip in a high-performance workstation. Unlike multicore CPU architectures, which currently ship with two or four cores, GPU architectures are "manycore" with hundreds of cores capable of running thousands of threads in parallel. NVIDIA's CUDA is a co-evolved hardware-software architecture that enables high-performance computing developers to harness the tremendous computational power and memory bandwidth of the GPU in a familiar programming environment - the C programming language. We describe the CUDA programming model and motivate its use in the biomedical imaging community.
TL;DR: The porting of twoand three-dimensional Euler solvers from a conventional CPU implementation to the novel target platform of the Graphics Processing Unit (GPU) is described, with results found to be up to 29 times quicker on the GPU than on the CPU.
Abstract: The porting of twoand three-dimensional Euler solvers from a conventional CPU implementation to the novel target platform of the Graphics Processing Unit (GPU) is described. The motivation for such an effort is the impressive performance that GPUs offer: typically 10 times more floating point operations per second than a modern CPU, with over 100 processing cores and all at a very modest financial cost. Both codes were found to generate the same results on the GPU as the FORTRAN versions did on the CPU. The 2D solver ran up to 29 times quicker on the GPU than on the CPU; the 3D solver 16 times faster. Nomenclature cv Specific heat capacity at constant volume e Specific total energy = cvT + 1 2 V 2 h0 Specific stagnation enthalpy p Pressure t Time u,v Cartesian components of velocity V Velocity (magnitude) T Temperature Yp Stagnation pressure loss coefficient = p01−p0 p01−p2 tb302@cam.ac.uk gp10006@cam.ac.uk
TL;DR: It is demonstrated that a novel predictive user-level scheduler based on past performance history for heterogeneous systems allows multiple applications to fully utilize all available processing resources in CPU/GPU-like systems and consistently achieve speedups ranging from 30% to 40% compared to just using the GPU in a single application mode.
Abstract: Heterogeneous architectures are currently widespread. With the advent of easy-to-program general purpose GPUs, virtually every recent desktop computer is a heterogeneous system. Combining the CPU and the GPU brings great amounts of processing power. However, such architectures are often used in a restricted way for domain-specific applications like scientific applications and games, and they tend to be used by a single application at a time. We envision future heterogeneous computing systems where all their heterogeneous resources are continuously utilized by different applications with versioned critical parts to be able to better adapt their behavior and improve execution time, power consumption, response time and other constraints at runtime. Under such a model, adaptive scheduling becomes a critical component.
In this paper, we propose a novel predictive user-level scheduler based on past performance history for heterogeneous systems. We developed several scheduling policies and present the study of their impact on system performance. We demonstrate that such scheduler allows multiple applications to fully utilize all available processing resources in CPU/GPU-like systems and consistently achieve speedups ranging from 30% to 40% compared to just using the GPU in a single application mode.
TL;DR: In this paper, an information processing apparatus includes a CPU including a plurality of instruction processors, a monitoring unit which monitors an operating power-supplying environment, and a power saving unit which controls the number of operating instruction processors provided in the CPU in accordance with the operating power supplying environment obtained by the monitoring with the monitoring unit.
Abstract: According to one embodiment, an information processing apparatus includes a CPU including a plurality of instruction processors, a monitoring unit which monitors an operating power supplying environment, and a power saving unit which controls the number of operating instruction processors provided in the CPU in accordance with the operating power supplying environment obtained by the monitoring with the monitoring unit.
TL;DR: In this paper, a computer system includes a data capture module for capturing at least one of a make, a model and a unit identification number for a vehicle involved in a loss event.
Abstract: A computer system includes a data capture module for capturing at least one of a make, a model and a unit identification number for a vehicle involved in a loss event. The system further includes a data storage module for storing the at least one of a make, a model and a unit identification number. A computer processor analyzes information related to the vehicle to detect a pattern of reported problems involving the vehicle. An output device outputs an identification of a subrogation opportunity based on the detected pattern.
TL;DR: This article investigates data-parallel computer architectures such as Graphics Processing Units (GPUs) to simulate large scale ABMs and believes that its system is the first ever completely GPU based agent simulation framework.
Abstract: Agent-based modeling is a technique for modeling dynamic systems from the bottom up. Individual elements of the system are represented computationally as agents. The system-level behaviors emerge from the micro-level interactions of the agents. Contemporary state-of-the-art agent-based modeling toolkits are essentially discrete-event simulators designed to execute serially on the Central Processing Unit (CPU). They simulate Agent-Based Models (ABMs) by executing agent actions one at a time. In addition to imposing an un-natural execution order, these toolkits have limited scalability. In this article, we investigate data-parallel computer architectures such as Graphics Processing Units (GPUs) to simulate large scale ABMs. We have developed a series of efficient, data parallel algorithms for handling environment updates, various agent interactions, agent death and replication, and gathering statistics. We present three fundamental innovations that provide unprecedented scalability. The first is a novel stochastic memory allocator which enables parallel agent replication in O(1) average time. The second is a technique for resolving precedence constraints for agent actions in parallel. The third is a method that uses specialized graphics hardware, to gather and process statistical measures. These techniques have been implemented on a modern day GPU resulting in a substantial performance increase. We believe that our system is the first ever completely GPU based agent simulation framework. Although GPUs are the focus of our current implementations, our techniques can easily be adapted to other data-parallel architectures. We have benchmarked our framework against contemporary toolkits using two popular ABMs, namely, SugarScape and StupidModel.
TL;DR: This work shows how to exploit recent massively parallel single-instruction multiple-thread (SIMT) based graphics processing unit (GPU) platforms to tackle power grid analysis with promising performance and proposes a GPU-accelerated hybrid multigrid algorithm, GpuHMD, and its implementation.
Abstract: The challenging task of analyzing on-chip power (ground) distribution networks with multi-million node complexity and beyond is key to todaypsilas large chip designs. For the first time, we show how to exploit recent massively parallel single-instruction multiple-thread (SIMT) based graphics processing unit (GPU) platforms to tackle power grid analysis with promising performance. Several key enablers including GPU-specific algorithm design, circuit topology transformation, workload partitioning, performance tuning are embodied in our GPU-accelerated hybrid multigrid algorithm, GpuHMD, and its implementation. In particular, a proper interplay between algorithm design and SIMT architecture consideration is shown to be essential to achieve good runtime performance. Different from the standard CPU based CAD development, care must be taken to balance between computing and memory access, reduce random memory access patterns and simplify flow control to achieve efficiency on the GPU platform. Extensive experiments on industrial and synthetic benchmarks have shown that the proposed GpuHMD engine can achieve 100times runtime speedup over a state-of-the-art direct solver and be more than 15times faster than the CPU based multigrid implementation. The DC analysis of a 1.6 million-node industrial power grid benchmark can be accurately solved in three seconds with less than 50 MB memory on a commodity GPU. It is observed that the proposed approach scales favorably with the circuit complexity, at a rate about one second per million nodes.
TL;DR: This book gives an introduction to the MSP430 which extends the architecture to address more memory and which provides a bridge to the ARM 7 processor.
Abstract: The MSP430 is a simple 16-bit microcontroller with a compact and economical CPU containing only 27 instructions and 16 registers. It offers other advantages which make it suitable for low power applications: a rich variety of peripherals for analog input and output; rapid processing wake up time; the treatment of data and address on equal footing.Introduction to the MSP430 combines a tutorial approach with a description of the CPU and main peripherals. The tutorial builds from a basic program for lighting LEDs to the use of a timer. It uses the C programming language from the start but programs are also developed in assembly language to show how a program interacts with the hardware. To demonstrate the special features of the MSP430 full coverage is given to the instruction set, sigma-delta analog-digital converters and timers. Finally, the book gives an introduction to the MSP430 which extends the architecture to address more memory and which provides a bridge to the ARM 7 processor.Contents:1.Embedded electronic systems and microcontrollers; 2. Texas MSP430; 3. Development; 4. A simple tour of the MSP430; 5. Architecture of the MSP430; 6.Functions, interrupts and low-power modes; 7.Digital input, output and displays; 8. Timers; 9. Mixed-signal systems: Analog input and output; 10. Communication; 11. The future: MSP430X; Appendices.*The only tutorial book on the MSP430*Uses both C and assembly language *A CDROM containing a development kit to help the engineer and hobbyist program the MSP430.
TL;DR: The novelties in the approach and techniques employed to represent data, compute distances, centroids and identify the cluster elements using the GPU are presented and performance is measured using the metric: computational time per iteration.
Abstract: We exploit the parallel architecture of the Graphics Processing Unit (GPU) used in desktops to efficiently implement the traditional K-means algorithm. Our approach in clustering avoids the need for data and cluster information transfer between the GPU and CPU in between the iterations. In this paper we present the novelties in our approach and techniques employed to represent data, compute distances, centroids and identify the cluster elements using the GPU. We measure performance using the metric: computational time per iteration. Our implementation of k-means clustering on an Nvidia 5900 graphics processor is 4 to 12 times faster than the CPU and 7 to 22 times faster on the Nvidia 8500 graphics processor for various data sizes. We also achieved 12 to 64 times speed gain on the 5900 and 20 to 140 times speed gains on the 8500 graphics processor in computational time per iteration for evaluations with various cluster sizes.
TL;DR: This work has tried to exploit the advantages of both optical tree architecture and TOAD-based switch to design an integrated all-optical circuit that can perform binary addition, addition with carry, subtract with borrow, subtract (2's complement), double, increment, decrement, and transfer operations.
Abstract: An all-optical arithmetic unit with the help of terahertz-optical-asymmetric-demultiplexer (TOAD)-based tree architecture is proposed. We describe the all-optical arithmetic unit by using a set of all-optical multiplexer, all-optical full-adder, and optical switch. The all-optical arithmetic unit can be used to perform a fast central processor unit using optical hardware components. We have tried to exploit the advantages of both optical tree architecture and TOAD-based switch to design an integrated all-optical circuit that can perform binary addition, addition with carry, subtract with borrow, subtract (2's complement), double, increment, decrement, and transfer operations.
TL;DR: In this paper, a transcoding system that shares the workload of video transcoding through the use of multiple central processing unit (CPU) cores and/or one or more graphical processing units (GPU) is described.
Abstract: Embodiments of the invention as described herein provide a solution to the problems of conventional methods as stated above. In the following description, various examples are given for illustration, but none are intended to be limiting. Embodiments are directed to a transcoding system that shares the workload of video transcoding through the use of multiple central processing unit (CPU) cores and/or one or more graphical processing units (GPU), including the use of two components within the GPU: a dedicated hardcoded or programmable video decoder for the decode step and compute shaders for scaling and encoding. The system combines usage of an industry standard Microsoft DXVA method for using the GPU to accelerate video decode with a GPU encoding scheme, along with an intermediate step of scaling the video.
TL;DR: In this article, a sensor node including a sensor for measuring biological information, a CPU for acquiring data by driving the sensor, a wireless communication unit for transmitting the data acquired by the CPU, a battery for supplying the control unit, the wireless communications unit, and the sensor with electric power, a RAM for storing the data, a compression unit for compressing the data stored in the RAM when the wireless communication units cannot carry out the transmission, thereby storing as much sensing data as possible on the sensor node, which is limited in resources, and preventing loss of the sensing data.
Abstract: Provided is a sensor node including: a sensor for measuring biological information; a CPU for acquiring data by driving the sensor; a wireless communication unit for transmitting the data acquired by the CPU; a battery for supplying the control unit, the wireless communication unit, and the sensor with electric power; a RAM for storing the data; a compression unit for compressing the data stored in the RAM when the wireless communication unit cannot carry out the transmission; and a flash memory for storing the compressed data, thereby storing as much sensing data as possible on the sensor node, which is limited in resources, and preventing loss of the sensing data.
TL;DR: An exact analysis of the problem for a real-time embedded application running on a DVS-enabled CPU and using potentially multiple devices is undertaken and a provably optimal algorithm is proposed to determine the optimal CPU speed as well as device state transition decisions to minimize the system-level energy.
Abstract: Dynamic Voltage Scaling (DVS) and Dynamic Power Management (DPM) are two popular techniques commonly employed to save energy in real-time embedded systems. DVS policies aim at reducing the CPU energy, while DPM-based solutions involve putting the system components (e.g. memory or I/O devices) to low-power/sleep states at run-time, when sufficiently long idle intervals can be predicted. Despite numerous research papers that tackled the energy minimization problem using DVS or DPM separately, the interactions of these two popular techniques are not yet well understood. In this paper, we undertake an exact analysis of the problem for a real-time embedded application running on a DVS-enabled CPU and using potentially multiple devices. Specifically, by adopting a generalized system-level energy model and taking into account the non-trivial time/energy overheads involved in device transitions, we characterize the variations in different components of the system energy as a function of the CPU processing speed. Then, we propose a provably optimal algorithm to determine the optimal CPU speed as well as device state transition decisions to minimize the system-level energy. Our algorithm runs in O(m log m) time, where m is the number of devices used by the application. The evaluations with realistic system parameters indicate that our solution, which combines DVS and DPM optimally, can lead to substantial energy savings when compared to previous solutions.
TL;DR: In this paper, a processing platform and a method of controlling power consumption of a central processing unit of the processing platform is presented, which is able to set an upper and a lower performance state limit.
Abstract: A processing platform and a method of controlling power consumption of a central processing unit of the processing platform are presented. By operating the method the processing platform is able to set an upper performance state limit and a lower performance state limit. The upper performance state limit is based on a central processing unit activity rate value and the lower performance state limit is based on a minimum require of the operating system to perform operating system tasks. The performance state values are varying within a range of the lower and upper limits according to a power management policy.
TL;DR: In this paper, the problem of properly performing sheet transportation processing depending on whether a first fixing means can be used or a second fixing means is not used was addressed. But the second fixing unit was not used.
Abstract: PROBLEM TO BE SOLVED: To properly perform sheet transportation processing depending on whether a first fixing means can be used or a second fixing means can be used SOLUTION: When a first fixing unit 401 cannot be used, a CPU 205 performs control not to execute first transportation processing in which the first fixing unit 401 and the second fixing unit 402 are used, and second transportation processing in which the first fixing unit 401 is used, but the second fixing unit 402 is not used When the first fixing unit 401 can be used, but the second fixing unit 402 cannot be used, the CPU 205 performs control not to execute the first transportation processing while performing control to execute the second transportation processing COPYRIGHT: (C)2008,JPO&INPIT
TL;DR: In this article, a method and an apparatus that determine a total number of threads to concurrently execute executable codes compiled from a single source for target processing units in response to an API (Application Programming Interface) request from an application running in a host processing unit are described.
Abstract: A method and an apparatus that determine a total number of threads to concurrently execute executable codes compiled from a single source for target processing units in response to an API (Application Programming Interface) request from an application running in a host processing unit are described. The target processing units include GPUs (Graphics Processing Unit) and CPUs (Central Processing Unit). Thread group sizes for the target processing units are determined to partition the total number of threads according to a multi-dimensional global thread number included in the API request. The executable codes are loaded to be executed in thread groups with the determined thread group sizes concurrently in the target processing units.
TL;DR: The experimental results first confirm that a system-level memory DTM policy may significantly improve system performance and power efficiency, compared with existing memory bandwidth throttling scheme, and observe that the second policy, called DTM-CDVFS (Coordinated Dynamic Voltage and Frequency Scaling), has much better performance than previously reported.
Abstract: Thermal management of DRAM memory has become a critical issue for server systems. We have done, to our best knowledge, the first study of software thermal management for memory subsystem on real machines. Two recently proposed DTM (Dynamic Thermal Management) policies have been improved and implemented in Linux OS and evaluated on two multicore servers, a Dell PowerEdge 1950 server and a customized Intel SR1500AL server testbed. The experimental results first confirm that a system-level memory DTM policy may significantly improve system performance and power efficiency, compared with existing memory bandwidth throttling scheme. A policy called DTM-ACG (Adaptive Core Gating) shows performance improvement comparable to that reported previously. The average performance improvements are 13.3% and 7.2% on the PowerEdge 1950 and the SR1500AL (vs. 16.3% from the previous simulation-based study), respectively. We also have surprising findings that reveal the weakness of the previous study: the CPU heat dissipation and its impact on DRAM memories, which were ignored, are significant factors. We have observed that the second policy, called DTM-CDVFS (Coordinated Dynamic Voltage and Frequency Scaling), has much better performance than previously reported for this reason. The average improvements are 10.8% and 15.3% on the two machines (vs. 3.4% from the previous study), respectively. It also significantly reduces the processor power by 15.5% and energy by 22.7% on average.
TL;DR: In this paper, the authors present an apparatus and method for processing graphic data, which includes a CPU having at least one core, a GPU configured to process graphic data; a usage level checking unit configured to check a used level of the CPU and/or of the GPU; and a control unit that allows the graphic data to be processed in parallel by both CPU and GPU according to the comparison results.
Abstract: The present invention relates to an apparatus and method for processing graphic data. According to an embodiment, the graphic data processing apparatus includes a CPU having at least one core; a GPU configured to process graphic data; a usage level checking unit configured to check a usage level of the CPU and/or a usage level of the GPU; and a control unit configured to compare the checked usage level of the CPU with a usage level reference of the CPU and/or to compare the checked usage level of the GPU with a usage level reference of the GPU, to allow the graphic data to be processed in parallel by the CPU and the GPU or only by the GPU according to the comparison results.
TL;DR: Performance results show that the GPU can outperform a modern CPU on this problem by an order of magnitude and also confirm that a GPU can be used to support a system with at least three parity disks with no performance penalty.
Abstract: Graphical Processing Units (GPUs) have been applied to more types of computations than just graphics processing for several years. Until recently, however, GPU hardware has not been capable of efficiently performing general data processing tasks. With the advent of more general-purpose extensions to GPUs, many more types of computations are now possible. One such computation that we have identified as being suitable for the CPU's unique architecture is Reed-Solomon coding in a manner appropriate for RAID-type systems. In this paper, we motivate the need for RAID with triple-disk parity and describe a pipelined architecture for using a GPU for this purpose. Performance results show that the GPU can outperform a modern CPU on this problem by an order of magnitude and also confirm that a GPU can be used to support a system with at least three parity disks with no performance penalty.
TL;DR: This work develops an SoC with 8 processor cores and 8 user RAMs targeted for power-efficient high-performance embedded applications and assigns 16 blocks to separate power domains so that they can be independently be powered off.
Abstract: Power efficient SoC design for embedded applications requires several independent power-domains where the power of unused blocks can be turned off. An SoC for mobile phones defines 23 hierarchical power domains but most of the power domains are assigned for peripheral IPs that mainly use low-leakage high-Vt transistors. Since high-performance multiprocessor SoCs use leaky low-Vt transistors for CPU sections, leakage power savings of these CPU sections is a primary objective. We develop an SoC with 8 processor cores and 8 user RAMs (1 per core) targeted for power-efficient high-performance embedded applications. We assign these 16 blocks to separate power domains so that they can be independently be powered off. A resume mode is also introduced where the power of the CPU is off and the user RAM is on for fast resume operation. An automatic parallelizing compiler schedules tasks for each CPU core and also performs power management for each CPU core. With the help of this compiler, each processor core can operate at a different frequency or even dynamically stop the clock to maintain processing performance while reducing average operating power consumption. The compiler also executes power-off control of unnecessary CPU cores.
TL;DR: The use of Graphics Processors Unit for computing acoustic likelihoods in a speech recognition system shows that GPU is 5x faster than the CPU SSE-based implementation, which led to a speed up of 35% on a large vocabulary task.
Abstract: This paper introduces the use of Graphics Processors Unit (GPU) for computing acoustic likelihoods in a speech recognition system. In addition to their high availability, GPUs provide high computing performance at low cost. We have used a NVidia GeForce 8800GTX programmed with the CUDA (Compute Unified Device Architecture) which shows the GPU as a parallel coprocessor. The acoustic likelihoods are computed as dot products, operations for which GPUs are highly efficient. The implementation in our speech recognition system shows that GPU is 5x faster than the CPU SSE-based implementation. This improvement led to a speed up of 35% on a large vocabulary task.
TL;DR: In this paper, the authors present a hardware implementation of a Hard Object system, where each page table entry has additional meta-data beyond what it commonly holds, and each time a data load or store is issued from the CPU, and the virtual address 032 translated to the physical address 034, the hard object system uses its additional PTE meta data 061 to perform memory access checks additional to those done in current systems.
Abstract: In accordance with one embodiment, additions to the standard computer microprocessor architecture hardware are disclosed comprising novel page table entry fields 015 062, special registers 021 022, instructions for modifying these fields 120 122 and registers 124 126, and hardware-implemented 038 runtime checks and operations involving these fields and registers. More specifically, in the above embodiment of a Hard Object system, there is additional meta-data 061 in each page table entry beyond what it commonly holds, and each time a data load or store is issued from the CPU, and the virtual address 032 translated to the physical address 034, the Hard Object system uses its additional PTE meta-data 061 to perform memory access checks additional to those done in current systems. Together with changes to software, these access checks can be arranged carefully to provide more fine-grain access control for data than do current systems.
TL;DR: In this paper, a multiprocessor system having plural heterogeneous processing units schedules instruction sets for execution on a selected of the processing units by matching workload processing characteristics of processing units and the instruction sets.
Abstract: A multiprocessor system having plural heterogeneous processing units schedules instruction sets for execution on a selected of the processing units by matching workload processing characteristics of processing units and the instruction sets. To establish an instruction set's processing characteristics, the homogeneous instruction set is executed on each of the plural processing units with one or more performance metrics tracked at each of the processing units to determine which processing unit most efficiently executes the instruction set. Instruction set workload processing characteristics are stored for reference in scheduling subsequent execution of the instruction set.
TL;DR: New schemes for efficient automatic task distribution between CPU and GPU are presented and tests and results of implementing those schemes are included with a test case and with a real-time system.
Abstract: The increase of computational power of programmable GPU (graphics processing unit) brings new concepts for using these devices for generic processing. Hence, with the use of the CPU and the GPU for data processing come new ideas that deals with distribution of tasks among CPU and GPU, such as automatic distribution. The importance of the automatic distribution of tasks between CPU and GPU lies in three facts. First, automatic task distribution enables the applications to use the best of both processors. Second, the developer does not have to decide which processor will do the work, allowing the automatic task distribution system to choose the best option for the moment. And third, sometimes, the application can be slowed down by other processes if the CPU or GPU is already overloaded. Based on these facts, this paper presents new schemes for efficient automatic task distribution between CPU and GPU. This paper also includes tests and results of implementing those schemes with a test case and with a real-time system.
TL;DR: HYDRA is a programming model and a runtime support layer which enables utilization of host processors as well as various programmable peripheral devices' processors and was able to cut down the development cost of a system that uses multiple heterogenous compute resources significantly.
Abstract: The constant race for faster and more powerful CPUs is drawing to a close. No longer is it feasible to significantly increase the speed of the CPU without paying a crushing penalty in power consumption and production costs. Instead of increasing single thread performance, the industry is turning to multiple CPU threads or cores (such as SMT and CMP) and heterogeneous CPU architectures (such as the Cell Broadband Engine). While this is a step in the right direction, in every modern PC there is a wealth of untapped compute resources. The NIC has a CPU; the disk controller is programmable; some high-end graphics adaptersare already more powerful than host CPUs. Some of these CPUs can perform some functions more efficiently than the host CPUs. Our operating systems and programming abstractions should be expanded to let applications tap into these computational resources and make the best use of them.Therefore, we propose the HYDRA framework, which lets application developers use the combined power of every compute resource in a coherent way. HYDRA is a programming model and a runtime support layer which enables utilization of host processors as well as various programmable peripheral devices' processors. We present the frameworkand its application for a demonstrative use-case, as well as provide a thorough evaluation of its capabilities. Using HYDRA we were able to cut down the development cost of a system that uses multiple heterogenous compute resources significantly.
TL;DR: In this paper, a central processing unit measurement facility is virtualized in order to support concurrent use of the facility by multiple guests executing within a virtual environment, each guest of the environment has independent control over disablement/enablement of the device for that guest.
Abstract: A central processing unit measurement facility is virtualized in order to support concurrent use of the facility by multiple guests executing within a virtual environment. Each guest of the environment has independent control over disablement/enablement of the facility for that guest.
TL;DR: In this article, a pipelined processor including an architecture for address generation interlocking, the processor including: an instruction grouping unit to detect a read-after-write dependency and to resolve instruction interdependency; an instruction dispatch unit (IDU) including address generator interlock (AGI) and operand fetching logic for dispatching an instruction to at least one of a load store unit and an execution unit; the execution unit is configured to write data into a general purpose register bank.
Abstract: A pipelined processor including an architecture for address generation interlocking, the processor including: an instruction grouping unit to detect a read-after-write dependency and to resolve instruction interdependency; an instruction dispatch unit (IDU) including address generation interlock (AGI) and operand fetching logic for dispatching an instruction to at least one of a load store unit and an execution unit; wherein the load store unit is configured with access to a data cache and to return fetched data to the execution unit; wherein the execution unit is configured to write data into a general purpose register bank; and wherein the architecture provides support for bypassing of results of a load multiple instruction for address generation while such instruction is executing in the execution unit before the general purpose register bank is written. A method and a computer system are also provided.
TL;DR: In this article, the authors present an architecture for data processing on a network on chip (NOC) that includes integrated processor (IP) blocks, each of a plurality of the IP blocks including at least one computer processor, each such computer processor implementing hardware threads of execution; low latency, high bandwidth application messaging interconnects; memory communications controllers; network interface controllers; and routers.
Abstract: Data processing on a network on chip (‘NOC’) that includes integrated processor (‘IP’) blocks, each of a plurality of the IP blocks including at least one computer processor, each such computer processor implementing a plurality of hardware threads of execution; low latency, high bandwidth application messaging interconnects; memory communications controllers; network interface controllers; and routers; each of the IP blocks adapted to a router through a separate one of the low latency, high bandwidth application messaging interconnects, a separate one of the memory communications controllers, and a separate one of the network interface controllers; each application messaging interconnect abstracting into an architected state of each processor, for manipulation by computer programs executing on the processor, hardware inter-thread communications among the hardware threads of execution; each memory communications controller controlling communication between an IP block and memory; each network interface controller controlling inter-IP block communications through routers.
TL;DR: In this paper, a compiler for a heterogeneous system that includes both one or more primary processors and one or multiple parallel co-processors is presented, and the compiler compiles the foreign macro-instructions as if they were predefined functions of the CPU, rather than as remote procedure calls.
Abstract: A compiler for a heterogeneous system that includes both one or more primary processors and one or more parallel co-processors is presented. For at least one embodiment, the primary processors(s) include a CPU and the parallel co-processor(s) include a GPU. Source code for the heterogeneous system may include code to be performed on the CPU but also code segments, referred to as “foreign macro-instructions”, that are to be performed on the GPU. An optimizing compiler for the heterogeneous system comprehends the architecture of both processors, and generates an optimized fat binary that includes machine code instructions for both the primary processor(s) and the co-processor(s). The optimizing compiler compiles the foreign macro-instructions as if they were predefined functions of the CPU, rather than as remote procedure calls. The binary is the result of compiler optimization techniques, and includes prefetch instructions to load code and/or data into the GPU memory concurrently with execution of other instructions on the CPU. Other embodiments are described and claimed.
TL;DR: In this paper, an emulator is provided that uses a dispatch table stored in main memory to convert a guest program counter into a host program counter, which is accessed to see if the dispatch table contains a particular host program buffer for a particular guest program buffer.
Abstract: Various technologies and techniques are disclosed for using transactional memory hardware to accelerate virtualization or emulation. State isolation can be facilitated by providing isolated private state on transactional memory hardware and storing the stack of a host that is performing an emulation in the isolated private state. Memory accesses performed by a central processing unit can be monitored by software to detect that a guest being emulated has made a self modification to its own code sequence. Transactional memory hardware can be used to facilitate dispatch table updates in multithreaded environments by taking advantage of the atomic commit feature. An emulator is provided that uses a dispatch table stored in main memory to convert a guest program counter into a host program counter. The dispatch table is accessed to see if the dispatch table contains a particular host program counter for a particular guest program counter.