About: Instructions per second is a research topic. Over the lifetime, 207 publications have been published within this topic receiving 4739 citations. The topic is also known as: IPS.
TL;DR: FireSim is presented, an open-source simulation platform that enables cycle-exact microarchitectural simulation of large scale-out clusters by combining FPGA-accelerated simulation of silicon-proven RTL designs with a scalable, distributed network simulation.
Abstract: We present FireSim, an open-source simulation platform that enables cycle-exact microarchitectural simulation of large scale-out clusters by combining FPGA-accelerated simulation of silicon-proven RTL designs with a scalable, distributed network simulation. Unlike prior FPGA-accelerated simulation tools, FireSim runs on Amazon EC2 F1, a public cloud FPGA platform, which greatly improves usability, provides elasticity, and lowers the cost of large-scale FPGA-based experiments. We describe the design and implementation of FireSim and show how it can provide sufficient performance to run modern applications at scale, to enable true hardware-software co-design. As an example, we demonstrate automatically generating and deploying a target cluster of 1,024 3.2 GHz quad-core server nodes, each with 16 GB of DRAM, interconnected by a 200 Gbit/s network with 2 microsecond latency, which simulates at a 3.4 MHz processor clock rate (less than 1,000x slowdown over real-time). In aggregate, this FireSim instantiation simulates 4,096 cores and 16 TB of memory, runs ∼14 billion instructions per second, and harnesses 12.8 million dollars worth of FPGAs---at a total cost of only ∼$100 per simulation hour to the user. We present several examples to show how FireSim can be used to explore various research directions in warehouse-scale machine design, including modeling networks with high-bandwidth and low-latency, integrating arbitrary RTL designs for a variety of commodity and specialized datacenter nodes, and modeling a variety of datacenter organizations, as well as reusing the scale-out FireSim infrastructure to enable fast, massively parallel cycle-exact single-node microarchitectural experimentation.
TL;DR: An 8-bit microprocessor made from plastic electronic technology directly on flexible plastic foil that can execute user-defined programs and is attractive features for integration on everyday objects where it could be programmed as a calculator, timer, or game controller.
Abstract: Forty years after the first silicon microprocessors, we demonstrate an 8-bit microprocessor made from plastic electronic technology directly on flexible plastic foil. The operation speed is today limited to 40 instructions per second. The power consumption is as low as 100 μW. The ALU-foil operates at a supply voltage of 10 V and back-gate voltage of 50 V. The microprocessor can execute user-defined programs: we demonstrate the execution of the multiplication of two 4-bit numbers and the calculation of the moving average of a string of incoming 6-bit numbers. To execute such dedicated tasks on the microprocessor, we create small plastic circuits that generate the sequences of appropriate instructions. The near transparency, mechanical flexibility, and low power consumption of the processor are attractive features for integration on everyday objects, where it could be programmed as, amongst other items, a calculator, timer, or game controller.
TL;DR: A system that determines a bound on the energy savings for an application is developed that applies to three scientific programs, two of which exhibit load imbalance---particle simulation and UMT2K.
Abstract: Power is now a first-order design constraint in large-scale parallel computing. Used carefully, dynamic voltage scaling can execute parts of a program at a slower CPU speed to achieve energy savings with a relatively small (possibly zero) time delay. However, the problem of when to change frequencies in order to optimize energy savings is NP-complete, which has led to many heuristic energy-saving algorithms. To determine how closely these algorithms approach optimal savings, we developed a system that determines a bound on the energy savings for an application. Our system uses a linear programming solver that takes as inputs the application communication trace and the cluster power characteristics and then outputs a schedule that realizes this bound. We apply our system to three scientific programs, two of which exhibit load imbalance---particle simulation and UMT2K. Results from our bounding technique show particle simulation is more amenable to energy savings than UMT2K.
TL;DR: In this article, the authors focus on optimization techniques for enhancing cache performance by hiding both the low main memory bandwidth and the latency of main memory accesses which is slow in contrast to the floating-point performance of the CPUs.
Abstract: In order to mitigate the impact of the growing gap between CPU speed and main memory performance, today’s computer architectures implement hierarchical memory structures. The idea behind this approach is to hide both the low main memory bandwidth and the latency of main memory accesses which is slow in contrast to the floating-point performance of the CPUs. Usually, there is a small and expensive high speed memory sitting on top of the hierarchy which is usually integrated within the processor chip to provide data with low latency and high bandwidth; i.e., the CPU registers. Moving further away from the CPU, the layers of memory successively become larger and slower. The memory components which are located between the processor core and main memory are called cache memories or caches. They are intended to contain copies of main memory blocks to speed up accesses to frequently needed data [378], [392]. The next lower level of the memory hierarchy is the main memory which is large but also comparatively slow. While external memory such as hard disk drives or remote memory components in a distributed computing environment represent the lower end of any common hierarchical memory design, this paper focuses on optimization techniques for enhancing cache performance.
TL;DR: The Yorktown Simulation Engine is a special-purpose, highly-parallel programmable machine for the gate-level simulation of logic that can simulate up to one million gates at a speed of over two billion gate simulations per second.
Abstract: The Yorktown Simulation Engine (YSE) is a special-purpose, highly-parallel programmable machine for the gate-level simulation of logic. It can simulate up to one million gates at a speed of over two billion gate simulations per second; it is estimated that the IBM 3081 processor could have been simulated on the YSE at a rate of 1000 instructions per second. This is far beyond the capabilities of existing register-level software simulators. The YSE has been designed and is being constructed at the IBM T. J. Watson Research Center. This paper introduces the YSE and describes its top-level architecture.