TL;DR: A fully free and open source software (FOSS) architecture-neutral FPGA framework comprising of Yosys for Verilog synthesis, and nextpnr for placement, routing, and bitstream generation is introduced.
Abstract: This paper introduces a fully free and open source software (FOSS) architecture-neutral FPGA framework comprising of Yosys for Verilog synthesis, and nextpnr for placement, routing, and bitstream generation. Currently, this flow supports two commercially available FPGA families, Lattice iCE40 (up to 8K logic elements) and Lattice ECP5 (up to 85K elements) and has been hardware-proven for custom-computing machines including a low-power neural-network accelerator and an OpenRISC system-on-chip capable of booting Linux. Both Yosys and nextpnr have been engineered in a highly flexible manner to support many of the features present in modern FPGAs by separating architecture-specific details from the common mapping algorithms.This framework is demonstrated on a longest-path case study to find an atypical single source-sink path occupying up to 45% of all on-chip wiring.
TL;DR: PULP (Parallel processing Ultra-Low Power platform), an architecture built on clusters of tightly-coupled OpenRISC ISA cores, with advanced techniques for fast performance and energy scalability that exploit the capabilities of the STMicroelectronics UTBB FD-SOI 28nm technology is proposed.
Abstract: Novel pervasive devices such as smart surveillance cameras and autonomous micro-UAVs could greatly benefit from the availability of a computing device supporting embedded computer vision at a very low power budget. To this end, we propose PULP (Parallel processing Ultra-Low Power platform), an architecture built on clusters of tightly-coupled OpenRISC ISA cores, with advanced techniques for fast performance and energy scalability that exploit the capabilities of the STMicroelectronics UTBB FD-SOI 28nm technology. We show that PULP performance can be scaled over a 1x-354x range, with a peak theoretical energy efficiency of 211 GOPS/W. We present performance results for several demanding kernels from the image processing and vision domain, with post-layout power modeling: a motion detection application that can run at an efficiency up to 192 GOPS/W (90 % of the theoretical peak); a ConvNet-based detector for smart surveillance that can be switched between 0.7 and 27fps operating modes, scaling energy consumption per frame between 1.2 and 12mJ on a 320 ×240 image; and FAST + Lucas-Kanade optical flow on a 128 ×128 image at the ultra-low energy budget of 14 μJ per frame at 60fps.
TL;DR: P predictive instruction-based dynamic clock adjustment as a technique to trim dynamic timing margins in pipelined microprocessors to exploit the different timing requirements for individual instructions during the dynamically varying program execution flow without the need for complex circuit-level measures.
Abstract: Static timing analysis provides the basis for setting the clock period of a microprocessor core, based on its worst-case critical path. However, depending on the design, this critical path is not always excited and therefore dynamic timing margins exist that can theoretically be exploited for the benefit of better speed or lower power consumption (through voltage scaling). This paper introduces predictive instruction-based dynamic clock adjustment as a technique to trim dynamic timing margins in pipelined microprocessors. To this end, we exploit the different timing requirements for individual instructions during the dynamically varying program execution flow without the need for complex circuit-level measures to detect and correct timing violations. We provide a design flow to extract the dynamic timing information for the design using post-layout dynamic timing analysis and we integrate the results into a custom cycle-accurate simulator. This simulator allows annotation of individual instructions with their impact on timing (in each pipeline stage) and rapidly derives the overall code execution time for complex benchmarks. The design methodology is illustrated at the microarchitecture level, demonstrating the performance and power gains possible on a 6-stage OpenRISC in-order general purpose processor core in a 28nm CMOS technology. We show that employing instruction-dependent dynamic clock adjustment leads on average to an increase in operating speed by 38% or to a reduction in power consumption by 24%, compared to traditional synchronous clocking, which at all times has to respect the worst-case timing identified through static timing analysis.
TL;DR: This paper partitions the processor core into reconfigurable modules and duplicate these modules to implement a concurrent error detection mechanism and generates precompiled configurations which include spare resources and are used to runtime repair the defective module.
Abstract: The increasing use of field programmable devices for the implementation of embedded processors and systems-on-chip even in mission-critical applications demands for fault tolerant techniques to improve reliability and extend system lifetime. Furthermore, the runtime partial reconfiguration potentials of the latest FPGA devices along with the availability of unused programmable resources in most FPGA designs provide interesting opportunities to build fault tolerant mechanisms. In this paper, we exploit the latest dynamic reconfiguration advances and propose a fault-tolerant FPGA processor architecture based on runtime reconfigurable modules. We partition the processor core into reconfigurable modules and duplicate these modules to implement a concurrent error detection mechanism. For every duplicated module we generate precompiled configurations which include spare resources and are used to runtime repair the defective module. The processor freezes upon the detection of an error and an on-chip controller coordinates the processor recovery and repair in a reconfiguration process transparent to the processor. We demonstrate the proposed approach in OpenRISC core, a widely-used open-source soft processor.
TL;DR: This paper proposes a unified processor power modeling methodology for the creation of power models at multiple granularity levels that can be quickly mapped to an ESL design flow and demonstrates the usefulness of having multiple power models.
Abstract: With power becoming a major constraint for multiprocessor embedded systems, it is becoming important for designers to characterize and model processor power dissipation. It is critical for these processor power models to be useable across various modeling abstractions in an electronic system level (ESL) design flow, to guide early design decisions. In this paper, we propose a unified processor power modeling methodology for the creation of power models at multiple granularity levels that can be quickly mapped to an ESL design flow. Our experimental results based on applying the proposed methodology on the OpenRISC and MIPS processors demonstrate the usefulness of having multiple power models. The generated models range from very high-level two-state and architectural/instruction set simulator models that can be used in transaction level models, to extremely detailed cycle-accurate models that enable early exploration of power optimization techniques. These models offer a designer tremendous flexibility to trade off estimation accuracy with estimation/simulation effort.