TL;DR: Examples of problems that can appear and solutions for implementing in analysis software of common implementations of floating-point operators are given.
Abstract: Current critical systems often use a lot of floating-point computations, and thus the testing or static analysis of programs containing floating-point operators has become a priority. However, correctly defining the semantics of common implementations of floating-point is tricky, because semantics may change according to many factors beyond source-code level, such as choices made by compilers. We here give concrete examples of problems that can appear and solutions for implementing in analysis software.
TL;DR: In this article, a modification of the normal load/store RISC operations includes within in its instruction set some instructions that permit floating point operations to be paired with load or store operations.
Abstract: A data processor includes both integer and floating point operation units and operates as a reduced instruction set computer (RISC). A modification of the normal load/store RISC operations includes within in its instruction set some instructions that permit floating point operations to be paired with load or store operations. These operations include: vector floating point add; vector multiply accumulate; vector floating point multiply; vector multiply subtract; vector reverse subtract; vector round floating point input; vector round integer input; and vector floating point subtract.
TL;DR: The PA7100 CPU, the first precision-architecture, reduced-instruction-set-computer (PA-RISC) architecture implementation to combine an integer core and floating-point coprocessor into a single-chip format, is described.
Abstract: The PA7100 CPU, the first precision-architecture, reduced-instruction-set-computer (PA-RISC) architecture implementation to combine an integer core and floating-point coprocessor into a single-chip format, is described. It incorporates superscalar execution and supports clock rates of up to 100 MHz in standard 0.8- mu m CMOS. Features such as a flexible primary cache organization and multiprocessing capability allow the device to be scaled to a variety of system applications, price ranges, and performance levels. The microprocessor instruction execution pipeline, cache design, translation look-aside buffer (TLB) for virtual address translation, floating-point unit, and system interface bus are discussed. The design, test, and verification methods used in the development of the PA7100 are reviewed. >
TL;DR: Algorithms iteratively using this adder tree kernel for IEEE double extended multiplication, division, and square root; conversions between 18-digit BCD integers and 64-b binary integers; and transcendental function evaluation are described.
Abstract: The authors describe a numeric processor with a kernel that is a tree of redundant binary adders and effects either a 17 /spl times/ 69-b multiply-and-add or a 19 /spl times/ 69-b multiply with exact redundant binary output and single cycle latency. Feedback paths selectively allow a high-order or low-order part of the adder tree output to be fed back in redundant binary form to the multiplicand and/or addend inputs to the adder tree. The authors describe algorithms iteratively using this adder tree kernel for IEEE double extended multiplication, division, and square root; conversions between 18-digit BCD integers and 64-b binary integers; and transcendental function evaluation. The multiplier design described was implemented in the Cyrix 83D87 numeric coprocessor (typically 33 MHz). Results for this coprocessor as compared with competitive x87 units are included. >
TL;DR: The SG++ algorithm is extended to the Intel Many Integrated Core Architecture, generating both the host and the coprocessor code, and the ease of porting an application to Intel MIC Architecture is shown: porting existing SSE code is very easy and straightforward.
Abstract: Extracting knowledge from vast datasets is a major challenge in data-driven applications, such as classification and regression, which are mostly compute bound. In this paper, we extend our SG++ algorithm to the Intel® Many Integrated Core Architecture (Intel® MIC Architecture). The ease of porting an application to Intel MIC Architecture is shown: porting existing SSE code is very easy and straightforward. We evaluate the current prototype pre-release coprocessor board codenamed Intel® "Knights Ferry". We utilize the pragma-based offloading programming model offered by the Intel® Composer XE for Intel MIC Architecture, generating both the host and the coprocessor code. We compare the achieved performance with an NVIDIA C2050 accelerator and show that the pre-release Knights Ferry coprocessor delivers better performance than the C2050 and exceeds the C2050 when comparing the productivity aspect of implementing algorithms for the coprocessors.