TL;DR: In this article, the authors present results of accuracy evaluation of numerous numerical algorithms for the numerical approximation of the Inverse Laplace Transform, including Stehfest, Abate and Whitt, Vlach and Singhai.
Abstract: In the paper we present results of accuracy evaluation of numerous numerical algorithms for the numerical approximation of the Inverse Laplace Transform. The selected algorithms represent diverse lines of approach to this problem and include methods by Stehfest, Abate and Whitt, Vlach and Singhai, De Hoog, Talbot, Zakian and a one in which the FFT is applied for the Fourier series convergence acceleration. We use C++ and Python languages with arbitrary precision mathematical libraries to address some crucial issues of numerical implementation. The test set includes Laplace transforms considered as difficult to compute as well as some others commonly applied in fractional calculus. Evaluation results enable to conclude that the Talbot method which involves deformed Bromwich contour integration, the De Hoog and the Abate and Whitt methods using Fourier series expansion with accelerated convergence can be assumed as general purpose high-accuracy algorithms. They can be applied to a wide variety of inversion problems.
TL;DR: This work proposes novel dynamic programming algorithms that alleviate the curse of dimensionality in problems that exhibit certain low-rank structure, and demonstrates the algorithms running in real time on board a quadcopter during a flight experiment under motion capture.
Abstract: Motion planning and control problems are embedded and essential in almost all robotics applications. These problems are often formulated as stochastic optimal control problems and solved using dynamic programming algorithms. Unfortunately, most existing algorithms that guarantee convergence to optimal solutions suffer from the curse of dimensionality: the run time of the algorithm grows exponentially with the dimension of the state space of the system. We propose novel dynamic programming algorithms that alleviate the curse of dimensionality in problems that exhibit certain low-rank structure. The proposed algorithms are based on continuous tensor decompositions recently developed by the authors. Essentially, the algorithms represent high-dimensional functions e.g. the value function in a compressed format, and directly perform dynamic programming computations e.g. value iteration, policy iteration in this format. Under certain technical assumptions, the new algorithms guarantee convergence towards optimal solutions with arbitrary precision. Furthermore, the run times of the new algorithms scale polynomially with the state dimension and polynomially with the ranks of the value function. This approach realizes substantial computational savings in źcompressibleź problem instances, where value functions admit low-rank approximations. We demonstrate the new algorithms in a wide range of problems, including a simulated six-dimensional agile quadcopter maneuvering example and a seven-dimensional aircraft perching example. In some of these examples, we estimate computational savings of up to 10 orders of magnitude over standard value iteration algorithms. We further demonstrate the algorithms running in real time on board a quadcopter during a flight experiment under motion capture.
TL;DR: A practical, compact, and more quantum-resistant variant of the BLISS Ideal Lattice Signature Scheme is developed and it is demonstrated that arithmetic decoding from an uniform source to target distribution is also an optimal non-uniform sampling method in the sense that a minimal amount of true random bits is required.
Abstract: We describe new arithmetic coding techniques and side-channel blinding countermeasures for lattice-based cryptography. Using these techniques, we develop a practical, compact, and more quantum-resistant variant of the BLISS Ideal Lattice Signature Scheme. We first show how the BLISS parameters and hash-based random oracle can be modified to be more secure against quantum pre-image attacks while optimizing signature size. Arithmetic Coding offers an information theoretically optimal compression for stationary and memoryless sources, such as the discrete Gaussian distributions often present in lattice-based cryptography. We show that this technique gives better signature sizes than the previously proposed advanced Huffman-based signature compressors. We further demonstrate that arithmetic decoding from an uniform source to target distribution is also an optimal non-uniform sampling method in the sense that a minimal amount of true random bits is required. Performance of this new Binary Arithmetic Coding sampler is comparable to other practical samplers. The same code, tables, or circuitry can be utilized for both tasks, eliminating the need for separate sampling and compression components. We then describe simple randomized blinding techniques that can be applied to anti-cyclic polynomial multiplication to mask timing- and power consumption side-channels in ring arithmetic. We further show that the Gaussian sampling process can also be blinded by a split-and-permute techniques as an effective countermeasure against side-channel attacks.
TL;DR: In this paper, the authors present an implementation of arbitrary-precision numerical integration with rigorous error bounds in the Arb library, which combines adaptive bisection with adaptive Gaussian quadrature where error bounds are determined via complex magnitudes.
Abstract: We present an implementation of arbitrary-precision numerical integration with rigorous error bounds in the Arb library. Rapid convergence is ensured for piecewise complex analytic integrals by use of the Petras algorithm, which combines adaptive bisection with adaptive Gaussian quadrature where error bounds are determined via complex magnitudes without evaluating derivatives. The code is general, easy to use, and efficient, often outperforming existing non-rigorous software.
TL;DR: The results of simulation indicate that the latency of the proposed novel binary multiplier systems (8-bit, 16-bit and 24-bit) with significantly shorter than existing implementations.
Abstract: Arithmetic Logic Units (ALUs) are very important components of the processor, which performs various arithmetic and logical operations such as multiplication, division, addition, subtraction, cubing, squaring, etc. Of these all operations, multiplication is most elementary and most frequently used operation in the ALUs. The operation of multiplication also forms the basis of many other complex arithmetic operations such as cubing, squaring, convolution, etc. This paper presents the modified novel multi-precision binary multiplier architecture to achieve a reduced latency/delay and area/hardware utilization along with existing implementations of binary multiplication. This system will function as second stage of the of a novel multi-precision binary multiplier system. The system was implemented using Xilinx 14.2 ISE and simulated with ISIM which was available from Xilinx 14.2 ISE. The results of simulation indicate that the latency of the proposed novel binary multiplier systems (8-bit, 16-bit and 24-bit) with significantly shorter than existing implementations.
TL;DR: A strategy for rigorous arbitrary-precision evaluation of Legendre polynomials on the unit interval and its application in the generation of Gauss-Legendre quadrature rules with simultaneous high degree and precision is described.
Abstract: We describe a strategy for rigorous arbitrary-precision evaluation of Legendre polynomials on the unit interval and its application in the generation of Gauss-Legendre quadrature rules. Our focus is on making the evaluation practical for a wide range of realistic parameters, corresponding to the requirements of numerical integration to an accuracy of about 100 to 100 000 bits. Our algorithm combines the summation by rectangular splitting of several types of expansions in terms of hypergeometric series with a fixed-point implementation of Bonnet's three-term recurrence relation. We then compute rigorous enclosures of the Gauss-Legendre nodes and weights using the interval Newton method. We provide rigorous error bounds for all steps of the algorithm. The approach is validated by an implementation in the Arb library, which achieves order-of-magnitude speedups over previous code for computing Gauss-Legendre rules with simultaneous high degree and precision.
TL;DR: This paper presents an extensive case study on low power ARM Cortex-A53 CPU, and shows how it can generate 1-bit, 2-bit convolutions with speedups up to 16x over an optimized 16-bit integer baseline and 2.3x better than handwritten implementations.
Abstract: State of the art deep learning models have made steady progress in the fields of computer vision and natural language processing, at the expense of growing model sizes and computational complexity. Deploying these models on low power and mobile devices poses a challenge due to their limited compute capabilities and strict energy budgets. One solution that has generated significant research interest is deploying highly quantized models that operate on low precision inputs and weights less than eight bits, trading off accuracy for performance. These models have a significantly reduced memory footprint (up to 32x reduction) and can replace multiply-accumulates with bitwise operations during compute intensive convolution and fully connected layers.
Most deep learning frameworks rely on highly engineered linear algebra libraries such as ATLAS or Intel's MKL to implement efficient deep learning operators. To date, none of the popular deep learning directly support low precision operators, partly due to a lack of optimized low precision libraries. In this paper we introduce a work flow to quickly generate high performance low precision deep learning operators for arbitrary precision that target multiple CPU architectures and include optimizations such as memory tiling and vectorization. We present an extensive case study on low power ARM Cortex-A53 CPU, and show how we can generate 1-bit, 2-bit convolutions with speedups up to 16x over an optimized 16-bit integer baseline and 2.3x better than handwritten implementations.
TL;DR: An implementation of large integer multiplication using Single Instruction Multiple Data (SIMD) instructions using a reduced-radix representation is proposed and the execution time and the number of instructions are compared against the GNU Multiple Precision Arithmetic Library (GMP).
Abstract: In this paper, we propose an implementation of large integer multiplication using Single Instruction Multiple Data (SIMD) instructions. We evaluated the implementation on an Intel Xeon Phi processor. The second generation Intel Xeon Phi processor, Knights Landing, has a set of Advanced Vector Extensions-512 (AVX-512) instructions. Using AVX-512, the processor can handle 512 bits at the same time and has the potential to multiply faster than a processor using Streaming SIMD Extensions (SSE) and AVX. Therefore, we applied AVX-512F (foundation) instructions to the program. In the multiplication of large integers, as the number of digits increases, various processing costs also become larger. One of these costs is carry processing. Therefore, we implemented a multiplication function using a reduced-radix representation and compared the execution time and the number of instructions against the GNU Multiple Precision Arithmetic Library (GMP). Furthermore, we used some optimization techniques for this kernel. We successfully achieved an execution time that was approximately 2.5x faster than GMP on the Knights Landing architecture.
TL;DR: This paper designs and implements an FPTAS with excellent computational performance and shows that it is faster than an exact algorithm even for small problem instances and small approximation factors, becoming orders of magnitude faster as the problem size increases.
Abstract: We propose a computationally efficient fully polynomial-time approximation scheme (FPTAS) to compute an approximation with arbitrary precision of the value function of convex stochastic dynamic programs, using the technique of $K$-approximation sets and functions introduced by Halman et al. [Math. Oper. Res., 34, (2009), pp. 674--685]. This paper deals with the convex case only, and it has the following contributions. First, we improve on the worst-case running time given by Halman et al. Second, we design and implement an FPTAS with excellent computational performance and show that it is faster than an exact algorithm even for small problem instances and small approximation factors, becoming orders of magnitude faster as the problem size increases. Third, we show that with careful algorithm design, the errors introduced by floating point computations can be bounded, so that we can provide a guarantee on the approximation factor over an exact infinite-precision solution. We provide an extensive computatio...
TL;DR: In this article, the authors describe a strategy for rigorous arbitrary-precision evaluation of Legendre polynomials on the unit interval and its application in the generation of Gauss-Legendre quadrature rules.
Abstract: We describe a strategy for rigorous arbitrary-precision evaluation of Legendre polynomials on the unit interval and its application in the generation of Gauss-Legendre quadrature rules Our focus is on making the evaluation practical for a wide range of realistic parameters, corresponding to the requirements of numerical integration to an accuracy of about 100 to 100 000 bits Our algorithm combines the summation by rectangular splitting of several types of expansions in terms of hypergeometric series with a fixed-point implementation of Bonnet's three-term recurrence relation We then compute rigorous enclosures of the Gauss-Legendre nodes and weights using the interval Newton method We provide rigorous error bounds for all steps of the algorithm The approach is validated by an implementation in the Arb library, which achieves order-of-magnitude speedups over previous code for computing Gauss-Legendre rules with simultaneous high degree and precision
TL;DR: This work proposes an accurate geometry-based method for local repair of intersecting meshes accurately without the need to manipulate with the input data or to employ arbitrary precision arithmetic.
Abstract: Triangular meshes are important in many fields in both basic and applied research that rely on their correctness
and accuracy. Many operations with meshes can lead to undesirable situations and the resulting models can be
damaged and further unusable. Self-intersection and mesh-to-mesh intersection are types of operations that are
often present and can cause such problems. We propose an accurate geometry-based method for local repair
of intersecting meshes. The state-of-the-art methods either solve the problem inaccurately, or use methods
such as arbitrary precision arithmetic or virtual perturbation to deal with the troublesome boundary cases. Our
method represents a robust way to repair intersecting meshes accurately without the need to manipulate with
the input data or to employ arbitrary precision arithmetic. The correct solution is obtained through a careful
classification of the cases that could result from a numerical imprecision of the floating point arithmetic.
TL;DR: In this article, Pseudo-Spectral Collocation (PSC) is used in combination with high-order precision arithmetic for Numerical Relativity problems with high accuracy and performance requirements.
Abstract: Numerical Relativity is a mature field with many applications in Astrophysics, Cosmology and even in Fundamental Physics. As such, we are entering a stage in which new sophisticated methods adapted to open problems are being developed. In this paper, we advocate the use of Pseudo-Spectral Collocation (PSC) methods in combination with high-order precision arithmetic for Numerical Relativity problems with high accuracy and performance requirements. The PSC method provides exponential convergence (for smooth problems, as is the case in many problems in Numerical Relativity) and we can use different bit precision without the need of changing the structure of the numerical algorithms. Moreover, the PSC method provides high-compression storage of the information. We introduce a series of techniques for combining these tools and show their potential in two problems in relativistic gravitational collapse: (i) The classical Choptuik collapse, estimating with arbitrary precision the location of the apparent horizon. (ii) Collapse in asympotically anti-de Sitter spacetimes, showing that the total energy is preserved by the numerical evolution to a very high degree of precision.
TL;DR: In this paper, the authors present an implementation of arbitrary-precision numerical integration with rigorous error bounds in the Arb library, which combines adaptive bisection with adaptive Gaussian quadrature where error bounds are determined via complex magnitudes.
Abstract: We present an implementation of arbitrary-precision numerical integration with rigorous error bounds in the Arb library. Rapid convergence is ensured for piecewise complex analytic integrals by use of the Petras algorithm, which combines adaptive bisection with adaptive Gaussian quadrature where error bounds are determined via complex magnitudes without evaluating derivatives. The code is general, easy to use, and efficient, often outperforming existing non-rigorous software.
TL;DR: A new algorithm is proposed that allows the approximation of the average of a set of measurements done by sensor network with arbitrary precision, controlled by two parameters, and requires O(D) rounds, where D is the diameter of the network.
Abstract: We propose a new algorithm for the classical averaging problem for distributed wireless sensors networks. This subject has been studied extensively and there are many clever algorithms in the literature. These algorithms are based on the idea of local exchange of information. They behave well in dense networks (e.g., in networks whose connections form a complete graph), but their convergence to the real average is very slow in linear or cyclic graphs.Our solution is different. In order to calculate the average, we first build an approximate histogram of observed data; then, from this histogram, we estimate the average. In our solution, we use the extreme propagation technique and probabilistic counters. It allows us to find the approximation of the average of a set of measurements done by sensor network with arbitrary precision, controlled by two parameters. Our method requires O(D) rounds, where D is the diameter of the network. We study the message complexity of this algorithm and show that it is of order O(log n) for each node, where n is the size of the network.
Abstract: A complete set of radiating “outwards” eigensolutions of the Helmholtz equation, obtained by transforming appropriately through the Vekua mapping the kernel of Laplace equation, is applied to the investigation of the acoustic scattering by penetrable prolate spheroidal scatterers. The scattered field is expanded in terms of the aforementioned set, detouring so the standard spheroidal wave functions along with their inherent numerical deficiencies. The coefficients of the expansion are provided by the solution of linear systems, the conditioning of which calls for arbitrary precision arithmetic. Its integration enables the polyparametric investigation of the convergence of the current approach to the solution of the direct scattering problem. Finally, far‐field pattern visualization in the 3D space clarifies the preferred scattering directions for several frequencies of the incident wave, ranging from the “low” to the “resonance” region.
TL;DR: Simulation results show that design of floating point matrix multiplier is better at 45 nm than 180 nm technology node in terms of lesser delay by 43% and energy-delay product by 97.86% at 1 V.
Abstract: This paper presents a single-precision floating point (IEEE 754 standard) matrix multiplier module. This is constructed using subblocks, which include floating point adder and floating point multiplier. These subblocks are designed to achieve the goal of low power consumption. Different architectures of subblocks are compared on the basis of energy-delay product. Design and simulations have been performed for 180 and 45 nm technology node. Simulation results show that design of floating point matrix multiplier is better at 45 nm than 180 nm technology node in terms of lesser delay by 43% and energy-delay product by 97.86% at 1 V. Also, 45 nm technology cells occupy only 6.25% of the area as compared to 180 nm cells.
TL;DR: The CADNA library has been improved to enable the estimation of rounding errors in programs using quadruple precision floating-point variables, i.e. having 113-bit mantissa length variables.
Abstract: Discrete Stochastic Arithmetic (DSA) enables one to estimate rounding errors and to detect numerical instabilities in simulation programs. DSA is implemented in the CADNA library that can analyze the numerical quality of single and double precision programs. In this article, we show how the CADNA library has been improved to enable the estimation of rounding errors in programs using quadruple precision floating-point variables, i.e. having 113-bit mantissa length. Although an implementation of DSA called SAM exists for arbitrary precision programs, a significant performance improvement has been obtained with CADNA compared to SAM for the numerical validation of programs with 113-bit mantissa length variables. This new version of CADNA has been sucessfully used for the control of accuracy in quadruple precision applications, such as a chaotic sequence and the computation of multiple roots of polynomials. We also describe a new version of the PROMISE tool, based on CADNA, that aimed at reducing in numerical programs the number of double precision variable declarations in favor of single precision ones, taking into account a requested accuracy of the results. The new version of PROMISE can now provide type declarations mixing single, double and quadruple precision.
TL;DR: An IP core of coprocessor supporting computations requiring integer multiple-precision arithmetic (MPA) and a processor aimed to provide scalability allowing one to use the developed IP core not only in scientific computing, but also in embedded systems employing encryption based on MPA.
Abstract: In this paper, we present an IP core of coprocessor supporting computations requiring integer multiple-precision arithmetic (MPA). Whilst standard 32/64-bit arithmetic is sufficient to solve many computing problems, there are still applications that require higher numerical precision. Hence, the purpose of the developed coprocessor is to support and offload central processing unit (CPU) in such computations. The developed digital circuit of the coprocessor works with integer numbers of precision approaching maximally 32 kbits. Our IP core is developed using the very high speed integrated circuit hardware description language (VHDL) and simulated assuming implementation in field-programmable gate arrays (FPGAs). It exchanges data using three 64-bit data buses whereas a code for execution on the coprocessor is fetched from a dedicated 8-bit bus (all buses in AMBA standard - AXI Stream). An instruction set of the coprocessor currently consists of 7 instructions including multiplication, addition and subtraction. The computations can maximally employ 16 registers of the length 32k bits. Simulation results assuming implementation on Zynq system on chip (SoC) show that computations of the factorial $(n!)$ for $n=\pmb{1000}$ take $\pmb{326.4}\mu\pmb{\sec}$ . Such a design currently requires 7982 look-up tables (LUTs), 10400 flip-flops (FFs), 33 block RAMs (BRAMs) and 28 DSP modules. The processor is aimed to provide scalability allowing one to use the developed IP core not only in scientific computing, but also in embedded systems employing encryption based on MPA.
TL;DR: This work considers the summation of large sets of floating-point numbers on hybrid CPU-GPU platforms using MPRES, a new software library for multiple-precision computations on CPUs and CUDA compatible GPUs, and presents the addition algorithm for RNS-based representations, as well as three multiple- Precision summation algorithms.
Abstract: We consider the summation of large sets of floating-point numbers on hybrid CPU-GPU platforms using MPRES, a new software library for multiple-precision computations on CPUs and CUDA compatible GPUs. This library uses an RNSbased floating-point representation, in accordance with which the multiple-precision significands are represented in a residue number system (RNS). This representation allows the computation of digits (residues) of significands in a parallel way and without carry propagation delay. We present the addition algorithm for RNS-based representations, as well as three multiple-precision summation algorithms: recursive summation, pairwise summation, and block-parallel hybrid summation. The hybrid algorithm demonstrates better performance, as it allows the full utilization of the GPU's resources.
TL;DR: The results of numerical experiments presented in the paper prove high accuracy and efficiency of developed methods for computation of quadratures' nodes and weights, decreased amount of required iterations for polynomials zeros finding and elimination of truncation errors during weights computation.
Abstract: In the paper there are presented efficient and accurate methods of Gauss-Jacobi nodes and weights computation. They include an enhancement for standard iteration method for Jacobi polynomials zeros finding, weight function formula transformation for increased accuracy of fractional derivatives computation and arbitrary precision application for mitigation of double precision arithmetic flaws. The results of numerical experiments presented in the paper prove high accuracy and efficiency of developed methods for computation of quadratures' nodes and weights, decreased amount of required iterations for polynomials zeros finding and elimination of truncation errors during weights computation. Accuracy of computations depends on height of precision applied for it, which is limited only by accessible hardware.
TL;DR: The application of EFT to explicit extrapolation methods to solve initial value problems of ordinary differential equations is proposed and implemented routines can be effective for large-sized linear ODE and small-sized nonlinear ODE.
Abstract: The application of error-free transformation (EFT) is recently being developed to solve ill-conditioned problems. It can reduce the number of arithmetic operations required, compared with multiple precision arithmetic, and also be applied by using functions supported by a well-tuned BLAS library. In this paper, we propose the application of EFT to explicit extrapolation methods to solve initial value problems of ordinary differential equations. Consequently, our implemented routines can be effective for large-sized linear ODE and small-sized nonlinear ODE, especially in the case when harmonic sequence is used.
TL;DR: An 8-bit asynchronous wave-pipelined arithmetic logic unit has been modified with set of 8 arithmetic and 12 logical operations in order to reduce power and latency by using ASIC semi-custom design flow in cadence® environment using gpdk-180-nm technology.
Abstract: In this paper, an 8-bit asynchronous wave-pipelined arithmetic logic unit has been modified with set of 8 arithmetic and 12 logical operations. All the internal modules have been modified in order to reduce power and latency by using ASIC semi-custom design flow in cadence® environment using gpdk-180-nm technology. This modified design has achieved reduction in power by 45%, reduction in delay by 19%, reduction in area by 43%, reduction in cell count by 49% as compared to the existing ALU.
TL;DR: A new error control scheme is presented that provides the truncation numbers as well as the required digits of machine precision for the multilevel fast multipole algorithm (MLFMA) and can be used to solve low-frequency problems that would otherwise experience overflow issues.
Abstract: We present a new error control scheme that provides the truncation numbers as well as the required digits of machine precision for the multilevel fast multipole algorithm (MLFMA). The proposed method is valid for all frequencies, whereas the previous studies on error control are valid only for high-frequency problems. When combined with a multiple-precision arithmetic framework, the proposed method can be used to solve low-frequency problems that would otherwise experience overflow issues. Numerical results in the form of optimal truncation numbers and machine precisions for a variety of box sizes and desired relative error thresholds are presented and compared with the results available in the literature.
TL;DR: Evaluation of the cosine function is done via a simple Cordic-like algorithm, together with a package for handling arbitrary-precision arithmetic in the computer program Matlab, which approximates the function having hundreds of correct decimals.
Abstract: Evaluation of the cosine function is done via a simple Cordic-like algorithm, together with a package for handling arbitrary-precision arithmetic in the computer program Matlab. Approximations to the cosine function having hundreds of correct decimals are presented with a discussion around errors and implementation.
TL;DR: Diffracted field calculations using multiple precision arithmetic and parallel computing to obtain meaningful numerical data for larger disks using double precision or quadruple precision arithmetic is described.
Abstract: KOSEN students learn various latest skills and technologies. In computer science, high performance computing such as parallel computing and high precision computing is a latest topic. The electromagnetic diffraction by a circular disk of perfect conductor has been analyzed rigorously. However, it has been difficult to obtain meaningful numerical data for larger disks using double precision or quadruple precision arithmetic. By using multiple precision arithmetic, numerical data for the current distributions and the diffracted fields can be obtained for larger disks. In addition, equations for calculating the diffracted fields are expanded and rearranged to control accuracies of numerical data. The multiple precision arithmetic wastes computing time. For speeding up, parallel computing is used. Through these high performance computing, our KOSEN students learn numerical calculation technique. This article describes diffracted field calculations using multiple precision arithmetic and parallel computing.
TL;DR: In this paper, a refinement algorithm is proposed to solve quadratic optimization problems to arbitrary precision, assuming a floating-point QP solver oracle and proving linear convergence of residuals and primal errors.
Abstract: Quadratic optimization problems (QPs) are ubiquitous, and solution algorithms have matured to a reliable technology. However, the precision of solutions is usually limited due to the underlying floating-point operations. This may cause inconveniences when solutions are used for rigorous reasoning. We contribute on three levels to overcome this issue. First, we present a novel refinement algorithm to solve QPs to arbitrary precision. It iteratively solves refined QPs, assuming a floating-point QP solver oracle. We prove linear convergence of residuals and primal errors. Second, we provide an efficient implementation, based on SoPlex and qpOASES that is publicly available in source code. Third, we give precise reference solutions for the Maros and M\'esz\'aros benchmark library.
Abstract: As very important modules in a processor, arithmetic circuits often play a pivotal role in determining the performance and power dissipation of a demanding computation. The demand for higher speed and power efficiency, as well as the desirability for error resilience in many applications (e.g., multimedia, recognition and data analytics) has driven the development of approximate arithmetic circuit design. In this dissertation, approximate arithmetic circuits are evaluated, several fundamental approximate circuits are devised, and a high-performance and energy-efficient approximate adaptive filter is proposed using approximate distributed arithmetic (DA) circuits. Existing approximate arithmetic circuits in the literature are first reviewed, evaluated and compared to guide the selection of a suitable approximate design for a specific application with designated purposes. A low-power approximate radix-8 Booth multiplier using an approximate recoding adder is then proposed for signed multiplication. Compared with an accurate multiplier, the proposed approximate design saves as much as 44% in power and 43% in area with a mean relative error distance (MRED) of 0.43%. Compared with the other approximate Booth multipliers, the proposed design has the lowest power-delay product while providing a moderate accuracy. Moreover, an adaptive approximation approach is proposed for the design of a divider and a square root (SQR) circuit. In this design, the division/SQR is computed using a reduced-width divider/SQR circuit and a shifter by adaptively pruning the input bits. The synthesis results show that the proposed approximate divider with an MRED of 6.6% achieves more than 60% improvements in speed and power dissipation compared with an accurate design. The proposed divider is more accurate than other approximate dividers when a similar power-delay product is considered. By changing the width of the reduced-width SQR circuit, the approximate SQR circuit is 22.69% to 74.54% faster, and saves 30.75% to 79.34% in power with an MRED from 0.7% to 8.0% compared with an accurate design. Compared to other approximate designs, the proposed approximate divider and SQR circuit designs perform better in image processing applications. The superior control capability of the cerebellum has motivated extensive interest in the development of computational cerebellar models. Many models have been applied to motor control and image stabilization in robots. Often computationally complex, cerebellar models have rarely been implemented in dedicated hardware. In this dissertation, a fixed-point finite impulse response adaptive filter is proposed using approximate DA circuits. This design can be used in general digital signal processing applications as well as in control systems as an adaptive filter-based cerebellar model. In this design, the radix-8 Booth algorithm is used to reduce the number of partial products in the DA architecture, and the partial products are approximately generated by truncating the input data with error compensation, accumulated by using an approximate Wallace tree. At a similar accuracy, the proposed design attains on average a 55% reduction in energy per operation and a 2.2× increase in throughput per area compared with an accurate design. A saccadic system using the proposed approximate adaptive filter-based cerebellar model achieves a similar retinal slip as using an accurate filter. These results are promising for the large-scale integration of approximate circuits into high-performance and energy-efficient systems for error-resilient applications.
TL;DR: This paper seeks to develop provably correct posit arithmetic based on fixed-point big integers that can serve as a reference for other hardware-optimized implementations, as a test bed for applications to experiment with different posit bit configurations, and to analyze the relative errors of using smaller bit sizes in the posit numbers compared to using the native 32-bit or 64-bit floating-point numbers.
Abstract: Floating-point number format is used extensively in many applications, especially scientific software. The applications rely on efficient hardware floating-point support to perform arithmetic operations. With the advent of multicore CPUs and massively parallel GPUs, the memory bandwidth of a computer system is increasingly limited for each of the compute cores. The limited memory bandwidth is a serious bottleneck to the system performance. The posit number format [12] is a promising approach to improve the accuracy of the arithmetic operations with more efficient use of bit storage, hence, reducing memory contention. However, robust and reliable software implementations of posit arithmetic libraries in C/C++ or Python are not readily available. In this paper, we seek to develop provably correct posit arithmetic based on fixed-point big integers. A robust and reliable implementation can then serve as a reference for other hardware-optimized implementations, as a test bed for applications to experiment with different posit bit configurations, and to analyze the relative errors of using smaller bit sizes in the posit numbers compared to using the native 32-bit or 64-bit floating-point numbers.