TL;DR: An indispensable resource for instruction, professional development, and research, Computer Arithmetic: Algorithms and Hardware Designs, Second Edition combines broad coverage of the underlying theories of computer arithmetic with numerous examples of practical designs, worked-out examples, and a large collection of meaningful problems.
Abstract: Ideal for graduate and senior undergraduate courses in computer arithmetic and advanced digital design, Computer Arithmetic: Algorithms and Hardware Designs, Second Edition, provides a balanced, comprehensive treatment of computer arithmetic. It covers topics in arithmetic unit design and circuit implementation that complement the architectural and algorithmic speedup techniques used in high-performance computer architecture and parallel processing. Using a unified and consistent framework, the text begins with number representation and proceeds through basic arithmetic operations, floating-point arithmetic, and function evaluation methods. Later chapters cover broad design and implementation topics-including techniques for high-throughput, low-power, fault-tolerant, and reconfigurable arithmetic. An appendix provides a historical view of the field and speculates on its future.An indispensable resource for instruction, professional development, and research, Computer Arithmetic: Algorithms and Hardware Designs, Second Edition, combines broad coverage of the underlying theories of computer arithmetic with numerous examples of practical designs, worked-out examples, and a large collection of meaningful problems. This second edition includes a new chapter on reconfigurable arithmetic, in order to address the fact that arithmetic functions are increasingly being implemented on field-programmable gate arrays (FPGAs) and FPGA-like configurable devices. Updated and thoroughly revised, the book offers new and expanded coverage of saturating adders and multipliers, truncated multipliers, fused multiply-add units, overlapped quotient digit selection, bipartite and multipartite tables, reversible logic, dot notation, modular arithmetic, Montgomery modular reduction, division by constants, IEEE floating-point standard formats, and interval arithmetic.Features:* Divided into 28 lecture-size chapters * Emphasizes both the underlying theories of computer arithmetic and actual hardware designs * Carefully links computer arithmetic to other subfields of computer engineering * Includes 717 end-of-chapter problems ranging in complexity from simple exercises to mini-projects * Incorporates many examples of practical designs * Uses consistent standardized notation throughout * Instructor's manual includes solutions to text problems * An author-maintained website http://www.ece.ucsb.edu/~parhami/text_comp_arit.htm contains instructor resources, including complete lecture slides
TL;DR: INTLAB is a toolbox for Matlab supporting real and complex intervals, and vectors, full matrices and sparse matrices over those, designed to be very fast and achieves the anticipated speed with identical code on a variety of computers.
Abstract: INTLAB is a toolbox for Matlab supporting real and complex intervals, and vectors, full matrices and sparse matrices over those. It is designed to be very fast. In fact, it is not much slower than the fastest pure floating point algorithms using the fastest compilers available (the latter, of course, without verification of the result). Beside the basic arithmetical operations, rigorous input and output, rigorous standard functions, gradients, slopes and multiple precision arithmetic is included in INTLAB. Portability is assured by implementing all algorithms in Matlab itself with exception of exactly one routine for switching the rounding downwards, upwards and to nearest. Timing comparisons show that the used concept achieves the anticipated speed with identical code on a variety of computers, ranging from PC’s to parallel computers. INTLAB is freeware and may be copied from our home page.
TL;DR: It is shown that the overestimation of midpoint-radius interval arithmetic compared to power set operations is uniformly bounded by a factor 1.5 in radius, and matrix operations are very fast on almost any computer, with minimal effort for the implementation.
Abstract: Infimum-supremum interval arithmetic is widely used because of ease of implementation and narrow results. In this note we show that the overestimation of midpoint-radius interval arithmetic compared to power set operations is uniformly bounded by a factor 1.5 in radius. This is true for the four basic operations as well as for vector and matrix operations, over real and over complex numbers. Moreover, we describe an implementation of midpoint-radius interval arithmetic entirely using BLAS. Therefore, in particular, matrix operations are very fast on almost any computer, with minimal effort for the implementation. Especially, with the new definition it is seemingly the first time that full advantage can be taken of the speed of vector and parallel architectures. The algorithms have been implemented in the Matlab interval toolbox INTLAB.
TL;DR: In this paper, an approach to solving the long-standing dependency problem in internal arithmetic is presented, called constrained internal arithmetic, which has an additive inverse, a multiplicative inverse and satisfies the distributive law.
Abstract: This paper presents an approach to solving the long-standing dependency problem in internal arithmetic. An extension to internal arithmetic, called here constrained internal arithmetic, is developed. Unlike interval arithmetic, constrained internal arithmetic has an additive inverse, a multiplicative inverse and satisfies the distributive law. This means that the algebraic structure of constrained interval arithmetic is different than that of interval arithmetic. The applicability of constrained interval arithmetic is explored.
TL;DR: This paper describes the study of a new field programmable gate array architecture based on on-line arithmetic, dedicated to single chip implementation of numerical algorithms in low-power signal processing and digital control applications.
Abstract: This paper describes the study of a new field programmable gate array architecture based on on-line arithmetic. This architecture, called Field Programmable On-line oPerators (FPOP), is dedicated to single chip implementation of numerical algorithms in low-power signal processing and digital control applications. FPOP is based on a reprogrammable array of on-line arithmetic operators. On-line arithmetic is a digit-serial arithmetic with most significant digits first using a redundant number system. The digit-level pipeline, the small number of communication wires between the operators and the small size of the arithmetic operators lead to high-performance parallel computations. In FPOP, the basic elements are arithmetic operators such as adders, subtracters, multipliers, dividers, square-rooters, sine or cosine operators.... An equation model is then sufficient to describe the mapping of the algorithm on the circuit. The digit-serial communication mode also significantly reduces the necessary programmable routing resources compared to standard FPGAs.
TL;DR: Lower bounds for polynomials of constant degree over fields of characteristic zero were shown in this paper, where the complexity of computing a polynomial in this model was attributed to the wealth of partial derivatives it has on every affine subspace of small co-dimension.
Abstract: In this paper we prove near quadratic lower bounds for depth-3 arithmetic formulae over fields of characteristic zero. Such bounds are obtained for the elementary symmetric functions, the (trace of) iterated matrix multiplication, and the determinant. As corollaries we get the first non-trivial lower bounds for computing polynomials of constant degree, and a gap between the power depth-3 arithmetic formulas and depth-4 arithmetic formulas. The main technical contribution relates the complexity of computing a polynomial in this model to the wealth of partial derivatives it has on every affine subspace of small co-dimension. Lower bounds for related models utilize an algebraic analog of Nechiporuk lower bound on Boolean formulae.
TL;DR: This work presents a procedure by which additions and subtractions can be performed rapidly and accurately, and shows that these operations are thereby competitive with their floating-point equivalents.
Abstract: As an alternative to floating-point, several papers have proposed the use of a logarithmic number system, in which a real number is represented as a fixed-point logarithm. Multiplication and division therefore proceed in minimal time with no rounding error. However, the system can only offer an overall advantage if addition and subtraction can be performed with speed and accuracy at least equal to that of floating-paint, but these operations require the interpolation of a non-linear function which has hitherto been either time-consuming or inaccurate. We present a procedure by which additions and subtractions can be performed rapidly and accurately, and show that these operations are thereby competitive with their floating-point equivalents. We then show that the average performance of the logarithmic system exceeds floating-point, in terms of both speed and accuracy.
TL;DR: In this paper, an arithmetic processor having an arithmetic logic unit having a plurality of arithmetic circuits each for performing a group of associated arithmetic operations, such as finite field operations, or modular integer operations.
Abstract: The present disclosure provides an arithmetic processor having an arithmetic logic unit having a plurality of arithmetic circuits each for performing a group of associated arithmetic operations, such as finite field operations, or modular integer operations. The arithmetic logic unit has an operand input data bus, for receiving operand data thereon and a result data output bus for returning the results of the arithmetic operations thereon. A register file is coupled to the operand data bus and the result data bus. The register file is shared by the plurality of arithmetic circuits. Further a controller is coupled to the ALU and the register file, the controller selecting one of the plurality of arithmetic circuits in response to a mode control signal requesting an arithmetic operation and for controlling data access between the register file and the ALU and whereby the register file is shared by the arithmetic circuits.
TL;DR: The potential of using multiple precision arithmetic units to effectively support synthesis of low-power application-specific integrated circuits is explored and a new architectural scheme for collaborate addition of sets of variable precision data is proposed.
Abstract: Many modern multimedia applications such as image and video processing are characterized by a unique combination of arithmetic and computational features: fixed-point arithmetic, a variety of short data types, high degree of instruction-level parallelism, strict timing constraints, and high computational requirements. Computationally intensive algorithms usually boost device's power dissipation which is often key to the efficiency of many communications and multimedia applications. Although recently virtually all general-purpose processors have been equipped with multiprecision operations, the current generation of behavioral synthesis tools for application-specific systems does not utilize this power/performance optimization paradigm. In this paper, we explore the potential of using multiple precision arithmetic units to effectively support synthesis of low-power application-specific integrated circuits. We propose a new architectural scheme for collaborate addition of sets of variable precision data. We have developed a novel resource allocation and computation assignment methodology for a set of multiple precision arithmetic units. The optimization algorithms explore the trade-off of allocating low-width bus structures and executing multiple-cycle operations. Experimental results indicate strong advantages of the proposed approach.
TL;DR: This paper presents a methodology for merging distributed arithmetic with the residue number systems to achieve high-performance FPL solutions.
Abstract: Field-programmable logic (FPL) densities and performance have steadily improved, allowing DSP solutions to be integrated on a single FPL chip. The primary limitation of FPLs, in DSP-centric applications, is their intrinsically weak arithmetic performance compared to DSP microprocessors and ASICs. In some cases, distributed arithmetic (DA) has been used to mask FPL arithmetic inadequacies. The Residue Number System (RNS) has demonstrated an ability to support high-bandwidth arithmetic with limited resources. This paper presents a methodology for merging distributed arithmetic with the residue number systems to achieve high-performance FPL solutions.
TL;DR: This work presents two proof-of-concept VLSI implementations whose power dissipation changes according to the precision of the computation performed, and explores the potential of Distributed Arithmetic computation structures for low power precision-on-demand computation.
Abstract: A recent trend in low power design has been the employment of reduced precision processing methods for decreasing arithmetic activity and average power dissipation. Such designs can trade off power and arithmetic precision as system requirements change. This work explores the potential of Distributed Arithmetic (DA) computation structures for low power precision-on-demand computation. We present two proof-of-concept VLSI implementations whose power dissipation changes according to the precision of the computation performed.
TL;DR: It is shown that a CLNS addition can be performed with approximately the same hardware as a high-radix CORDIC operation.
Abstract: This paper describes the application of high radix redundant CORDIC algorithms to complex logarithmic number system arithmetic. It shows that a CLNS addition can be performed with approximately the same hardware as a high-radix CORDIC operation. A design example comparable to single precision floating point has been designed and simulated.
TL;DR: This paper presents an efficient implementation of the pipeline FFT processor based on the radix-4 decimation-in-time algorithm with the use of digit-serial arithmetic units that can not only achieve nearly 100% hardware utilization, but also require much less memory compared with the previous digit- serial FFT processors.
Abstract: This paper presents an efficient implementation of the pipeline FFT processor based on the radix-4 decimation-in-time algorithm with the use of digit-serial arithmetic units. By splitting the sequential input sample into parallel digit-serial data streams, the proposed architecture can not only achieve nearly 100% hardware utilization, but also require much less memory compared with the previous digit-serial FFT processors. Furthermore, in FFT processors, several modules of ROM are required for the storage of twiddle factors. By exploiting the redundancy of the factors, the overall ROM size can be effectively reduced by a factor of 2.
TL;DR: The Overlap Resolution Number System is a new redundant representation of signals, with Continuous Valued Digits, that presents new methods for binary arithmetic and digital signal processing.
Abstract: The Overlap Resolution Number System (ORNS) employs bit level analog residue arithmetic, and opens up a powerful approach to digital computing. This new redundant representation of signals, with Continuous Valued Digits, presents new methods for binary arithmetic and digital signal processing. The number system is based on analog residue digits, as opposed to binary or multiple-valued digit levels. Importantly, arithmetic in ORNS is tolerant to VLSI circuit tolerances. This allows simple elementary analog circuits to be employed, targeting digital accuracy.
TL;DR: This paper explains how the hyperasymptotic expansion of late terms in divergent asymptotic expansions can be used to compute all the Stokes multipliers to arbitrary precision.
Abstract: In this paper we explain how the hyperasymptotic expansion of late terms in divergent asymptotic expansions can be used to compute all the Stokes multipliers to arbitrary precision.
TL;DR: This research lays the groundwork for future analysis of arithmetic codes with high-order context models by considering the important problem of decoding an arithmetic code stream when an initial segment of that code stream is unknown.
Abstract: Summary form only given. Arithmetic coding is a popular and efficient lossless compression technique that maps a sequence of source symbols to an interval of numbers between zero and one. We consider the important problem of decoding an arithmetic code stream when an initial segment of that code stream is unknown. We call decoding under these conditions resynchronizing an arithmetic code. This problem has importance in both error resilience and cryptology. If an initial segment of the code stream is corrupted by channel noise, then the decoder must attempt to determine the original source sequence without full knowledge of the code stream. In this case, the ability to resynchronize helps the decoder to recover from the channel errors. But in the situation of encryption one would like to have very high time complexity for resynchronization. We consider the problem of resynchronizing simple arithmetic codes. This research lays the groundwork for future analysis of arithmetic codes with high-order context models. In order for the decoder to achieve full resynchronization, the unknown, initial b bits of the code stream must be determined exactly. When the source is approximately IID, the search complexity associated with choosing the correct sequence is at least O(2/sup b/2/). Therefore, when b is 100 or more, the time complexity required to achieve full resynchronization is prohibitively high. To partially resynchronize, the decoder must determine the coding interval after b bits have been output by the encoder. For a stationary source and a finite-precision static binary arithmetic coder, the complexity of determining the code interval is O(2/sup 2s/), where the precision is s bits.
TL;DR: The mathematical foundations of the Overlap Resolution Number System (ORNS) which is based on signed continuous valued digits (CVDs), a redundant number system employing residue arithmetic, are presented.
Abstract: This paper presents mathematical foundations of the Overlap Resolution Number System (ORNS) which is based on signed continuous valued digits (CVDs). ORNS is a redundant number system employing residue arithmetic. In contrast to the implementation of arithmetic by binary or multiple-valued logic circuits, arithmetic operations in this novel number system are performed by analog digit manipulation circuitry. The redundancy in an ensemble of continuous valued digits that comprises a number provides tolerance to implementation imprecisions. Processing with these analog digits is performed by carry-free arithmetic structures with systematic circuit level redundancy.
TL;DR: A class of complex number representations called Redundant Complex Number Systems (RCNSs) are presented, useful for designing VLSI signal processors with complex arithmetic capability and the experimental fabrication of the RCNS-based complex multiplier in 0.5 /spl mu/m CMOS technology is presented.
Abstract: This paper presents a class of complex number representations called Redundant Complex Number Systems (RCNSs), which are useful for designing VLSI signal processors with complex arithmetic capability. A redundant complex: number system is defined as an imaginary-radix number system having a redundant integer digit set. This makes possible the construction of high-speed complex arithmetic circuits: examples include a complex-number parallel adder with no carry propagation chain, and a complex-number multiplier using fast binary-tree addition structure. This paper also presents the experimental fabrication of the RCNS-based complex multiplier in 0.5 /spl mu/m CMOS technology.
TL;DR: This work investigates how the CAD algorithm can be adapted to the situation when the coefficients are inexact, or, more precisely, Mathematica arbitrary-precision floating point numbers.
Abstract: We study the problem of deciding whether a system of real polynomial equations and inequalities has solutions, and if yes finding a sample solution. For polynomials with exact rational number coefficients the problem can be solved using a variant of the cylindrical algebraic decomposition (CAD) algorithm. We investigate how the CAD algorithm can be adapted to the situation when the coefficients are inexact, or, more precisely, Mathematica arbitrary-precision floating point numbers. We investigate what changes need to be made in algorithms used by CAD, and how reliable are the results we get.
TL;DR: Efficient architectures for designing pipelined floating point units for addition/subtraction, multiplication and division that were used for 3D computer graphics applications in FPGAs are described.
Abstract: There are many methods to perform floating point arithmetic functions Until recently, such circuits could not be efficiently implemented in FPGAs due to their small size and low speed The limited resources of FPGAs, both in terms of logic functions and routing seriously limits the complexity of a design that uses floating point arithmetic We will show alternative algorithms that take advantage of the features of FPGA architecture can yield impressive performance and very small area requirements We also describe efficient architectures for designing pipelined floating point units for addition/subtraction, multiplication and division that were used for 3D computer graphics applications in FPGAs By exploiting the intrinsic architecture of such devices, and overcoming the limitations of VHDL and synthesis tools, these pipelined arithmetic units allow the use of multiple floating point units on the same device
TL;DR: A pipelined Fast Fourier Transform Processor (FFP) as mentioned in this paper is an FFT processor that includes a memory arrangement, a cascade of a first arithmetic unit, a scratch memory and a second arithmetic unit.
Abstract: A pipelined Fast Fourier Transform Processor includes, besides a memory arrangement, a cascade of a first arithmetic unit, a scratch memory and a second arithmetic unit. One of both arithmetic units can only perform at least one type of butterfly Fast Fourier Transform arithmetic calculations, whereas the other one can perform, besides this at least one type of butterfly Fast Fourier Transform arithmetic calculations, at least one second type of butterfly Fast Fourier Transform arithmetic calculations. This architecture optimises both timing as well as circuit restrictions.
TL;DR: This paper considers the problem of fast computation of n-ary products, for large n, over arbitrary precision integer orional number domains, and proposes three different implementations of the t,echnique.
Abstract: III this paper we consider the problem of fast computation of n-ary products, for large n, over arbitrary precision integer or ra.tional number domains. The combinat.ion of loop unrolling, chains of recurrences techniques and analogs of binary powering allows us to obt.ain order-of-magnitude speed improvements for such computations. Three different implement,ations of the t,echnique (in Maple, C++ and Java) are described. Many examples together with timings are given.
TL;DR: An efficient threshold scheme is proposed which overcomes the weakness of Asmuth et al.s scheme and is shown to be perfect, and a method is proposed to generate a pairwise relatively prime integer set which satisfies the requirements of these schemes.
Abstract: An efficient threshold scheme is proposed in this paper. It overcomes the weakness of Asmuth et al.s scheme. The (r, n)-threshold scheme proposed below only takes O(r) operations to recover the shared secret while Shamirs scheme takes O(rlogr) operations. We show that our scheme is perfect. Further, we also propose a method to generate a pairwise relatively prime integer set which satisfies the requirements of Asmuth et al.s and our schemes.
TL;DR: A short discussion of performing the basic arithmetic operations of MATHCAD is provided here, because these operations are often required during a MathsCAD work session.
Abstract: Obviously, performing the basic arithmetic operations does not belong to the main uses of MATHCAD. You can continue to use an electronic calculator for this purpose. However, because these operations are often required during a MATHCAD work session, we provide a short discussion here.
TL;DR: This thesis is an attempt to design and implement Simple Composite Arithmetic Machine (SCAM), which will be capable of performing operations on exact numbers (rational and integer numbers), and the core of the research is Composite Ar arithmetic Unit, which contains operations like Multiply, Divide, Add and Subtract.
Abstract: As we advance into the new century, computers of the future will require new
techniques for arithmetic operations, which take advantage of the modern technology
and yield accurate results. Floating-point arithmetic has been in use for nearly forty
years, but is plagued with inaccuracies and limitations which necessitate introduction
of a new concept in computer arithmetic, called Composite Arithmetic. Composite
Arithmetic combines fixed-point and floating-point arithmetic into one integrated
concept where numbers are automatically assigned the right form. This negates the
need for differentiating between integer and real numbers in programming languages
and allows for better accuracy in calculations. The concept has two main forms:
exact and inexact. The exact form deals with integers and rational numbers, while
inexact form deals with numbers that cannot be represented exactly. To develop and
implement such concept in hardware, tools are needed that will allow for easy design
and re-design process, at a low cost. A device that meets these requirements is a
Field Programmable Gate Array. This electronic device provides quick and easy way
of designing the system and then implementing it by downloading data to the device.
It can then be tested and reprogrammed as desired, without the need for a new
device. This thesis is an attempt to design and implement Simple Composite
Arithmetic Machine (SCAM), which will be capable of performing operations on
exact numbers (rational and integer numbers). The core of the research is Composite
Arithmetic Unit, which contains operations like Multiply, Divide, Add and Subtract.
It also can find Greatest Common Divisor and cast out common factors of two
numbers. The CAU is controlled using Control Unit and Feedback Unit, and results
are stored in the Register Unit. The SCAM is therefore a basic microarchitecture that
will form a basis for further research in this field.
TL;DR: An overview of the interval-enhanced GNU Fortran compiler and its use in developing interval arithmetic software is given.
Abstract: Interval arithmetic provides an efficient method for representing uncertainty in data that results from physical measurements. It is also useful in evaluating the accuracy of measurements that have been processed by software. To facilitate the development of interval arithmetic software, the GNU Fortran compiler has been modified to support interval arithmetic. This paper gives an overview of the interval-enhanced GNU Fortran compiler and its use in developing interval arithmetic software. Support for interval arithmetic is also being added to GNU C and C++ compilers.
TL;DR: A new algorithm for the division of floating point expansions with Bareiss' determinant on a matrix of size between 3 and 10 is presented and running times with different determinant algorithms on different machines are compared with other multiprec- ision packages.
Abstract: We present in this work a new algorithm for the division of floating point expansions. Floating expansion is a multiple precision data type developped with arithmetic operators that use the processor floating point unit for core computations instead of the integer unit. Researches on this subject have arised recently from the observation that the floating point unit becomes a more and more efficient part of modern computers. Many simple arithmetic operators and some very usefull geometric operators have already been presented on expansions. Yet previous work presented only a very simple division algorithm. We present in this work a new algorithm. We take this opportunity to extend the set of geometric operators with Bareiss' determinant on a matrix of size between 3 and 10. Running times with different determinant algorithms on different machines are compared with other multiprec- ision packages including GMP, CADNA and a computer geometry package working with modular arithmetic.