TL;DR: Higham as discussed by the authors gives a thorough, up-to-date treatment of the behavior of numerical algorithms in finite precision arithmetic, combining algorithmic derivations, perturbation theory, and rounding error analysis.
Abstract: From the Publisher:
What is the most accurate way to sum floating point numbers? What are the advantages of IEEE arithmetic? How accurate is Gaussian elimination and what were the key breakthroughs in the development of error analysis for the method? The answers to these and many related questions are included here.
This book gives a thorough, up-to-date treatment of the behavior of numerical algorithms in finite precision arithmetic. It combines algorithmic derivations, perturbation theory, and rounding error analysis. Software practicalities are emphasized throughout, with particular reference to LAPACK and MATLAB. The best available error bounds, some of them new, are presented in a unified format with a minimum of jargon. Because of its central role in revealing problem sensitivity and providing error bounds, perturbation theory is treated in detail.
Historical perspective and insight are given, with particular reference to the fundamental work of Wilkinson and Turing, and the many quotations provide further information in an accessible format.
The book is unique in that algorithmic developments and motivations are given succinctly and implementation details minimized, so that attention can be concentrated on accuracy and stability results. Here, in one place and in a unified notation, is error analysis for most of the standard algorithms in matrix computations. Not since Wilkinson's Rounding Errors in Algebraic Processes (1963) and The Algebraic Eigenvalue Problem (1965) has any volume treated this subject in such depth. A number of topics are treated that are not usually covered in numerical analysis textbooks, including floating point summation, block LU factorization, condition number estimation, the Sylvester equation, powers of matrices, finite precision behavior of stationary iterative methods, Vandermonde systems, and fast matrix multiplication.
Although not designed specifically as a textbook, this volume is a suitable reference for an advanced course, and could be used by instructors at all levels as a supplementary text from which to draw examples, historical perspective, statements of results, and exercises (many of which have never before appeared in textbooks). The book is designed to be a comprehensive reference and its bibliography contains more than 1100 references from the research literature.
Audience
Specialists in numerical analysis as well as computational scientists and engineers concerned about the accuracy of their results will benefit from this book. Much of the book can be understood with only a basic grounding in numerical analysis and linear algebra.
About the Author
Nicholas J. Higham is a Professor of Applied Mathematics at the University of Manchester, England. He is the author of more than 40 publications and is a member of the editorial boards of the SIAM Journal on Matrix Analysis and Applications and the IMA Journal of Numerical Analysis. His book Handbook of Writing for the Mathematical Sciences was published by SIAM in 1993.
TL;DR: A cutting plane approach combining Chvatal–Gomory cutting planes with column generation is generalized for the case of multiple stock lengths in the one-dimensional cutting stock problem.
TL;DR: A randomized procedure for rounding fractional perfect matchings to (integral) matchings if the original fractional matching satisfies any linear inequality, then with high probability, the new matching satisfies that linear inequality in an approximate sense.
Abstract: We present a randomized procedure for rounding fractional perfect matchings to (integral) matchings. If the original fractional matching satisfies any linear inequality, then with high probability, the new matching satisfies that linear inequality in an approximate sense. This extends the well-known LP rounding procedure of Raghavan and Thompson, which is usually used to round fractional solutions of linear programs.¶We use our rounding procedure to design an additive approximation algorithm to the Quadratic Assignment Problem. The approximation error of the algorithm is en
2 and it runs in n
O
(log
n
/e2) time.¶We also describe Polynomial Time Approximation Schemes (PTASs) for dense subcases of many well-known NP-hard arrangement problems, including MINIMUM LINEAR ARRANGEMENT, MINIMUM CUT LINEAR ARRANGEMENT, MAXIMUM ACYCLIC SUBGRAPH, and BETWEENNESS.
TL;DR: Interval arithmetic is used to take account of rounding errors in the computation of Viswanath's constant, the rate at which a random Fibonacci sequence increases.
Abstract: The introduction of fast and efficient software for interval arithmetic, such as the MATLAB toolbox INTLAB, has resulted in the increased popularity of the use of interval analysis. We give an introduction to interval arithmetic and explain how it is implemented in the toolbox INTLAB. A tutorial is provided for those who wish to learn how to use INTLAB. We then focus on the interval versions of some important problems in numerical analysis. A variety of techniques for solving interval linear systems of equations are discussed, and these are then tested to compare timings and accuracy. We consider univariate and multivariate interval nonlinear systems and describe algorithms that enclose all the roots. Finally, we give an application of interval analysis. Interval arithmetic is used to take account of rounding errors in the computation of Viswanath's constant, the rate at which a random Fibonacci sequence increases.
TL;DR: In this article, the authors gave an improved randomized rounding scheme for the multiway cut problem, yielding a 12/11-approximation algorithm for k = 3 and a 1.3438-approach algorithm for general geometric relaxations.
Abstract: The multiway-cut problem is, given a weighted graph and k >= 2 terminal nodes, to find a minimum-weight set of edges whose removal separates all the terminals. The problem is NP-hard, and even NP-hard to approximate within 1+delta for some small delta > 0.
Calinescu, Karloff, and Rabani (1998) gave an algorithm with performance guarantee 3/2-1/k, based on a geometric relaxation of the problem. In this paper, we give improved randomized rounding schemes for their relaxation, yielding a 12/11-approximation algorithm for k=3 and a 1.3438-approximation algorithm in general.
Our approach hinges on the observation that the problem of designing a randomized rounding scheme for a geometric relaxation is itself a linear programming problem. The paper explores computational solutions to this problem, and gives a proof that for a general class of geometric relaxations, there are always randomized rounding schemes that match the integrality gap.
TL;DR: An improved rounding method using an SDP relaxation is presented, and improved approximation ratios for several MAX-GP problems are established, including Dense-Subgraph, Max-Cut,Max-Not- cut, and Max-Vertex-Cover.
Abstract: Given an undirected graph G=(V,E) with |V|=n and an integer k between 0 and n, the maximization graph partition (MAX-GP) problem is to determine a subset S⊂V of k nodes such that an objective function w(S) is maximized. The MAX-GP problem can be formulated as a binary quadratic program and it is NP-hard. Semidefinite programming (SDP) relaxations of such quadratic programs have been used to design approximation algorithms with guaranteed performance ratios for various MAX-GP problems. Based on several earlier results, we present an improved rounding method using an SDP relaxation, and establish improved approximation ratios for several MAX-GP problems, including Dense-Subgraph, Max-Cut, Max-Not-Cut, and Max-Vertex-Cover.
TL;DR: This paper presents a tool for studying the propagation of rounding errors in floating-point computations, that carries out some ideas proposed in [3], [7].
Abstract: The manipulation of real numbers by computers is approximated by floatingpoint arithmetic, which uses a finite representation of numbers. This implies that a (small in general) rounding error may be committed at each operation. Although this approximation is accurate enough for most applications, there are some cases where results become irrelevant because of the precision lost at some stages of the computation, even when the underlying numerical scheme is stable. In this paper, we present a tool for studying the propagation of rounding errors in floating-point computations, that carries out some ideas proposed in [3], [7]. Its aim is to detect automatically a possible catastrophic loss of precision, and its source. The tool is intended to cope with real industrial problems, and we believe it is specially appropriate for critical instrumentation software. On these numerically quite simple programs, we believe our tool will bring some very helpful information, and allow us to find possible programming errors such as potentially dangerous double/float conversions, or blatant unstabilities or losses of accuracy. The techniques used being those of static analysis, the tool will not compete on numerically intensive codes with a numerician’s study of stability. Neither is it designed for helping to find better numerical schemes. But, it is automatic and in comparison with a study of sensitivity to data, brings about the contribution of rounding errors occuring at every intermediary step of the computation. Moreover, static analyzes are sure (but may be pessimistic) and consider a set of possible executions and not just one, which is the essential requirement a verification tool for critical software must meet.
TL;DR: The design of the LZA is modified so that the leading bits of its output are produced first and can be used to begin the normalization and the delay of the resulting architecture for double-precision format is estimated.
Abstract: We propose an architecture for the computation of the floating-point multiply-add-fused (MAF) operation A+ (B /spl times/ C) This architecture is based on the combined addition and rounding (using a dual adder) and on the anticipation of the normalization step before the addition Because the normalization is performed before the addition, it is not possible to overlap the leading-zero-anticipator with the adder Consequently, to avoid the increase in delay we modify the design of the LZA so that the leading bits of its output are produced first and can be used to begin the normalization Moreover, parts of the addition are also anticipated We have estimated the delay of the resulting architecture for double-precision format, considering the load introduced by long connections, and estimate a reduction of about 15% to 20% with respect to traditional implementations of the floating-point MAF unit
TL;DR: Computation reduction techniques which can either be used to obtain multiplierless implementation of finite impulse response (FIR) digital filters or to further improve multiplier less implementation obtained by currently used techniques are presented.
Abstract: We present computation reduction techniques which can either be used to obtain multiplierless implementation of finite impulse response (FIR) digital filters or to further improve multiplierless implementation obtained by currently used techniques. Although presented in the FIR filtering framework, these ideas are also directly applicable to any task/application which can be expressed as multiplication of vectors by scalars. The presented approach is to remove computational redundancy by reordering computation. The reordering problem is formulated using a graph in which vertices represent coefficients and edges represent resources required in a computation using the differential coefficient defined by the difference of the vertices joined by the edge. This interpretation leads to various methods for computation reduction for which simple polynomial run time algorithms are presented. It is shown that about 20% reduction in the number of add operations per coefficient can be obtained over the conventional multiplierless implementations. It is also shown that implementations requiring less than one adder per coefficient can be obtained using the presented approaches when using nonuniformly scaled coefficients quantized from infinite precision representation by simple rounding.
TL;DR: The generalized S transform (GST), a family of reversible integer-to-integer transforms inspired by the S transform, is proposed and its practical utility is demonstrated.
Abstract: The generalized S transform (GST), a family of reversible integer-to-integer transforms inspired by the S transform, is proposed. This family of transforms is then studied in detail by considering topics such as GST parameter calculation, the effects of using different rounding operators in the GST, and the relationship between the GST and the lifting scheme. Some examples of specific transforms in the GST family are also given. In particular, a new transform in this family is introduced, and its practical utility is demonstrated.
TL;DR: The objective is to design on-line scheduling algorithms to ensure stability in an admissible system and presents a rounding algorithm for its linear relaxation and proves that the rounding technique cannot be substantially improved.
Abstract: In a wireless network, a basestation transmits data to mobiles at time-varying, mobile-dependent rates due to the ever changing nature of the communication channels. In this paper we consider a wireless system in which the channel conditions and data arrival processes are governed by an adversary. We first consider a single server and a set of users. At each time step t the server can only transmit data to one user. If user i is chosen the transmission rate is r/sub i/(t). We say that the system is (/spl omega/, /spl epsiv/)-admissible if in any window of /spl omega/ time steps the adversary can schedule the users so that the total data arriving to each user is at most 1 - /spl epsiv/ times the total service it receives. Our objective is to design on-line scheduling algorithms to ensure stability in an admissible system. We first show, somewhat surprisingly, that the admissibility condition alone does not guarantee the existence of a stable online algorithm, even in a subcritical system (i.e. /spl epsiv/ > 0). For example, if the nonzero rates in an infinite rate set can be arbitrarily small, then a subcritical system can be unstable for any deterministic online algorithm. On a positive note, we present a tracking algorithm that attempts to mimic the behavior of the adversary. This algorithm ensures stability for all (/spl omega/, /spl epsiv/)-admissible systems that are not excluded by our instability results. As a special case, if the rate set is finite, then the tracking algorithm is stable even for a critical system (i.e. /spl epsiv/ = 0). Moreover, the queue sizes are independent of e. For subcritical systems, we also show that a simpler max weight algorithm is stable as long as the user rates are bounded away from zero. The offline version of our problem resembles the problem of scheduling unrelated machines and can be modeled by an integer program. We present a rounding algorithm for its linear relaxation and prove that the rounding technique cannot be substantially improved. We conclude by discussing the extension of our model to the network setting.
TL;DR: This paper develops and compares several heuristic methods for solving the general two-dimensional cutting stock problem and compares three rounding procedures, rounding up, truncated branch and bound and the solution of a residual problem.
Abstract: In this paper we develop and compare several heuristic methods for solving the general two-dimensional cutting stock problem. We follow the Gilmore-Gomory column generation scheme in which at each iteration a new cutting pattern is obtained as the solution of a subproblem on one stock sheet. For solving this subproblem, in addition to classical dynamic programming, we have developed three heuristic procedures of increasing complexity, based on GRASP and Tabu Search techniques, producing solutions differing in quality and in time requirements. In order to obtain integer solutions from the fractional solutions of the Gilmore-Gomory process, we compare three rounding procedures, rounding up, truncated branch and bound and the solution of a residual problem. We have coded and tested all the combinations of algorithms and rounding procedures. The computational results obtained on a set of randomly generated test problems show their relative efficiency and allow the potential user to choose from among them, according to the available computing time.
TL;DR: This work proposes lightweight floating-point arithmetic, and demonstrates the effectiveness of the proposed scheme using the inverse discrete cosine transform (IDCT), in the context of video coding, as an example.
Abstract: To enable floating-point (FP) signal processing applications in low-power mobile devices, we propose lightweight floating-point arithmetic. It offers a wider range of precision/power/speed/area trade-offs, but is wrapped in forms that hide the complexity of the underlying implementations from both multimedia software designers and hardware designers. Libraries implemented in C++ and Verilog provide flexible and robust floating-point units with variable bit-width formats, multiple rounding modes and other features. This solution bridges the design gap between software and hardware, and accelerates the design cycle from algorithm to chip by avoiding the translation to fixed-point arithmetic. We demonstrate the effectiveness of the proposed scheme using the inverse discrete cosine transform (IDCT), in the context of video coding, as an example. Further, we implement lightweight floating-point IDCT into hardware and demonstrate the power and area reduction.
TL;DR: An empirical comparison of polynomial-time approximation algorithms and local search heuristics for the problem of minimizing total weighted completion time on unrelated parallel machines is presented.
Abstract: This paper presents an empirical comparison of polynomial-time approximation algorithms and local search heuristics for the problem of minimizing total weighted completion time on unrelated parallel machines. Algorithms with a worst-case performance guarantee are based on rounding a fractional solution to an LP-relaxation or to a convex quadratic-programming relaxation. We also investigate dominance relations among the lower bounds resulting from these relaxations.
TL;DR: It is shown that the matrix rounding using L1-discrepancy for a union of two laminar families is suitable for developing a high-quality digital-halftoning software.
Abstract: In this paper we study the problem of rounding a real-valued matrix into an integer-valued matrix to minimize an Lp-discrepancy measure between them. To define the Lp-discrepancy measure, we introduce a family F of regions (rigid submatrices) of the matrix, and consider a hypergraph defined by the family. The difficulty of the problem depends on the choice of the region family F. We first investigate the rounding problem by using integer programming problems with convex piecewise-linear objective functions, and give some nontrivial upper bounds for the Lp-discrepancy. Then, we propose "laminar family" for constructing a practical and well-solvable class of F. Indeed, we show that the problem is solvable in polynomial time if F is a union of two laminar families. Finally, we show that the matrix rounding using L1-discrepancy for a union of two laminar families is suitable for developing a high-quality digital-halftoning software.
TL;DR: This work considers the problem of designing a minimum cost access network to carry traffic from a set of endnodes to a core network and uses a primal—dual approach to obtain a solution whose cost is within O(K2) of optimal.
Abstract: . We consider the problem of designing a minimum cost access network to carry traffic from a set of endnodes to a core network. Trunks are available in Ktypes reflecting economies of scale . A trunk type with a high initial overhead cost has a low cost per unit bandwidth and a trunk type with a low overhead cost has a high cost per unit bandwidth. We formulate the problem as an integer program. We first use a primal—dual approach to obtain a solution whose cost is within O(K
2
)of optimal. Typically the value of Kis small. This is the first combinatorial algorithm with an approximation ratio that is polynomial in Kand is independent of the network size and the total traffic to be carried. We also explore linear program rounding techniques and prove a better approximation ratio of O(K) . Both bounds are obtained under weak assumptions on the trunk costs. Our primal—dual algorithm is motivated by the work of Jain and Vazirani on facility location [7]. Our rounding algorithm is motivated by the facility location algorithm of Shmoys et al. [12].
TL;DR: In this article, a rounding operation with respect to a calculation result is modified so that the rounding operation is varied according to features detected as to each block, without recording or transmitting additional information.
Abstract: To compress and encode an image by dividing the image into units of blocks, a method in which features of each block are detected and different operations are applied according to the features has been used often so as to reduce coding distortion and improve subjective image quality, but this method requires additional information representing features of blocks to be recorded and transmitted together, thereby deteriorating the coding efficiency. A rounding operation with respect to a calculation result is modified so that the rounding operation is varied according to features detected as to each block. By so doing, adaptive compressing and coding is implemented by only the processing upon coding, without recording or transmitting additional information.
TL;DR: A high-radix digit-recurrence algorithm or the computation of the logarithm is presented and the achievement of a speed-up by over 4 times with regard to a conventional radix-2 implementation with redundant arithmetic is shown.
Abstract: A high-radix digit-recurrence algorithm or the computation of the logarithm is presented in this paper. Selection by rounding is used in iterations j/spl ges/2, and selection by table in the first iteration is combined with a restricted digit-set for the second one, in order to guarantee the convergence of the algorithm. A sequential architecture is proposed. and the execution time and hardware requirements of this architecture are estimated, for a target precision of n=32 bits and a radix r=256. These estimates are obtained according to a rough model for the delay and area cost of the main logic blocks employed, and show the achievement of a speed-up by over 4 times with regard to a conventional radix-2 implementation with redundant arithmetic.
TL;DR: It is shown that the CORDIC output can be made exact to the digits by an additional rounding process, which is especially useful for DDFS applications where the CordIC output should be truncated to the final DAC (digital-to-analog converter) width.
Abstract: The circular-mode CORDIC (coordinate rotation digital computer) algorithm is analyzed for DDFS (direct digital frequency synthesis) applications. It is shown how the CORDIC parameters should be chosen to meet given DDFS parameters. Also, three methods of CORDIC datapath quantization: rounding, truncation, and jamming, have been investigated and their error bounds are derived. Through a set of simulations, it is demonstrated that jamming has desirable characteristics in many aspects such as complexity, speed, error, and bias. Finally, it is shown that the CORDIC output can be made exact to the digits by an additional rounding process, which is especially useful for DDFS applications where the CORDIC output should be truncated to the final DAC (digital-to-analog converter) width.
TL;DR: This work studies the data placement problem, where the goal is to place certain data objects in fixed capacity caches in a network to optimize latency of access, and presents a constant approximation, while blowing up the cache capacities by a constant factor.
Abstract: We study the data placement problem [1, 3], where the goal is to place certain data objects (with possible replication) in fixed capacity caches in a network to optimize latency of access. The locations of the caches are given and each cache has capacities both on the number of objects it can store and the number of users it can serve. Each user has a demand for a specific object.The end objective is to optimize the average user latency of accessing the objects. We present a constant approximation, while blowing up the cache capacities by a constant factor. This improves the previous results, which either ignore the bound on the number of users [1], or which need to blow up the capacities by a logarithmic factor [3]. Our solution technique involves writing an integer program for this problem and rounding its linear relaxation.We note that our result is the best possible that can be obtained by LP rounding. The problem is MAX-SNP hard as shown in [1], and the linear program has unbounded integrality gap unless we relax the capacity constraints [5].Our basic technique is to separate the rounding into two stages:Opening Objects: In this stage, we consider each object separately, and open copies in the network. We ignore the interaction of this object with other objects due to the cache capacity constraints. We use the capacitated facility location rounding from [5].Packing Objects: In this stage, we pack the objects into the cache so that cache capacity constraints are satisfied. We use the GAP rounding from [4].
TL;DR: In this article, a transaction tax calculating system using a hybrid method of referring to tables providing rules for state and local tax brackets, rounding and exceptions and applying those rounding and bracket tax calculation rules in conjunction for rounding.
Abstract: A transaction tax calculating system using a hybrid method of referring to tables providing rules for state and local tax brackets, rounding and exceptions and applying those rounding and bracket tax calculation rules in conjunction for rounding.
TL;DR: In this paper, a product-sum operation by adding third data to a product of first data and second data is performed, where a floating point multiplier first multiplies the first data by the second data, and then adds the addition result to more significant multiplication result data having a bit string representing the more significant digits as a fixed-point part.
Abstract: To perform a product-sum operation by adding third data to a product of first data and second data, a floating point multiplier first multiplies the first data by the second data, and a bit string representing a fixed-point part in the multiplication result is divided into a portion representing more significant digits in the fixed-point part and a portion representing less significant digits in the fixed-point part. Then, a floating point adder first adds less significant multiplication result data having a bit string representing the less significant digits as a fixed-point part to the third data, and then adds the addition result to more significant multiplication result data having a bit string representing the more significant digits as a fixed-point part. A rounding process is performed on the two addition results to obtain a result of the product-sum operation.
TL;DR: In this paper, the pth derivative of a polynomial expressed as a finite series of Jacobi polynomials is evaluated using a pseudospectral method, and the stability of the algorithm is compared with classical methods for evaluating derivatives.
TL;DR: In this article, the problem of rounding a realvalued matrix into an integer-valued matrix to minimize an Lp-discrepancy measure between them is studied, and it is shown that the problem is solvable in polynomial time if F is a union of two laminar families.
Abstract: In this paper we study the problem of rounding a real-valued matrix into an integer-valued matrix to minimize an Lp-discrepancy measure between them. To define the Lp-discrepancy measure, we introduce a family F of regions (rigid submatrices) of the matrix, and consider a hypergraph defined by the family. The difficulty of the problem depends on the choice of the region family F. We first investigate the rounding problem by using integer programming problems with convex piecewise-linear objective functions, and give some nontrivial upper bounds for the Lp-discrepancy. Then, we propose "laminar family" for constructing a practical and well-solvable class of F. Indeed, we show that the problem is solvable in polynomial time if F is a union of two laminar families. Finally, we show that the matrix rounding using L1-discrepancy for a union of two laminar families is suitable for developing a high-quality digital-halftoning software.
TL;DR: In this article, the authors proposed a loss-free integer transformation which can be directly coupled to an entropy coder without quantization in order to obtain a windowing method and transformation method that can be implemented advantageously with regard to hardware.
Abstract: In order to obtain an integer transformation, which provides integer initial values, the TDAC (time domain alaising cancellation) function of an MCDT (modified discrete cosine transform) is executed explicitly in the time domain before the forward transformation. In the event of overlapping windows, this leads to a Givens rotation that can be represented by lifting matrices. Time-discrete sampled values of an audio signal are initially combined in pairs to form a vector in order to be sequentially subjected to the action of a lifting matrix after the other. According to the invention, a rounding step (20, 24, 28) is executed by using a lifting matrix after each multiplication (18, 22, 26) of a vector whereby resulting only in integers on the output side. A spectral representation with integer spectral values can be obtained by transforming (14) the windowed integer sampled values with an integer transformation. The inverse imaging with an inverse rotation matrix and with corresponding inverse lifting matrices leads to a precise reconstruction. The inventive concept provides a loss-free transformation, which can be directly coupled to an entropy coder without quantization in order to obtain a windowing method and transformation method that can be implemented advantageously with regard to hardware.
TL;DR: In this article, a method and system for generating numerical test cases for testing binary floating-point arithmetic units for addition and subtraction operations, in order to verify the proper operation of the units according to a specified standard.
Abstract: A method and system for generating numerical test cases for testing binary floating-point arithmetic units for addition and subtraction operations, in order to verify the proper operation of the units according to a specified standard. The space for eligible test-cases is compatible with masks which stipulate the allowable forms of the operands and the result, including constant as well as variable digits in both the exponent and significand fields. The test-cases, which are generated randomly, cover the entire solution space without excluding any eligible solutions. All standard rounding modes are supported, and if a valid solution does not exist for a given set of masks, this fact is reported. The method is general and can be applied to any standard, such as the IEEE floating-point standard, in any precision. A system according to the present invention utilizes a set of sub-generators for biased exponents and significands, and also incorporates a fixed-point generator for performing calculations common to the other generators. The method relies on searching for solutions based on feasible carry sequences, and is also capable of generating test-cases for mask-constrained carry sequences.
TL;DR: In this article, the SPW tool is used to formalize the fixed-point number system and specify the different rounding modes in fixed point arithmetic such as the directed and even rounding modes.
Abstract: This paper addresses the formalization in higher-order logic of fixed-point arithmetic based on the SPW (Signal Processing WorkSystem) tool. We encoded the fixed-point number system and specified the different rounding modes in fixed-point arithmetic such as the directed and even rounding modes.We also considered the formalization of exceptions detection and their handling like overflow and invalid operation. An error analysis is then performed to check the correctness of the rounding and to verify the basic arithmetic operations, addition, subtraction, multiplication and division against their mathematical counterparts. Finally, we showed by an example how this formalization can be used to enable the verification of the transition from the floating-point to fixed-point algorithmic levels in the design flow of signal processors.
TL;DR: The IEEE floating-point adder (FP-adder) as discussed by the authors achieves low latency by combining various optimization techniques, including a non-standard separation into two paths, a simple rounding algorithm, unifying rounding cases for addition and subtraction, sign-magnitude computation of a difference based on one's complement subtraction and compound adders, and fast circuits for approximate counting of leading zeros from borrow-save representation.
Abstract: An IEEE floating-point adder (FP-adder) design. The adder accepts normalized numbers, supports all four IEEE rounding modes, and outputs the correctly normalized rounded sum/difference in the format required by the IEEE Standard. The latency of the design for double precision is roughly 24 logic levels, not including delays of latches between pipeline stages. Moreover, the design can be easily partitioned into two stages comprised of twelve logic levels each, and hence, can be used with clock periods that allow for twelve logic levels between latches. The FP-adder design achieves a low latency by combining various optimization techniques, including a non-standard separation into two paths, a simple rounding algorithm, unifying rounding cases for addition and subtraction, sign-magnitude computation of a difference based on one's complement subtraction, compound adders, and fast circuits for approximate counting of leading zeros from borrow-save representation. A comparison of the design with other implementations suggests a reduction in the latency by at least two logic levels as well as simplified rounding implementation. A reduced precision version of the FP adder has been verified by exhaustive testing.
TL;DR: In this article, a pseud-random value is generated and a preselcted number of pseudo-random bits are added to the result to be rounded prior to truncation being applied.
Abstract: Nondeterministic rounding of fixed point values in a digital signal processor. A pseud-random value is generated and a preselcted number of pseudo-random bits are added to the result to be rounded prior to truncation being applied. Pseudo-random numbers may be generated by means including two maximal-length pseudo-random sequence generators.