TL;DR: New results are obtained which avoid the necessity of doing special testing to get the last bit rounded correctly in accordance with all of the IEEE rounding modes in the case of division and square root.
Abstract: The additional speed and precision of the IBM RISC System/6000 floating-point unit have motivated reexamination of algorithms to perform division, square root, and the elementary functions. New results are obtained which avoid the necessity of doing special testing to get the last bit rounded correctly in accordance with all of the IEEE rounding modes in the case of division and square root. For the elementary function library, a technique is described for always getting the last bit rounded correctly in the selected rounding mode.
TL;DR: Algorithms for accurately converting floating-point numbers to decimal representation and modification of the well-known algorithm for radix-conversion of fixed-point fractions by multiplication for use in fixed-format applications.
Abstract: We present algorithms for accurately converting floating-point numbers to decimal representation. The key idea is to carry along with the computation an explicit representation of the required rounding accuracy.We begin with the simpler problem of converting fixed-point fractions. A modification of the well-known algorithm for radix-conversion of fixed-point fractions by multiplication explicitly determines when to terminate the conversion process; a variable number of digits are produced. The algorithm has these properties: No information is lost; the original fraction can be recovered from the output by rounding.No “garbage digits” are produced.The output is correctly rounded.It is never necessary to propagate carries on rounding.We then derive two algorithms for free-formal output of floating-point numbers. The first simply scales the given floating-point number to an appropriate fractional range and then applies the algorithm for fractions. This is quite fast and simple to code but has inaccuracies stemming from round-off errors and oversimplification. The second algorithm guarantees mathematical accuracy by using multiple-precision integer arithmetic and handling special cases. Both algorithms produce no more digits than necessary (intuitively, the “1.3 prints as 1.2999999” problem does not occur).Finally, we modify the free-format conversion algorithm for use in fixed-format applications. Information may be lost if the fixed format provides too few digit positions, but the output is always correctly rounded. On the other hand, no “garbage digits” are ever produced, even if the fixed format specifies too many digit positions (intuitively, the “4/3 prints as 1.333333328366279602” problem does not occur).
TL;DR: In this article, a processor for performing floating point arithmetic operations is provided that includes a circuit that performs a first floating point operation on a set of operands in a first cycle and a second floating point operator on an operand and a result of the first operation during a second cycle.
Abstract: A processor for performing floating point arithmetic operations is provided that includes a circuit that performs a first floating point arithmetic operation on a set of operands in a first cycle and and a second floating point arithmetic operation on an operand and a result of the first floating point arithmetic operation during a second cycle. A control circuit is provided for, in a third cycle, transferring a result of the second floating operation to the first floating point circuit for a first floating point operation in a next successive cycle while rounding the result of the second floating point operation.
TL;DR: It is shown that regardless of the probability distribution of the input signal x/sub n/, it is always possible to add a suitable dither signal to the input of the system so that both the quantization error and the roundoff error are uniformly distributed, white, and mutually uncorrelated.
Abstract: The author considers the characteristics of the error resulting when a continuous amplitude signal x/sub n/ is quantized and then multiplied by a constant multiplier a under fixed-point roundoff arithmetic. It is shown that the overall error of such an operation can be decomposed into two terms: one being a scaled version of the error due to the quantization of x/sub n/ and the other due to rounding off the product aQ(x/sub n/). Exact first- and second-order moments are derived for the quantization error, the roundoff error, and the overall error as a function of the multiplier a and the distribution of x/sub n/. Sufficient conditions are given for the quantization error and the roundoff error to be individually uniformly distributed and white up to the first- and second-order moments, and also for them to be mutually uncorrelated. It is also shown that regardless of the probability distribution of the input signal x/sub n/, it is always possible to add a suitable dither signal to the input of the system so that both the quantization error and the roundoff error are uniformly distributed, white, and mutually uncorrelated. For Gaussian inputs, the sufficient conditions given are not satisfied. >
TL;DR: In this paper, the initial operands and the exponent and fractional results are examined to predict when the postnormalization and rounding steps can be skipped, and the fraction result format enables a prediction under each of the addition, subtraction and multiplication possibilities, and under various choices of rounding mode which are used in floating point arithmetic.
Abstract: Apparatus for enhancing certain floating point arithmetic operations, by examining the initial operands and the exponent and fractional results and predicting when the steps of postnormalization and rounding can be skipped. The fraction result format enables a prediction of normalization and rounding under each of the addition, subtraction and multiplication possibilities, and under each of the various choices of rounding mode which are used in floating point arithmetic.
TL;DR: In this paper, a processor has an exception detector for detecting Overflow and Underflow exceptions in floating-point data subjected to a rounding operation, which includes first detection is for detecting whether the rounding operation to be executed is a rounding operation by raising or a round operation by truncating, second detection means for detecting a number of overflow and overflow data of a modified mantissa part, and third detection means is responsive to detection outputs of said first and second detection.
Abstract: A processor has an exception detector for detecting occurrence of
Overflow und Underflow exceptions in floating-point data subjected to a
rounding operation, said exception detector comprising first detection
means for detecting whether the rounding operation to be executed is a
rounding operation by raising or a rounding operation by truncating,
means for unconditionally performing the rounding operation by raising
on a mantissa part of said floating-point data to provide a modified
mantissa part, second detection means for detecting a number of an
exponent part of said floating-point data, and third detection means,
responsive to detection outputs of said first and second detection means
and overflow data of said modified mantissa part, for detecting whether
or not Overflow or Underflow exception occurs in said floating-point
data subjected to the rounding operation.
TL;DR: A systematic derivation of a radix-4 square-root algorithm using redundant residual and result and incorporates an on-the-fly conversion and rounding of the result, thus eliminating a carry-propagate step to obtain the final result.
Abstract: A systematic derivation of a radix-4 square-root algorithm using redundant residual and result is presented. Unlike other similar schemes it does not use a table lookup or PLA for the initial step, resulting in a simpler implementation without any time penalty. The scheme can be integrated with division and incorporates an on-the-fly conversion and rounding of the result, thus eliminating a carry-propagate step to obtain the final result. The result-digit selection uses 3 bits of the result and 7 bits of the estimate of the residual. >
TL;DR: A binary tree search algorithm for solving three-dimensional problems based on linear programming techniques and can effectively find controlled roundings when they exist or determine that no solution exists.
Abstract: Controlled rounding is a procedure that perturbs tabular data collected from respondents in such a way as to preserve the anonymity of the respondents while maintaining the integrity of the data. The three-dimensional controlled rounding problem is NP-complete. This paper describes a binary tree search algorithm for solving three-dimensional problems based on linear programming techniques. Computational results obtained from tests with over 31,000 randomly generated tables and 292 real-life tables have shown that this algorithm can effectively find controlled roundings when they exist or determine that no solution exists. The computational results also demonstrate that the running time of the search algorithm can be cut in half by using a heuristic for initializing the simplex basis for the linear programming problems.
TL;DR: A list of specific effects are proposed which can be tested for assessing robustness, and if the filter behaves properly with respect to all these tests it can be expected to behave properly also in any of the situations it will encounter during its operation.
Abstract: In a recursive digital filter the need for applying rounding/truncation and overflow corrections to the signal (data) at appropriate locations causes the filter behaviour to differ in a more or less noticeable way from the desired one. The filter is said to be the more robust the less such differences are pronounced. For assessing robustness, a list of specific effects are proposed which can individually be tested. If the filter behaves properly with respect to all these tests it can be expected to behave properly also in any of the situations it will encounter during its operation. The possibility of guaranteeing robustness is discussed.
TL;DR: It is shown that rounding the coefficients of each plane (or line) equation without altering the combinatorial information is NP-complete and avoids numerical problems that arise from using floating-point arithmetic to implement operations on solids.
Abstract: A standard technique in solid modeling is to represent planes (or lines) by explicit equations and to represent vertices and edges implicitly by means of combinatorial information. Numerical problems that arise from using floating-point arithmetic to implement operations on solids can be avoided by using exact arithmetic. Since the execution time of exact arithmetic operators increases with the number of bits required to represent the operands, it is important to avoid increasing the number of bits required to represent the plane (or line) equation coefficients. Set operations on solids do not increase the number of bits required. However, rotating a solid greatly increases the number of bits required, thus adversely affecting efficiency. One proposed solution to this problem is to round the coefficients of each plane (or line) equation without altering the combinatorial information. We show that such rounding is NP-complete.
TL;DR: In this paper, the fractional result of a floating point arithmetic operation is simultaneously passed to a normalize circuit and a round circuit, and the first two bit positions of the result are examined.
Abstract: A method and apparatus for processing postnormalization and rounding in parallel in floating point arithmetic circuits. The fractional result of a floating point arithmetic operation is simultaneously passed to a normalize circuit (34a) and a round circuit (34b), and the first two bit positions of the fractional result are examined. If the 2-bit format is 1.X the round circuit is activated; if the 2-bit format is 0.1X the fractional result is shifted left one position and the round circuit is activated; if the 2-bit format is in neither of the above formats the normalize circuit is activated. In no event is it necessary to activate sequentially the normalize circuit and the round circuit.
TL;DR: In this article, a new and general stability chart for the geometrical rounding mechanism is presented, showing that there is no machine setup which can prevent every possible instability related with any number of lobes on the workpiece.
TL;DR: Numerical results obtained from processing 32,500 randomly generated tables and 292 real-life tables have demonstrated that this algorithm can efficiently find controlled roundings, provided they exist.
Abstract: Controlled rounding is a procedure whereby tabular data gathered from respondents is perturbed in such a way as to preserve the anonymity of the respondents while maintaining the integrity of the data. This paper describes an algorithm for solving three-dimensional controlled rounding problems which is based on simulated annealing, linear programming, and binary search procedures. Numerical results obtained from processing 32,500 randomly generated tables and 292 real-life tables have demonstrated that this algorithm can efficiently find controlled roundings, provided they exist. The algorithm is significantly faster than any previously known solution procedure for this class of problems. INFORMS Journal on Computing, ISSN 1091-9856, was published as ORSA Journal on Computing from 1989 to 1995 under ISSN 0899-1499.
TL;DR: A versatile floating point adder which performs high speed floating point addition or subtraction on operands supplied in a signed magnitude format includes separate exponent and mantissa data paths for processing the exponent fields and the mantissa fields of the floating point binary numbers.
Abstract: A versatile floating point adder which performs high speed floating point addition or subtraction on operands supplied in a signed magnitude format includes separate exponent and mantissa data paths for processing the exponent fields and mantissa fields of the floating point binary numbers to be added or subtracted. The exponent data path computes the absolute difference between the exponents of the floating point numbers, passes the large exponent, and adjusts the large exponent by an amount needed to normalize the mantissa and to reflect an overflow in the mantissa addition/substration and mantissa rounding operations. The mantissa data path denormalizes one of the input mantissas, adds the two mantissas after the denormalization operation, post-normalizes the resulting mantissa, and rounds the mantissa to the correct precision.
TL;DR: The complexity of the three-dimensional, zero-restricted controlled rounding problem is shown to be NP-complete and various relaxations of this basic rounding problem have been defined.
Abstract: Controlled rounding is a procedure that perturbs tabular data collected from respondents in such a way as to preserve the anonymity of the respondents while maintaining the integrity of the data. Controlled rounding techniques are regularly used by the United States Bureau of the Census and its counterparts in other countries. This paper discusses the complexity of the three-dimensional controlled rounding problem. In particular, the three-dimensional, zero-restricted controlled rounding problem is shown to be NP-complete. As zero-restricted controlled roundings may fail to exist, various relaxations of this basic rounding problem have been defined. The paper introduces a sequence of such relaxations and proceeds to address the existence of solutions and complexity issues for the relaxed problems.
TL;DR: In this paper, when rounding respective points on curve obtained by equal division in adjacent points, the nearest lattice point is not selected, but instead, if the incline value of this point is more than 1, an adjacent point containing integral X-coordinate is approximately computed, whereas if the inclination value of the point is less than one, a corresponding point with integral Y-coordinates is approximately calculated.
Abstract: In a method of drawing parametric curve on second-dimensional coordinate having discrete coordinate values, when rounding respective points on curve obtained by equal division in adjacent points, the nearest lattice point is not selected, but instead, if the incline value of this point is more than 1, an adjacent point containing integral X-coordinate is approximately computed, whereas if the incline value of this point is less than 1, an adjacent point containing integral Y-coordinate is approximately computed. Since specific lattice points nearest to the computed points are selected, smooth curve can be drawn without causing redundant point to occur at all. When executing linear interpolation, interpolation is executed by means of a straight line passing through the not-yet-rounded point without using such a line passing through the already rounded point. As a result, satisfactory linear interpolation very close to true curve can be achieved constantly.
TL;DR: In this paper, the existence of a singular real matrix which rounds to a square floating-point matrix was shown to be true for a single real matrix, where the real matrix is defined as a real matrix.
TL;DR: In this article, a multiple selector logic circuit for selecting divisor multiples in 2-bit, nonrestoring divide sequences, which provides a proper and accurate quotient result and remainder, and which produces rounding and indication of exact or inexact result in conformance with ANSI/IEEE Standard 754-1985, is presented.
Abstract: A multiple selector logic circuit for selecting divisor multiples in 2-bit, non-restoring divide sequences, which provides a proper and accurate quotient result and remainder, and which produces rounding and indication of exact or inexact result in conformance with ANSI/IEEE Standard 754-1985; the multiple selector logic circuit incorporates semiconductor circuits including a multiplier table having a particular matrix of multipliers which meet the standard.
TL;DR: It is shown why the above considerations add realism to a model, and how they may be incorporated in the solution procedure of quantity discounting.
Abstract: Consideration of bounds, discontinuities and the integrality of decision variables is often helpful for responding to many practical requirements. Using the case of quantity discounting as an example, it is shown why the above considerations add realism to a model, and how they may be incorporated in the solution procedure. The bounds on the order quantity are fulfilled by establishing suitable stopping rules, while gaps in order quantities are resolved by modifying the input data. As for integrality, a result is established which shows that the feasible points for the integer optimum can be obtained by suitably rounding those for the continuous optimum. A comprehensive algorithm containing these features is proposed, and analytical proofs of the results are included.
TL;DR: The level-index number system eliminates overflow and underflow problems through its closure in finite-precision arithmetic.
Abstract: Abstract The level-index number system represents numbers in a computer by their repeated logarithms. Its chief advantage is closure in finite¬ precision arithmetic, thereby eradicating the problems of overflow and underflow. This talk indicates how a Wilkinson-type running error analysis can be carried out in the new system.
TL;DR: In this paper, the effects of rounding on the significance and power of four test statistics were investigated for rounding non-normal data and guidelines were given on how the degree of precision recommended for normal populations can be applied when the population is nonnormal.
Abstract: In a previously published study, the effects of rounding on the significance and power of four test statistics were considered when the parent population was normal. Here we investigate how these tests will perform for rounded non-normal data. Guidelines are given on how the degree of precision recommended for normal populations can be applied when the population is non-normal.
TL;DR: In this paper, an approximate solution of a function F with respect to a given value X is derived by referring to a relation Y=F(X) at an accuracy at which an error between the approximate solution Ya and an infinitely precise solution Y is smaller than a digit in a place lower than a lowest place of significant digits for a final solution by two places.
Abstract: An approximate solution Ya of a function F with respect to a given value X is derived by referring to a relation Y=F(X) at an accuracy at which an error between the approximate solution Ya and an infinitely precise solution Y is smaller than a weight of a digit in a place lower than a lowest place of significant digits for a final solution by two places. The approximate solution Ya is rounded to an interim solution Yr equal to one of possible interim solutions which is closest to the approximate solution Ya. A value Xr is derived from the interim solution Yr and an inverse function F -1 by referring to a relation Xr=F -1 (Yr). A sticky digit S is set to 0, 1, or -1 in response to the relation between the magnitudes of the values X and Xr and other information. The sticky digit S is added to a place immediately lower than a lowest place of the interim solution Yr. A result of this addition is rounded in a designated rounding mode to obtain the final solution.
TL;DR: In this article, several effective algorithms which preserve the inclusion property concerning the complex zeros and which have a high computational efficiency are constructed, which combine the efficiency of ordinary floating-point iterations with the accuracy control that may be obtained by the iterations in interval arithmetic.
TL;DR: A complete formalization of the IEEE floating-point standard using the HOL (Higher-Order-Logic) specification and verification system developed at Cambridge University is presented and a technique for verifying hierarchies of communicating hardware interpreters is described.
Abstract: Due to their ever increasing complexity, it is becoming difficult to assure that hardware systems will not malfunction because of design errors. Hardware verification involves using theoremproving techniques to verify that the design of a system satisfies its specification. This paper describes work in progress on verifying a floating-point coprocessor based on the MC68881/82, with respect to the IEEE standard floating-point arithmetic (ANSIIEEE Std 754-1985). A complete formalization of the IEEE floating-point standard using the HOL (Higher-Order-Logic) specification and verification system developed at Cambridge University is presented. The specification of floating-point numbers, floating-point arithmetic functions, rounding schemes and postnormalization are described. In order for the specification to be adaptable to different implementations, several model parameters are used in specifying all the functions. The floating-point coprocessor is specified as a hierarchy of interpreters. The communication with the CPU is modeled as standard interprocessor communication. A technique for verifying hierarchies of communicating hardware interpreters is described.
TL;DR: In this paper, first order phase transitions are rounded in solids of finite size and the correlation length of the finite system is monitored by the so-called mass gap for the quantum Hamiltonian version of the model.
Abstract: First order phase transitions are rounded in solids of finite size. It is shown here that the above rounding is monitored by the correlation length ξL of the finite system, or equivalently by the so-called mass gap for the quantum Hamiltonian version of the model. Scaling with size is studied as a function of variable boundary conditions for the cylinder geometry (infinite strips with finite width), and a striking crossover is found in the mass gap behavior when the coupling g along the boundary becomes anti-periodic. For g>0, the rounding is exponential with size and an accurate determination of the spontaneous magnetization (order parameter) of the infinite system is obtained from numerical extrapolations.
TL;DR: In this article, the output of a bit inverting circuit is sent to a rounding circuit via a multiplexer only in a partial arithmetic process for extraction of the square root and performing a right shift by one bit after addition of 1 to the minimum digit.
Abstract: PURPOSE:To increase arithmetic processing speed for extraction of the square root by supplying the output of a bit inverting circuit to a rounding circuit via a multiplexer only in a partial arithmetic process for extraction of the square root and performing a right shift by one bit after addition of 1 to the minimum digit. CONSTITUTION:A multiplexer 8 supplies the output of a bit inverting circuit 7 to a rounding circuit 6 only in a partial arithmetic process for extraction of the square root. The output of the circuit 7 receives 1 at its minimum digit and is shifted to the right by one bit in the circuit 6. Otherwise the output of a right shift circuit 5 is supplied directly to the circuit 6 via the multiplexer 8. Thus it is possible to perform the binary calculation of 1/2*(3-AXi) where 0
TL;DR: In this paper, a method and a high speed processor incorporating that method are set forth for processing signals representing outputs generated by remainderless division algorithms (102) and remainderless square root algorithms so as to obtain rounded outputs (112) conforming to the IEEE 754-1985 binary floating point arithmetic standard.
Abstract: 2045662 9110189 PCTABS00006 A method and a high speed processor (HSP) incorporating that method are set forth for processing signals representing outputs generated by remainderless division algorithms (102) and remainderless square root algorithms so as to obtain rounded outputs (112) conforming to the IEEE 754-1985 binary floating point arithmetic standard. The method and procedure of the present invention allow the solutions of floating-point computations to be rounded such that sign bits, as well as binary bits, of the rounded results are in full compliance with all guidelines of the stated standard.
TL;DR: In this paper, the modified simplex method was used to solve the relaxed 0-1 linear programming method for test construction, such that the characteristics of test construction problems were taken into account.
Abstract: Linear programming models with 0-1 variables are useful for the construction of tests from an item bank. Most solution strategies for these models start with solving the relaxed 0-1 linear programming model, allowing the 0-1 variables to take on values between 0 and 1. Then, a 0-1 solution is found by just rounding, optimal rounding, or a heuristic. In most applications, the latter can be executed very rapidly. This paper uses the revised simplex method to solve the relaxed 0-1 linear programming method for test construction. The simplex method is modified such that the characteristics of test construction problems are taken into account. The modifications were implemented in the computer program LINPROG. Two item banks, each containing 450 items, were generated to determine if central processing unit (CPU) time was gained. Computational experiments showed a gain of CPU time for most modifications. Ten tables present the results for the modifications.
TL;DR: In this paper, a novel method for evaluating the frequency of deficiency states of power systems is suggested, which uses exact generation capacity states and system load-duration and load-frequency characteristics.
Abstract: A novel method for evaluating the frequency of deficiency-states of power systems is suggested. The method uses exact generation-capacity states and system load-duration and load-frequency characteristics. The approach accurately models load daily diagrams, including multimodal diagrams. The generating units are modeled as multistate elements. Analytic expressions for the frequency of deficiency-states are derived in closed form to avoid the accumulation of rounding errors that might appear in recursive approaches. Illustrative application examples are included. >
TL;DR: In this paper, the normalization processing is performed by counting the number of preceding zeros of the mantissa fu111 at a preceding zero detection circuit 11, and a barrel shifter 10 is operated based on a counted result, and the mn120 is generated by performing the normalisation processing.
Abstract: PURPOSE:To accelerate processing speed and to compress the scale of hardware by generating data for rounding in parallel with the normalization processing of a mantissa. CONSTITUTION:The normalization processing is performed by counting the number of preceding zeros of the mantissa fu111 at a preceding zero detection circuit 11. A barrel shifter 10 is operated based on a counted result, and the mantissa fn120 is generated by performing the normalization processing of the mantissa fu111. Also, the complement of an exponent is performed by operating an adder 23 and a complementor 22, then, the exponent er131 is generated. Also, a rounding processing is performed by generating a bit required for the judgement of rounding from the mantissa fu111 before normalization in a form corresponding to destination size at an LSB/R/S generation circuit 18. Then, the judgement of rounding is performed at a rounding judging circuit 19 based on those bits, and the addition of a rounding signal generated in the above judgement on the mantissa fn120 is performed at an adder 20.