TL;DR: Algorithms for the four binary arithmetic operations and for rounding are presented, together with proofs of their correctness; appropriate formulas for a priori error analysis of these algorithms are presented.
Abstract: In this paper we discuss directed roundings and indicate how hardware might be designed to produce proper upward directed, downward directed, and certain commonly used symmetric roundings. Algorithms for the four binary arithmetic operations and for rounding are presented, together with proofs of their correctness; appropriate formulas for a priori error analysis of these algorithms are presented. Some of the basic applications of directed roundings are surveyed.
TL;DR: In this article, the accuracy of 6-digit hexadecimal and 22-digit binary floating point number representations combined with the usual chop and round modes of arithmetic with various numbers of guard digits, and with a modified round mode with guard digits.
Abstract: This paper presents the statistical results of tests of the accuracy of certain arithmetic systems in evaluating sums, products and inner products, and analytic error estimates for some of the computations. The arithmetic systems studied are 6-digit hexadecimal and 22-digit binary floating point number representations combined with the usual chop and round modes of arithmetic with various numbers of guard digits, and with a modified round mode with guard digits. In a certain sense, arithmetic systems differing only in their use of binary or hexadecimal number representations are shown to be approximately statistically equivalent in accuracy. Further, the usual round mode with guard digits is shown to be statistically superior in accuracy to the usual chop mode in all cases save one. The modified round mode is found to be superior to the chop mode in all cases.
TL;DR: The variance of error at the output of a digital control system due to input data quantization and product rounding in a digital controller is derived.
TL;DR: An add/subtract logic network for forming the sum or difference of two n-bit operands with round-off occurring simultaneously with the formation of the n-bits result is presented in this paper.
Abstract: An add/subtract logic network for forming the sum or difference of two n-bit operands with round-off occurring simultaneously with the formation of the n-bit result.
TL;DR: A floating-point error analysis is given for the standard recursive method of evaluating trigonometric polynomials, and it is shown that, by introducing a phase-shift, one can hold the error growth down to an essentially linear function of the degree.
Abstract: A floating-point error analysis is given for the standard recursive method of evaluating trigonometric polynomials. It is shown that, by introducing a phase-shift, one can hold the error growth down to an essentially linear function of the degree. Explicit computable error bounds are derived and numerically verified. Given the problem of evaluating the Fourier series N-i N-i the most efficient, known method is Clenshaw's algorithm, which is also known as the "Goertzel-Watt" algorithm (1), (2). The numerical properties of the floating- point algorithm were analyzed by Gentleman (2), and the principal conclusion was that the cumulative effect of rounding errors could become very severe whenever 0 was small modulo ir. By using the phase-shift 5 = ir/2 - 0, one can always replace the Fourier series F(O) by the equivalent series G(0) = F(ir/2 - 4), and one can therefore always avoid evaluating a Fourier series at a small argument. If we arrange to perform the phase-shift whenever 0 is in the range (- ir/4, ir/4) modulo ir, we can guarantee that all evaluations occur with arguments in the range (ir/4, 3ir/4) modulo ir. Under these conditions, the behavior of the Clenshaw algorithm will be shown to be quite good. The transformation consists in determining the coefficients C,, S. of G(0), and these are related to C,, Sr by
TL;DR: It is shown bya priori methods of error analysis that this interval may be obtained via Newton's method without using interval arithmetic using binary computers with optimal upward directed rounding.
Abstract: The smallest machine representable interval containing the square root of a given machine representable number is sought. Assuming binary computers with optimal upward directed rounding, it is shown bya priori methods of error analysis that this interval may be obtained via Newton's method without using interval arithmetic. Less sharp but still useful results are obtained for the cube root.
TL;DR: In this paper, a modification is proposed in the design of digital ladder structures, which have transmission zeros, which results in a reduction in the number of adds required in the structure.
Abstract: A modification is proposed in the design of digital ladder structures, which have transmission zeros. This results in a reduction in the number of adds required in the structure. The location of the transmission zeros under coefficient rounding is then analyzed.
TL;DR: In this article, the authors present a set of methods for the inspection of residuals, which can be regarded as some form of the inspection process of the residuals and can be used in probability plots.
Abstract: Plots are generated which may be produced very quickly on a teletypewriter or similar remote terminal. The methods are useful for all displays which can be regarded as some form of the inspection of residuals. Versions for use in probability plotting, as an example not always thought of as examination of residuals, are also given. Examination of the variance and mean square error of rounding indicates that very little information can be lost when the data are smoothly distributed.
TL;DR: Exact expressions are given for the mean and variance of errors that result when rounding and truncation is performed on both sign-magnitude and two's complement finite accuracy number representations.
Abstract: Exact expressions are given for the mean and variance of errors that result when rounding and truncation is performed on both sign-magnitude (SM) and two's complement (TC) finite accuracy number representations. The results are significantly different from those of the often-used continuous model in the important case of one- or two-bit quantization.
TL;DR: Several serious problems in the theoretical foundations of factor analysis are reviewed in this paper, including rounding errors in the computerized extraction of factors from the perspective of psychoanalyticametric theory.
Abstract: Several serious problems in the theoretical foundations of factor analysis are reviewed. The first, factor‐naming, is discussed as a determinant of factor fame. A second problem, that of rounding errors in the computerized extraction of factors, is treated from the perspective of psychoanalyticametric theory. Results of an empirical study of the causes of rounding errors are presented and discussed.
TL;DR: This implementation is machine-independent in the sense that it will work on nay computer having a floating-point number system F characterized as follows: Each number x which belongs to F has a radix-beta t-digit fraction where t or = 1.
Abstract: : The paper describes all ALGOL 60 procedure which is an implementation of the floating-point summation technique described in Malcolm (1971). This implementation is machine-independent in the sense that it will work on nay computer having a floating-point number system F characterized as follows: Each number x which belongs to F has a radix-beta t-digit fraction where t or = 1. The radix beta can be any positive integer greater than 1. The number 0 is contained in F , but no assumption is made about its representation. All floating-point operations (e.g., addition and multiplication) are assumed to result in either 0 or a normalized floating-point number contained in F . The machine may do either proper rounding or chopping (truncation). (Note that this definition of F excludes machines using extra-length accumulators for intermediate arithmetic. However, this algorithm is seldom needed on such machines).