TL;DR: INTLAB as mentioned in this paper is a toolbox for Matlab supporting real and complex intervals, and vectors, full matrices and sparse matrices over those, which is designed to be very fast.
Abstract: INTLAB is a toolbox for Matlab supporting real and complex intervals, and vectors, full matrices and sparse matrices over those. It is designed to be very fast. In fact, it is not much slower than the fastest pure floating point algorithms using the fastest compilers available (the latter, of course, without verification of the result). Beside the basic arithmetical operations, rigorous input and output, rigorous standard functions, gradients, slopes and multiple precision arithmetic is included in INTLAB. Portability is assured by implementing all algorithms in Matlab itself with exception of exactly one routine for switching the rounding downwards, upwards and to nearest. Timing comparisons show that the used concept achieves the anticipated speed with identical code on a variety of computers, ranging from PC’s to parallel computers. INTLAB is freeware and may be copied from our home page.
TL;DR: In this paper, the authors present diagnostic measures to predict the rate of success of less optimal integer solutions, such as those obtained through rounding or sequential rounding, which can be computed directly.
Abstract: Global Positioning System ambiguity resolution is usually based on the integer least-squares principle (Teunissen 1993). Solution of the integer least-squares problem requires both the execution of a search process and an ambiguity decorrelation step to enhance the efficiency of this search. Instead of opting for the integer least-squares principle, one might also want to consider less optimal integer solutions, such as those obtained through rounding or sequential rounding. Although these solutions are less optimal, they do have one advantage over the integer least-squares solution: they do not require a search and can therefore be computed directly. However, in order to be confident that these less optimal solutions are still good enough for the application at hand, one requires diagnostic measures to predict their rate of success. These measures of confidence are presented and it is shown how they can be computed and evaluated.
TL;DR: A factor 2 approximation algorithm for finding a minimum-cost subgraph having at least a specified number of edges in each cut, which first solves the linear relaxation of the generalized Steiner network problem, and then iteratively rounds off the solution.
Abstract: We present a factor 2 approximation algorithm for finding a minimum-cost subgraph having at least a specified number of edges in each cut. This class of problems includes, among others, the generalized Steiner network problem, which is also known as the survivable network design problem. Our algorithm first solves the linear relaxation of this problem, and then iteratively rounds off the solution. The key idea in rounding off is that in a basic solution of the LP relaxation, at least one edge gets included at least to the extent of half. We include this edge into our integral solution and solve the residual problem.
TL;DR: In this paper, the authors considered the uncapacitated facility location problem and obtained a (1+2/e)-approximation algorithm, where $1 + 2/e \approx 1.736$, which is a significant improvement on the previously known approximation guarantees.
Abstract: We consider the uncapacitated facility location problem. In this problem, there is a set of locations at which facilities can be built; a fixed cost fi is incurred if a facility is opened at location i. Furthermore, there is a set of demand locations to be serviced by the opened facilities; if the demand location j is assigned to a facility at location i, then there is an associated service cost proportional to the distance between i and j, cij. The objective is to determine which facilities to open and an assignment of demand points to the opened facilities, so as to minimize the total cost. We assume that the distance function c is symmetric and satisfies the triangle inequality. For this problem we obtain a (1+2/e)-approximation algorithm, where $1+2/e \approx 1.736$, which is a significant improvement on the previously known approximation guarantees.
The algorithm works by rounding an optimal fractional solution to a linear programming relaxation. Our techniques use properties of optimal solutions to the linear program, randomized rounding, as well as a generalization of the decomposition techniques of Shmoys, Tardos, and Aardal [Proceedings of the 29th ACM Symposium on Theory of Computing, El Paso, TX, 1997, pp. 265--274].
TL;DR: A mechanically verified proof of correctness of the floating point multiplication, division, and square root instructions of the AMD-K7 microprocessor is described and a theory of bit vectors and their role in floating point representations and rounding is developed.
Abstract: We describe a mechanically verified proof of correctness of the floating point multiplication, division, and square root instructions of the AMD-K7 microprocessor. The instructions are implemented in hardware and represented here by register-transfer level specifications, the primitives of which are logical operations on bit vectors. On the other hand, the statements of correctness, derived from IEEE Standard 754, are arithmetic in nature and considerably more abstract. Therefore, we begin by developing a theory of bit vectors and their role in floating point representations and rounding. We then present the hardware model and a rigorous proof of its correctness. All of our definitions, lemmas and theorems have been formally encoded in the ACL2 logic, and every step in the proof has been mechanically checked with the ACL2 prover.
TL;DR: In this paper, a separation heuristic for mixed integer programs is presented that theoretically allows one to derive several families of "strong" valid inequalities for specific models and computationally gives results as good as or better than those obtained from several existing separation routines including flow cover and integer cover inequalities.
Abstract: A separation heuristic for mixed integer programs is presented that theoretically allows one to derive several families of "strong" valid inequalities for specific models and computationally gives results as good as or better than those obtained from several existing separation routines including flow cover and integer cover inequalities. The heuristic is based on aggregation of constraints of the original formulation and mixed integer rounding inequalities.
TL;DR: Once the integer ambiguities are successfully fixed, the carrier phase measurements will start to act as if they were high-precision pseudorange measurements, thus allowing for a baseline solution with a comparable high precision.
Abstract: High precision relative GPS positioning is based on the very precise carrier phase measurements. A prerequisite for obtaining high precision relative positioning results is that the double-differenced carrier phase ambiguities become sufficiently separable from the baseline coordinates. Different approaches are in use and have been proposed to ensure a sufficient separability between these two groups of parameters. In particular the approaches that explicitly aim at resolving the integer-values of the double-differenced ambiguities have been very successful. Once the integer ambiguities are successfully fixed, the carrier phase measurements will start to act as if they were high-precision pseudorange measurements, thus allowing for a baseline solution with a comparable high precision. The fixing of the ambiguities on integer values is however a non-trivial problem, in particular if one aims at numerical efficiency. This topic has therefore been a rich source of GPS-research over the last decade or so. Starting from rather simple but timeconsuming integer rounding schemes, the methods have evolved into complex and effective algorithms.
TL;DR: An efficient floating point filter for the computation of the sign of a determinant that works for arbitrary dimensions is engineer and how to use the interval techniques for exact linear optimization problems of low dimension as they arise in geometric computing is shown.
Abstract: We discuss interval techniques for speeding up the exact evaluation of geometric predicates and describe an efficient implementation of interval arithmetic that is strongly influenced by the rounding modes of the widely used IEEE 754 standard. Using this approach we engineer an efficient floating point filter for the computation of the sign of a determinant that works for arbitrary dimensions. Furthermore we show how to use our interval techniques for exact linear optimization problems of low dimension as they arise in geometric computing. We validate our approach experimentally, comparing it with other static, dynamic and semi-static filters.
TL;DR: In this paper, a multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed, where the multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form.
Abstract: A multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. An effective sign for the multiplier and multiplicand operands may be calculated and used to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, they may be summed and the results may be output. The results may be signed or unsigned, and may represent vector or scalar quantities. When a vector multiplication is performed, the multiplier may be configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components. The multiplier may also be configured to sum the products of the vector components to form the vector dot product. The final product may be output in segments so as to require fewer bus lines. The segments may be rounded by adding a rounding constant. Rounding and normalization may be performed in two paths, one assuming an overflow will occur, the other assuming no overflow will occur. The multiplier may also be configured to perform iterative calculations to evaluate constant powers of an operand. Intermediate products that are formed may be rounded and normalized in two paths and then compressed and stored for use in the next iteration. An adjustment constant may also be added to increase the frequency of exactly rounded results.
TL;DR: In this article, a multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed, where the multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form.
Abstract: A multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. An effective sign for the multiplier and multiplicand operands may be calculated based upon each operand's most significant bit and a control signal. The effective signs may then be used to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, they may be summed and the results may be output. The results may be signed or unsigned, and may represent vector or scalar quantities. When a vector multiplication is performed, the multiplier may be configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components. The multiplier may also be configured to sum the products of the vector components to form the vector dot product. The final product may be output in segments so as to require fewer bus lines. The segments may be rounded by adding a rounding constant. Rounding and normalization may be performed in two paths, one assuming an overflow will occur, the other assuming no overflow will occur.
TL;DR: This paper presents a framework for the approximation of a class of integer programming problems (covering problems) through generic heuristics all based on rounding of the optimal solution of a linear programming (LP) relaxation.
TL;DR: The problems discussed include estimation inaccuracy of the algorithms (especially the fundamental problem of matrix coefficient dependence), their computational complexity, as well as inadequate development of methods for analysis of interval systems with singular matrices.
Abstract: One of the simplest ways of representation of uncertain or inexact data, as well as inexact computations with them, is based on interval arithmetic. In this approach, an uncertain (real) number is represented by an interval (a continuous bounded subset) of real numbers which presumably contains the unknown exact value of the number in question. Despite its simplicity, it conforms very well to many practical situations, like tolerance handling or managing rounding errors in numerical computations. Also, the so-called α -cut method of handling fuzzy sets membership functions is based on replacing a fuzzy set problem with a set of interval problems. The purpose of this paper is to investigate possibilities of and problems with application of interval methods in (qualitative) analysis of linear mechanical systems with parameter uncertainties, in particular truss structures and frames. The paper starts with an introduction to interval arithmetic and systems of linear interval equations, including an overview of basic methods for finding interval estimates for the set of solutions of such systems. The methods are further illustrated by several examples of practical problems, solved by our hybrid system of analysis of mechanical structures. Finally, several general problems with using interval methods for analysis of such linear systems are identified, with promising avenues for further research indicated as a result. The problems discussed include estimation inaccuracy of the algorithms (especially the fundamental problem of matrix coefficient dependence), their computational complexity, as well as inadequate development of methods for analysis of interval systems with singular matrices.
TL;DR: An efficient and reliable method for computing the unit-in-the-last-place (ulp) of a double-precision floating-point number, taking advantage of the standard binary representation for floatingpoint numbers defined by IEEE Std 754-1985.
Abstract: We present an efficient and reliable method for computing the unit-in-the-last-place (ulp) of a double-precision floating-point number, taking advantage of the standard binary representation for floatingpoint numbers defined by IEEE Std 754-1985. The ulp is necessary to perform software rounding for robust rounded-interval arithmetic (RIA) operations. Hardware rounding, using two of the standard rounding modes defined by IEEE-754, may be more efficient. RIA has been used to produce robust software systems for the solution of systems of nonlinear equations, interrogation of geometric and differential properties of curves and surfaces, curve and surface intersections, and solid modeling.
TL;DR: Complete fixed-point error models that include the coefficient quantization are derived for two popular 8/spl times/8 two-dimensional IDCT architectures; one is based on distributed arithmetic, and the other is the multiplier-adder chain.
Abstract: Complete fixed-point error models that include the coefficient quantization are derived for two popular 8/spl times/8 two-dimensional (2-D) IDCT architectures; one is based on distributed arithmetic, and the other is the multiplier-adder chain. The error models are evaluated in the integer domain to accurately measure the effects of rounding. The analysis results show that the overall mean-square error performance (OMSE) is the most critical condition for meeting the IEEE specification (IEEE Std. 1180-1990) when the rounding scheme is employed. On the other hand, the mean error effects (OME and PME) are dominant for truncation. Finally, the analysis results are compared with those of bit-accurate simulation.
TL;DR: The design of a multiplier that performs either interval or floating point multiplication is presented, which requires only slightly more area and delay than a conventional floating point multiplier, and is one to two orders of magnitude faster than software implementations of interval multiplication.
Abstract: Interval arithmetic provides an efficient method for monitoring and controlling errors in numerical calculations. However, existing software packages for interval arithmetic are often too slow for numerically intensive computations. This paper presents the design of a multiplier that performs either interval or floating point multiplication. This multiplier requires only slightly more area and delay than a conventional floating point multiplier, and is one to two orders of magnitude faster than software implementations of interval multiplication.
TL;DR: The results show that large shift scheduling problems can be solved optimally with the proposed branch and cut algorithm and the implicit formulation.
Abstract: This paper presents a new branch and cut algorithm for optimal shift scheduling with multiple breaks and break windows. The proposed algorithm is based on an implicit formulation of the problem requiring significantly smaller number of variables than the set covering formulation of Dantzig. The new algorithm, adding cuts, developing upper bounds for the variables, and employing an efficient rounding heuristic, was tested successfully with 90 test problems involving between 2160 and 32 928 shift variations and five demand profiles. Our results show that large shift scheduling problems can be solved optimally with the proposed branch and cut algorithm and the implicit formulation.
TL;DR: In this paper, a data processing approach for performing a multiply-accumulate operation A + B + C in response to a single instruction identifying the multiply-concussive operation is presented.
Abstract: A data processing apparatus and method is provided for performing a multiply-accumulate operation A+(B*C) in response to a single instruction identifying said multiply-accumulate operation. The data processing operation comprises a multiplier for multiplying values B and C to generate an unrounded multiplication result, the multiplier further being arranged to generate first data required for rounding determination, and an adder for adding the unrounded multiplication result to a value A to generate an unrounded multiply-accumulate result, the adder further being arranged to generate second data required for rounding determination. Determination logic is then provided for using the first and second data to determine one or more rounding values required to produce a final multiply-accumulate result equivalent to the execution of a separate multiply instruction incorporating rounding, followed by a separate add instruction incorporating rounding. Rounding logic is then arranged to apply the one or more rounding values to generate the final multiply-accumulate result. By this approach, dedicated multiply-accumulate logic can be provided to enable fast execution of a multiply-accumulate instruction, whilst producing a result which is compliant with the IEEE 754-1985 standard.
TL;DR: A very-high radix digit-recurrence algorithm for the operation /spl radic/(x/d) is developed, with residual scaling and digit selection by rounding, which shows a similar cycle time and an increase of one cycle for the extended operation with respect to square-root.
Abstract: A very-high radix digit-recurrence algorithm for the operation /spl radic/(x/d) is developed, with residual scaling and digit selection by rounding. This is an extension of the division and square-root algorithms presented previously, and for which a combined unit was shown to provide a fast execution of these operations. The architecture of a combined unit to execute division, square-root, and /spl radic/(x/d) is described, with inverse square-root as a special case. A comparison with the corresponding combined division and square-root unit shows a similar cycle time and an increase of one cycle for the extended operation with respect to square-root. To obtain an exactly rounded result for the extended operation a datapath of about 2n bits is needed. An alternative is proposed which requires approximately the same width as for square-root, but produces a result with an error of less than one ulp. The area increase with respect to the division and square root unit should be no greater than 15 percent. Consequently, whenever a very high radix unit for division and square-root seems suitable, it might be profitable to implement the extended unit instead.
TL;DR: A new method of implementing efficient FIR filters is presented that involves approximation of an equiripple FIR by a rounding operation and application of the derived impulse response by a simple recursive equation.
Abstract: A new method of implementing efficient FIR filters is presented. It involves approximation of an equiripple FIR by a rounding operation and application of the derived impulse response by a simple recursive equation. The technique is extremely efficient for lowpass, highpass, bandpass, and bandstop filters with sharp transitions and low edge frequencies.
TL;DR: An algorithm for IEEE floating-point addition that accepts normalized numbers, supports all four IEEE rounding modes, and outputs the correctly rounded sum/difference in the format required by the IEEE Standard.
Abstract: We present an algorithm for IEEE floating-point addition. The latency of the addition algorithm for double precision is roughly 24 logic levels, not including delays of latches between pipeline stages. The algorithm accepts normalized numbers, supports all four IEEE rounding modes, and outputs the correctly rounded sum/difference in the format required by the IEEE Standard. The presentation of the algorithm is technology independent and can serve as basis for evaluation and comparison with other floating-point addition algorithms.
TL;DR: In this paper, the authors consider problems which stem from production planning processes in the chemical industry and investigate heuristic approaches which consist of different rounding strategies based on an optimal solution of the corresponding linear relaxation of the integer program.
Abstract: In this paper we consider problems which stem from production planning processes in the chemical industry. Many of these problems may be formulated as mixed integer linear programs. Since it is a big deal to obtain an optimal solution of this model in a reasonable amount of time, the design of fast and efficient heuristics is very important for practical purposes. In this paper we investigate heuristic approaches which consist of different rounding strategies based on an optimal solution of the corresponding linear relaxation of the integer program. Computational experiences on practical data are reported.
TL;DR: A new transform, called the rounding transform (RT), is introduced, which maps an integer vector onto another integer vector by using weighted average and difference filters followed by a rounding operation.
Abstract: A new transform, called the rounding transform (RT), is introduced in this paper. This transform maps an integer vector onto another integer vector by using weighted average and difference filters followed by a rounding operation. The RT can be applied to lossless pyramid structured coding with various elementary block sizes and filters. In addition, it generalizes other mean based lossless pyramid structured coding schemes.
TL;DR: A new rounding transform is proposed called the overlapping rounding transform (ORT), defined as a two-port input/two-port output FIR filtering system with a pair of rounding operations that is applied to develop lossless subband coding systems.
Abstract: We propose a new rounding transform called the overlapping rounding transform (ORT). It is defined as a two-port input/two-port output FIR filtering system with a pair of rounding operations. The ORT is applied to develop lossless subband coding systems. The ORT approach has both a simpler representation and more possibilities for lossless subband implementation than the lifting scheme.
TL;DR: In this paper, a floating point unit includes a rounding unit that rounds the least significant bits of a sum and a carry in bit that selects one of the unincremented sum and incrementally sum.
Abstract: A floating point unit includes a rounding unit that rounds the two least significant bits of a sum. After a sum of the two mantissas is generated the at least one least significant bit is separated from the sum. When addition is performed, two least significant bits are separated from the sum. A half add unit may be used to generate the sum along with a set of carry data, and thus at least one least significant bit of the carry data is also separated. A rounding unit receives the separated at least one least significant bit of the sum and carry data and produces a carry in bit as well as rounded at least one least significant bit. The sum and carry data are then summed in a later stage of the floating point unit to form both a unincremented sum and an incremented sum, which are stored in a multiplexer. The carry in bit is used to select one of the unincremented sum and incremented sum. The rounded at least one least significant bit produced by the rounding unit is then united with the selected one of the unincremented sum and incremented sum.
TL;DR: In this paper, the authors proposed a code-amount control device for video-coding devices, which can control the generated code-length of each block (block length) after variable-length coding.
Abstract: The invention is directed to provide a code-amount control device for use in video-coding device, which controls the generated code-amount of each block (block length) after variable-length coding in the video-coding process to surely be less than the designated target code-amount. The code amount control device is provided with a rate control circuit for setting a target code-amount of each of blocks to be encoded, a rounding circuit for converting a sequence of quantized values into a set of the number of continuous zeroes and following thereto non-zero quantized values and a code-length table containing variable code-length values to be allocated to respective sets of the number of continuous zeroes and non-zero values. Before variable-length encoding of the image block, the rounding circuit estimates by accumulating the code length of each block with reference to the code-length table and outputs the quantized values as they are to a variable-length coding circuit when the accumulated code-length does not exceed the target code-amount. If the estimated code-length exceeds the target code-amount, the rounding circuit outputs the quantized values after rounding off the excessive portion thereof.
TL;DR: In this article, a floating point arithmetic logic unit includes two rounding units that select between an incremented, unincremented, and complemented result from a carry propagate adder, based on the equality or inequality of the exponents of the operands, the relative sizes of the mantissas and the presence of a guard bit.
Abstract: A floating point arithmetic logic unit includes two rounding units that select between an incremented, unincremented, and complemented result from a carry propagate adder. A fast rounding unit selects a result as an approximation based on the equality or inequality of the exponents of the operands, the relative sizes of the mantissas and the presence of a guard bit. The result selected by the fast rounding unit is received by a leading zero count unit, which counts the leading zeros of the result. A second slower rounding unit meanwhile makes a selection between the incremented, unincremented, and complemented results based on the rounding mode, the sign of the result and whether the result is exact. The result is inexact when both the most significant bit and the guard bit are equal to one. While the slower rounding unit may take longer to determine the appropriate selection, the result selected is the most accurate. Based on the number of leading zeros determined from the result selected by the fast rounding unit, the bits in the result selected by the slower rounding unit are left shifted thereby normalizing the result.
TL;DR: In this paper, a Tabu Search (TS) metaheuristic was applied to the design of digital filters, with coefficient values expressed as the sum of power of two, and the results of the proposed technique are compared to those obtained by simply rounding the optimal values of coefficients obtained by means of Remetz algorithm.
Abstract: This paper deals with the application of a Tabu Search (TS) metaheuristic to the design of digital filters, with coefficient values expressed as the sum of power of two. The performances of the algorithm are heavily affected by the choice of its parameters, which in turn are related to different implemented strategies. The tuning of these parameters has been performed after many tests. The results of the proposed technique are compared to those obtained by simply rounding the optimal values of coefficients obtained by means of Remetz algorithm, and to those obtained using a simulated annealing algorithm.
TL;DR: In this paper, a method for simulating the bias effect associated with MPEG-compliant rounding of pixel values including a fractional component of 0.5 to the next highest integer value is described.
Abstract: Methods and apparatus for simulating, in reduced resolution video decoders, the biasing effect associated with MPEG's specified rounding of pixel values including a fractional component of 0.5 to the next highest integer value are described. In one embodiment, the biasing effect is simulated by generating luminance and chrominance DC DCT coefficient bias values from, e.g., motion vector offset data. The DC DCT bias values are then added to the DC DCT coefficients of the luminance and chrominance blocks, respectively, which correspond to the same image block to which the motion vector data used to generate the bias values corresponds. In another embodiment, pixel values are directly adjusted to simulate the biasing effect associated with MPEG compliant rounding. In such an embodiment, luminance and chrominance pixel biasing values are generated as a function of, e.g., motion vector offset information. The bias values are added to the pixel values generated through the use of motion compensated prediction. The pixel values are then rounded by performing a non-biased rounding operation to generate integer pixel values.
TL;DR: In this paper, a floating point unit with an adder 16 within the unit receives a first input and a second input to generate a sum, which is subject to subsequent normalization by a normalizer 60 and rounding by an incrementer 64.
Abstract: A floating point unit is described that performs addition operations. An adder 16 within the floating point unit receives a first input and a second input to generate a sum. This sum is subject to subsequent normalization by a normalizer 60 and rounding by an incrementer 64. If an operation is performed that is immediately followed by an addition operation using the result of the preceding operation, then the normalized but unrounded sum is fed back to the adder 16 together with an indication of its rounding requirement. This rounding requirement can be performed by the adder 16 in parallel with the execution of the following addition by using the carry-in bit of the adder 16 to apply any increment required to rounding of the preceding result.
TL;DR: In this article, a mixed-integer linear programming (MILP) approach is presented for the optimal control of reactive power in a power system, in comparison to the gradient projection approach.
Abstract: A mixed-integer programming approach is presented for the optimal control of reactive power in a power system, in comparison to the gradient projection approach. Although the problem is originally a nonlinear optimization problem, it can be formulated as a mixed-integer linear programming (MILP) problem without deterioration of solution accuracy to a certain extent. In order to obtain the linearized version of the problem, a new Jacobian matrix decomposition method is adopted in a unified approach of real and reactive power mismatch. The MILP code is developed by the branch and bound process search for the optimal solution. The variables for modeling transformer tap positions are handled as discrete integer ones, and other variables continuous ones. Numerical data resulting from a case study using a modified IEEE 30 bus system with contingency (line outages) show that the MILP can produce more reductions of magnitude in the operating cost, compared to the results obtained from the gradient projection programming approach based on the scheme of rounding intelligently and of weighting factor for fast convergence. The convergence characteristics of the results obtained from two optimization methods in searching for the optimal solution are also presented and discussed.