TL;DR: It is shown that a preemptive one-machine relaxation is a powerful tool for designing parallel machine scheduling algorithms that simultaneously produce good approximations and have small running times, and a general theorem relating the value of one- machine relaxations to that of the schedules obtained for the original m-machine problems is proved.
Abstract: We consider the problem of nonpreemptive scheduling to minimize average (weighted) completion time, allowing for release dates, parallel machines, and precedence constraints. Recent work has led to constant-factor approximations for this problem, based on solving a preemptive or linear programming relaxation and then using the solution to get an ordering on the jobs. We introduce several new techniques which generalize this basic paradigm. We use these ideas to obtain improved approximation algorithms for one-machine scheduling to minimize average completion time with release dates. In the process, we obtain an optimal randomized on-line algorithm for the same problem that beats a lower bound for deterministic on-line algorithms. We consider extensions to the case of parallel machine scheduling, and for this we introduce two new ideas: first, we show that a preemptive one-machine relaxation is a powerful tool for designing parallel machine scheduling algorithms that simultaneously produce good approximations and have small running times; second, we show that a non-greedy {open_quotes}rounding{close_quotes} of the relaxation yields better approximations than a greedy one. We also prove a general theorem relating the value of one-machine relaxations to that of the schedules obtained for the original m-machine problems. This theorem applies even when there are precedencemore » constraints on the jobs. We apply this result to precedence graphs such as in-trees, out-trees, and series- parallel graphs; these are of particular interest in compiler applications that partly motivated our work.« less
TL;DR: The system performance impact of floating-point division latency for varying instruction issue rates is presented and the performance implications of shared multiplication hardware, shared square root, on-the-fly rounding and conversion, and fused functional units are examined.
Abstract: Floating-point division is generally regarded as a low frequency, high latency operation in typical floating-point applications. However, in the worst case, a high latency hardware floating-point divider can contribute an additional 0.50 CPI to a system executing SPECfp92 applications. This paper presents the system performance impact of floating-point division latency for varying instruction issue rates. It also examines the performance implications of shared multiplication hardware, shared square root, on-the-fly rounding and conversion, and fused functional units. Using a system level study as a basis, it is shown how typical floating-point applications can guide the designer in making implementation decisions and trade-offs.
TL;DR: In this article, a floating point arithmetic unit for performing independent multiply and add operations in the execution of a multiply-add instruction AC+B on three operands A, B, and C of p-bit precision is presented.
Abstract: A floating point arithmetic unit for performing independent multiply and add operations in the execution of a multiply-add instruction AC+B on three operands A, B, and C of p-bit precision includes a multiplier unit, a sticky collect unit, an adder unit, and a rounding unit. In addition, a risk condition detection unit provides detection of a risk condition corresponding to an occurrence of an imprecise resultant quantity prior to being rounded by the rounding unit. Upon detection of a risk condition, a trap is triggered and an extended sequence implementation unit carries out an extended multiply-add sequence and provides a multiply-add output having infinite precision prior to a final rounding. A floating point arithmetic method for performing independent multiply and add operations in the execution of a multiply-add instruction AC+B on three operands A, B, and C of p-bit precision is disclosed also.
TL;DR: A lattice rounding technique using a natural matrix norm is analyzed and its application to proving in a non-uniform model the hardness of computing 2 log log p bits of the secret keys of Diffie-Hellman and related protocols from the public keys is presented.
Abstract: We analyze a lattice rounding technique using a natural matrix norm. We present its application to proving in a non-uniform model the hardness of computing 2 log log p bits of the secret keys of Diffie-Hellman and related protocols from the public keys. Earlier in it was shown that {radical}log p bits are hard to compute. Efforts to extend our results to the uniform model raise a natural approximation problem for a form of weighted bin packing. If this approximation problem is easy on random inputs, then MSB`s of Diffie-Hellman secrets are as hard to compute as the whole in the uniform model.
TL;DR: A validation strategy for codes ported to new computing environments is offered and illustrated based on a careful examination of the growth of machine rounding-sized initial errors in the NCAR Community Climate Model.
Abstract: The growth of machine rounding-sized initial errors in the NCAR Community Climate Model (CCM2) is studied. Growth is faster than predictability error growth that results from turbulent flow. Rounding accumulation dominates the growth when temperature differences are below the order of 10-3 K. Discontinuous code branches are not a source of growth when the differences are less than 10-3 K. Arguments are presented to explain this observation. The fast growth of differences is caused by the physical parameterizations, as they respond to the evolving states produced by the dynamical flow. Based on a careful examination of this growth of small differences, a validation strategy for codes ported to new computing environments is offered and illustrated.
TL;DR: This paper investigates the one-dimensional cutting stock problem with respect to the modified integer round-up property (MIRUP) and presents some results on subproblems having the MIRUP.
TL;DR: In this paper, a frame memory outputs at least one reference picture to a prediction picture calculation unit, according to motion vector information with half-pel accuracy, from the reference picture(s), the calculation unit produces the prediction picture by performing interpolation operations if the motion vector has a halfpel component, each interpolated pel value is rounded off to an integer according to a particular rounding algorithm specified by a calculation controller.
Abstract: A method and apparatus for interframe predictive video coding and decoding which avoid the accumulation of rounding errors and thus obtain high-quality reproduced pictures. A frame memory outputs at least one reference picture to a prediction picture calculation unit, according to motion vector information with half-pel accuracy. From the reference picture(s), the prediction picture calculation unit produces a prediction picture by performing interpolation operations if the motion vector has a half-pel component. Here, each interpolated pel value is rounded off to an integer according to a particular rounding algorithm specified by a calculation controller. More specifically, the prediction picture calculation unit supports a first algorithm that rounds off the pel values toward positive infinity and a second algorithm that rounds off them toward negative infinity. The calculation controller outputs a rounding control signal that directs the prediction picture calculation unit to select either of those two algorithms with substantially equal probabilities of occurrence.
TL;DR: When the instruction is not the first predetermined instruction and the rounding mode specified by the instruction was not a dynamic override, the circuit executed the instruction using a rounding mode (110) specified by instruction.
Abstract: A processor (100) contains a storage area (120) for a dynamic rounding mode control value, and a circuit (122, 124, 126) coupled to the storage area configured to execute an instruction using a rounding mode. When the instruction is a first predetermined instruction, a first predetermined rounding mode is used during execution of the instruction. When the instruction is not the first predetermined instruction and the rounding mode specified by the instruction is not a dynamic override, the circuit executes the instruction using a rounding mode (110) specified by the instruction. When the instruction is not the first predetermined instruction and the rounding mode specified by the instruction is the dynamic override, the circuit executes the instruction using a rounding mode specified by the dynamic rounding mode control value (115).
TL;DR: *PHDDs is applied to verify integer multipliers and floating point multipliers before the rounding stage, based on a hierarchical verification approach, to provide a compact representation for functions that map Boolean vectors into integer or floating point values.
Abstract: Data structures such as *BMDs, HDDs, and K*BMDs provide compact representations for functions which map Boolean vectors into integer values, but not floating point values. In this paper, we propose a new data structure, called Multiplicative Power Hybrid Decision Diagrams (*PHDDs), to provide a compact representation for functions that map Boolean vectors into integer or floating point values. The size of the graph to represent the IEEE floating point encoding is linear with the word size. The complexity of floating point multiplication grows linearly with the word size. The complexity of floating point addition grows exponentially with the size of the exponent part, but linearly with the size of the mantissa part. We applied *PHDDs to verify integer multipliers and floating point multipliers before the rounding stage, based on a hierarchical verification approach. For integer multipliers, our results are at least 6 times faster than *BMDs. Previous attempts at verifying floating point multipliers required manual intervention. We verified floating point multipliers before the rounding stage automatically.
TL;DR: In this article, two instructions are used to implement a procedure for rounding operands of finite but arbitrary precision in a data processor (10) with fixed width registers and a fixed width ALU.
Abstract: A method and apparatus for performing rounding in a data processor (10). In one embodiment, two instructions are used to implement a procedure for rounding operands of finite but arbitrary precision. A first instruction "rndp" performs a preliminary rounding operation by analyzing bits which are less significant than the rounding point and updating the values of the round (RND) and sticky (STK) status bits (70, 71) in a defined manner. A second instruction "rnd" performs the rounding of the most significant portion of the operand, using the contents of the RND and STK status bits (70, 71) as determined by the one or more iterations of the "rndp" instruction. By appropriate use of these two instructions, and additional add-with-carry operations, an operand of any length may be rounded at an arbitrary point using a data processor (10) with fixed width registers (40) and a fixed width ALU (30).
TL;DR: In this article, the selective dynamic rounding (SDRR) algorithm is used for the optimal sizing design of truss structures subject to linear buckling constraints, and a continuous design based on the regression analysis of section effectiveness vs. area is used as a starting point for the dual step discrete optimization phase.
Abstract: The selective dynamic rounding (SDR) algorithm previously developed by the authors, and based on a dual step rounding approach, is used for the optimal sizing design of truss structures subject to linear buckling constraints. The algorithm begins with a continuous optimum followed by a progressive freezing of individual variables while solving the remaining continuous problems. The allowable member stresses are predicted by the linear regression of the tabular section properties, while the exact allowable compressive stresses are back-substituted for those variables fixed on discrete values in each intermediate mixed-discrete nonlinear problem. It is shown that a continuous design based on the regression analysis of section effectiveness vs. area is effective as a starting point for the dual step discrete optimization phase. A range of examples is used to illustrate that with “conservative” regression, discrete designs can be achieved which are significantly lighter than those in which the variables have been rounded up.
TL;DR: In this article, a semi-empirical model correlating rounding with re-oxidation and nitride mask thickness is derived from mechanical studies, and the electrical properties of devices with HTR-STI for the 018 /spl mu/m technology are analyzed.
Abstract: In this work, we first establish the relationship between corner leakage and corner rounding through device simulation Then, we demonstrate a novel method to produce corner rounding, using a post-CMP, high temperature re-oxidation process (HTR-STI) A semi-empirical model correlating rounding with re-oxidation and nitride mask thickness is derived from mechanical studies Finally, we show the electrical properties of devices with HTR-STI for the 018 /spl mu/m technology
TL;DR: In this article, a method and apparatus for efficiently generating multipleinteger quotients of integer numerators divided by a common integer denominator are implemented by multiplying a floating point approximation of the reciprocal of the integer denominator by floating point representations of the numerators, and then rounding up or down as required.
Abstract: A method and apparatus for efficiently generating multiple
integer quotients of integer numerators divided by a common integer
denominator are implemented by multiplying a floating point
approximation of the reciprocal of the integer denominator by floating
point representations of the numerators, biasing the floating point
quotients before rounding up (158), or down (156) as required. First, an
initial approximation of the reciprocal of the denominator is computed
(102) by squaring (114) a limited precision square root of the
reciprocal (112). A final reciprocal is computed using a finite power
series (104). Finally, modified numerators are formed (106) by biasing
the original numerators, products of the modified numerators
multiplied times the reciprocal are computed, and the products are
rounded up or down as required (108).
TL;DR: In this paper, the sum-product result register 6 outputs its held value to the path P1 of a processor which executes positive conversion processing, which converts coded data into uncoded data, and saturation calculation processing which rounds a value to an appropriate number of bits at high speed.
Abstract: A processor which executes positive conversion processing, which converts coded data into uncoded data, and saturation calculation processing, which rounds a value to an appropriate number of bits, at high speed. When a positive conversion saturation calculation instruction “MCSST D1” is decoded, the sum-product result register 6 outputs its held value to the path P1. The comparator 22 compares the magnitude of the held value of the sum-product result register 6 with the coded 32-bit integer “0x0000_00FF”. The polarity judging unit 23 judges whether the eighth bit of the value held by the sum-product result register 6 is “ON”. The multiplexer 24 outputs one of the maximum value “0x0000_00FF” generated by the constant generator 21, the zero value “0x0000_0000” generated by the zero generator 25, and the held value of the sum-product result register 6 to the data bus 18.
TL;DR: This paper designs a provably correct, modular, IEEE-compliant floating point unit that deals with the detection of exceptions and trapped overflow and underflow exceptions as an integral part of the rounding unit.
Abstract: Engineering design methodology recommends designing a system as follows: start with an unambiguous specification, partition the system into blocks, specify the functionality of each block, design each block separately, and glue the blocks together. Verifying the correctness of an implementation then reduces to a local verification procedure. We apply this methodology for designing a provably correct, modular, IEEE-compliant floating point unit. First, we provide a mathematical, and hopefully unambiguous, definition of IEEE Standard 754 (1985) which specifies the functionality. The design consists of an adder, a multiplier and a rounding unit, each of which is further partitioned. Our floating point unit design deals with the detection of exceptions and trapped overflow and underflow exceptions as an integral part of the rounding unit. Our abstraction level avoids bit-level arguments while still enabling the addressing of crucial implementation issues such as delay and cost.
TL;DR: The proposed scheme offers a 10/spl times/ reduction in power consumption in comparison to that of conventional high speed floating point adders that use leading zero anticipation logic, for IEEE single precision floating point data format.
Abstract: We present a new architecture of a low power floating point adder, that is fast and has low latency. The functional partitioning of the adder into three distinct, controlled data paths allows activity reduction. During any given operation cycle, only one of the data paths is active, during which time, the logic assertion status of the circuit nodes of the other data paths are held at their previous states. Critical path delay and latency are reduced by incorporating speculative rounding and pseudo leading zero anticipation logic as well as data path simplifications. The proposed scheme offers a 10/spl times/ reduction in power consumption in comparison to that of conventional high speed floating point adders that use leading zero anticipation logic, for IEEE single precision floating point data format. The reduction in power delay product is about 16/spl times/. The corresponding figures for double precision units are around 40/spl times/ and 66/spl times/ respectively.
TL;DR: Applying symmetry properties, the existence of double eigenvalues is proved and strict error bounds are computed for the 100 lowest eigen values of the clamped plate problem in the unit square.
Abstract: We describe a method for the calculation of theN lowest eigenvalues of fourth-order problems inH
0
2
(Ω). In order to obtain small error bounds, we compute the defects inH
−2(Ω) and, to obtain a bound for the rest of the spectrum, we use a boundary homotopy method. As an example, we compute strict error bounds (using interval arithmetic to control rounding errors) for the 100 lowest eigenvalues of the clamped plate problem in the unit square. Applying symmetry properties, we prove the existence of double eigenvalues.
TL;DR: In this article, the authors developed a method for converting the continuous solution produced by the Arsenal Exchange Model into a feasible integer solution, using the LP relaxation of the integer goal program.
Abstract: : The Air Force Studies and Analyses Agency sought a method for converting the continuous solution produced by the Arsenal Exchange Model into a feasible integer solution. The current integerization method leaves weapons unused and targets uncovered and could be improved. In the method developed, the noninteger valued variables in the continuous solution are rounded down to the nearest integer to produce a truncated solution. An integer goal program is then used to reallocate the weapons and targets made available in the rounding process. The truncated solution is then combined with the solution of the integer goal program to produce a feasible integer solution for the original problem. The revised solution method, using the LP relaxation of the integer goal program, was implemented in AEM. The implementation was used to solve four test cases. In all four cases, the revised solution method produced solutions that were closer to the continuous solution in terms of damage expectancy, target coverage, and goal performance than the solutions produced by the current method.
TL;DR: By performing all floating point operations in the same rounding mode, these algorithms can exploit software pipelining to provide better performance than simply evaluating the corresponding point elementary function at each endpoint of the argument interval and rounding.
Abstract: We present table driven algorithms for computing interval bounds on several common elementary functions. Our algorithms use directed rounding to obtain sharp bounds-within 1.5 units in the last place of the exact range of the function over the argument interval-without the explicit use of extended precision. Moreover, by performing all floating point operations in the same rounding mode, our algorithms can exploit software pipelining to provide better performance than simply evaluating the corresponding point elementary function at each endpoint of the argument interval and rounding.
TL;DR: In this article, a method and system for processing instructions in a floating point unit for executing denormalized numbers in floating point pipeline via serializing uses an instruction unit and having a control unit and a pipelined data flow unit, a shifter and a rounding unit.
Abstract: A method and system for processing instructions in a floating point unit for executing denormalized numbers in floating point pipeline via serializing uses an instruction unit and having a control unit and a pipelined data flow unit, a shifter and a rounding unit. The floating point unit has an external feedback path for providing intermediate result data from said rounding unit to an input of the pipelined data flow unit to reuse the pipeline for denormalization by passing intermediate results in the pipeline which have a denormalized condition computed after the exponent calculation of the shifting circuit directly from the rounding unit to the top of the dataflow in the pipeline via an external feedback path. The pipelined has two paths which are selected based on the presence of other instructions in the pipeline. If no other instructions are in the pipeline a first path is taken which uses the external feedback path from the rounding unit back into the top of the dataflow. When there are instructions in the pipeline a shifter unit performing normalization of the fraction indicates possible underflow of the exponent, and prepares to hold the exponent and the fraction in a floating point data flow register; and upon detection of exponent underflow during the rounder stage and detection of any other instructions in pipeline; then the control unit forces an interrupt for serialization, and cancels execution of this instruction and other instructions in pipeline.
TL;DR: In this article, a floating point unit is provided which conveys the rounding mode in effect upon dispatch of a particular instruction with that particular instruction into the execution pipeline of the floating-point unit.
Abstract: A floating point unit is provided which conveys the rounding mode in effect upon dispatch of a particular instruction with that particular instruction into the execution pipeline of the floating point unit. Upon dispatch of a control word update instruction into the execution pipeline, the rounding mode is updated according to the updated control word provided for the control word update instruction. Instructions subsequent to the control word update instruction thereby receive the updated rounding mode as those instructions are dispatched. The updated rounding mode is available to the subsequent instructions prior to retiring the control word update instruction. The rounding mode is therefore updated without serializing the update. If the control word update instruction modifies the value in a field other than the rounding mode, the instructions subsequent to the control word update instruction may be discarded and re-executed subsequent to updating the control word register with the updated control word. In this manner, the control word update is effectively serialized for cases in which a field other than the rounding mode is updated.
TL;DR: In this article, a reservation station with an addressable constant store enables the provision of floating point constants to arithmetic units in a floating point unit of a superscalar processor, which provides extra precision and/or range when compared with formats available external to the unit.
Abstract: A reservation station with an addressable constant store enables the provision of floating point constants to arithmetic units in a floating point unit of a superscalar processor. Floating point constant identifiers supplied with floating point instructions index into the addressable store and addressed constant are provided in an internal, extended-precision format which provides extra precision and/or range when compared with formats available external to the floating point unit. In this way, full internal, extended-precision constants can be provided for use in microcoded floating point instruction sequences. Additionally, internal extended-precision floating point constants may be rounded in accordance with a prevailing rounding mode and format to provide external format floating point constant values for use in implementing load constant instructions.
TL;DR: This paper explores and presents an approach that allows one to construct regular, nondegenerate positional or tangent plane continuous triangular or n-sided patches, each one as a trimmed tensor product surface.
Abstract: The tensor product Bkzier and NURBs surface representation is frequently exploited in computer aided geometric design. Yet, this representation is inherently rectangular, a topology that does not easily enable the skinning, filleting, and rounding of triangular regions or domains with arbit,rary n-sided boundaries. Modern solid modeling systems support tensor product BCzier and NURBs surfaces with the additional ability to represent the trimmed form of these surfaces. This paper explores and presents an approach that allows one to construct regular, nondegenerate positional or tangent plane continuous triangular or n-sided patches, each one as a trimmed tensor product surface. The proposed method is demonstrated on rounding of triangular corners using positional and tangent plane continuity conditions as well as an example of a Co hexagonal filleting patch. ‘This work was supported in part by DARPA (N00014-92-J-4113) and the NSF Saence and Technology Center for Computer Graphics and Saentific Vwmliaation (ASC-89.20219). All opinions, findings, conclusions or recommendations expressed in this document are those of the aut.hors and do not necessarily reflect the views of the sponsormg agencies ‘This work was supported in part by grant No 92-00223 from the United States-Israel Binational Science Found&Ion (BSF), Jerusalem, Israel. However, opmmns, conclusions or recommendations arlsing out of supported research activities are those of the author or the grantee and should not be presented as implying that they are the VI<:WS of the BSF ‘While vwtmg at the University of Utah.
TL;DR: The paper presents a floating point addition algorithm and adder pipeline design employing a packet forwarding pipeline paradigm and the proposed algorithms constitute a new paradigm for handling data hazards in deeply pipelined floating point pipelines.
Abstract: For pt.I see ibid., p.140-7 (1997). The paper presents a floating point addition algorithm and adder pipeline design employing a packet forwarding pipeline paradigm. The packet forwarding format and the proposed algorithms constitute a new paradigm for handling data hazards in deeply pipelined floating point pipelines. The addition algorithm employs a four stage execution phase pipeline with each stage suitable for implementation in a short clock period, assuming about fifteen logic levels per cycle. The first two cycles are related to addition proper and are the principal focus of the paper. The last two cycles perform the rounding. The addition algorithm accepts one operand in a standard binary floating point format at the start of cycle one. Packets comprising the other operand in our packet forwarding floating point format are input at the start of cycles one and two. Output of the result occurs in the packet format after cycles two and three with the format representing a floating point value equal to the standard IEEE 754 rounded result. The same result in a standard binary floating point format is available after cycle four for retirement to a register. The packet forwarding result is thus available with an effective two cycle latency for forwarding to the start of the adder pipeline or to a cooperating multiplier pipeline accepting a packet forwarding operand. The effective latency of the proposed design is two cycles for successive dependent operations while preserving IEEE 754 binary floating point compatibility.
TL;DR: A floating-point arithmetic processing apparatus has a circuit for generating a limit value for normalization shift by subtracting an exponent of the minimum value of a normalized number from a value of an exponent in an intermediate result as discussed by the authors.
Abstract: A floating-point arithmetic processing apparatus has a circuit for generating a limit value for normalization shift by subtracting an exponent of the minimum value of a normalized number from a value of an exponent of an intermediate result, and a circuit for generating, as a normalization shift number, smaller one of a shift number necessary for making the mantissa of the intermediate result a normalized number and the limit value for normalization shift. The floating-point arithmetic processing apparatus further has a circuit having a circuit for detecting a condition for overflow before the rounding process and a circuit for generating a value in the case of overflow, so that a predetermined value is delivered as a final result only when the overflow condition is detected before the rounding process but in the other case, a result obtained by performing the normalization process and the rounding process is delivered. When no overflow takes place before the rounding process but overflow occurs after the rounding process, the result obtained by performing the normalization process and the rounding process is delivered as a final result.
TL;DR: A procedure for determining whether a box intersects a tetrahedron is developed, and it gives a guaranteed answer (modulo rounding errors).
Abstract: A procedure for determining whether a box intersects a tetrahedron is developed. The procedure uses interval arithmetic tools and barycentric coordinates, and it gives a guaranteed answer (modulo rounding errors). The basic geometric idea uses the separating plane approach. The procedure is robust and easy to implement. Numerical examples are given for each of the possible results (i.e., inclusion, overlap and exclusion).
TL;DR: In this paper, a new solution method for a basic version of the European airport slot allocation problem is presented, based on an integer linear programming model where for each flight there is a set of 0-1 variables, each associated with one of its possible departure slots.
Abstract: This report presents a new solution method for a basic version of the European airport slot allocation problem. The approach is based on an integer linear programming model where for each flight there is a set of 0-1 variables, each associated with one of its possible departure slots. Also presented is a heuristic rounding to derive feasible integral solution from traditional solutions after each iteration of the column generation algorithm.
TL;DR: This paper presents an algorithm for classifying a point with respect to a region bounded by a series of piecewise continuous linear, quadratic, or cubic parametric curve segments, and techniques for handling special cases when the test ray passes through the end point of a curve segment.
Abstract: Classifying a point with respect to a closed planar region is a frequently encountered problem in the field of geometric computing. One of the most popular techniques is the parity count method in which a semi-infinite ray originating from a given test point is intersected with the boundary of a region. An odd number of intersections indicates that the point is inside the region. While this method can be applied to regions bounded by complex curves, most reported work mainly confines the boundary of a region to be composed of straight lines, circular-arcs, or parabolic splines. Owing to rounding errors in floating point computation, mis-located intersections between a ray and the boundary may lead to incorrect results. For regions bounded by complex curve segments, solutions to high order polynomial equations are usually required. The effect of rounding error is thus more apparent and may cause serious inconsistency such that a test point well within a region may be classified as outside the region. This paper presents an algorithm for classifying a point with respect to a region bounded by a series of piecewise continuous linear, quadratic, or cubic parametric curve segments. Techniques for handling special cases when the test ray passes through the end point of a curve segment will be discussed. The robust point inclusion test is attained through the use of a tolerance zone associated with each end point of a curve segment. The size of the tolerance zone depends on the segment representation error and the ray-segment intersection error. A method for estimating these errors and the size of the tolerance zone will also be discussed.
TL;DR: In this article, the effect of both input and product quantization is examined for two-complement, fixed-point arithmetic with rounding quantisation. But the model for the quantization noise is not considered.
Abstract: The effect of both input and product quantisation is examined for two's-complement, fixed-point arithmetic with rounding quantisation. Both direct-form and state-space structures are considered, and mathematical models for the transmission of the quantisation noise are established. Expressions are derived for the variance of the statistically-stationary output noise, with an example showing how a time-varying filter realisation could be superior to an equivalent time-invariant structure.
TL;DR: It is shown that nonparametric orthogonal series estimators of probability densities retain the mean integrated square error (MISE) consistency when observations are grouped to the points of a uniform grid (prebinned).
Abstract: In this correspondence we show that nonparametric orthogonal series estimators of probability densities retain the mean integrated square error (MISE) consistency when observations are grouped to the points of a uniform grid (prebinned). This kind of grouping is typical for computer rounding errors and may also be useful in data compression, before calculating estimates, e.g., using the FFT. The main result shows that MISE consistency holds for all square-integrable densities under mild conditions on the grid step size.