TL;DR: It is shown that the multiplication operation c=a · b · r-1 in the field GF(2k can be implemented significantly faster in software than the standard multiplication, where r is a special fixed element of the field.
Abstract: We show that the multiplication operation c=a · b · r^-1 in the field GF(2^k can be implemented significantly faster in software than the standard multiplication, where r is a special fixed element of the field. This operation is the finite field analogue of the Montgomery multiplication for modular multiplication of integers. We give the bit-level and word-level algorithms for computing the product, perform a thorough performance analysis, and compare the algorithm to the standard multiplication algorithm in GF(2^k. The Montgomery multiplication can be used to obtain fast software implementations of the discrete exponentiation operation, and is particularly suitable for cryptographic applications where k is large.
TL;DR: This work presents fast multiplication algorithms for matrix pairs of arbitrary dimensions, estimates the asymptotic running time as a function of the dimensions, and improves the exponents of the complexity estimates for computing basic solutions to the linear programming problem with constraints andvariables.
TL;DR: New probabilistic algorithms are presented for factoring univariate polynomials over finite fields, using fast matrix multiplication techniques and the new baby step/giant step techniques.
Abstract: New probabilistic algorithms are presented for factoring univariate polynomials over finite fields. The algorithms factor a polynomial of degree n over a finite field of constant cardinality in time O(n 1.815 ). Previous algorithms required time Θ(n 2+o(1) ). The new algorithms rely on fast matrix multiplication techniques. More generally, to factor a polynomial of degree n over the finite field F q with q elements, the algorithms use O(n 1.815 log q) arithmetic operations in F q . The new baby step/giant step techniques used in our algorithms also yield new fast practical algorithms at super-quadratic asymptotic running time, and subquadratic-time methods for manipulating normal bases of finite fields.
TL;DR: The implementation of IEEE single precision floating-point multiplication and addition is discussed, followed by a discussion of an algorithm, matrix multiplication, based on these operations, which achieves performance comparable to conventional microprocessors.
Abstract: The use of reconfigurable hardware to perform high precision operations such as IEEE floating point operations has been limited in the past by FPGA resources. We discuss the implementation of IEEE single precision floating-point multiplication and addition. Then, we assess the practical implications of using these operations in the Xilinx 4000 series FPGAs considering densities available now and scheduled for the near future. For each operation, we present space requirements and performance information. This is followed by a discussion of an algorithm, matrix multiplication, based on these operations, which achieves performance comparable to conventional microprocessors. Algorithm implementation options and their performance implications are discussed and corresponding measured results are given.
TL;DR: In this article, a Monte Carlo (MC) model was used to estimate the excess noise factor in thin p/sup +/-i-n/sup +/ GaAs avalanche photodiodes (APD's).
Abstract: A Monte Carlo (MC) model has been used to estimate the excess noise factor in thin p/sup +/-i-n/sup +/ GaAs avalanche photodiodes (APD's). Multiplication initiated both by pure electron and hole injection is studied for different lengths of multiplication region and for a range of electric fields. In each ease a reduction in excess noise factor is observed as the multiplication length decreases, in good agreement with recent experimental measurements. This low noise behavior results from the higher operating electric field needed in short devices, which causes the probability distribution function for both electron and hole ionization path lengths to change from the conventionally assumed exponential shape and to exhibit a strong dead space effect. In turn this reduces the probability of higher order ionization events and narrows the probability distribution for multiplication. In addition, our simulations suggest that fur a given overall multiplication, electron initiated multiplication in short devices has inherently reduced noise, despite the higher feedback from hole ionization, compared to long devices.
TL;DR: An implementation of Strassen's algorithm for matrix multiplication is reported on that uses several unconventional techniques to make the algorithm memory-friendly and often outperforms the alternative techniques.
Abstract: Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memory-friendly. First, the algorithm internally uses a non- standard array layout known as Morton order that is based on a quad-tree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms the alternative techniques (up to 25%). However, we also observe wide variability across platforms and across matrix sizes, indicating that at this time, no single implementation is a clear choice for all platforms or matrix sizes. We also note that the time required to convert matrices to/from Morton order is a noticeable amount of execution time (5% to 15%). Eliminating this overhead further reduces our execution time.
TL;DR: In this paper, a multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed, where the multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form.
Abstract: A multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. An effective sign for the multiplier and multiplicand operands may be calculated and used to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, they may be summed and the results may be output. The results may be signed or unsigned, and may represent vector or scalar quantities. When a vector multiplication is performed, the multiplier may be configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components. The multiplier may also be configured to sum the products of the vector components to form the vector dot product. The final product may be output in segments so as to require fewer bus lines. The segments may be rounded by adding a rounding constant. Rounding and normalization may be performed in two paths, one assuming an overflow will occur, the other assuming no overflow will occur. The multiplier may also be configured to perform iterative calculations to evaluate constant powers of an operand. Intermediate products that are formed may be rounded and normalized in two paths and then compressed and stored for use in the next iteration. An adjustment constant may also be added to increase the frequency of exactly rounded results.
TL;DR: In this paper, the authors presented a parallel m/2-bit×n/2 bit multiplication with four multiplicand selectors and four multiplicator selectors, in which the selected multiplicands and multiplicators are input into four multipliers, and then multiplication results of (lower m 2 bits of m bits)×(lower n/2 bits of n bits)
Abstract: Upon execution of four sets of m/2 bit×n/2 bit multiplication, four multiplicand selectors select m/2-bit multiplicands respectively and four multiplicator selectors select corresponding n/2-bit multiplicators respectively, then the selected m/2-bit multiplicands and n/2-bit multiplicators are input into four multipliers, and then four sets of m/2 bit×n/2 bit multiplication are executed in parallel. Upon execution of m bit×n bit multiplication, the four multiplicand selectors select upper or lower m/2-bit multiplicands respectively and the four multiplicator selectors select upper or lower n/2-bit multiplicators respectively, then the selected m/2-bit multiplicands and n/2-bit multiplicators are input into the four multipliers respectively, then multiplication results of (lower m/2 bits of m bits)×(lower n/2 bits of n bits) and (upper m/2 bits of m bits)×(upper n/2 bits of n bits) out of four multiplication results of the four multipliers are connected by a connector, and then the connected multiplication results and the other two multiplication results are added by an adder with arranging in a predetermined bit location each other respectively.
TL;DR: In this paper, a multiplier capable of performing both signed and unsigned scalar and vector multiplication is disclosed, and the multiplier is configured for use in a microprocessor and may include a partial product generator, a selection logic unit, and an adder.
Abstract: A multiplier capable of performing both signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured for use in a microprocessor and may include a partial product generator, a selection logic unit, and an adder. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. The multiplier is also configured to receive a first control signal indicative of whether signed or unsigned multiplication is to be performed and a second control signal indicative of whether vector multiplication is to be performed. The multiplier is configured to calculate an effective sign for the multiplier and multiplicand operands based upon each operand's most significant bit and the control signal. The effective signs may then be used by the partial product generation unit and the selection logic to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, the adder is configured to sum them and output the results, which may be signed or unsigned. When a vector multiplication is performed, the multiplier is configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components.
TL;DR: In this article, a simplified algorithm for calculating time response of avalanche photodiodes (APDs) is presented, which considers the time course of avalanche processes for the general case of positiondependent double-carrier multiplications including the dead space effect.
Abstract: A simplified algorithm for calculating time response of avalanche photodiodes (APDs) is presented. The algorithm considers the time course of avalanche processes for the general case of position-dependent double-carrier multiplications including the dead space effect. The algorithm is based on a discrete time setting ideally suited for computer modeling and can be applied to any APD structure. It gives a fast and accurate estimation of the time and frequency response of APDs. As an example, the present method is applied to InP-InGaAs separate absorption, grading, charge, and multiplication (SAGCM) APDs. The variation of multiplication pain with bias voltage and 3-dB electrical bandwidth at different multiplication gain obtained using the new algorithm show good agreement with experimental results. The algorithm can be used to study temperature dependence of APD characteristics and can be easily extended to calculate the excess noise factor.
TL;DR: In this paper, an algorithm for computer control of a digital plotter is given, which may be programmed without multiplication or division instructions and is efficient with respect to speed of execution and memory utilization.
Abstract: An algorithm is given for computer control of a digital plotter.
The algorithm may be programmed without multiplication or division instructions and is efficient with respect to speed of execution and memory utilization.
TL;DR: Two co-transformations that accomplish this goal are introduced, an approximation based on real analysis of the subtraction logarithm and a simple algebra that applies for both real and complex values and that works for both addition and subtraction.
Abstract: The real logarithmic number system, which represents a value with a sign bit and a quantized logarithm, can be generalized to create the complex logarithmic number system, which replaces the sign bit with a quantized angle in a log/polar coordinate system. Although multiplication and related operations are easy in both real and complex systems, addition and subtraction are hard, especially when interpolation is used to implement the system. Both real and complex logarithmic arithmetic benefit from the use of co-transformation, which converts an addition or subtraction from a region where interpolation is expensive to a region where it is easier. Two co-transformations that accomplish this goal are introduced. The first is an approximation based on real analysis of the subtraction logarithm. The second is based on simple algebra that applies for both real and complex values and that works for both addition and subtraction.
TL;DR: In this article, a multiplier configured to obtain higher frequencies of exactly rounded results by adding an adjustment constant to intermediate products generated during iterative multiplication operations is disclosed, and the results converge toward an infinitely precise result.
Abstract: A multiplier configured to obtain higher frequencies of exactly rounded results by adding an adjustment constant to intermediate products generated during iterative multiplication operations is disclosed. One such iterative multiplication operation is the Newton-Raphson iteration, which may be utilized by the multiplier to perform reciprocal calculations and reciprocal square root calculations. For each iteration, the results converge toward an infinitely precise result. To improve the frequency of the exactly rounded result, the results of the iterative calculations may be studied for a large number of differing input operands to determine the best suited value for the adjustment constant. The multiplier may also be configured to perform scalar and packed vector multiplication using the same hardware.
TL;DR: The main building blocks used in the VLP arithmetic circuits are presented, the similarities of each arithmetic operator are shown and area/time estimates of these circuits in Xilinx FPGAs are presented.
Abstract: This paper presents the organization of an arithmetic unit for variable long-precision (VLP) operands suitable for reconfigurable computing. The reconfigurable arithmetic coprocessor (RAC) cooperates with the host computer in the VLP tasks. The main design issues addressed in the paper are: (a) mapping of the most frequent and time consuming operations of the VLP arithmetic algorithms to RAG, and (b) design of VLP algorithms that allow reduced reconfiguration time between arithmetic operations. The VLP arithmetic algorithms proposed cover multiplication, division and square root. In this paper we present the main building blocks used in the VLP arithmetic circuits, show the similarities of each arithmetic operator and present area/time estimates of these circuits in Xilinx FPGAs.
TL;DR: This work shows that a rather simple sequential cache-efficient algorithm provides significantly better performance than existing algorithms for sparse matrix multiplication, and describes a multithreaded implementation of this simple algorithm that scales well with the number of threads and CPUs.
Abstract: Several fast sequential algorithms have been proposed in the past to multiply sparse matrices These algorithms do not explicitly address the impact of caching on performance We show that a rather simple sequential cache-efficient algorithm provides significantly better performance than existing algorithms for sparse matrix multiplication We then describe a multithreaded implementation of this simple algorithm and show that its performance scales well with the number of threads and CPUs For 10% sparse, 500/spl times/500 matrices, the multithreaded version running on 4-CPU systems provides more than a 411-fold speed increase over the well-known BLAS routine and a 146 fold and 446-fold speed increase over two other recent techniques for fast sparse matrix multiplication, both of which are relatively difficult to parallelize efficiently
TL;DR: In this article, a multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed, where the multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form.
Abstract: A multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. An effective sign for the multiplier and multiplicand operands may be calculated based upon each operand's most significant bit and a control signal. The effective signs may then be used to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, they may be summed and the results may be output. The results may be signed or unsigned, and may represent vector or scalar quantities. When a vector multiplication is performed, the multiplier may be configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components. The multiplier may also be configured to sum the products of the vector components to form the vector dot product. The final product may be output in segments so as to require fewer bus lines. The segments may be rounded by adding a rounding constant. Rounding and normalization may be performed in two paths, one assuming an overflow will occur, the other assuming no overflow will occur.
TL;DR: In this article, a multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed, where the multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form.
Abstract: A multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. An effective sign for the multiplier and multiplicand operands may be calculated based upon each operand's most significant bit and a control signal. The effective signs may then be used to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, they may be summed and the results may be output. The results may be signed or unsigned, and may represent vector or scalar quantities. When a vector multiplication is performed, the multiplier may be configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components. The multiplier may also be configured to sum the products of the vector components to form the vector dot product. The final product may be output in segments so as to require fewer bus lines. The segments may be rounded by adding a rounding constant. Rounding and normalization may be performed in two paths, one assuming an overflow will occur, the other assuming no overflow will occur.
TL;DR: This paper describes a restarted Lanczos algorithm that particularly suitable for implementation on distributed machines and demonstrates the correctness and the effectiveness of the new algorithm.
Abstract: This paper describes a restarted Lanczos algorithm that particularly suitable for implementation on distributed machines. The only communication operation is requires outside of the matrix-vector multiplication is a global sum. For most large eigenvalue problems, the global sum operation takes a small fraction of the total execution time. The majority of the computer is spent in the matrix-vector multiplication. Efficient parallel matrix-vector multiplication routines can be found in many parallel sparse matrix packages such as AZTEC [9], BLOCK-SOLVE [10], PETSc [3], P_SPARSLIB. For this reason, our main emphasis in this paper is to demonstrate the correctness and the effectiveness of the new algorithm.
TL;DR: This paper investigates the low power implementation issues of the Soft-Output Viterbi algorithm (SOVA), a building block for turbo codes, and develops an architecture that completes those computations with reduced power consumption.
Abstract: An important technique for reducing pow er consumption in VLSI systems is strength reduction, the substitution of a less-costly operation such as a shift, for a more-costly operation such a multiplication. Using a logarithmic number represen tation provides sev eral opportunities for strength reductions; in particular, m ultiplicationis performed as the fixed-point addition of logarithms, and extracting a square root is implemented via a shift. These reductions occur transparently at the hardware level; consequently relativ ely little algorithmic modification is required, and they are readily applicable to adaptive filtering. For performing Givens rotations in the QR decomposition recursiv e least squares adaptive filter, logarithmic arithmetic is shown to compare favorably to other strength reduction techniques, such as CORDIC arithmetic, in terms of switched capacitance and numerical accuracy.
TL;DR: A simple and efficient method that uses bit replication, instead of conventional multiplication, to achieve bit replication expansion, and shows that the optimal number of repetitions is given by ceiling (m/q) and that the method is equivalent to multiplication by the ideal gain when m/q is an integer.
TL;DR: The Deep Thought algorithm as discussed by the authors takes a commutator presentation for a finitely generated torsion-free nilpotent group G, and produces explicit polynomials for the multiplication of elements of G.
Abstract: We describe the “Deep Thought” algorithm, which can, among other things, take a commutator presentation for a finitely generated torsion-free nilpotent group G , and produce explicit polynomials for the multiplication of elements of G . These polynomials were first shown to exist by Philip Hall, and allow for “symbolic collection” in finitely generated nilpotent groups. We discuss various practicalissues in calculations in such groups, including the construction of a hybrid collector, making use of both the polynomials and ordinary collection from the left.
TL;DR: In this paper, a coprocessor including a first multiplication circuit and a second multiplication circuit with a series input to receive n bits and a series output to give n+k bits is described.
Abstract: A coprocessor including a first multiplication circuit and a second multiplication circuit with a series input to receive n bits and a series output to give n+k bits. The coprocesser also includes addition and multiplexing circuits enabling the data elements produced by the multiplication circuits to be added up with one another and with other data elements encoded on n bits. The invention makes parallel use of the multiplication circuits to carry out modular or non-modular operations on pieces of binary data having n bits or more.
TL;DR: The technique, similar in spirit to that used by Vaidya to find the best complexity bounds for linear programming, is based on reducing much of the computation involved to matrix multiplication, so complexity bounds in their most general form are a function of the complexity of matrix multiplication.
TL;DR: In this paper, it was shown that the multiplication of an n × n Toeplitz-plus-Hankel matrix by a vector requires only 4 transformations of length n plus O(n ) operations.
TL;DR: This paper presented an analysis of young students' development of multiplication and division concepts based on a multimodal SOLO model, drawn from two sources of data: a two-year longitudinal study of 70 Grade 2 to 3 students' solutions to 24 multiplicative word problems, and examples from a problem-centred teaching project with Grade 3 students.
Abstract: This paper presents an analysis of young students’ development of multiplication and division concepts based on a multimodal SOLO model The analysis is drawn from two sources of data: a two-year longitudinal study of 70 Grade 2 to 3 students’ solutions to 24 multiplicative word problems, and examples from a problem-centred teaching project with Grade 3 students An increasingly complex range of counting, additive, and multiplicative strategies based on an equal-grouping structure demonstrated conceptual growth through ikonic and concrete symbolic modes The solutions employed by students to solve any particular problem reflected the mathematical structure they imposed on it A SOLO developmental model for multiplication and division is described in terms of developing structure and associated counting and calculation strategies
TL;DR: In this paper, Bozzo et al. presented representations for inverses of Toeplitz-plus-Hankel matrices and more general T + H-Bezoutians with discrete Fourier transformations, which can be used for fast matrixvector multiplication.
TL;DR: This work provides syntactic characterizations of the queries on finitely representable databases that preserve certain geometric conditions, such as being convex polytopes, polyhedra, and compact semi-linear sets, and proves decidability of preservation properties for conjunctive queries.
Abstract: We extend some of the classical characterization theorems of relational database theory---particularly those related to query safety---to the context where database elements come with fixed interpreted structure and where formulae over elements of that structure can be used in queries We show that the addition of common interpreted functions, such as real addition and multiplication, to the relational calculus preserves important characterization theorems of the relational calculus and also preserves certain combinatorial properties of queries Our main result of the first kind is that there is a syntactic characterization of the collection of safe queries over the relational calculus supplemented by a wide class of interpreted functions---a class that includes addition, multiplication, and exponentiation---and that this characterization gives us an interpreted analog of the concept of range-restricted query from the uninterpreted setting Furthermore, our range-restricted queries are particularly intuitive for the relational calculus with real arithmetic and give a natural syntax for safe queries in the presence of polynomial functions We use these characterizations to show that safety is decidable for Boolean combinations of conjunctive queries for a large class of interpreted structures We show a dichotomy theorem that sets a polynomial bound on the growth of the output of a query that might refer to addition, multiplication, and exponentiation
We apply the above results for finite databases to get results on constraint databases representing potentially infinite objects We start by getting syntactic characterizations of the queries on constraint databases that preserve geometric conditions in the constraint data model We consider classes of convex polytopes, polyhedra, and compact semilinear sets, the latter corresponding to many spatial applications We show how to give an effective syntax to safe queries and prove that for conjunctive queries the preservation properties are decidable
TL;DR: Threshold based neural networks for serial binary addition and multiplication of n-bit operands are proposed and it is shown that the serial addition can be computed with polynomially bounded weights and a maximum fan-in in the order of O(log n) in O(n= log n) serial cycles.
Abstract: This paper investigates threshold based neural networks for periodic symmetric Boolean functions and some related operations. It is shown that any n-input variable periodic symmetric Boolean function can be implemented with a feedforward linear threshold-based neural network with size of O(log n) and depth also of O(log n), both measured in terms of neurons. The maximum weight and fan-in values are in the order of O(n). Under the same assumptions on weight and fan-in values, an asymptotic bound of O(log n) for both size and depth of the network is also derived for symmetric Boolean functions that can be decomposed into a constant number of periodic symmetric Boolean subfunctions. Based on this results neural networks for serial binary addition and multiplication of n-bit operands are also proposed. It is shown that the serial addition can be computed with polynomially bounded weights and a maximum fan-in in the order of O(log n) in O(n/log n) serial cycles. Finally, it is shown that the serial multiplication can be computed in O(n) serial cycles with O(log n) size neural gate network, and with O(n log n) latches.
TL;DR: In this paper, the authors analyzed use of the commutative property for solving multiplication problems by children who learn about multiplication in schools and by street vendors who solve multiplication problems through repeated addition.
Abstract: We analyzed use of the commutative property for solving multiplication problems by children who learn about multiplication in schools and by street vendors who solve multiplication problems through repeated addition. Subjects were Brazilian street vendors with irregular school attendance, who had received no, or very little, school instruction on multiplication, and 1stto 3rd-grade Brazilian school children. Results from 2 studies show that use of commutativity to solve multiplication problems is closely related to use of multiplication. Street sellers who rely exclusively on repeated addition to solve such problems may, however, use multiplicative commutativity if it represents a clear reduction in the number of computational steps needed to reach a solution. But their justifications for its use are often based on knowledge about multiplication.