Abstract: If Ω denotes an open subset of Rn (n = 1, 2,…), we define an algebra g (Ω) which contains the space D′(Ω) of all distributions on Ω and such that C∞(Ω) is a subalgebra of G (Ω). The elements of G (Ω) may be considered as “generalized functions” on Ω and they admit partial derivatives at any order that generalize exactly the derivation of distributions. The multiplication in G(Ω) gives therefore a natural meaning to any product of distributions, and we explain how these results agree with remarks of Schwartz on difficulties concerning a multiplication of distributions. More generally if q = 1, 2,…, and ƒ∈OM(R2q)—a classical Schwartz notation—for any G1,…,Gq∈G(σ), we define naturally an element ƒG1,…,Gq∈G(σ). These results are applied to some differential equations and extended to the vector valued case, which allows the multiplication of vector valued distributions of physics.
TL;DR: The Systolic Mode of Parallel Processing is introduced, with examples: Mapping Different Filter Banks onto the Same Fixed-Size Processor Array, and Unidirectional Full-Systolic Arrays with Bidirectional Data Flow.
Abstract: The Systolic Mode of Parallel Processing. Introduction to the Underlying Concept. The Original Motivation: VSLI Implementation. The Present Trend: Efficient Algorithms for Massively Parallel Computers. A List of Known Applications. Defining and Expressing Systolic Arrays and Algorithms. Using Automata Notions. Defining Systolic Automata, Arrays, and Algorithms. Expressing Systolic Algorithms. Analysis and Comparison of Systolic Algorithms. Matrix-Vector and Matrix Multiplication. Introduction to Vectors and Matrices. Matrix-Vector Multiplication. Systolic Simulation of Feedforward Artificial Neural Networks. Matrix Multiplication. Solving Systems of Linear Algebraic Equations. Introduction to Linear Systems. Gaussian Elimination. Systolic Arrays for Triangularization and LU/QR Decomposition. Systolic Algorithms for Back Substitution. Systolic Implementation of Iterative Methods. Further Problems of Linear Algebra. Computing the Inverse of a Matrix. Generalized Elimination. Computing the Characteristic Polynomial. Matrix Transposition and Related Operations. Convolution and Linear Filters. Convolution, Correlation, FIR and IIR Filters. Semi-Systolic Realizations. Unidirectional Full-Systolic Arrays. Systolic Arrays with Bidirectional Data Flow. Bit-Level Systolic Convolver. Operations with Polynomials. Introduction. Multiplication of Polynomials and Integers. Division of Polynomials. Computing the Greatest Common Divisor. Polynomial Interpolation. Evaluation of Polynomials. Comparison Problems. Sorting. Selection and Running Order Statistics. Sorting and Order Statistics for Rank Filtering. A Data Structure: Priority Queue. Dynamic Programming and its Applications. Introduction. Implementing the Dynamic Programming Recurrence in a Two-Dimensional Systolic Array. Implementation in One-Dimensional Arrays. Further Dynamic Programming Recurrences. Computational Geometry. Convex Hull. Nearest-Neighbours Problems. Systematic Design of Systolic Algorithms. Dependence Graphs. Systolic Array Dependence Graphs. Extracting Systolic Algorithms from Dependence Graphs. Modifying the Properties of Systolic Algorithms. Partitioning of Systolic Algorithms. Partitioning, Algorithm Mapping, Design of Flexible Systolic Structures, Time Sharing. Application of c-Slow Automata to the Realization of Parallel Structures. Examples: Mapping Different Filter Banks onto the Same Fixed-Size Processor Array. A Summary of the Technique and Alternative Approaches. References and Additional Literature. Subject Index.
TL;DR: In this paper, the product sum operation is performed by a ROM table and an adder, and the number of times of multiplication is reduced by utilizing inherent characteristics of coefficients of DCT/IDCT processing.
Abstract: A one-dimensional discrete cosine transform processor of N (N: positive integer)-term input data X includes a preprocessing section for carrying out addition and subtraction of (i)th-term data x (i) and (N-i)th-term data x (N-1) of input data X, and a unit for performing a product sum operation for sets of intermediate data subjected to preprocessing by addition and sets of intermediate data subjected to preprocessing by subtraction, respectively. The product sum operation unit includes a data rearranging unit for outputting, in parallel and in order, bit data of the same figure of a set of data, a partial sum generator for generating a partial sum by using the parallel bit data as an address, and an accumulator for accumulating outputs of the partial sum generator. A one-dimensional inverse discrete cosine transform processor of N-term input data X includes a unit for performing a product sum operation of input data, and a postprocessing section for carrying out addition and subtraction of 2-term data in a predetermined combination of an output of the product sum operation unit. The number of times of multiplication is reduced by utilizing inherent characteristics of coefficients of DCT/IDCT processing. Since the product sum operation is performed by a ROM table and an adder, a faster multiplication is realized.
TL;DR: Surprisingly, this enables us to decrease the bilinear complexity of n X n matrix multiplication below the current record upper bound for the same computation over the infinite fields of complex, real, or rational numbers.
TL;DR: In this paper, it was shown that, when field elements are represented by polynomials, division over finite fields can be performed by solving a system of m linear equations over GF(q).
Abstract: Division and bit-serial multiplication in finite fields are considered. Using co-ordinates of the supporting elements it is shown that, when field elements are represented by polynomials, division over GF(qm) can be performed by solving a system of m linear equations over GF(q). For a canonical basis representation, a relationship between the division and the discrete-time Wiener-Hopf equation of degree m over GF(q) is derived. This relationship leads to a bit-serial multiplication scheme that can be easily realised for all irreducible polynomials.
TL;DR: In this article, a taxonomy of objective functions and heuristics used to solve the mapping problem is presented, and a highly parallel heuristic mapping algorithm, called Cyclic Pairwise Exchange (CPE), is developed.
Abstract: This thesis investigates the mapping problem: assign the tasks of a parallel program to the processors of a parallel computer such that the execution time is minimized.
First, a taxonomy of objective functions and heuristics used to solve the mapping problem is presented. Next, we develop a highly parallel heuristic mapping algorithm, called Cyclic Pairwise Exchange (CPE), and discuss its place in the taxonomy. CPE uses local pairwise exchanges of processor assignments to iteratively improve an initial mapping. A variety of initial mapping schemes are tested and recursive spectral bipartitioning (RSB) followed by CPE is shown to result in the best mappings. For the test cases studied here, problems arising in computational fluid dynamics and structural mechanics on unstructured triangular and tetrahedral meshes, RSB and CPE outperform methods based on simulated annealing. Much less time is required to do the mapping and the results obtained are better. Compared with random and naive mappings, RSB and CPE reduce the communication time twofold for the test problems used.
Finally, we use CPE in two applications on a CM-2. The first application is a data parallel mesh-vertex upwind finite volume scheme for solving the Euler equations on 2-D triangular unstructured meshes. CPE is used to map grid points to processors. The performance of this code is compared with a similar code on a Cray-YMP and an Intel iPSC/860. The second application is parallel sparse matrix-vector multiplication used in the iterative solution of large sparse linear systems of equations. We map rows of the matrix to processors and use an inner-product based matrix-vector multiplication. We demonstrate that this method is an order of magnitude faster than methods based on scan operations for our test cases.
TL;DR: Three ways to modify this conversion process so that the result is rounded are described, which can be done on-the-fly as the digits are produced, without the use of a carry-propagate adder.
Abstract: In implementations of operations based on digit-recurrence algorithms such as division, left-to-right multiplication and square root, the result is obtained in digit-serial form, from most significant digit to least significant. To reduce the complexity of the result-digit selection and allow the use of redundant addition, the result-digit has values from a signed-digit set. As a consequence, the result has to be converted to conventional representation, which can be done on-the-fly as the digits are produced, without the use of a carry-propagate adder. The authors describe three ways to modify this conversion process so that the result is rounded. The resulting operation is fast because no carry-propagate addition is needed. The schemes described apply also to online arithmetic operations. >
TL;DR: A simple but powerful architecture based on the classical associative processor model, by distributing logic among slices of storage cells such that a number of bit-planes share a simple logic unit, bit-parallel arithmetic for massively parallel processing becomes feasible.
Abstract: A simple but powerful architecture based on the classical associative processor model is proposed. By distributing logic among slices of storage cells such that a number of bit-planes share a simple logic unit, bit-parallel arithmetic for massively parallel processing becomes feasible. For m-bit operands, this architecture enables complex operations such as multiplication and division to execute in O(m) cycles as opposed to O(m/sup 2/) for bit-serial machines. Algorithms which utilize this bit-parallel property to efficiently perform operations on floating point data have been developed. The simplicity of the architecture enables its implementation using VLSI technology, and hence allows the construction of a word-parallel, bit-parallel, massively parallel (P/sup 3/) computing system. Implementations of the fast Fourier transform and matrix multiplication are presented to illustrate the operation of this system. >
TL;DR: In the context of shift-and-add algorithms, the lower order terms require fewer shifting operations and less total hardware to effect multiplication than the corresponding higher-order terms as mentioned in this paper.
Abstract: Method and apparatus for implementing a digital filter employing coefficients expressed as sums of 2 to an integer power. Coefficients expressed as sums of powers of 2 may be algebraically manipulated such that higher order terms are replaced by an equivalent group of lower order terms. In the context of a shift-and-add algorithm, the lower order terms require fewer shifting operations and less total hardware to effect multiplication than the corresponding higher order terms.
TL;DR: It is shown that multiplication of two N-bit integers can be performed in O(1) time on N*N reconfigurable mesh and can be extended to provide area-time tradeoffs in the usual bit model of VLSI to satisfy AT/sup 2/ optimality over 1
Abstract: It is shown that multiplication of two N-bit integers can be performed in O(1) time on N*N reconfigurable mesh. This result is obtained by combining the O(1) time multiplication algorithm on N*N/sup 2/ reconfigurable mesh, the Rader transform, and decomposition of one-dimensional convolution into multidimensional convolution. Choosing the Radar transform at the expense of long word length frees one from storing twiddle factors in advance, which is needed in other designs. It is also shown that the present algorithm can be simulated on other restricted reconfigurable mesh models without asymptotic increase in time or number of processing elements. It is shown that the present result can be extended to provide area-time tradeoffs in the usual bit model of VLSI to satisfy AT/sup 2/ optimality over 1 >
TL;DR: In this paper, a signed arithmetic data processing system (20) detects a multiply (MUL) or a multiply-and-accumulate (MAC) instruction in which a multiplier and a multiplicand each assume their respective maximum negative values.
Abstract: A signed arithmetic data processing system (20) detects a multiply (MUL) or a multiply-and-accumulate (MAC) instruction in which a multiplier and a multiplicand each assume their respective maximum negative values. If one or both of the operands is not equal to its maximum negative value, the multiplication proceeds normally, such as in a modified Booth's multiplier/MAC (33). However, if both operands are equal to their respective maximum negative values, the data processing system (20) substitutes a maximum positive constant for the output of the multiplier/MAC (33). This substitution allows the result to be expressed with one fewer bits. The resulting error is very small and becomes insignificant in most digital signal processing algorithms, especially those based on fractional, saturation arithmetic. Alternatively, an extra bit of precision may be achieved for a given hardware size.
TL;DR: For example, this article found that 75% of the children were able to solve the problems using a wide variety of strategies even though they had not received formal instruction in multiplication or division for most of the 2 year period.
Abstract: Children’s solution strategies to a variety of multiplication and division word problems were analysed at four interview stages in a 2-year longitudinal study. The study followed 70 children from Year 2 into Year 3, from the time where they had received no formal instruction in multiplication and division to the stage where they were being taught basic multiplication facts. Ten problem structures, five for multiplication and five for division, were classified on the basis of differences in semantic structure. The relationship between problem condition (i.e. small or large number combinations and use of physical objects or pictures), on performance and strategy use was also examined. The results indicated that 75% of the children were able to solve the problems using a wide variety of strategies even though they had not received formal instruction in multiplication or division for most of the 2 year period. Performance level generally increased for each interview stage, but few differences were found between multiplication and division problems except for Cartesian and Factor problems. Solution strategies were classified for both multiplication and division problems at three levels:
A wide range of counting strategies were classified as counting-all, skip counting and double counting. Analysis of intuitive models revealed preference for a repeated addition model for multiplication, and a ‘building-up’ model for division.
TL;DR: Efficient algorithms are described for matrix multiplication on SIMD computers and SIMD implementations of Winograd’s algorithm are considered in the case where additions are faster than multiplications.
Abstract: Efficient algorithms are described for matrix multiplication on SIMD computers. SIMD implementations of Winograd’s algorithm are considered in the case where additions are faster than multiplications. Classical kernels and the use of Strassen’s algorithm are also considered. Actual performance figures using the MasPar family of SIMD computers are presented and discussed.
TL;DR: In this paper simple basic carry save adders are described using which multiplication circuits of depth 3.71 log n and majority formulae of size O (n3.13) are constructed and the shallowest known multiplication circuits and the shortest formULae for the majority function are obtained.
Abstract: A general theory is developed for constructing the asymptotically shallowest networks and the asymptotically smallest networks (with respect to formula size) for the carry save addition of n numbers using any given basic carry save adder as a building block. Using the optimal carry save additional networks the shallowest known multiplication circuits and the shortest formulae for the majority function (and many other symmetric Boolean functions) are obtained. In this paper simple basic carry save adders are described using which multiplication circuits of depth 3.71 log n (the result of which is given as the sum of two numbers) and majority formulae of size O (n3.13) are constructed. Using more complicated basic carry save adders, not described here, these results could be further improved. Our best bounds are currently 3.57 log n for depth and O (n3.13) for formula size.
TL;DR: In this article, the authors propose newshifted remainder conditioning, and sign multiplexing techniques in combination with novel circuit architecture approaches to obtain efficient divider and square-root architectures.
Abstract: This paper addresses design of high speed architectures for fixed-point, two's-complement, bit-parallel division, square-root, and multiplication operations. These architectures make use of hybrid number representations (i.e. the input and output numbers are represented using two's complement representation, and the internal numbers are represented using radix-2 redundant representation). We propose newshifted remainder conditioning, andsign multiplexing techniques in combination with novel circuit architecture approaches to obtain efficient divider and square-root architectures. Our divider exploits full dynamic range of operands and eliminates the need for on-line or off-line conversion of the result to binary (this is because our nonrestoring division and square-root operators output binary quotient). Furthermore, since the binary input set is a subset of the redundant digit set, no binary-to-redundant number conversion is necessary at the input of the divider and square-root operators. We also present a fast, new conversion scheme for converting radix-2 redundant numbers to two's complement binary numbers, and use this to design a bit-parallel multiplier. This multiplier architecture requires fewer pipelining latches than conventional two's complement multipliers, and reduces the latency of the multiplication operation from (2W---1) to aboutW (whereW is the word-length), when pipelined at the bit-level.
TL;DR: In this paper, a fuzzy inference with two parameters of a steering condition of steering system serving as an input, calculating a differentiation correction value of steering torque, and correcting a control value of an assist motor is presented.
Abstract: PURPOSE:To improve a return characteristic of a handle and a driving feeling at the time of low car speed by performing fuzzy inference with two parameters of a steering condition of steering system serving as an input, calculating a differentiation correction value of steering torque in accordance with a return condition of steering, and correcting a control value of an assist motor. CONSTITUTION:In a torque differentiating part 31, detection torque VT is differentiated and output to a multiplication arithmetic part 32, and this torque is multiplied by a return time differentiation gain from a fuzzy inference part 40 in the multiplication arithmetic part 32 and added to an output of a multiplication arithmetic part 15 to generate an output of an assist command part 10 supplied to a current control part 20. An output of the torque differentiating part 31 is input to the fuzzy inference part 40, and it performs fuzzy inference with a car speed, steering torque, differentiation value of torque VT and a steering angular speed serving as input parameters, to decide a return condition of steering and to calculate the return time differentiation gain output to the multiplication arithmetic part 32.
TL;DR: The architecture of a modular exponentiation unit with systolic modular multipliers, used for the parallel processing of an exponentiation using a half-sized modulus (less security requirements) or for an application of the Chinese Remainder Theorem is described.
Abstract: The described architecture of a modular exponentiation unit with systolic modular multipliers shows the following features:
•
simple VLSI-implementation based on systolic arrays, which are improved versions of the multipliers proposed in [Atrubi65]
•
two identical systolic arrays for the implementation of Montomery's modulo multiplication method
•
small data-paths because of the serial operation mode
•
the required number of clock cycles for a modular multiplication depends on the actual size of the operands and not on the size of the systolic arrays
•
By the separation of the cells in the middle of the systolic arrays, the modular multiplier can be reconfigured such that two modular multipliers are available for the multiplication of operands with half of the size. This can be used for the parallel processing of an exponentiation using a half-sized modulus (less security requirements) or for an application of the Chinese Remainder Theorem.
•
The throughput and the area demand of a chip for modular exponentiations based on this architecture can be widely effected by the selection of the design parameters (base b, number of modular multipliers, number of registers).
TL;DR: This is the first study to propose the use of digital partitioning for optical matrix processing and it is found to be the most efficient algorithms if coding time and hardware are considered.
Abstract: Digital partitioning offers a flexible means of increasing the accuracy of an optical matrix-vector processor. This algorithm can be implemented with the same architecture required for a purely analog processor, which gives optical matrix-vector processors the ability to perform high-accuracy calculations at speeds comparable with or greater than electronic computers as well as the ability to perform analog operations at a much greater speed. Digital partitioning is compared with digital multiplication by analog convolution, residue number systems, and redundant number representation in terms of the size and the speed required for an equivalent throughput as well as in terms of the hardware requirements. Digital partitioning and digital multiplication by analog convolution are found to be the most efficient algorithms if coding time and hardware are considered, and the architecture for digital partitioning permits the use of analog computations to provide the greatest throughput for a single processor. To our knowledge this is the first study to propose the use of digital partitioning for optical matrix processing.
TL;DR: It is shown that, in general, the standard Booth algorithm cannot be extended to higher radix (>2) multiplication, and a rule to modify the Booth standard radix-2 algorithm for higher-radix multiplication is presented.
Abstract: It is shown that, in general, the standard Booth algorithm cannot be extended to higher radix (>2) multiplication. A rule to modify the Booth standard radix-2 algorithm for higher-radix multiplication is presented. This rule corrects the product computed by Booth's algorithm for certain cases of high-radix bit-recoded multiplications. In addition, the modification is defined for multipliers of any size, utilizing any power-of-2-bit recoding. >
TL;DR: A set of game cards includes a plurality of first cards indicating each number from 1 to 9, and plurality of second cards each indicating a figure generated by multiplying together figures, respectively indicated on arbitrary first cards as mentioned in this paper.
Abstract: A set of game cards includes a plurality of first cards respectively indicating each number from 1 to 9, and a plurality of second cards each indicating a figure generated by multiplying together figures, respectively indicated on arbitrary first cards. Thus, the game cards help players learn multiplication tables when they are used.
TL;DR: In this paper, a systolized and modular arithmetic device has a control module, followed by a series arrangement of processing modules, and finally a tail module for exponentiation of Q.
Abstract: Arithmetic unit for multiplying long integers modulo M and R.S.A. converter provided with such multiplication device. A systolized and modular arithmetic device has a control module, followed by a series arrangement of processing module, followed by a tail module. For multiplying an integer P and an integer Q modulo a third integer M, a provisional product is incremented each time with Q for a -1- bit in P, preceding by a doubling of the product. For a -0- bit only the doubling ensues. Normalizing mod M is effected by adding the complement of M, W, under control of propagated carry values. A similar procedure is proposed for exponentiation of Q. F.
TL;DR: In this paper, the authors proposed a feature extraction-based algorithm (FEBA) for sparse matrix-vector multiplication, which exploits any regular structure present in the sparse matrix by extracting it and processing it separately.
Abstract: The authors propose a feature-extraction-based algorithm (FEBA) for sparse matrix-vector multiplication. The key idea of FEBA is to exploit any regular structure present in the sparse matrix by extracting it and processing it separately. The order in which these structures are extracted is determined by the relative efficiency with which they can be processed. The authors have tested FEBA on IBM 3000 VF for matrices from the Harwell Boeing and OSL collection. The results obtained were on average five times faster than the ESSL routine which is based on the ITPACK storage structure.
TL;DR: An estimate of the execution time for a radix-256 unit for double-precision square root and a comparison with other implementations indicate that the proposed approach is an alternative to consider when designing a square-root unit.
Abstract: A scheme for performing higher radix square root based on prescaling of the radicand is presented to reduce the complexity of the result-digit selection. The scheme requires several steps, namely multiplication for prescaling the radicand, square root, multiplication for prescaling for the division, and division. Online algorithms are used to reduce the overall time and pipelining to reuse the different modules. An estimate of the execution time for a radix-256 unit for double-precision square root and a comparison with other implementations indicate that the proposed approach is an alternative to consider when designing a square-root unit. >
TL;DR: Two efficient algorithms for the multiplication of the so-called reduced biquaternions are presented and may be used for the realization of digital filters with hypercomplex coefficients.
TL;DR: A variable-length segmentation strategy which significantly reduces the average number of additions required by the m-ary segmentation and the canonical recodiing algorithms for multiplication of large binary numbers.
Abstract: We propose a variable-length segmentation strategy which significantly reduces the average number of additions required by the m-ary segmentation and the canonical recodiing algorithms for multiplication of large binary numbers. This strategy produces two new algorithms: the adaptive m-ary segmentation algorithm utilizes both the speedup inherent in high-radix multiplication and the ability to skip zero bits; the adaptive m-ary segmentation canonical recoding algorithm gains additional benefit from the increased probability of zero after the canonical recoding. The average number of additions required is computed using Markov chains.
TL;DR: An efficient data structure is presented which supports general unstructured sparse matrix-vector multiplications on a Distributed Array of Processors (DAP) and organises the operations in batches of massively parallel steps by a heuristic scheduling procedure performed on the host computer.
Abstract: An efficient data structure is presented which supports general unstructured sparse matrix-vector multiplications on a Distributed Array of Processors (DAP). This approach seeks to reduce the inter-processor data movements and organises the operations in batches of massively parallel steps by a heuristic scheduling procedure performed on the host computer. The resulting data structure is of particular relevance to iterative schemes for solving linear systems. Performance results for matrices taken from well-known Linear Programming (LP) test problems are presented and analysed.