TL;DR: The 2*2 case of the algorithm can be adapted to computing Givens rotations, resulting in a 17% savings in real matrix multiplications.
Abstract: Novel fast algorithms for multiplying square complex matrices are presented. The algorithms are based on concepts from fast methods of complex multiplication in which a surrogate is used for the square root of minus one. Previous methods imposed the structure of a finite ring or field on the problem. The novel algorithms also use a surrogate, but do not require the imposed structure and its inherent rounding. The number of real matrix multiplications required can be reduced from four to two for even dimension, and to 2+1/N/sup 2/ for odd dimension N. The disadvantage of the algorithms is the imposition of a requirement on the structure of one of the two complex matrices being multiplied. The 2*2 case of the algorithm can be adapted to computing Givens rotations, resulting in a 17% savings in real matrix multiplications. >
TL;DR: The authors explored whether preservice elementary teachers have the same misconceptions as children about multiplication and division, and found that they are not likely to recognize the related errors students make, and that their instruction might inadvertently contribute to perpetuating the misconceptions.
Abstract: In recent years much has been written about children's and adolescents' misconceptions concerning the operations needed to solve multiplication and division word problems (e.g., Bell, 1982; Bell, Fischbein, & Greer, 1984; Bell, Swan, & Taylor, 1981; Brown, 1981; Vergnaud, 1983). Fischbein, Deri, Nello, and Marino (1985) argued that the arithmetical operations generally remain linked to implicit primitive behavioral models. They noted that the primitive models for multiplication and division may be viewed as the source of misconceptions such as "the divisor must be a whole number" or "multiplication always makes bigger and division always makes smaller." The purpose of our study was to explore whether preservice elementary teachers have the same misconceptions. If preservice teachers hold these misconceptions, they are not likely to recognize the related errors students make. In fact, their instruction might inadvertently contribute to perpetuating the misconceptions. We attempted to determine whether preservice teachers select the correct operation when they are presented with problems having data that conflict with the implicit rules of the primitive behavioral models of multiplication and division. As teacher educators, we were also interested in noting whether the preservice teachers would exhibit other misconceptions and the extent to which such misconceptions were similar to those previously noted among children.
TL;DR: It is shown that solving the DTWHE is equivalent to performing division over finite fields, and the proof provides a new interpretation of the relationship between bit- serial multiplication and DTWHEs that enables bit-serial multiplication over GF(2/sup m/) to be understood more easily.
Abstract: Discrete-time Wiener-Hopf equations (DTWHEs) over finite fields are considered. It is shown that solving the DTWHE is equivalent to performing division over finite fields. The proof provides a new interpretation of the relationship between bit-serial multiplication and DTWHEs. The interpretation enables bit-serial multiplication over GF(2/sup m/) to be understood more easily. As an example, bit-serial multiplication methods for multiplying any two elements that can be done without performing any transformation, or with only a simple transformation of the bases, are presented. >
TL;DR: Based on MSD addition, parallel algorithms for multiplication and division are developed on the basis of symbolic substitution (SS), and the performance of the proposed optical arithmetic system is analyzed and compared with that of state-of-the-art electronic counterparts.
Abstract: The modified-signed-digit (MSD) number system offers parallel addition and subtraction of any two numbers, with carry propagation constrained only between two adjacent digits. Based on MSD addition, parallel algorithms for multiplication and division are developed in this paper. The optical implementations of these MSD arithmetic algorithms are developed on the basis of symbolic substitution (SS). The space-invariant nature of SS matches well with the parallel nature of the MSD arithmetic algorithms presented. The potential advantages of using these algorithms for optical computing include the significant increase in speed, full exploitation of parallelism, and higher system throughput compared with existing electronic arithmetic processors. The performance of the proposed optical arithmetic system is analyzed and compared with that of state-of-the-art electronic counterparts.
TL;DR: This paper found that the choice of operation was dominated by the numerical preference for dividing by an integer or the smaller of the two numbers; decimal points were sometimes ignored in determining this preference.
Abstract: Three hundred seventy-seven subjects in six age groups, ranging from 10 to 20 years, were asked to identify the appropriate operation for 24 multiplication and division problems involving price, speed, and measure conversion. For multiplication problems, difficulty was strongly dependent on the type of number in the preferred-multiplier role. Important differences from the general tendency were noted in those change-of-size and mixture problems in which both quantities were measured in the same units. Choice of operation also was affected by the misconception that multiplication makes bigger and division smaller. In division problems, choice was dominated by the numerical preference for dividing by an integer or the smaller of the two numbers; decimal points were sometimes ignored in determining this preference. MMBDS also operated, but not in all problems. In a second experiment with 42 fifteen-yearold students, half were asked to estimate answers and half to choose the correct operation for change-of-size and mixture problems. Estimating the outcome was easier than both choosing the operation for division, and multiplication by numbers less than 1. For multiplication by numbers substantially greater than 1, choosing the operation was easier.
TL;DR: The generalized proof for the multibit overlapped scanning multiplication is introduced, and the multiplication process is discussed, to establish the correctness of the decode and the actions taken to produce the multiplication of any valid scheme proposed in the past.
Abstract: Because of recent advances in technology, multibit scanning implementations can be considered that exceed three-bit and four-bit groupings. The generalized proof for the multibit overlapped scanning multiplication is introduced, and the multiplication process is discussed. The proofs are intended to establish the correctness of the decode and the actions taken to produce the multiplication of any valid scheme proposed in the past, and to dictate the correct decode and actions taken for any overlapped s-bit scanning algorithm such that s is a natural number greater than or equal to two. The multiplication is considered to be between two fractional numbers, which are represented in two's-complement form. >
TL;DR: The authors present a regular iterative algorithm for matrix multiplication using the S. Winograd method and show that several well-known matrix multiplication arrays are directly obtained from it, differing only in the choice of iteration vector.
Abstract: The authors present a regular iterative algorithm for matrix multiplication and show that several well-known matrix multiplication arrays are directly obtained from it, differing only in the choice of iteration vector. They then present a regular iterative algorithm for matrix multiplication using the S. Winograd method (1968) and show in detail how to derive one array from this algorithmic description. Other arrays in the same family can similarly be obtained for different choices of the iteration space. The new arrays compute the product of two matrices faster than available conventional arrays and use a smaller number of processor cells. >
TL;DR: This paper proves that computing the product of two $n \times n$ matrices over the binary field requires at least $2.5n^2 - o( n^2 )$ multiplications.
Abstract: This paper proves that computing the product of two $n \times n$ matrices over the binary field requires at least $2.5n^2 - o(n^2 )$ multiplications.
TL;DR: The proposed scheme may provide a considerable saving in the hardware logic in implementing a self-checking arithmetic logic unit (ALU) and may ultimately make feasible a single-chip self- checking microprocessor or reduced-instruction-set-computer (RISC) design.
Abstract: A novel approach to designing concurrent-error-detecting arithmetic and logic units using Berger code is presented. Several theorems are developed on Berger check predictions for arithmetic and logical operations. Specifically, the Berger check prediction is proposed for additions and subtractions with unsigned numbers as well as signed numbers. Berger check prediction for 16 logical operations and shift operations, multiplication, and division are given. The proposed scheme may provide a considerable saving in the hardware logic (or chip area) in implementing a self-checking arithmetic logic unit (ALU) and may ultimately make feasible a single-chip self-checking microprocessor or reduced-instruction-set-computer (RISC) design. >
TL;DR: Generic primitives for matrix operations as defined by the level one, two and three of the BLAS are of great value in that they make user programs much simpler, and hide most of the architectular detail of improtance for performence in the primitives.
Abstract: Generic primitives for matrix operations as defined by the level one, two and three of the BLAS are of great value in that they make user programs much simpler, and hide most of the architectular detail of improtance for performence in the primitives. We describe generic shared memory primitives such as one-to-all and all-to-all broadcasting, and one-to-all and all-to-all personalized communication, and implementations theoref thar are within a factor of two of the best known lower bounds. We describe algorithms for the multiplication of arbitrarily shaped matrices using these primitives. Of the three loops required for a standard matrix multiplication algorithm expressed in Fortran all three can be parallelised. We show that if one loop is parallelised, then the processors shall be aligned with the loops having the most elements. Depending on the initial matrix allocation data permutatuions may be required to accomplish the processor/loop alignment. This permutation id included in our analysis. We show that in parallelizing two loops the optimum aspect ratio of the processing plane is equal to the ratio of the number of matrix elements in the two loops being parallelized
TL;DR: A new algorithm is presented for computing the product of two elements in a finite field by means of sums and products in a fixed subfield with very fast parallel implementation and a considerable reduction in the number of computations.
Abstract: A new algorithm is presented for computing the product of two elements in a finite field by means of sums and products in a fixed subfield. The algorithm is based on a normal basis representation of fields and assumes that the dimension m of the finite field over the subfield is a highly composite number. A very fast parallel implementation and a considerable reduction in the number of computations are allowed, in comparison to some methods discussed in the literature. >
TL;DR: This document discusses the architecture of the MIPS-X Compiler System, the hardware and software behind it, and some of the techniques used to develop and implement this system.
Abstract: 1 Introduction.- 2 Architecture.- 3 The Compiler System.- 4 A Hardware Overview.- 5 The Execute Engine.- 6 Instruction Fetch Hardware.- 7 The External Interface.- A Exception Handling.- A.1 Interrupts.- A.2 Trap On Overflow.- A.3 Trap Instructions.- B Integer Multiplication and Division.- B.1 Multiplication and Division Support.- B.2 Multiplication.- B.3 Division.- C Opcode Map.- C.1 OP Field Bit Assignments.- C.2 Comp Func Field Bit Assignments.- C.3 Opcode Map of All Instructions.- D MIPS-X Revision 1 and 2 Pin Numbers.- D.1 Pin Mapping for Probe Card and Funsim.- D.2 Pin Map for 144 Pin PGA.- E Revision 1 and Revision 2 Differences.
TL;DR: In this paper, a multiplication, division and square root extraction apparatus which calculates the solutions to addition, division, and SE functions by approximation using iteration has a multiplier, an adder-subtracter and a shifter of prescribed bit width connected to a bus.
Abstract: A multiplication, division and square root extraction apparatus which calculates the solutions to addition, division and square root extraction functions by approximation using iteration has a multiplier, an adder-subtracter and a shifter of prescribed bit width connected to a bus. Iteration is conducted by inputting the output of the multiplier to the adder-subtracter or the shifter and returning the result to the input of the multiplier via the bus. A shifter and an arithmetic and logic unit connected to a second bus connected to the aforesaid bus via a switch have a greater bit width than the prescribed bit width and are used for large scale calculations, thus preventing a reduction in processing speed.
TL;DR: The author proposes combining multipliers with these multiplication-free algorithms to construct fast methods of elementary function generation, and demonstrates the idea by combiningmultipliers with the CORDIC algorithm to achieve fast vector rotation.
Abstract: Virtually all numerical techniques for elementary function generation share the common property of avoiding multiplication by iteratively performing shift operations. However, with the advent of VLSI, multiplier economics are considerably less formidable than before. The author proposes combining multipliers with these multiplication-free algorithms to construct fast methods of elementary function generation. He demonstrates the idea by combining multipliers with the CORDIC algorithm to achieve fast vector rotation. >
TL;DR: A modified CORDIC algorithm is presented that offers a considerable latency time reduction and chip area savings when compared with the original CORDic method.
Abstract: In this contribution we present a modified CORDIC algorithm that offers a considerable latency time reduction and chip area savings when compared with the original CORDIC method. The operations used are adds, shifts, and multiplication or division.
TL;DR: A data parallel implementation of the multiplication of matrices of arbitrary shapes and sizes is presented and a systolic algorithm based on a rectangular processor layout is used by the implementation.
Abstract: A data parallel implementation of the multiplication of matrices of arbitrary shapes and sizes is presented. A systolic algorithm based on a rectangular processor layout is used by the implementation. All processors contain submatrices of the same size for a given operand. Matrix-vector multiplication is used as a primitive for local matrix-matrix multiplication in the Connection Machine system CM-2 implementation. The peak performance of the local matrix-matrix multiplication is in excess of 20 Gflops s-1. The overall algorithm including all required data motion has a peak performance of 5.8 Gflops s-1.
TL;DR: Various multipliers, such as the parallel or serial-parallel types and the realisation through microprogramming on an arithmetic unit, are presented, all of which are based on the above technique.
Abstract: The direct digital multiplication of complex numbers is examined. The proposed technique bypasses the classical method based on the identity (A + Bj)(C + Dj) = AC − BD + (AD + BC)j. The complex multiplication is considered as one operation and is examined in the bit-level. Various multipliers, such as the parallel or serial-parallel types and the realisation through microprogramming on an arithmetic unit, are presented, all of which are based on the above technique.
TL;DR: The Cascade hardware architecture for high/variable precision arithmetic uses a radix-16 redundant signed-digit number representation and provides a complete suite of memory management functions implemented in hardware, including a garbage collector.
Abstract: The Cascade hardware architecture for high/variable precision arithmetic is described. It uses a radix-16 redundant signed-digit number representation and directly supports single or multiple precision addition, subtraction, multiplication, division, extraction of the square root, and computation of the greatest divisor. It is object-oriented and implements an abstract class of objects, variable precision integers. It provides a complete suite of memory management functions implemented in hardware, including a garbage collector. The Cascade hardware permits free tradeoffs of space versus time. >
TL;DR: This paper proposes several efficient general-purpose parallel algorithms for single-input and multiinput eigenvalue assignment problems and proposes a numerical method for the Sylvester matrix equation arising in the construction of a Luenberger observer.
TL;DR: Two algorithms for parallel multiplication of two n-bit binary numbers using column compression to increase the speed of execution and can be easily modified to handle two's complement numbers with constant differences in time are presented.
Abstract: Two algorithms for parallel multiplication of two n-bit binary numbers are presented. Both use column compression to increase the speed of execution. They require almost regular interconnection between only two types of cells and hence are very suitable for VLSI implementation. Both of them can also be easily modified to handle two's complement numbers with constant differences in time. >
TL;DR: The technique can be used to transform a class of algorithms to specific forms that can be mapped directly onto higher-dimensional systolic networks, yet maintaining the same number of processing cells as its 1-D counterpart.
TL;DR: Discusses the implementation of a residue arithmetic circuit using multiple-valued bidirectional current-mode MOS technology and designs and fabricates the mod7 three-operand multiply adder as an integrated circuit based on 10- mu m CMOS technology.
Abstract: Discusses the implementation of a residue arithmetic circuit using multiple-valued bidirectional current-mode MOS technology Each residue digit is represented by new multiple-valued coding suitable for highly parallel computation By the coding, mod m/sub i/ multiplication can be simply performed by a shift operation In mod m/sub i/ addition, radix-5 signed-digit (SD) arithmetic is employed for a high degree of parallelism and multiple-operand addition, so that high-speed arithmetic operations can be achieved Finally, the mod7 three-operand multiply adder is designed and fabricated as an integrated circuit based on 10- mu m CMOS technology >
TL;DR: An analog arithmetic circuit for executing multiplications, divisions, compressions, expansions and combinations thereof is presented in this paper, where the arithmetic circuit can be fabricated in the form of a MOS LSI because it does not use a precise triangle waveform generator for pulse width modulation.
Abstract: An analog arithmetic circuit for executing multiplications, divisions, compressions, expansions and combinations thereof. The arithmetic circuit is provided with a ΔΣ modulator comprising an A/D converter and a first D/A converter, a second D/A converter for receiving the output from the ΔΣ modulator, and a low-pass filter which receives the output of the second D/A converter and outputs the result of an arithmetic operation. The arithmetic circuit can be fabricated in the form of a MOS LSI because it does not use a precise triangle waveform generator for pulse width modulation.
TL;DR: In this article, a delay circuit and adder group, a multiplication block group, and an adder groups are integrated to form one image-processing integrated circuit device to reduce the number of parts.
Abstract: An image-processing integrated circuit device comprises a delay circuit and adder group, a multiplication block group, and an adder group. Image data in a window are fed to the delay circuit and adder group simultaneously row by row and then added up for every symmetrical positions in the window. The respective sums of the image data thus added up for every symmetrical positions are multiplied by corresponding coefficient data in the multiplication block group. Lastly, the respective results of multiplication obtained from the multiplication block group are added up by the adder group to thereby obtain a filter output. The delay circuit and adder group, the multiplication block group, and the adder group can be integrated to form one image-processing integrated circuit device. Accordingly, the number of parts is reduced. Further, not only the image data can be fed into the delay circuit and adder group simultaneously but also the multiplication in the multiplication block group can be carried out in parallel and simultaneously by using exclusive multiplication blocks each provided for symmetrical positions in the window. Accordingly, the speed in filtering processing becomes high.
TL;DR: The author proposes dedicated high-speed architectures for nibble-serial implementation of arithmetic operations (such as addition, multiplication, division, and square root) using a two's-complement fixed-point number system.
Abstract: The author proposes dedicated high-speed architectures for nibble-serial implementation of arithmetic operations (such as addition, multiplication, division, and square root) using a two's-complement fixed-point number system (all numbers assumed to lie between -1 and +1). Nibble-serial circuits are obtained by systematically applying the unfolding transformation on the corresponding bit-serial circuits. Nibble-serial arithmetic circuits input W/sub 1/-b of a word or sample in a single cycle, and the complete word is input in W/sub 2/ cycles, where W=W/sub 1/W/sub 2/ is the word length. W/sub 1/ need not be 4 in a nibble-serial implementation, but can be any divisor of the word length. >
TL;DR: The method uses conventional multiplication, followed by partial modular reduction based on sums of residues, and it is shown that for a simple recursive system where the output ofpartial modular reduction is the input for the next multiplication, overflow presents few problems.
Abstract: This paper describes a method for computing a modular exponentiation, useful in performing the RSA Public Key algorithm, suitable for software or hardware implementation. The method uses conventional multiplication, followed by partial modular reduction based on sums of residues. We show that for a simple recursive system where the output of partial modular reduction is the input for the next multiplication, overflow presents few problems.
TL;DR: An 80-bit floating-point coprocessor which implements 24 vector/matrix instructions and 22 mathematical functions is described and the performance of 6.7 MFLOPS in the vector-matrix multiplication at 20 MHz has been attained by the use of parallel operations.
Abstract: An 80-bit floating-point coprocessor which implements 24 vector/matrix instructions and 22 mathematical functions is described. This processor can execute floating-point addition/rounding and pipelined multiplication concurrently, under the control of horizontal-type microinstructions. The SRT division method and CORDIC trigonometrical algorithm are used for a favorable cost/performance implementation. The performance of 6.7 MFLOPS in the vector-matrix multiplication at 20 MHz has been attained by the use of parallel operations. The vector/matrix instruction is about three times faster than conventional add and multiply instructions. The chip has been fabricated in 1.2- mu m double-metal layer CMOS process containing 433000 transistors on a 11.6*14.9-mm/sup 2/ die size. >