TL;DR: The primary contribution of this paper is in introducing several state machine-based computational elements for performing sigmoid nonlinearity mappings, linear gain, and exponentiation functions, and describing an efficient method for the generation of, and conversion between, stochastic and deterministic binary signals.
Abstract: This paper examines a number of stochastic computational elements employed in artificial neural networks, several of which are introduced for the first time, together with an analysis of their operation. We briefly include multiplication, squaring, addition, subtraction, and division circuits in both unipolar and bipolar formats, the principles of which are well-known, at least for unipolar signals. We have introduced several modifications to improve the speed of the division operation. The primary contribution of this paper, however, is in introducing several state machine-based computational elements for performing sigmoid nonlinearity mappings, linear gain, and exponentiation functions. We also describe an efficient method for the generation of, and conversion between, stochastic and deterministic binary signals. The validity of the present approach is demonstrated in a companion paper through a sample application, the recognition of noisy optical characters using soft competitive learning. Network generalization capabilities of the stochastic network maintain a squared error within 10 percent of that of a floating-point implementation for a wide range of noise levels. While the accuracy of stochastic computation may not compare favorably with more conventional binary radix-based computation, the low circuit area, power, and speed characteristics may, in certain situations, make them attractive for VLSI implementation of artificial neural networks.
TL;DR: It is shown that the IEEE standard's specification of operations involving the signed infinities, signed zeros, and the exact/inexact flag are such as to make a correct and optimal implementation more efficient.
Abstract: We start with a mathematical definition of a real interval as a closed, connected set of reals. Interval arithmetic operations (addition, subtraction, multiplication, and division) are likewise defined mathematically and we provide algorithms for computing these operations assuming exact real arithmetic. Next, we define interval arithmetic operations on intervals with IEEE 754 floating point endpoints to be sound and optimal approximations of the real interval operations and we show that the IEEE standard's specification of operations involving the signed infinities, signed zeros, and the exact/inexact flag are such as to make a correct and optimal implementation more efficient. From the resulting theorems, we derive data that are sufficiently detailed to convert directly to a program for efficiently implementing the interval operations. Finally, we extend these results to the case of general intervals, which are defined as connected sets of reals that are not necessarily closed.
TL;DR: Results show that OEFs when used with the new inversion and multiplication algorithms provide a substantial performance increase over other reported methods.
Abstract: This contribution focuses on a class of Galois field used to achieve fast finite field arithmetic which we call an Optimal Extension Field (OEF), first introduced in [3]. We extend this work by presenting an adaptation of Itoh and Tsujii's algorithm for finite field inversion applied to OEFs. In particular, we use the facts that the action of the Frobenius map in GF (pm) can be computed with only m-1 subfield multiplications and that inverses in GF (p) may be computed cheaply using known techniques. As a result, we show that one extension field inversion can be computed with a logarithmic number of extension field multiplications. In addition, we provide new extension field multiplication formulas which give a performance increase. Further, we provide an OEF construction algorithm together with tables of Type I and Type II OEFs along with statistics on the number of pseudo-Mersenne primes and OEFs. We apply this new work to provide implementation results using these methods to construct elliptic curve cryptosystems on both DEC Alpha workstations and Pentium-class PCs. These results show that OEFs when used with our new inversion and multiplication algorithms provide a substantial performance increase over other reported methods.
TL;DR: This work presents a commitment scheme allowing commitment to arbitrary size integers, based on any Abelian group with certain properties, most importantly that it is hard for the committer to compute its order.
Abstract: We present a commitment scheme allowing commitment to arbitrary size integers, based on any Abelian group with certain properties, most importantly that it is hard for the committer to compute its order. Potential examples include RSA and class groups. We also give efficient zero-knowledge protocols for proving knowledge of the contents of a commitment and for verifying multiplicative relations over the integers on committed values. This means that our scheme can support, for instance, the efficent interval proofs of Boudot[1]. The scheme can be seen as a modification and a generalization of an earlier scheme of Fujisaki and Okamoto [5], and in particular our results show that we can use a much larger class of RSA moduli than the safe prime products proposed in [5]. Also, we correct some mistakes in the proofs of [5] and give what appears to be the first multiplication protocol for a Fujisaki/Okamoto-like scheme with a complete proof of soundness.
TL;DR: A two-dimensional (2-D) integer discrete cosine transform is proposed, which needs only integer operations and shifts and is nonseparable and requires a far fewer number of operations than that used by the corresponding row-column 2-D integer discrete Cosine transform.
Abstract: A method is proposed to factor the type-II discrete cosine transform (DCT-II) into lifting steps and additions. After approximating the lifting matrices, we get a new type-II integer discrete cosine transform (IntDCT-II) that is float-point multiplication free. Based on the relationships among the various types of DCTs, we can generally factor any DCTs into lifting steps and additions and then get four types of integer DCTs, which need no float-point multiplications. By combining the polynomial transform and the one-dimensional (1-D) integer cosine transform, a two-dimensional (2-D) integer discrete cosine transform is proposed. The proposed transform needs only integer operations and shifts. Furthermore, it is nonseparable and requires a far fewer number of operations than that used by the corresponding row-column 2-D integer discrete cosine transform.
TL;DR: A package that contains procedures for automatic manipulations and transformations of univariate holonomic functions and sequences within the computer algebra system Mathematica is implemented and some different techniques for proving holonomic identities are described.
Abstract: Holonomic functions and sequences have the property that they can be represented by a finite amount of information. Moreover, these holonomic objects are closed under elementary operations like, for instance, addition or (termwise and Cauchy) multiplication. These (and other) operations can also be performed “algorithmically”. As a consequence, we can prove any identity of holonomic functions or sequences automatically. Based on this theory, the author implemented a package that contains procedures for automatic manipulations and transformations of univariate holonomic functions and sequences within the computer algebra system Mathematica . This package is introduced in detail. In addition, we describe some different techniques for proving holonomic identities.
TL;DR: In this paper, a parallel counter comprises logic for generating output bits as symmetrical functions of the input bits and a multiplication circuit is also provided in which an array of combinations of each bit of a binary number with each other bit of another binary number is generated having a reduced form.
Abstract: A parallel counter comprises logic for generating output bits as symmetrical functions of the input bits. The parallel counter can be used in a multiplication circuit. A multiplication circuit is also provided in which an array of combinations of each bit of a binary number with each other bit of another binary number is generated having a reduced form in order to reduce the steps required in array reduction.
TL;DR: In this article, the authors consider a network of processors among which elements in a finite field K can be verifiably shared in a constant number of rounds, and show how the network can securely, efficiently and in constant-round compute determinant, characteristic polynomial, rank, and the solution space of linear systems of equations.
Abstract: Consider a network of processors among which elements in a finite field K can be verifiably shared in a constant number of rounds. Assume furthermore constant-round protocols are available for generating random shared values, for secure multiplication and for addition of shared values. These requirements can be met by known techniques in all standard models of communication. In this model we construct protocols allowing the network to securely solve standard computational problems in linear algebra. In particular, we show how the network can securely, efficiently and in constant-round compute determinant, characteristic polynomial, rank, and the solution space of linear systems of equations. Constant round solutions follow for all problems which can be solved by direct application of such linear algebraic methods, such as deciding whether a graph contains a perfect match. If the basic protocols (for shared random values, addition and multiplication) we start from are unconditionally secure, then so are our protocols. Our results offer solutions that are significantly more efficient than previous techniques for secure linear algebra, they work for arbitrary fields and therefore extend the class of functions previously known to be computable in constant round and with unconditional security. In particular, we obtain an unconditionally secure protocol for computing a function f in constant round, where the protocol has complexity polynomial in the span program size of f over an arbitrary finite field.
TL;DR: In this paper, a multiplication block for a reconfigurable chip includes multiple multiplication units and a group of the selectable adder units operably interconnectable with the multiplication units.
Abstract: A multiplication block for a reconfigurable chip includes multiple multiplication units and a group of the selectable adder units operably interconnectable with the multiplication units. The adder units can be selectively connected for different configurations. The multiplication block is preferably controlled by an instruction which can put the multiplication block into different configurations.
TL;DR: Testing whether people represent analogous principles for each arithmetic operation showed that operations with longer developmental histories had strong principle representations, and the implications for a structure-mapping approach to mathematical problem solving are discussed.
Abstract: Current theories of mathematical problem solving propose that people select a mathematical operation as the solution to a problem on the basis of a structure mapping between their problem representation and the representation of the mathematical operations. The structure-mapping hypothesis requires that the problem and the mathematical representations contain analogous relations. Past research has demonstrated that the problem representation consists of functional relationships, orprinciples. The present study tested whether people represent analogous principles for each arithmetic operation (i.e., addition, subtraction, multiplication, and division). For each operation, college (Experiments 1 and 2) and 8th grade (Experiment 2) participants were asked to rate the degree to which a set of completed problems was a good attempt at the operation. The pattern of presented answers either violated one of four principles or did not violate any principles. The distance of the presented answers from the correct answers was independently manipulated. Consistent with the hypothesis that people represent the principles, (1) violations of the principles were rated as poorer attempts at the operation, (2) operations that are learned first (e.g., addition) had more extensive principle representations than did operations learned later (multiplication), and (3) principles that are more frequently in evidence developed more quickly. In Experiment 3, college participants rated the degree to which statements were indicative of each operation. The statements were either consistent or inconsistent with one of two principles. The participants’ ratings showed that operations with longer developmental histories had strong principle representations. The implications for a structure-mapping approach to mathematical problem solving are discussed.
TL;DR: This paper analyzes the performance of the sparse matrix–vector product with symmetric matrices originating from the FEM and describes techniques that lead to a fast implementation.
Abstract: The sparse matrix–vector product is an important computational kernel that runs ineffectively on many computers with super-scalar RISC processors. In this paper we analyse the performance of the sparse matrix–vector product with symmetric matrices originating from the FEM and describe techniques that lead to a fast implementation. It is shown how these optimisations can be incorporated into an efficient parallel implementation using message-passing. We conduct numerical experiments on many different machines and show that our optimisations speed up the sparse matrix–vector multiplication substantially.
TL;DR: This work presents and compares various algorithms, including a new one, allowing to perform multiplications by integer constants using elementary operations, useful, as they occur in several problems, such as the Toom-Cook-like algorithms to multiply large multiple-precision integers.
Abstract: We present and compare various algorithms, including a new one, allowing to perform multiplications by integer constants using elementary operations Such algorithms are useful, as they occur in several problems, such as the Toom-Cook-like algorithms to multiply large multiple-precision integers, the approximate computation of consecutive values of a polynomial, and the generation of integer multiplications by compilers
TL;DR: This contribution describes a methodology used to efficiently implement elliptic curves (EC) over GF(p) on the 16-bit TI MSP430x33x family of low-cost microcontrollers.
Abstract: This contribution describes a methodology used to efficiently implement elliptic curves (EC) over GF(p) on the 16-bit TI MSP430x33x family of low-cost microcontrollers. We show that it is possible to implement EC cryptosystems in highly constrained embedded systems and still obtain acceptable performance at low cost. We modified the EC point addition and doubling formulae to reduce the number of intermediate variables while at the same time allowingfor flexibility. We used a Generalized-Mersenne prime to implement the arithmetic in the underlying field. We take advantage of the special form of the moduli to minimize the number of precomputations needed to implement inversion via Fermat's Little theorem and the k-ary method of exponentiation. We apply these ideas to an implementation of an elliptic curve system over GF(p), where p = 2128 - 297 - 1. We show that a scalar point multiplication can be achieved in 3.4 seconds without any stored/precomputed values and the processor clocked at 1 MHz.
TL;DR: An equivalent bit conversion algorithm (EBCA) is proposed to eliminate the need for final carry propagation in the redundant binary to normal binary (NB) conversion step for RB multiplication, and the entire multiplication process can be made free of carry propagation from input to output.
Abstract: An equivalent bit conversion algorithm (EBCA) is proposed to eliminate the need for final carry propagation in the redundant binary (RB) to normal binary (NB) conversion step for RB multiplication. The multiplication process helps with the carry-free conversion step by eliminating certain combinations of RB product. When the EBCA is applied, conventional power-consuming carry-propagating adders are replaced by simple, minimum-sized carry-free converters, and the entire multiplication process can be made free of carry propagation from input to output. The method employed in this work reduces 40% of the total power and 30% of the total multiplication time in the final adder stage of traditional multipliers. The prototype fabricated in 0.35-/spl mu/m CMOS demonstrates that the 54 b/spl times/54 b multiplier consumes only 53.4 mW at 3.3 V for 74-MHz operation.
TL;DR: This paper summarizes and presents all distributive relations, known by now, on multiplication and addition of generalized (proper and improper) intervals.
Abstract: The arithmetic on an extended set of proper and improper intervals presents algebraic completion of the conventional interval arithmetic allowing thus efficient solution of some interval algebraic problems. In this paper we summarize and present all distributive relations, known by now, on multiplication and addition of generalized (proper and improper) intervals.
TL;DR: The authors introduced a q-generalization of sine and cosine functions, related to the ϑ functions, but possessing addition and multiplication formulas more analogous to those of ordinary sin and cos.
Abstract: We introduce a q-generalization of the sine and cosine functions, related to the ϑ functions, but (as revealed by computer experiments) possessing addition and multiplication formulas more analogous to those of ordinary sin and cos These formulas then contribute identities to ϑ theory, and hint of a more natural formulation of ϑ functions as outgrowths of elementary functions Nevertheless, this paper can be read without knowledge of ϑ functions—it was certainly written that way
TL;DR: In this article, the authors investigated how some of the new features of the Xilinx Virtex FPGA may be used to support efficient and optimised implementation of matrix product based on Multiply and Accumulate (MAC) such operations are frequently used in signal applications.
Abstract: This paper investigates how some of the new features of the Xilinx Virtex FPGA may be used to support efficient and optimised implementation of matrix product based on Multiply and Accumulate (MAC) such operations are frequently used in signal applications. The principle new features that have been investigated are the Block RAM and the fully digital Delay-Locked Loop (DLL) The approach used for the matrix multiplication algorithm employs the idea used in the modified Booth encoder multiplication using Wallace Trees addition. Preliminary performance results and comparisons with similar algorithms implemented on multi-FPGA platforms have shown better performance for the proposed architecture.
TL;DR: In this article, a multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed, where the multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form.
Abstract: A multiplier capable of performing signed and unsigned scalar and vector multiplication is disclosed. The multiplier is configured to receive signed or unsigned multiplier and multiplicand operands in scalar or packed vector form. An effective sign for the multiplier and multiplicand operands may be calculated and used to create and select a number of partial products according to Booth's algorithm. Once the partial products have been created and selected, they may be summed and the results may be output. The results may be signed or unsigned, and may represent vector or scalar quantities. When a vector multiplication is performed, the multiplier may be configured to generate and select partial products so as to effectively isolate the multiplication process for each pair of vector components. The multiplier may also be configured to sum the products of the vector components to form the vector dot product. The final product may be output in segments so as to require fewer bus lines. The segments may be rounded by adding a rounding constant. Rounding and normalization may be performed in two paths, one assuming an overflow will occur, the other assuming no overflow will occur. The multiplier may also be configured to perform iterative calculations to evaluate constant powers of an operand. Intermediate products that are formed may be rounded and normalized in two paths and then compressed and stored for use in the next iteration. An adjustment constant may also be added to increase the frequency of exactly rounded results.
TL;DR: New techniques for efficient software implementation of binary field multiplication in normal basis are presented, which are more efficient in terms of both speed and memory compared with alternative approaches.
Abstract: Finite field arithmetic is becoming increasingly important in today's computer systems, particularly for implementing cryptographic operations. Among various arithmetic operations, finite field multiplication is of particular interest since it is a major building block for elliptic curve cryptosystems. In this paper, we present new techniques for efficient software implementation of binary field multiplication in normal basis. Our techniques are more efficient in terms of both speed and memory compared with alternative approaches.
TL;DR: A compact encoding of non-adjacent representations that allows to skip the exponent recoding step and a straightforward technique for picking random numbers that alreadysatisfy the non-adjacence property is proposed.
Abstract: Techniques for fast exponentiation (multiplication) in various groups have been extensively studied for use in cryptographic primitives. Specifically, the coding of the exponent (multiplier) plays an important role in the performances of the algorithms used. The crucial optimization relies in general on minimizing the Hamming weight of the exponent (multiplier). This can be performed optimally with nonadjacent representations. This paper introduces a compact encoding of non-adjacent representations that allows to skip the exponent recoding step. Furthermore, a straightforward technique for picking random numbers that alreadysatisfythe non-adjacence propertyis proposed. Several examples of application are given, in particular in the context of scalar multiplication on elliptic curves.
TL;DR: This paper investigates how some of the new features of the Xilinx Virtex FPGA may be used to support efficient and optimised implementation of matrix product based on Multiply and Accumulate such operations are frequently used in signal applications.
Abstract: This paper investigates how some of the new features of the Xilinx Virtex FPGA may be used to support efficient and optimised implementation of matrix product based on Multiply and Accumulate (MAC) such operations are frequently used in signal applications. The principle new features that have been investigated are the Block RAM and the fully digital Delay-Locked Loop (DLL). The approach used for the matrix multiplication algorithm employs the idea used in the modified Booth encoder multiplication using Wallace Trees addition. Preliminary performance results and comparisons with similar algorithms implemented on multi-FPGA platforms have shown better performance for the proposed architecture.
TL;DR: In this paper, a parallel array VLIW digital signal processor is employed along with specialized complex long multiplication instructions and communication operations between the processing elements which are overlapped with computation to provide very high performance operation.
Abstract: Efficient computation of complex long multiplication results and an efficient calculation of a covariance matrix are described. A parallel array VLIW digital signal processor is employed along with specialized complex long multiplication instructions and communication operations between the processing elements which are overlapped with computation to provide very high performance operation. Successive iterations of a loop of tightly packed VLIWs may be used allowing the complex multiplication pipeline hardware to be efficiently used.
TL;DR: In this paper, a vector-level algorithm for normal basis multipli- cation over the extended binary field GF(2 m ) is presented. But the vector level algorithm does not address the problem of normal base multiplication in hardware implementation.
Abstract: For cryptographic applications, normal bases have received considerable attention, especially for hardware implementation. In this article, we consider fast software algorithms for normal basis multipli- cation over the extended binary field GF(2 m ). We present a vector-level algorithm which essentially eliminates the bit-wise inner products needed in the conventional approach to the normal basis multiplication. We then present another algorithm which significantly reduces the dynamic in- struction counts. Both algorithms utilize the full width of the data-path of the general purpose processor on which the software is to be exe- cuted. We also consider composite fields and present an algorithm which can provide further speed-up and an added flexibility toward hardware- software co-design of processors for very large finite fields.
TL;DR: Novel encoding schemes for the implementation of higher radix multiplication are proposed, which provide more flexible multiplier designs that can be implemented in shorter pipeline stages and compare the proposed designs with multipliers that use traditional Booth recoding.
Abstract: Multipliers are used at many different places in microprocessor design. As the non-memory sub-blocks of the microprocessor with the largest size and delay, multipliers have a big impact on the cycle time of the microprocessor. Targeting deeper pipelines and higher clock frequencies, there is a growing demand for multiplier designs that can be split into shorter stages. For this purpose, the use of Booth recoding has been a popular method to cut down the number of partial products in a multiplier to reduce the delay of the partial product accumulation and to simplify the partition of the multiplier into several shorter stages. The complexity to pre-compute an increasing number of digit multiples of the multiplicand within the multiplier unit limits the use of Booth recoding mainly to radices 4 and 8. We propose novel encoding schemes for the implementation of higher radix multiplication. In particular we consider multiplication radix-32 and radix-256. The features provide more flexible multiplier designs that can be implemented in shorter pipeline stages. We compare the proposed designs with multipliers that use traditional Booth recoding.
TL;DR: A flexible multiplication unit and configurable carry logic circuitry suitable for incorporation into a FPGA logic block are proposed, based on a modified carry-save adder that efficiently supports multiplication, addition and multiply accumulate operations in serial or parallel form.
Abstract: FPGAs are increasingly being applied to DSP applications but are often inefficient in space and time compared with dedicated DSP chips, particularly for multiplication-based operations. To improve FPGA arithmetic performance, a flexible multiplication unit and configurable carry logic circuitry suitable for incorporation into a FPGA logic block are proposed. The multiplier unit is based on a modified carry-save adder and along with the carry logic circuitry efficiently supports multiplication, addition and multiply accumulate operations in serial or parallel form. Preliminary results indicate logic utilization for a multiplier implementation in such an FPGA is approximately a third that of the XC 4000 architecture and half that of the Virtex architecture. Propagation delays are also reduced due to the use of dedicated inter-block interconnect for all sum and carry signals and flexible routing multiplexers.
TL;DR: In this paper, a stronger bound of 2n/2/61 was proven by a new technique, using a recently found universal family of hash functions, and a first non-trivial upper bound of 7/3 ċ 24n/3 for the OBDD size of MULn-1,n was provided.
Abstract: Ordered binary decision diagrams (OBDDs) nowadays belong to the most common representation types for Boolean functions. Although they allow important operations such as satisfiability test and equality test to be performed efficiently, their limitation lies in the fact that they may require exponential size for important functions. Bryant [8] has shown that any OBDD-representation of the function MULn-1,n, which computes the middle bit of the product of two n-bit numbers, requires at least 2n/8 nodes. In this paper a stronger bound of 2n/2/61 is proven by a new technique, using a recently found universal family of hash functions [23]. As a result, one cannot hope anymore to find reasonable small OBDDs even for the multiplication of relatively short integers, since for only a 64-bit multiplication millions of nodes are required. Further, a first non-trivial upper bound of 7/3 ċ 24n/3 for the OBDD size of MULn-1,n is provided.
TL;DR: A suite of optimizations for eliminating division, modulo, and remainder operations from programs are described, analogous to strength reduction techniques used for multiplications.
Abstract: Integer division, modulo, and remainder operations are expressive and useful operations. They are logical candidates to express complex data accesses such as the wrap-around behavior in queues using ring buffers. In addition, they appear frequently in address computations as a result of compiler optimizations that improve data locality, perform data distribution, or enable parallelization. Experienced application programmers, however, avoid them because they are slow. Furthermore, while advances in both hardware and software have improved the performance of many parts of a program, few are applicable to division and modulo operations. This trend makes these operations increasingly detrimental to program performance.
This paper describes a suite of optimizations for eliminating division, modulo, and remainder operations from programs. These techniques are analogous to strength reduction techniques used for multiplications. In addition to some algebraic simplifications, we present a set of optimization techniques that eliminates division and modulo operations that are functions of loop induction variables and loop constants. The optimizations rely on algebra, integer programming, and loop transformations.
TL;DR: A Frobenius manifold is a complex manifold with a multiplication and a metric on the holomorphic tangent bundle and two distinguished vector fields satisfying a series of natural conditions.
Abstract: A Frobenius manifold is a complex manifold with a multiplication and a metric on the holomorphic tangent bundle and two distinguished vector fields, satisfying a series of natural conditions.