TL;DR: To accelerate multiple-precision multiplication, a new algorithm to reduce the number of memory accesses is proposed and implemented elliptic curve point multiplication for 160-bit, 192- bit, and 224-bit NIST/SECG curves over GF(p), RSA-1024 and RSA-2048 on two 8-bit microcontrollers.
Abstract: Strong public-key cryptography is often considered to be too computationally expensive for small devices if not accelerated by cryptographic hardware. We revisited this statement and implemented elliptic curve point multiplication for 160-bit, 192-bit, and 224-bit NIST/SECG curves over GF(p) and RSA-1024 and RSA-2048 on two 8-bit microcontrollers. To accelerate multiple-precision multiplication, we propose a new algorithm to reduce the number of memory accesses.
TL;DR: In this paper, an Atmel ATmega128 at 8 MHz was used to implement ECC point multiplication over fields using pseudo-Mersenne primes as standardized by NIST and SECG.
Abstract: Strong public-key cryptography is often considered to be too computationally expensive for small devices if not accelerated by cryptographic hardware. We revisited this statement and implemented elliptic curve point multiplication for 160-bit, 192-bit, and 224-bit NIST/SECG curves over GF(p) and RSA-1024 and RSA-2048 on two 8-bit microcontrollers. To accelerate multiple-precision multiplication, we propose a new algorithm to reduce the number of memory accesses. Implementation and analysis led to three observations: 1. Public-key cryptography is viable on small devices without hardware acceleration. On an Atmel ATmega128 at 8 MHz we measured 0.81s for 160-bit ECC point multiplication and 0.43s for a RSA-1024 operation with exponent e = 2 16 +1. 2. The relative performance advantage of ECC point multiplication over RSA modular exponentiation increases with the decrease in processor word size and the increase in key size. 3. Elliptic curves over fields using pseudo-Mersenne primes as standardized by NIST and SECG allow for high performance implementations and show no performance disadvantage over optimal extension fields or prime fields selected specifically for a particular processor architecture.
TL;DR: An in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times, finds even near-optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches.
Abstract: Utilizing graphics hardware for general purpose numerical computations has become a topic of considerable interest. The implementation of streaming algorithms, typified by highly parallel computations with little reuse of input data, has been widely explored on GPUs. We relax the streaming model's constraint on input reuse and perform an in-depth analysis of dense matrix-matrix multiplication, which reuses each element of input matrices O(n) times. Its regular data access pattern and highly parallel computational requirements suggest matrix-matrix multiplication as an obvious candidate for efficient evaluation on GPUs but, surprisingly we find even near-optimal GPU implementations are pronouncedly less efficient than current cache-aware CPU approaches. We find the key cause of this inefficiency is that the GPU can fetch less data and yet execute more arithmetic operations per clock than the CPU when both are operating out of their closest caches. The lack of high bandwidth access to cached data will impair the performance of GPU implementations of any computation featuring significant input reuse.
TL;DR: A careful analysis of elliptic curve point multiplication methods that use the point halving technique of Knudsen and Schroeppel is presented and an algorithm of Knuth is adapted to allow efficient use of projective coordinates with halving-based windowing methods for point multiplication.
Abstract: We present a careful analysis of elliptic curve point multiplication methods that use the point halving technique of Knudsen and Schroeppel and compare these methods to traditional algorithms that use point doubling. The performance advantage of halving methods is clearest in the case of point multiplication kP, where P is not known in advance and smaller field inversion to multiplication ratios generally favor halving. Although halving essentially operates on affine coordinate representations, we adapt an algorithm of Knuth to allow efficient use of projective coordinates with halving-based windowing methods for point multiplication.
TL;DR: This work provides simple, faster algorithms for the detection of cliques and dominating sets of fixed order based on reductions to rectangular matrix multiplication and an improved algorithm for diamonds detection.
TL;DR: A sequential implementation of the algorithm, with a control unit which allows the independent computation of logarithm and exponential, is proposed and the execution times and hardware requirements are estimated for single and double-precision floating-point computations.
Abstract: An architecture for the computation of logarithm, exponential, and powering operations is presented in this paper, based on a high-radix composite algorithm for the computation of the powering function (X/sup Y/). The algorithm consists of a sequence of overlapped operations: 1) digit-recurrence logarithm, 2) left-to-right carry-free (LRCF) multiplication, and 3) online exponential. A redundant number system is used and the selection in 1) and 3) is done by rounding except from the first iteration, when selection by table look-up is necessary to guarantee the convergence of the recurrences. A sequential implementation of the algorithm, with a control unit which allows the independent computation of logarithm and exponential, is proposed and the execution times and hardware requirements are estimated for single and double-precision floating-point computations. These estimates are obtained for radices from r=8 to r=1,024, according to an approximate model for the delay and area of the main logic blocks and help determining the radix values which lead to the most efficient implementations: r=32 and r=128.
TL;DR: A high performance elliptic curve coprocessor is developed, which is optimized for a binary field recommended by NIST, and an efficient implementation of Itoh and Tsujii's method for inversion with performance comparable to the extended Euclidean algorithm is used.
Abstract: A high performance elliptic curve coprocessor is developed, which is optimized for a binary field recommended by NIST The architecture uses a field multiplier capable of performing a field multiplication over the extension field with degree 163 in 0060 /spl mu/sec The coprocessor uses Lopez and Dahab's projective coordinate system and is optimized specifically for Koblitz curves An efficient implementation of Itoh and Tsujii's method for inversion with performance comparable to the extended Euclidean algorithm is used A prototype of the processor has been implemented for the binary extension field with degree 163 on a Xilinx XCV2000E FPGA The prototype runs at 66 MHz and performs an elliptic curve scalar multiplication in 0233 msec on a generic curve and 0075 msec on a Koblitz curve
TL;DR: It has been demonstrated that implementing the array and booth multiplier as 4x4 modules in the proposed architecture leads to a considerable improvement in their efficiency.
Abstract: A N X N bit parallel overlay multiplier architecture is designed for high speed DSP operations. The architecture is based on the vertical and crosswise algorithm of ancient Indian Vedic Mathematics. In the proposed architecture grouping of the bits 4 at a time is done for both the multiplicand and multiplier. Thus the whole multiplication operation is decomposed into 4x4 bit multiplication modules. The 4x4 multiplication modules can be implemented by using any multiplier such as array, booth, wallace or future proposed efficient multiplier. It has been demonstrated that implementing the array and booth multiplier as 4x4 modules in the proposed architecture leads to a considerable improvement in their efficiency. Due to its efficient performance, the overlay architecture is a boon for DSP applications such as multimedia and image processing. In order to test the effect of further decomposition of the bits on the efficiency of the architecture, the 4x4 multiply module are further decomposed into parallel 2x2 multiply modules by grouping two bits at a time for both 4 bit multiplicand and multiplier. The results has shown that decomposition nearly reaches a saturation level in its efficiency at 4x4 decomposition and further decomposition has a not a significant improvement in the architecture efficiency .
TL;DR: Using the scaled modulus technique and the specialized inversion algorithm, an elliptic curve processor architecture is developed that successfully utilizes redundant representation of elements in GF(p) and provides a low-power, high speed, and small footprint specialized elliptic Curve implementation.
Abstract: We introduce new modulus scaling techniques for transforming a class of primes into special forms which enables efficient arithmetic The scaling technique may be used to improve multiplication and inversion in finite fields We present an efficient inversion algorithm that utilizes the structure of scaled modulus Our inversion algorithm exhibits superior performance to the Euclidean algorithm and lends itself to efficient hardware implementation due to its simplicity Using the scaled modulus technique and our specialized inversion algorithm we develop an elliptic curve processor architecture The resulting architecture successfully utilizes redundant representation of elements in GF(p) and provides a low-power, high speed, and small footprint specialized elliptic curve implementation
TL;DR: A general-purpose processor architecture for accelerating public-key computations on server systems that demand high performance and flexibility to accommodate large numbers of secure connections with heterogeneous clients that are likely to be limited in the set of cryptographic algorithms supported is described.
Abstract: We describe a general-purpose processor architecture for accelerating public-key computations on server systems that demand high performance and flexibility to accommodate large numbers of secure connections with heterogeneous clients that are likely to be limited in the set of cryptographic algorithms supported. Flexibility is achieved in that the processor supports multiple public-key cryptosystems, namely RSA, DSA, DH, and ECC, arbitrary key sizes and, in the case of ECC, arbitrary curves over fields GF(p) and GF(2/sup m/). At the core of the processor is a novel dual-field multiplier based on a modified carry-save adder (CSA) tree that supports both GF(p) and GF(2/sup m/). In the case of a 64-bit integer multiplier, the necessary modifications increase its size by a mere 5%. To efficiently schedule the multiplier, we implemented a multiply-accumulate instruction that combines several steps of a multiple-precision multiplication in a single operation: multiplication, carry propagation, and partial product accumulation. We have developed a hardware prototype of the cryptographic processor in FPGA technology. If implemented in current 1.5 GHz processor technology, the processor executes 5,265 RSA-1024 op/s and 25,756 ECC-163 op/s - the given key sizes offer comparable security strength. Looking at future security levels, performance is 786 op/s for RSA-2048 and 9,576 op/s for ECC-233.
TL;DR: This book discusses teaching mathematics to deaf children: how the story began and the happy ending, as well as connecting multiplication, division and many other mathematical ideas.
Abstract: Preface.Acknowledgements.Conventions used for descriptions of interviews.Chapter 1 Introduction.Chapter 2 Counting and its creative uses.Chapter 3 Additive reasoning: connecting addition and subtraction.Chapter 4 Reading and writing numbers.Chapter 5 Multiplicative reasoning: connecting multiplication, division and many other mathematical ideas.Chapter 6 Teaching mathematics to deaf children: how the story began and the happy ending.References.Index.
TL;DR: An iterative decimal multiplier is presented, which operates at high clock frequencies and scales well to large operand sizes and uses a new decimal representation for intermediate products, which allows for a very fast two-stage iterative multiplier design.
Abstract: Decimal arithmetic is regaining popularity in the computing community due to the growing importance of commercial, financial, and Internet-based applications, which process decimal data This paper presents an iterative decimal multiplier, which operates at high clock frequencies and scales well to large operand sizes The multiplier uses a new decimal representation for intermediate products, which allows for a very fast two-stage iterative multiplier design Decimal multipliers, which are synthesized using a 011 micron CMOS standard cell library, operate at clock frequencies close to 2 GHz The latency of the proposed design to multiply two n-digit BCD operands is (n+8) cycles with a new multiplication able to begin every (n+1) cycles
TL;DR: The results of numerical tests clearly indicate that the accelerated matrix multiplication routines implementing two or three disjoint product-based algorithm are comparable in computational time with an implementation of Winograd algorithm and clearly outperform it with respect to working space and (especially) numerical stability.
TL;DR: In this paper, a protocol and a special type of multiplication gate that can be realized in a surprisingly simple and efficient way using standard homomorphic threshold ElGamal encryption was proposed.
Abstract: The invention introduces, in the framework of secure multiparty computation based on homomorphic threshold cryptosystems, a protocol and a special type of multiplication gate that can be realized in a surprisingly simple and efficient way using just standard homomorphic threshold ElGamal encryption. As addition gates are essentially for free, the conditional gate not only allows for building a circuit for any function, but actually yields efficient circuits for a wide range of tasks.
TL;DR: This work generalizes the multiplication algorithm of D.V. and G. Chudnovsky and improves the upper bounds of the bilinear complexity of multiplication in all extensions of a finite field of characteristic 2.
TL;DR: In this article, an alternative multiplication method for rhotrices is proposed, which establishes some relationships between Rhotrices and matrices, and the method establishes some relationship between R-matrices and R-hotrices.
Abstract: In this note, an alternative multiplication method for rhotrices is proposed. The method establishes some relationships between rhotrices and matrices.
TL;DR: The present work appears to be unique in the context of previous similar studies on mental calculation, which invariably adopted an arithmetic task as the primary demand and suggests the application of a contextual definition of automaticity to this kind of retrieval, given the fact that it might depend both on association strength and on contextual setting variables.
Abstract: Three experiments tested the limiting conditions of multiplication facts retrieval in a number-matching task (LeFevre, Bisanz, & Mrkonjic, 1988) By presenting two digits as cue and by requiring participants to decide whether a subsequent numerical target had been present in the pair, we found interference when the target coincided with the product of the cue digits This was evidence for obligatory activation of multiplication facts Also, we showed that multiplication facts retrieval occurred even in the absence of any arithmetic context (ie, a multiplication sign between the cue digits) and did not require processing resources (ie, the process met the capacity criterion of automaticity; Jonides, 1981), whereas manipulation of the spatial relation between the two operands (cue digits) negatively affected retrieval The present work appears to be unique in the context of previous similar studies on mental calculation, which invariably adopted an arithmetic task as the primary demand We identify this difference as the reason for the failure of all previous studies in revealing independence of multiplication facts from attentional resources Furthermore, we suggest the application of a contextual definition of automaticity to this kind of retrieval, given the fact that it might depend both on association strength and on contextual setting variables
TL;DR: This paper describes a hardware implementation of an arithmetic processor which is efficient for elliptic curve (EC) cryptosystems, which are becoming increasingly popular as an alternative for public key cryptosSystems based on factoring.
Abstract: This paper describes a hardware implementation of an arithmetic processor which is efficient for elliptic curve (EC) cryptosystems, which are becoming increasingly popular as an alternative for public key cryptosystems based on factoring. The modular multiplication is implemented using a Montgomery modular multiplication in a systolic array architecture, which has the advantage that the clock frequency becomes independent of the bit length m.
TL;DR: Theorems for the classical as well as quantum query complexity of the OR function are established, which imply a time-space tradeoff T/sup 2/S = /spl Omega/(N/sup 3/) for sorting N items on a quantum computer, which is optimal up to polylog factors.
Abstract: A strong direct product theorem says that if we want to compute k independent instances of a function, using less than k times the resources needed for one instance, then our overall success probability will be exponentially small in k. We establish such theorems for the classical as well as quantum query complexity of the OR function. This implies slightly weaker direct product results for all total functions. We prove a similar result for quantum communication protocols computing k instances of the Disjointness function.
Our direct product theorems imply a time-space tradeoff T^2*S=Omega(N^3) for sorting N items on a quantum computer, which is optimal up to polylog factors. They also give several tight time-space and communication-space tradeoffs for the problems of Boolean matrix-vector multiplication and matrix multiplication.
TL;DR: This method is applicable in, e.g., FIR filter banks, transforms, and polyphase form FIR filters for sample rate changes and yields good results compared to realizing the matrix multiplication by utilizing multiple coefficient multiplication techniques for the rows or columns separately.
Abstract: In this paper a novel approach for realizing constant coefficient matrix multiplication using few additions and subtractions is proposed. This method is applicable in, e.g., FIR filter banks, transforms, and polyphase form FIR filters for sample rate changes. Examples show that the proposed method yields good results compared to realizing the matrix multiplication by utilizing multiple coefficient multiplication techniques for the rows or columns separately.
TL;DR: General form of linear functional on this space is presented which makes possible to define a large family of defuzzification methods of that class of numbers.
Abstract: Ordered fuzzy number is an ordered pair of continuous real functions defined on the interval [0, 1]. Such numbers have been introduced by the author and his co-workers as an enlargement of classical fuzzy numbers by requiring a membership relation. It was done in order to define four algebraic operations between them, i.e. addition, subtraction, multiplication and division, in a way that renders them an algebra. Further, a normed topology is introduced which makes them a Banach space, and even more, a Banach algebra with unity. General form of linear functional on this space is presented which makes possible to define a large family of defuzzification methods of that class of numbers.
TL;DR: In this article, a low-cost coprocessor for smartcards which supports all necessary mathematical operations for a fast calculation of the Elliptic Curve Digital Signature Algorithm (ECDSA) based on the finite field GF(2 m ).
Abstract: In this article we present a low-cost coprocessor for smartcards which supports all necessary mathematical operations for a fast calculation of the Elliptic Curve Digital Signature Algorithm (ECDSA) based on the finite field GF(2 m ). These ECDSA operations are GF(2 m ) addition, 4-bit digit-serial multiplication in GF(2 m ), inversion in GF(2 m ), and inversion in GF(p). An efficient implementation of the multiplicative inversion which breaks the 11:1 limit regarding multiplications makes it possible to use affine instead of projective coordinates for point operations on elliptic curves. A bitslice architecture allows an easy adaptation for different bit lengths. A small chip area is achieved by reusing the hardware registers for different operations.
TL;DR: In this paper, the authors proposed a solution to reduce the number of times of calculation per unit time in a one-input and multi-output convolution method and a signal processor.
Abstract: PROBLEM TO BE SOLVED: To reduce the number of times of calculation per unit time in a one-input and multi-output convolution method and a signal processor. SOLUTION: A first coefficient is input to a real number input unit, and a second coefficient is input to an imaginary number input unit. Fast Fourier transform is then performed to generate a coefficient vector in a frequency domain. After an input signal is transformed to a signal vector in the frequency domain by the Fast Fourier transform, multiplication is executed to a transform result and the coefficient vector on element basis. When the multiplication result is applied inverse fast Fourier transform, a real number output unit in the transform result becomes a first output signal sequence, while an imaginary number output unit becomes a second output signal sequence. COPYRIGHT: (C)2005,JPO&NCIPI
TL;DR: It is shown that it is possible to perform long integer modular arithmetic without using multiple precision operations when t is chosen properly and proposed efficient arithmetic methods for XTR cryptosystem.
Abstract: In 1999, Jerome Solinas introduced families of moduli called the generalized Mersenne numbers [8]. The generalized Mersenne numbers are expressed in a polynomial form, p = f(t), where t is a power of 2. It is shown that such p's lead to fast modular reduction methods which use only a few integer additions and subtractions. We further generalize this idea by allowing any integer for t. We show that more generalized Mersenne numbers still lead to a significant improvement over well-known modular multiplication techniques. While each generalized Mersenne number requires a dedicated implementation, more generalized Mersenne numbers allow flexible implementations that work for more than one modulus. We also show that it is possible to perform long integer modular arithmetic without using multiple precision operations when t is chosen properly. Moreover, based on our results, we propose efficient arithmetic methods for XTR cryptosystem.
TL;DR: This paper presents a new scheme for the hardware evaluation of elementary functions, based on a piecewise second order minimax approximation, that requires only one small rectangular multiplication and outperforms all the previously published methods in both area and speed.
Abstract: This paper presents a new scheme for the hardware evaluation of elementary functions, based on a piecewise second order minimax approximation. The novelty is that this evaluation requires only one small rectangular multiplication. Therefore the resulting architecture combines a small table size, thanks to second-order evaluation, with a short critical path: Consisting of one table lookup, the rectangular multiplication, and one addition, the critical path is shorter than that of a plain first-order evaluation. Synthesis results for several functions show that this method outperforms all the previously published methods in both area and speed for precisions ranging from 12 to 24 bits and over.
TL;DR: In this paper, the authors presented a new sequential normal basis multiplier over GF(2 m ) where m =163,233,283,409,571, where m is the five recommended fields by NIST for elliptic curve cryptography.
Abstract: We present a new sequential normal basis multiplier over GF(2 m ) The gate complexity of our multiplier is significantly reduced from that of Agnew et al and is comparable to that of Reyhani-Masoleh and Hasan, which is the lowest complexity normal basis multiplier of the same kinds On the other hand, the critical path delay of our multiplier is same to that of Agnew et al Therefore it is supposed to have a shorter or the same critical path delay to that of Reyhani-Masoleh and Hasan Moreover our method of using a Gaussian normal basis makes it easy to find a basic multiplication table of normal elements So one can easily construct a circuit array for large finite fields, GF(2 m ) where m=163,233,283,409,571, ie the five recommended fields by NIST for elliptic curve cryptography
TL;DR: This work generalizes the Montgomery method for scalar multiplication to the jacobian of genus 2 curves to obtain an algorithm that is competitive compared to the usual methods of scalarmultiplication and that has additional properties such as resistance to timings attacks.
Abstract: Using powerful tools on genus 2 curves like the Kummer variety, we generalize the Montgomery method for scalar multiplication to the jacobian of these curves. Previously this method was only known for elliptic curves. We obtain an algorithm that is competitive compared to the usual methods of scalar multiplication and that has additional properties such as resistance to timings attacks. This algorithm has very important applications in cryptography using hyperelliptic curves and more particularly for people interested in cryptography on smart cards.
TL;DR: Participants performed 40 blocks of practice trials in which they generated answers to a subset of simple multiplication problems and measured response times and errors in a true—false product verification task, which suggests that operand multiples were suppressed, rather than strengthened, during the practice phase.
Abstract: In theintegrated structures model of simple multiplication (Manly & Spoehr, 1999), it is proposed that retrieval of a multiplication fact (e.g., 2 × 7=?) activates and strengthens operand multiples representations (e.g., 4, 6, 8, etc.; 14, 21, 28, etc.). In contrast, in the phenomenon ofretrieval-induced forgetting (RIF; M. C. Anderson, Bjork, & Bjork, 1994), it is suggested that operand multiples ought to be suppressed with practice. Participants (N=72) performed 40 blocks of practice trials in which they generated answers to a subset of simple multiplication problems. We then measured response times and errors in a true—false product verification task. Both true and related false equations with practiced operand multiples as presented products were solved relatively poorly in the verification task. In agreement with RIF, this suggests that operand multiples were suppressed, rather than strengthened, during the practice phase.
TL;DR: In this article, the authors show that faithful multiplication and projective modules have common properties and use a method based on Anderson's characterizations of multiplication modules which enables new proofs of theorems on multiplication modules of El Bast, Low, Smith and Smith.
Abstract: The associated ideal θ(M) and the trace ideal T(M) of a module M play analogous but distinct roles in the study of multiplication and projective modules respectively. We further investigate both and show in particular that faithful multiplication and projective modules have common properties. We use a method based on Anderson's characterizations of multiplication modules which enables new proofs of theorems on multiplication modules of El Bast, Low, Smith and Smith, and then, also enables similar results for projective modules.
TL;DR: In this article, a hardware evaluation of elementary functions, based on a piecewise second order minimax approximation, is presented. But this evaluation requires only one small rectangular multiplication, and the critical path is shorter than that of a plain first-order evaluation.
Abstract: This paper presents a new scheme for the hardware evaluation of elementary functions, based on a piecewise second order minimax approximation The novelty is that this evaluation requires only one small rectangular multiplication Therefore the resulting architecture combines a small table size, thanks to second-order evaluation, with a short critical path: Consisting of one table lookup, the rectangular multiplication, and one addition, the critical path is shorter than that of a plain first-order evaluation Synthesis results for several functions show that this method outperforms all the previously published methods in both area and speed for precisions ranging from 12 to 24 bits and over