TL;DR: The basic ideas of ℋ- andℋ2-matrices are introduced and an algorithm that adaptively computes approximations of general matrices in the latter format is presented.
Abstract: A class of matrices (H2-matrices) has recently been introduced for storing discretisations of elliptic problems and integral operators from the BEM. These matrices have the following properties: (i) They are sparse in the sense that only few data are needed for their representation. (ii) The matrix-vector multiplication is of linear complexity. (iii) In general, sums and products of these matrices are no longer in the same set, but after truncation to the H2-matrix format these operations are again of quasi-linear complexity.We introduce the basic ideas of H- and H2-matrices and present an algorithm that adaptively computes approximations of general matrices in the latter format.
TL;DR: This result indicates that arithmetic function is related to working memory in a subsystem-specific manner: multiplication is more closely linked to phonological loop and subtraction to visuo-spatial sketchpad.
TL;DR: A new class of provably invertible mappings which can mix arithmetic operations (negation, addition, subtraction, multiplication) and boolean operations (not, xor, and, or), are highly efficient, and have desirable cryptographic properties are introduced.
Abstract: Invertible transformations over n-bit words are essential ingredients in many cryptographic constructions. When n is small (e.g., n = 8) we can compactly represent any such transformation as a lookup table, but when n is large (e.g., n = 64) we usually have to represent it as a composition of simpler operations such as linear mappings, S-P networks, Feistel structures, etc. Since these cryptographic constructions are often implemented in software on standard microprocessors, we are particularly interested in invertible univariate or multivariate transformations which can be implemented as small compositions of basic machine instructions on 32 or 64 bit words. In this paper we introduce a new class of provably invertible mappings which can mix arithmetic operations (negation, addition, subtraction, multiplication) and boolean operations (not, xor, and, or), are highly efficient, and have desirable cryptographic properties. In particular, we show that for any n the mapping x ? x + (x2 ? C) (mod 2n) is a permutation with a single cycle of length 2n iff both the least significant bit and the third least significant bit in the constant C are 1.
TL;DR: This paper surveys several classical and new zealous algorithms for manipulating formal power series, including algorithms for multiplication, division, resolution of differential equations, composition and reversion, and gives various relaxed algorithms for these operations.
TL;DR: A hardware oriented representation of the field elements is examined, comparing the resulting algorithms for field addition and multiplication operations, and it is shown that characteristic three arithmetic need not significantly under-perform comparable characteristic two alternatives.
Abstract: In this paper we examine a number of ways of implementing characteristic three arithmetic in hardware. While this type of arithmetic is not traditionally used in cryptographic systems, recent advances in Tate and Weil pairing based cryptosystems show that it is potentially valuable. We examine a hardware oriented representation of the field elements, comparing the resulting algorithms for field addition and multiplication operations, and show that characteristic three arithmetic need not significantly under-perform comparable characteristic two alternatives.
TL;DR: Several elliptic curve multiplication algorithms secure against side channel attacks (SCA) are improved and can be applied to the recommended curves found in various standards.
Abstract: We improve several elliptic curve multiplication algorithms secure against side channel attacks (SCA). While some efficient SCA-resistant algorithms were developed that apply only to special classes of curves, we are interested in algorithms that are suitable for general elliptic curves and can be applied to the recommended curves found in various standards. We compare the running time and memory usage of the improved schemes.
TL;DR: This paper examines a number of ways of implementing characteristic three arithmetic for use in cryptosystems based on the Tate pairing, and the resulting algorithms for the field addition, multiplication and cubing are compared.
Abstract: In this paper, the authors examine a number of ways of implementing characteristic three arithmetic for use in cryptosystems based on the Tate pairing. Three alternative representations of the field elements are examined, and the resulting algorithms for the field addition, multiplication and cubing are compared. Issues related to the arithmetic of supersingular elliptic curves over fields of characteristic three are also examined. Details of how to compute the Tate pairing itself are not covered, since these are well documented elsewhere.
TL;DR: A hardware solution for finite field arithmetic with application in asymmetric cryptography, ready for future cryptographic bitlengths and allow operation at high clock frequency on moderate hardware resources is presented.
Abstract: In this article we present a hardware solution for finite field arithmetic with application in asymmetric cryptography. It supports calculation in GF(p) as well as in GF(2m). Addition and multiplication with interleaved modular reduction are the main functionality of the unit. Additional functions--like shift operations and integer incrementation--allow the calculation of the multiplicative inverse and covering all operations required to implement Elliptic Curve Cryptography. Redundant number representation and efficient modular reduction make it ready for future cryptographic bitlengths and allow operation at high clock frequency on moderate hardware resources.
TL;DR: It is concluded that cipher designers may have placed too much faith in multiplication as a mixing operator, and that it should be combined with at least two other incompatible group operations.
Abstract: We present a new type of differential that is particularly suited to analyzing ciphers that use modular multiplication as a primitive operation. These differentials are partially inspired by the differential used to break Nimbus, and we generalize that result. We use these differentials to break the MultiSwap cipher that is part of the Microsoft Digital Rights Management subsystem, to derive a complementation property in the xmx cipher using the recommended modulus, and to mount a weak key attack on the xmx cipher for many other moduli. We also present weak key attacks on several variants of IDEA. We conclude that cipher designers may have placed too much faith in multiplication as a mixing operator, and that it should be combined with at least two other incompatible group operations.
TL;DR: In this paper, the authors propose a programming method that preprocesses the netlist of function blocks in a user's programmable logic design, grouping multiplication and multiplication-related functions efficiently.
Abstract: A programming method efficiently programs programmable logic devices of the type having specialized multiplier blocks that include multipliers and other arithmetic function elements Such blocks can be used to perform certain multiplication and multiplication-related functions more efficiently than general-purpose programmable logic In order to efficiently program devices having such specialized multiplier blocks, so that they are used to their full potential and so that the maximum number of multiplier-related functions can be accommodated on a single programmable logic device, the programming method pre-processes the netlist of function blocks in a user's programmable logic design, grouping multiplication and multiplication-related functions efficiently The method takes into account limitations imposed by the structure of the specialized multiplier blocks, in addition to location constraints imposed by the user and location constraints dictated by the need for certain functions be carried out near where certain other functions are carried out
TL;DR: This paper considers fault tolerant multiplication in finite fields with detection of errors of bit-parallel and bit-serial polynomial basis multipliers over finite fields of characteristic two using the parity prediction technique.
Abstract: In many of cryptographic schemes, the most time consuming basic arithmetic operation is the finite field multiplication and its hardware implementation may require millions of logic gates. It is a complex and costly task to develop such large finite field multipliers which will always yield error free outputs. In this effect, this paper considers fault tolerant multiplication in finite fields. It deals with detection of errors of bit-parallel and bit-serial polynomial basis multipliers over finite fields of characteristic two. Our approach is to partition the multiplier structure into a number of smaller computational units and use the parity prediction technique to detect errors.
TL;DR: This paper introduces the mathematical foundations for positive and negative numbers, addition, multiplication, redundancy, radix conversions, and also the digit value integrity for circuit implementations in a novel number system based on signed continuous valued digits.
Abstract: This paper presents a novel number system based on signed continuous valued digits. Arithmetic operations in this number system are performed using simple analog circuitry, in contrast to the conventional implementation of arithmetic units by Boolean or multiple-valued logic circuits. Unlike the limited precision offered by classical analog arithmetic circuits, the ensemble of continuous valued digits that comprises a number in this system provides arbitrary implementation precision with standard analog circuitry. The number system also provides almost-carry-free arithmetic structures with digit level redundancy. In this paper, we introduce the mathematical foundations for positive and negative numbers, addition, multiplication, redundancy, radix conversions, and also the digit value integrity for circuit implementations. Potential applications are in the area of low noise and low cross-talk circuitry for arithmetic circuits used in mixed-signal systems.
TL;DR: These designs significantly reduce the energy dissipation and latency compared with the state-of-the-art FPGA-based designs and improve the energy performance of the optimized design from the recent Xilinx library by 32% to 88% without any increase in area-latency product.
Abstract: We develop new algorithms and architectures for matrix multiplication on configurable devices. These designs significantly reduce the energy dissipation and latency compared with the state-of-the-art FPGA-based designs. We derive functions to represent the impact of algorithmic level design choices on the system-wide energy dissipation, latency, and area by capturing algorithm and architecture details including features of the target FPGA. The functions are used to optimize energy performance under latency and area constraints for a family of candidate algorithms and architectures. As a result, our designs improve the energy performance of the optimized design from the recent Xilinx library by 32% to 88% without any increase in area-latency product. In terms of comprehensive metrics such as EAT (Energy-Area-Time) and E/AT (Energy/Area-Time), our designs offer superior performance compared with the Xilinx design by 50%-79% and 13%-44%, respectively. We also address how to exploit further increases in density of future FPGA devices for asymptotic improvement in latency and energy dissipation for multiplication of larger size matrices.
TL;DR: The potential speedup achieved by using decimal floating-point hardware, instead of software routines, on a high-performance superscalar architecture is addressed.
Abstract: This paper addresses the potential speedup achieved by using decimal floating-point hardware, instead of software routines, on a high-performance superscalar architecture. Software routines were written to perform decimal addition, subtraction, multiplication, and division. Cycle counts were then measured for each instruction using the Simplescalar simulator. After this, new hardware algorithms were developed, existing hardware algorithms were analyzed, and cycle counts were estimated for the same set of instructions using specialized decimal floating-point hardware. This data was then used to show the potential speedup obtained for programs with different instruction mixes and a previously developed benchmark.
TL;DR: A novel technique which allows a virtual increase of the bitlength of a crypto-coprocessor in an efficient and elegant way and two conceivable economic implementations of the MultModDiv instruction are presented.
Abstract: We present a novel technique which allows a virtual increase of the bitlength of a crypto-coprocessor in an efficient and elegant way. The proposed algorithms assume that the coprocessor is equipped with a special modular multiplication instruction. This instruction, called MultModDiv(A,B,N) computes A * B mod N and ?(A*B)/N?. In addition to the doubling algorithm, we also present two conceivable economic implementations of the MultModDiv instruction: one hardware and one software realization. The hardware realization of the MultModDiv instruction has the same performance as the modular multiplication presented in the paper. The software realization requires two calls of the modular multiplication instruction. Our most efficient algorithm needs only six calls to an n-bit MultModDiv instruction to compute a modular 2n-bit multiplication. Obviously, special variants of our algorithm, e.g., squaring, require fewer calls.
TL;DR: In this paper, the authors define analogues of homogeneous coordinate algebras for noncommutative two-tori with real multiplication, and prove that the categories of standard holomorphic vector bundles on such non-commutive tori can be described in terms of graded modules over appropriate homogeneous coordinates.
Abstract: We define analogues of homogeneous coordinate algebras for noncommutative two-tori with real multiplication. We prove that the categories of standard holomorphic vector bundles on such noncommutative tori can be described in terms of graded modules over appropriate homogeneous coordinate algebras. We give a criterion for such an algebra to be Koszul and prove that the Koszul dual algebra also comes from some noncommutative two-torus with real multiplication. These results are based on the techniques of math.QA/0211262 allowing to interpret all the data in terms of autoequivalences of the derived categories of coherent sheaves on elliptic curves.
TL;DR: Partial bitwise permutation instructions are provided in a microprocessor or microcontroller as discussed by the authors, which can be specified by one or more of the following: a destination specifier, a previous partial value source, a destination subset specifier and a control specifier.
Abstract: Partial bitwise permutation instructions are provided in a microprocessor or microcontroller. Partial bitwise permutations may be specified by one or more of the following: a destination specifier, a previous partial value source, a destination subset specifier, and a control specifier. The array unit (3030) within the microprocessor or microcontroller includes two parallel multipliers (Marray 4100 and MParray 4200) and permutation logic (4300). The first array, Marray (4100), performs arithmetic multiplication. Marray (4100) uses Acc1 (3031) Acc2 (3032), M (3033) and sel (3034) as inputs and produces a ResultC and ResultS as outputs. The second array MParray 4200 performs binary polynomial multiplication. Permutation logic (4300) is used to perform various permutations on low order bits of RShold based on the value in RThold (3012).
TL;DR: In this article, integer multiplication and division operators dedicated to Virtex-II FPGAs from Xilinx are presented, based on small 18×18 multiplier blocks available in the Virtex II device family.
Abstract: This paper presents integer multiplication and division operators dedicated to Virtex-II FPGAs from Xilinx Those operators are based on small 18×18 multiplier blocks available in the Virtex-II device family Various trade-offs are explored (computation decomposition, radix, digit sets ) using specific VHDL generators The obtained op-eratorslead to speed improvements up to 18% for multiplication and 40% for division compared to standard solutions only based on CLBs
TL;DR: In this article, a class of Prufer -multiplication domains, which are closely related to Krull domains, is investigated, in the same spirit of earlier work on generalized Dedekind domains.
Abstract: This paper investigates a class of Prufer -multiplication domains, which are closely related to Krull domains, in the same spirit of earlier work on generalized Dedekind domains.
TL;DR: This work has studied how efficient several field multipliers can be mapped to lookup table based FPGAs, and shown how an elliptic curve coprocessor based on the Montgomery algorithm for curve multiplication can be implemented using the generic coproscessor architecture.
Abstract: FPGAs are an attractive platform for elliptic curve cryptography hardware. Since field multiplication is the most critical operation in elliptic curve cryptography, we have studied how efficient several field multipliers can be mapped to lookup table based FPGAs. Furthermore we have compared different curve coordinate representations with respect to the number of required field operations, and show how an elliptic curve coprocessor based on the Montgomery algorithm for curve multiplication can be implemented using our generic coprocessor architecture.
TL;DR: In this paper, the authors extended the recent work of Gallant et al. to a much larger family of hyperelliptic curves over finite fields that have efficiently-computable endomorphisms.
Abstract: As Koblitz curves were generalized to hyperelliptic Koblitz curves for faster point multiplication by Gunter, et al. [10] we extend the recent work of Gallant, et al. [8] to hyperelliptic curves. So the extended method for speeding point multiplication applies to a much larger family of hyperelliptic curves over finite fields that have efficiently-computable endomorphisms. For this special family of curves, a speedup of up to 55 (59) % can be achieved over the best general methods for a 160-bit point multiplication in case of genus g =2 (3).
TL;DR: In this article, a new pulse multiplication technique based on six-pulse thyristor converters is proposed, which can achieve 12pulse, 18pulse and 24pulse operations on the input current and output voltage.
Abstract: A new pulse multiplication technique based on six-pulse thyristor converters is proposed in this paper. With the proposed technique, 12-pulse, 18-pulse, and 24-pulse operations have been obtained both on the input current and on the output voltage. A control strategy over the whole range of phase angle is provided along with sophisticated input current and output voltage analysis. Experimental results from a laboratory prototype verify the proposed theory.
TL;DR: The Non Reduced Montgomery Multiplication of order s, of A and B, modulo N (odd integer) is defined by NRMMs(A,B,N) = (AB + N (-ABN-1 (mod 2s))) 2-s.
Abstract: We define the Non Reduced Montgomery Multiplication of order s, of A and B, modulo N (odd integer) by NRMMs(A,B,N) = (AB + N (-ABN-1 (mod 2s))) 2-s Given an upper bound on A and B, with respect to N, we show how to choose the variable s in a way that guarantees that NRMMs(A,B,N) < 2NA few applications are demonstrated, showing the advantage of using NRMMs with an appropriately chosen s, over the classical Montgomery Multiplication
TL;DR: An optimized multiply-and-accumulate (MAC) unit is introduced that makes it possible to compute a/spl times/b+c+d with only one instruction, whereby a, b, c, d are single-precision words (unsigned integers).
Abstract: Modulo multiplication of long integers (/spl ges/ 1024 bits) is the major operation of many public-key cryptosystems like RSA or Diffie-Hellman. The efficient implementation of modulo arithmetic is a challenging task, in particular on smart cards due to their constrained resources and relatively slow clock frequency. We present the concept of an application-specific instruction set extension (ISE) for long integer arithmetic. We introduce an optimized multiply-and-accumulate (MAC) unit that makes it possible to compute a/spl times/b+c+d with only one instruction, whereby a, b, c, d are single-precision words (unsigned integers). This additional instruction is simple to incorporate into common RISC architectures like the MIPS32. Experimental results show that the inner-product operation of a multiple-precision multiplication can be accelerated by a factor of two without increasing the processor's clock frequency. We also estimate the execution time of a 1024-bit modulo exponentiation assuming that this special MAC instruction was made available. The proposed ISE is an alternative solution to a crypto co-processor especially for multi-application smart cards (e.g., Java cards) with an embedded 32-bit RISC core.
TL;DR: In this paper, a non-reduced and extended Montgomery multiplication between a first A and a second B integer values is presented, in which the number of iterations required is greater than the number n of bits n of an odd modulo value N. At least s(n+1) iterations of the following steps are performed, while in each iteration choosing one bit, in sequence, from the value of said first integer value A, starting from its least significant bit: adding to the content of the accumulating device S the product of the selected bit and said second integer value B
Abstract: A method for carrying out modular arithmetic computations involving multiplication operations by utilizing a non-reduced and extended Montgomery multiplication between a first A and a second B integer values, in which the number of iterations required is greater than the number of bits n of an odd modulo value N. The method comprises storing n+2 bit values in an accumulating device (S) capable of, of adding n+2-bit values (X) to it content, and of dividing its content by 2. Whenever desired, the content of the accumulating device is set to zero value. At least s(>n+1) iterations of the following steps are performed, while in each iteration choosing one bit, in sequence, from the value of said first integer value A, starting from its least significant bit: adding to the content of the accumulating device S the product of the selected bit and said second integer value B; adding to the resulting content the product of its current least significant bit and N; dividing the result by 2; and obtaining a non-reduced and extended Montgomery multiplication result by repeating these steps s-1 additional times while in each time using the previous result (S).
TL;DR: A general approach which specifically targets reduction of redundant computation in common digital-signal processing (DSP) tasks such as filtering and matrix multiplication is presented and it is shown that such tasks can be expressed as multiplication of vectors by scalars and this allows fast multiplication by sharing computation.
Abstract: In this paper, we present a general approach which specifically targets reduction of redundant computation in common digital-signal processing (DSP) tasks such as filtering and matrix multiplication. We show that such tasks can be expressed as multiplication of vectors by scalars and this allows fast multiplication by sharing computation. Vector scaling operation is decomposed to find the most effective precomputations which yield a fast multiplier implementation. Two decomposition approaches are presented, one based on a greedy decomposition and the other based on fixed-size lookup and this leads to two multiplier architectures for vector-scalar products. Analog simulation of an example multiplier shows a speed advantage by a factor of about 1.85 over a conventional carry save array multiplier. Further simulations using 0.18 /spl mu/ technology show up to 20% speed advantage over Booth encoded Wallace tree multipliers.
TL;DR: In this article, an inverse power method is examined for calculation of the effective multiplication factor in neutron transport problems, and it is shown that the method can be extended to multigroup multidimensional problems.
TL;DR: In this article, the mathematically required operations of inverse vertical DCT (321) and addition (322) are combined with a basis transform (323) The basis transform is chosen to be such that the combined operation is physically replaced by multiplication with a sparse matrix.
Abstract: For some video processing applications, most notably watermark detection (40), it is necessary to add or average (parts of) the two interlaced fields which make up a frame This operation is not trivial in the MPEG domain due to the existence of frame-encoded DCT blocks The invention provides a method and arrangement for adding the fields without requiring a frame memory or an on-the-fly inverse DCT To this end, the mathematically required operations of inverse vertical DCT (321) and addition (322) are combined with a basis transform (323) The basis transform is chosen to be such that the combined operation is physically replaced by multiplication with a sparse matrix (32) Said sparse matrix multiplication can easily be executed on-the-fly The inverse basis transform (35) is postponed until after the desired addition (33, 34) has been completed
TL;DR: This paper applies multidimensional scaling (MDS) to the domain of multiplication problems, to uncover their underlying similarity structure and shows greater similarity between problems containing larger operands and suggests thattie problems (e.g., 7 x 7) hold special status.
Abstract: Adults consistently make errors in solving simple multiplication problems. These errors have been explained with reference to the interference between similar problems. In this paper, we apply multidimensional scaling (MDS) to the domain of multiplication problems, to uncover their underlying similarity structure. A tree-sorting task was used to obtain perceived dissimilarity ratings. The derived representation shows greater similarity between problems containing larger operands and suggests thattie problems (e.g., 7 x 7) hold special status. A version of the generalized context model (Nosofsky, 1986) was used to explore the derived MDS solution. The similarity of multiplication problems made an important contribution to producing a model consistent with human performance, as did the frequency with which such problems arise in textbooks, suggesting that both factors may be involved in the explanation of errors.