TL;DR: In this article, a multiplication-free one-bit transform (1BT) for low-complexity block-based motion estimation is presented, which can be implemented in integer arithmetic using addition and shifts only, reducing the computational complexity, processing time, and power consumption.
Abstract: A multiplication-free one-bit transform (1BT) for low-complexity block-based motion estimation is presented in this letter. A novel filter kernel is utilized to construct the 1BT of image frames using addition and shift operations only. It is shown that the proposed approach provides the same motion estimation accuracy at macro-block level and even better accuracy for smaller block sizes compared to previously proposed 1BT methods. Because the proposed 1BT approach does not require multiplication operations, it can be implemented in integer arithmetic using addition and shifts only, reducing the computational complexity, processing time, as well as power consumption
TL;DR: Lower bounds for problems related to realizing multiplication by constants with shifts, adders, and subtracters are presented and have applications in proving the optimality of solutions obtained by heuristics.
Abstract: Lower bounds for problems related to realizing multiplication by constants with shifts, adders, and subtracters are presented. These lower bounds are straightforwardly calculated and have applications in proving the optimality of solutions obtained by heuristics.
TL;DR: A practical algorithm for computing least solutions of systems of equations over the integers with addition, multiplication with positive constants, maximum and minimum, based on strategy iteration is presented.
Abstract: We present a practical algorithm for computing least solutions of systems of equations over the integers with addition, multiplication with positive constants, maximum and minimum. The algorithm is based on strategy iteration. Its run-time (w.r.t. the uniform cost measure) is independent of the sizes of occurring numbers. We apply our technique to solve systems of interval equations. In particular, we show how arbitrary intersections as well as full interval multiplication in interval equations can be dealt with precisely.
TL;DR: By optimizing and balancing the critical delay of sub-components, two high performance GCM implementations are presented on Xilinx Virtex-4 devices and provide a good criterion to minimize the influence of technology mapping.
Abstract: This paper addresses efficient and high-throughput implementations of AES-GCM optimized for FPGAs. Two main components, the AES engine and the modular multiplication over GF(2m), are discussed and their complexities on FPGAs are shown. Instead of discussing the complexities by using AND and XOR gates as primitives, we present the complexity analysis directly based on FPGA primitives, e.g., Look-Up-Tables (LUTs). For the modular multiplier, the straightforward multiplication is used to get a speed-efficient design while the Karatsuba 's algorithm is used to get an area-efficient design. For the AES engine, the composite field approach is adopted and then inner-round pipelining technology is applied. The estimated resource consumption returned by the complexity analysis provides a good criterion to minimize the influence of technology mapping. By optimizing and balancing the critical delay of sub-components, two high performance GCM implementations are presented on Xilinx Virtex-4 devices.
TL;DR: This paper identifies two basic methods for designing serial reduction circuits: the tree-traversal method and the striding method and proposes high-performance and area-efficient designs using each method.
Abstract: Field-programmable gate arrays (FPGAs) have become an attractive option for accelerating scientific applications. Many scientific operations such as matrix-vector multiplication and dot product involve the reduction of a sequentially produced stream of values. Unfortunately, because of the pipelining in FPGA-based floating-point units, data hazards may occur during these sequential reduction operations. Improperly designed reduction circuits can adversely impact the performance, impose unrealistic buffer requirements, and consume a significant portion of the FPGA. In this paper, we identify two basic methods for designing serial reduction circuits: the tree-traversal method and the striding method. Using accumulation as an example, we analyze the design trade-offs among the number of adders, buffer size, and latency. We then propose high-performance and area-efficient designs using each method. The proposed designs reduce multiple sets of sequentially delivered floating-point values without stalling the pipeline or imposing unrealistic buffer requirements. Using a Xilinx Virtex-ll Pro FPGA as the target device, we implemented our designs and present performance and area results.
TL;DR: A circuit architecture that can handle multiple data lengths using the same circuits and improve the Montgomery multiplication algorithm in order to maximize the performance of the multiplication unit in FPGA.
Abstract: This paper describes a modular exponentiation processing method and circuit architecture that can exhibit the maximum performance of FPGA resources. The modular exponentiation architecture proposed by us comprises three main techniques. The first technique is to improve the Montgomery multiplication algorithm in order to maximize the performance of the multiplication unit in FPGA. The second technique is to improve and balance the circuit delay. The third technique is to ensure and make fast the scalability of the effective FPGA resource. We propose a circuit architecture that can handle multiple data lengths using the same circuits. In addition, our architecture can perform fast operations using small-scale resources; in particular, it can complete 512-bit modular exponentiation in 0.26 ms by means of XC4VF12-10SF363, which is the minimum logic resources in the Virtex-4 Series FPGAs. Also, the number of SLICEs used is approx. 4000 to make a very compact design. Moreover, 1024-, 1536- and 2048-bit modular exponentiations can be processed in the same circuit with the scalability.
TL;DR: In this article, an optimal hardware implementation of the FFT/IFFT operation that minimizes the number of clock cycles required to compute FFT while at the same time minimizing the complexity of complex multipliers is presented.
Abstract: We disclose an optimal hardware implementation of the FFT/IFFT operation that minimizes the number of clock cycles required to compute the FFT/IFFT while at the same time minimizing the number of complex multipliers needed. An input module combines a plurality of inputs after applying a multiplication factor to each of the inputs. At least one multiplicand generator generates multiplicands. At least two complex multiplier modules perform complex multiplications with at least one of the complex multiplier modules receiving an output from the input module. A map module receives outputs of the at least two complex multiplier modules, the map module selecting and applying a multiplication factor to each of the outputs received to generate multiple outputs. Finally, an accumulation module receives and performs an accumulation task on each of the multiple outputs of the map module thereby generating a corresponding number of multiple outputs.
TL;DR: A method and apparatus for accelerating scalar multiplication in an elliptic curve cryptosystem (ECC) over prime fields is provided in this article, where multiplication operations within an ECC point operation are identified and modified utilizing an equivalent point representation that inserts multiples of two.
Abstract: A method and apparatus for accelerating scalar multiplication in an elliptic curve cryptosystem (ECC) over prime fields is provided. Multiplication operations within an ECC point operation are identified and modified utilizing an equivalent point representation that inserts multiples of two. Algebraic substitutions of the multiplication operations with squaring operations and other cheaper field operations are performed. Scalar multiplication can also be protected against simple side-channel attacks balancing the number of multiplication operations and squaring operations and providing novel atomic structures to implement the ECC operation. In addition, a new coordinate system is defined to enable more effective operation of ECC to multiprocessor environments.
TL;DR: A method for all-optical, tunable pulse repetition-rate multiplication of a mode-locked laser based on spectral line-by-line control is demonstrated with very high fidelity.
Abstract: We demonstrate a method for all-optical, tunable pulse repetition-rate multiplication of a mode-locked laser based on spectral line-by-line control. In particular, two-to-five-times repetition-rate multiplication of a 9 GHz source is achieved with very high fidelity.
TL;DR: The role of executive and phonological working memory resources in simple arithmetic was investigated in two experiments as mentioned in this paper, where participants had to solve simple multiplication problems (e.g., 4×8) or simple division problems under no load, phonological-load, and executive-load conditions.
Abstract: The role of executive and phonological working memory resources in simple arithmetic was investigated in two experiments. Participants had to solve simple multiplication problems (e.g., 4×8; Experiment 1) or simple division problems (e.g., 42÷7; Experiment 2) under no-load, phonological-load, and executive-load conditions. The choice/no-choice method was used to investigate strategy execution and strategy selection independently. Results for strategy execution showed that executive working memory resources were involved in direct memory retrieval of both multiplication and division facts. Executive working memory resources were also involved in the use of nonretrieval strategies. Phonological working memory resources, on the other hand, tended to be involved in nonretrieval strategies only. Results for strategy selection showed no effects of working memory load. Finally, correlation analyses showed that both strategy execution and strategy selection correlated with individual-difference variables, such as gender, math anxiety, associative strength, calculator use, arithmetic skill, and math experience.
TL;DR: Based on a recently proposed Toeplitz matrix-vector product approach, a subquadratic computational complexity scheme is presented for multiplications in binary extended finite fields using type I and II optimal normal bases.
Abstract: Based on a recently proposed Toeplitz matrix-vector product approach, a subquadratic computational complexity scheme is presented for multiplications in binary extended finite fields using type I and II optimal normal bases.
TL;DR: The first MEMOCODE hardware/software co-design contest posed the following problem: optimize matrix-matrix multiplication in such a way that it is split between the FPGA and PowerPC on a Xilinx Virtex IIPro30, which was implemented on aXilinx XUP development board with 256 MB of DRAM.
Abstract: The first MEMOCODE hardware/software co-design contest posed the following problem: optimize matrix-matrix multiplication in such a way that it is split between the FPGA and PowerPC on a Xilinx Virtex IIPro30. In this paper we discuss our solution, which we implemented on a Xilinx XUP development board with 256 MB of DRAM. The design was done by the five authors over a span of approximately 3 weeks, though of the 15 possible man-weeks, about 9 were actually spent working on this problem. All hardware design was done using Blue-spec SystemVerilog (BSV), with the exception of an imported Verilog multiplication unit, necessary only due to the limitations of the Xilinx FPGA toolflow optimizations.
TL;DR: In this paper, the authors recall the hybrid method of Gura et al. for multi-precision multiplication, which exploits the increased number of registers available on modern architectures in order to avoid duplicated loads from memory.
Abstract: In this paper we recall the hybrid method of Gura et al. for multi-precision multiplication [4] which is an improvement on the basic Comba method and which exploits the increased number of registers available on modern architectures in order to avoid duplicated loads from memory. We then show how to improve and generalise the method for application across a wide range of processor types, setting some new records in the process.
TL;DR: Experimental results presented here show that due to an increase in the number of parity bits, the area overhead tends to increase linearly, but the probability of error detection approaches unity fairly quickly, e.g., for eight parity bits.
Abstract: The detection of errors in arithmetic operations is an important issue. This paper discusses the detection of multiple-bit errors due to faults in bit-serial and bit-parallel polynomial basis (PB) multipliers over binary extension fields. Our approach is based on multiple parity bits. Experimental results presented here show that due to an increase in the number of parity bits, the area overhead tends to increase linearly, but the probability of error detection approaches unity fairly quickly, e.g., for eight parity bits. In bit-serial implementation of a GF(2163) PB multiplier using eight parity bits, the area overhead and the probability of error detection are 10.29% and 0.996, respectively. This is achieved without any increase in the computation time of the GF(2163) PB multiplier
TL;DR: This paper proposes a modification to Barrett's algorithm that leads to a significant reduction (25% to 75%) in multiplications and additions.
Abstract: It is widely acknowledged that efficient modular multiplication is a key to high-performance implementation of public-key cryptography, be it classical RSA, Diffie-Hellman, or (hyper-) elliptic curve algorithms. In the recent decade, practitioners have relied mainly on two popular methods: Montgomery Multiplication and regular long-integer multiplication in combination with Barrett's modular reduction technique. In this paper, we propose a modification to Barrett's algorithm that leads to a significant reduction (25% to 75%) in multiplications and additions.
TL;DR: A suitable extension of fast univariate Euclidean division of triangular families of polynomials is extended, obtaining theoretical and practical improvements over a direct recursive approach and reaching quasi-linear complexity for a family of special cases.
Abstract: We study arithmetic operations for triangular families of polynomials, concentrating on multiplication in dimension zero. By a suitable extension of fast univariate Euclidean division, we obtain theoretical and practical improvements over a direct recursive approach; for a family of special cases, we reach quasi-linear complexity. The main outcome we have in mind is the acceleration of higher-level algorithms, by interfacing our low-level implementation with languages such as AXIOM or Maple We show the potential for huge speed-ups, by comparing two AXIOM implementations of van Hoeij and Monagan's modular GCD algorithm.
TL;DR: It is shown that multiplication complexities of n-term Karatsuba-Like formulae of GF(2)[x] (7
Abstract: For original paper see P.L. Montgomery, ibid., vol.54, no.3, p.362-369, (2005). We show that multiplication complexities of n-term Karatsuba-Like formulae of GF(2)[x] (7
TL;DR: In this paper, the authors analyzed the best single-scalar multiplication with variable base point by combining a huge range of options: many choices of coordinate systems and formulas for individual group operations, including new formulas for tripling on Edwards curves; double-base chains with many different doubling/tripling ratios, including standard base-2 chains as an extreme case; many precomputation strategies, going beyond Dimitrov, Imbert, Mishra, and Doche and Imbert (Asiacrypt 2005) and
Abstract: This paper analyzes the best speeds that can be obtained for single-scalar multiplication with variable base point by combining a huge range of options: – many choices of coordinate systems and formulas for individual group operations, including new formulas for tripling on Edwards curves; – double-base chains with many different doubling/tripling ratios, including standard base-2 chains as an extreme case; – many precomputation strategies, going beyond Dimitrov, Imbert, Mishra (Asiacrypt 2005) and Doche and Imbert (Indocrypt 2006). The analysis takes account of speedups such as S − M tradeoffs and includes recent advances such as inverted Edwards coordinates. The main conclusions are as follows. Optimized precomputations and triplings save time for single-scalar multiplication in Jacobian coordinates, Hessian curves, and tripling-oriented Doche/Icart/Kohel curves. However, even faster single-scalar multiplication is possible in Jacobi intersections, Edwards curves, extended Jacobi-quartic coordinates, and inverted Edwards coordinates, thanks to extremely fast doublings and additions; there is no evidence that double-base chains are worthwhile for the fastest curves. Inverted Edwards coordinates are the speed leader.
TL;DR: A formal proof is proposed which shows that the recoding technique diminishes the number of additions in a sublinear way and proves Lefevre's conjecture that the multiplication by an integer constant is achievable in sublinear time.
Abstract: This paper explores the use of the double-base number system (DBNS) for constant integer multiplication. The DBNS recoding scheme represents integers - in this case constants in a multiple-radix way in the hope of minimizing the number of additions to be performed during constant multiplication. On the theoretical side, we propose a formal proof which shows that our recoding technique diminishes the number of additions in a sublinear way. Therefore, we prove Lefevre's conjecture that the multiplication by an integer constant is achievable in sublinear time. In a second part, we investigate various strategies and we provide numerical data showcasing the potential interest of our approach.
TL;DR: This paper presents some structure results for Gaussian normal bases of , and uses the results to devise faster multiplication algorithms for Boneh-Franklin and Sakai-Kasahara identity-based encryption schemes.
Abstract: Fast arithmetic for characteristic three finite fields is desirable in pairing-based cryptography because there is a suitable family of elliptic curves over having embedding degree 6. In this paper we present some structure results for Gaussian normal bases of , and use the results to devise faster multiplication algorithms. We carefully compare multiplication in using polynomial bases and Gaussian normal bases. Finally, we compare the speed of encryption and decryption for the Boneh-Franklin and Sakai-Kasahara identity-based encryption schemes at the 128-bit security level, in the case where supersingular elliptic curves with embedding degrees 2, 4 and 6 are employed.
TL;DR: This paper presents the row–column multiplication of rhotrices that are of high dimension, an extension of the same multiplication carried out on rHotrices of dimension three, considered to be the base rhotices.
Abstract: This paper presents the row–column multiplication of rhotrices that are of high dimension. This is an extension of the same multiplication carried out on rhotrices of dimension three, considered to be the base rhotrices.
TL;DR: This tutorial introduces the reader to a set of general techniques to improve the performance of numerical code, focusing on optimizations for the computer's memory hierarchy.
Abstract: The complexity of modern computing platforms has made it extremely difficult to write numerical code that achieves the best possible performance. Straightforward implementations based on algorithms that minimize the operations count often fall short in performance by at least one order of magnitude. This tutorial introduces the reader to a set of general techniques to improve the performance of numerical code, focusing on optimizations for the computer's memory hierarchy. Further, program generators are discussed as a way to reduce the implementation and optimization effort. Two running examples are used to demonstrate these techniques: matrix-matrix multiplication and the discrete Fourier transform.
TL;DR: In this article, a fast Fourier transform based method was proposed to reduce the number of F3m-multiplications for multiplication in F36m from 18 in recent implementations to 15.
Abstract: Efficient computation of the Tate pairing is an important part of pairing-based cryptography. Recently with the introduction of the Duursma-Lee method special attention has been given to the fields of characteristic 3. Especially multiplication in F36m, where m is prime, is an important operation in the above method. In this paper we propose a new method to reduce the number of F3m-multiplications for multiplication in F36m from 18 in recent implementations to 15. The method is based on the fast Fourier transform and its explicit formulas are given. The execution times of our software implementations for F36m show the efficiency of our results.
TL;DR: With adjacent scale multiplication in odd Gabor transform domain, a sharpened edge response output is obtained, which can more effectively resist the inverse influence from noise contamination on the performance of edge detector.
TL;DR: A low-power, area-efficient four-way 32-bit multifunction arithmetic unit has been developed for programmable shaders for handheld 3D graphics systems and unified into a single arithmetic platform with maximum four-cycle latency and single-cycle throughput.
Abstract: A low-power, area-efficient four-way 32-bit multifunction arithmetic unit has been developed for programmable shaders for handheld 3D graphics systems. It adopts the logarithmic number system (LNS) at the arithmetic core for the single-cycle throughput and the small-size low-power unification of various complicated arithmetic operations such as power, logarithm, trigonometric functions, vector-SIMD multiplication, division, square root and vector dot product. 24-region and 16-region piecewise linear logarithmic and antilogarithmic converters are proposed with 0.8% and 0.02% maximum conversion error, respectively. All the supported operations are implemented with less than 6.3% operation error and unified into a single arithmetic platform with maximum four-cycle latency and single-cycle throughput. A 93 K gate test chip is fabricated using one-poly five-metal 0.18-mum CMOS technology. It operates at 210 MHz with maximum power consumption of 15.3 mW at 1.8 V.
TL;DR: The proof technique is new and relies on the factorization algorithm for multivariate black-box polynomials, on lower bounds on the length of linear locally decodable codes with 2 queries, and on a theorem regarding the structure of identically zero depth-3 circuits with four multiplication gates.
Abstract: In this paper we consider the problem of constructing a small arithmetic circuit for a polynomial for which we have oracle access. Our focus is on n-variate polynomials, over a finite field F, that have depth-3 arithmetic circuits (with an addition gate at the top) with two multiplication gates of degree at most d. We obtain the following results: 1. Multilinear case: When the circuit is multilinear (multiplication gates compute multilinear polynomials) we give an algorithm that outputs, with probability 1 − o(1), all the depth-3 circuits with two multiplication gates computing the polynomial. The running time of the algorithm is poly(n, |F|). 2. General case: When the circuit is not multilinear we give a quasi-polynomial (in n, d, |F|) time algorithm that outputs, with probability 1 − o(1), a succinct representation of the polynomial. In particular, if the depth-3 circuit for the polynomial is not of small depth-3 rank (namely, after removing the g.c.d. of the two multiplication gates, the remaining linear functions span a not too small linear space) then we output the depth-3 circuit itself. In case that the rank is small we output a depth-3 circuit with a quasi-polynomial number of multiplication gates. Prior to our work there have been several interpolation algorithms for restricted models. However, all the techniques used there completely fail when dealing with depth-3 circuits with even just two multiplication gates. Our proof technique is new and relies on the factorization algorithm for multivariate black-box polynomials, on lower bounds on the length of linear locally decodable codes with 2 queries, and on a theorem regarding the structure of identically zero depth-3 circuits with four multiplication gates. ∗Preliminary version appeared in [Shp07]. †Faculty of Computer Science, Technion, Haifa 32000, Israel. Email: shpilka@cs.technion.ac.il. This research was supported by the Israel Science Foundation (grant number 439/06).
TL;DR: In this paper a reconfigurable architecture for a cryptographic processor is presented and a SPA resistant algorithm for point multiplication is implemented and is shown to be faster than the double-and-add method.
Abstract: There has been a lot of interest in recent years in the problems faced by cryptosystems due to side channel attacks. Algorithms for elliptic curve point scalar multiplication such as the double and add method are prone to such attacks. By making use of special addition chains, it is possible to implement a simple power analysis (SPA) resistant cryptosystem. In this paper a reconfigurable architecture for a cryptographic processor is presented. A SPA resistant algorithm for point multiplication is implemented and is shown to be faster than the double-and-add method. Post place and route results for the processor are given
TL;DR: The Cayley-Dickson process is used in connection with square array representations of the Cayley Dickson algebras as mentioned in this paper, which give a convenient representation of the octonion division algebra and a description of octonions multiplication.
Abstract: The Cayley-Dickson process is used in connection with square array representations of the Cayley-Dickson algebras. This involves an array operation on square arrays distinct from matrix multiplication. The arrays give a convenient representation of the octonion division algebra and a description of octonion multiplication. The connection between this description of pure octonion multiplication and seven dimensional real space using products related to the commutator and associator of the octonions is extended to the other Cayley-Dickson algebras and the appropriate real vector space.
TL;DR: Jillian, a third-grade teacher, had such an experience when she asked her students to solve the following problem:
Abstract: Teaching Children Mathematics / December 2007/January 2008 Assessing student learning is a very important part of teaching. As teachers, we are continually searching for assessments that give us valid information about what our students are learning. Occasionally, the assessments we use can surprise us with the results they yield. Jillian, a third-grade teacher, had such an experience when she asked her students to solve the following problem: