TL;DR: The Dot-Product Engine (DPE) is developed as a high density, high power efficiency accelerator for approximate matrix-vector multiplication, invented a conversion algorithm to map arbitrary matrix values appropriately to memristor conductances in a realistic crossbar array.
Abstract: Vector-matrix multiplication dominates the computation time and energy for many workloads, particularly neural network algorithms and linear transforms (e.g, the Discrete Fourier Transform). Utilizing the natural current accumulation feature of memristor crossbar, we developed the Dot-Product Engine (DPE) as a high density, high power efficiency accelerator for approximate matrix-vector multiplication. We firstly invented a conversion algorithm to map arbitrary matrix values appropriately to memristor conductances in a realistic crossbar array, accounting for device physics and circuit issues to reduce computational errors. The accurate device resistance programming in large arrays is enabled by close-loop pulse tuning and access transistors. To validate our approach, we simulated and benchmarked one of the state-of-the-art neural networks for pattern recognition on the DPEs. The result shows no accuracy degradation compared to software approach (99 % pattern recognition accuracy for MNIST data set) with only 4 Bit DAC/ADC requirement, while the DPE can achieve a speed-efficiency product of 1,000× to 10,000× compared to a custom digital ASIC.
TL;DR: In this article, a secure multi-party computation of arithmetic circuits over a finite field with oblivious transfer has been proposed, which is based on an arithmetic view of oblivious transfer, with careful consistency checks and other techniques to obtain malicious security.
Abstract: We consider the task of secure multi-party computation of arithmetic circuits over a finite field. Unlike Boolean circuits, arithmetic circuits allow natural computations on integers to be expressed easily and efficiently. In the strongest setting of malicious security with a dishonest majority --- where any number of parties may deviate arbitrarily from the protocol --- most existing protocols require expensive public-key cryptography for each multiplication in the preprocessing stage of the protocol, which leads to a high total cost. We present a new protocol that overcomes this limitation by using oblivious transfer to perform secure multiplications in general finite fields with reduced communication and computation. Our protocol is based on an arithmetic view of oblivious transfer, with careful consistency checks and other techniques to obtain malicious security at a cost of less than 6 times that of semi-honest security. We describe a highly optimized implementation together with experimental results for up to five parties. By making extensive use of parallelism and SSE instructions, we improve upon previous runtimes for MPC over arithmetic circuits by more than 200 times.
TL;DR: This work developed basic arithmetic operations such as addition, subtraction, multiplication and division, and some algebraic operations as maximum, minimum, square and square root of continuous Z-numbers.
TL;DR: It is shown that a variant of tensor rank due to Tao gives a quantitative understanding of the notion of unstable tensor from geometric invariant theory.
Abstract: In 2003, Cohn and Umans described a framework for proving upper bounds on the exponent $\omega$ of matrix multiplication by reducing matrix multiplication to group algebra multiplication, and in 2005 Cohn, Kleinberg, Szegedy, and Umans proposed specific conjectures for how to obtain $\omega=2$. In this paper we rule out obtaining $\omega=2$ in this framework from abelian groups of bounded exponent. To do this we bound the size of tricolored sum-free sets in such groups, extending the breakthrough results of Croot, Lev, Pach, Ellenberg, and Gijswijt on cap sets. As a byproduct of our proof, we show that a variant of tensor rank due to Tao gives a quantitative understanding of the notion of unstable tensor from geometric invariant theory.
TL;DR: This paper proves in a mathematically rigorous manner that in partial product perforation, the imposed errors are bounded and predictable, depending only on the input distribution, in terms of power dissipation and error.
Abstract: Approximate computing has received significant attention as a promising strategy to decrease power consumption of inherently error tolerant applications. In this paper, we focus on hardware-level approximation by introducing the partial product perforation technique for designing approximate multiplication circuits. We prove in a mathematically rigorous manner that in partial product perforation, the imposed errors are bounded and predictable, depending only on the input distribution. Through extensive experimental evaluation, we apply the partial product perforation method on different multiplier architectures and expose the optimal architecture–perforation configuration pairs for different error constraints. We show that, compared with the respective exact design, the partial product perforation delivers reductions of up to 50% in power consumption, 45% in area, and 35% in critical delay. In addition, the product perforation method is compared with the state-of-the-art approximation techniques, i.e., truncation, voltage overscaling, and logic approximation, showing that it outperforms them in terms of power dissipation and error.
TL;DR: This paper presents a novel and efficient quantum approach to securely compute the summation and multiplication of multiparty private inputs, respectively and can ensure the unconditional security and the perfect privacy protection based on the physical principle of quantum mechanics.
Abstract: As a fundamental primitive, Secure Multiparty Summation and Multiplication can be used to build complex secure protocols for other multiparty computations, specially, numerical computations. However, there is still lack of systematical and efficient quantum methods to compute Secure Multiparty Summation and Multiplication. In this paper, we present a novel and efficient quantum approach to securely compute the summation and multiplication of multiparty private inputs, respectively. Compared to classical solutions, our proposed approach can ensure the unconditional security and the perfect privacy protection based on the physical principle of quantum mechanics.
TL;DR: The results demonstrate that solving single-digit addition and subtraction, but not multiplication, is associated with horizontal shifts of attention, and support the idea that mental movements to the left or right of a sequential representation of numbers are elicited during single- digit arithmetic.
TL;DR: The proposed processor employs extensive pipelining techniques for Karatsuba-Ofman method to achieve high throughput multiplication and supports the recommended NIST curve P256 and is based on an extended NIST reduction scheme.
Abstract: In this paper, an exportable application-specific instruction-set elliptic curve cryptography processor based on redundant signed digit representation is proposed. The processor employs extensive pipelining techniques for Karatsuba–Ofman method to achieve high throughput multiplication. Furthermore, an efficient modular adder without comparison and a high-throughput modular divider, which results in a short datapath for maximized frequency, are implemented. The processor supports the recommended NIST curve P256 and is based on an extended NIST reduction scheme. The proposed processor performs single-point multiplication employing points in affine coordinates in 2.26 ms and runs at a maximum frequency of 160 MHz in Xilinx Virtex 5 (XC5VLX110T) field-programmable gate array.
TL;DR: In this paper, the authors present the first implementation of the 3D SpGEMM formulation that exploits multiple (intranode and internode) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrency.
Abstract: Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos--Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first implementation of the 3D SpGEMM formulation that exploits multiple (intranode and internode) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.
TL;DR: This paper analyzes the communication lower bounds and compares the communication costs of various classic parallel algorithms in the context of sparse-dense matrix-matrix multiplication and presents new communication-avoiding algorithms based on a 1D decomposition, called 1.5D.
Abstract: Multiplication of a sparse matrix with a dense matrix is a building block of an increasing number of applications in many areas such as machine learning and graph algorithms. However, most previous work on parallel matrix multiplication considered only both dense or both sparse matrix operands. This paper analyzes the communication lower bounds and compares the communication costs of various classic parallel algorithms in the context of sparse-dense matrix-matrix multiplication. We also present new communication-avoiding algorithms based on a 1D decomposition, called 1.5D, which -- while suboptimal in dense-dense and sparse-sparse cases -- outperform the 2D and 3D variants both theoretically and in practice for sparse-dense multiplication. Our analysis separates one-time costs from per iteration costs in an iterative machine learning context. Experiments demonstrate speedups up to 100x over a baseline 3D SUMMA implementation and show parallel scaling over 10 thousand cores.
TL;DR: A highly optimized software implementation of standards-compliant elliptic curve cryptography (ECC) for wireless sensor nodes equipped with an 8-bit AVR microcontroller and its implementation of scalar multiplication has a highly regular execution profile, which helps to protect against certain side-channel attacks.
Abstract: In this paper, we introduce a highly optimized software implementation of standards-compliant elliptic curve cryptography (ECC) for wireless sensor nodes equipped with an 8-bit AVR microcontroller. We exploit the state-of-the-art optimizations and propose novel techniques to further push the performance envelope of a scalar multiplication on the NIST P-192 curve. To illustrate the performance of our ECC software, we develope the prototype implementations of different cryptographic schemes for securing communication in a wireless sensor network, including elliptic curve Diffie–Hellman (ECDH) key exchange, the elliptic curve digital signature algorithm (ECDSA), and the elliptic curve Menezes–Qu–Vanstone (ECMQV) protocol. We obtain record-setting execution times for fixed-base, point variable-base, and double-base scalar multiplication. Compared with the related work, our ECDH key exchange achieves a performance gain of roughly 27% over the best previously published result using the NIST P-192 curve on the same platform, while our ECDSA performs twice as fast as the ECDSA implementation of the well-known TinyECC library. We also evaluate the impact of Karatsuba’s multiplication technique on the overall execution time of a scalar multiplication. In addition to offering high performance, our implementation of scalar multiplication has a highly regular execution profile, which helps to protect against certain side-channel attacks. Our results show that NIST-compliant ECC can be implemented efficiently enough to be suitable for resource-constrained sensor nodes.
TL;DR: A novel division technique called CORDIV is proposed that exploits correlation between the input parameters and not only has lower cost than previous stochastic dividers, but is also significantly more accurate.
Abstract: Stochastic computing (SC) encodes data in the signal probabilities associated with pseudo-random bit-streams. It enables very low-area and low-power arithmetic operations using standard VLSI circuits, it is also highly error-tolerant. While addition, subtraction and multiplication have extremely simple SC implementations, this is not true for division. Known stochastic dividers employ sequential logic circuits whose accuracy, convergence properties, etc., are unsatisfactory or not well under-stood. As a result, division is usually avoided or approximated in SC design. We first review and analyze in depth the existing design approaches to stochastic division. We then propose a novel division technique called CORDIV that exploits correlation between the input parameters. CORDIV not only has lower cost than previous stochastic dividers, but is also significantly more accurate. Area is reduced mainly because CORDIV requires less overhead for stochastic number conversion. We provide experimental data showing a typical 3x reduction in area and about a 10x improvement in accuracy.
TL;DR: In this article, a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM) is proposed, which correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in a sequential computation.
Abstract: We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in a sequential computation. We show that identifying a communication-optimal algorithm for particular input matrices is equivalent to solving a hypergraph partitioning problem. Our approach is nonzero structure dependent, meaning that we seek the best algorithm for the given input matrices.In addition to our three-dimensional fine-grained model, we also propose coarse-grained one-dimensional and two-dimensional models that correspond to simpler SpGEMM algorithms. We explore the relations between our models theoretically, and we study their performance experimentally in the context of three applications that use SpGEMM as a key computation. For each application, we find that at least one coarse-grained model is as communication efficient as the fine-grained model. We also observe that different applications have affinities for different algorithms.Our results demonstrate that hypergraphs are an accurate model for reasoning about the communication costs of SpGEMM as well as a practical tool for exploring the SpGEMM algorithm design space.
TL;DR: A recently proposed arithmetic circuit representation, the Probabilistic Sentential Decision Diagram (PSDD), is considered and it is shown that PSDD supports a polytime multiplication operator, while they do not support a poly time operator for summing-out variables.
Abstract: We consider tractable representations of probability distributions and the polytime operations they support. In particular, we consider a recently proposed arithmetic circuit representation, the Probabilistic Sentential Decision Diagram (PSDD). We show that PSDD supports a polytime multiplication operator, while they do not support a polytime operator for summing-out variables. A polytime multiplication operator make PSDDs suitable for a broader class of applications compared to arithmetic circuits, which do not in general support multiplication. As one example, we show that PSDD multiplication leads to a very simple but effective compilation algorithm for probabilistic graphical models: represent each model factor as a PSDD, and then multiply them.
TL;DR: The efficiency of Urdhva Tiryagbhyam (vertical and crosswise) Vedic method for multiplication which is different from the process of normal multiplication is presented and is the most efficient algorithm that gives minimum delay for multiplication for all types of numbers irrespective of their size.
Abstract: This paper describes the design of high speed Vedic multiplier that uses the techniques of Vedic mathematics based on 16 sutras (algorithms) to improve the performance. In this paper the efficiency of Urdhva Tiryagbhyam (vertical and crosswise) Vedic method for multiplication which is different from the process of normal multiplication is presented. Urdhva-Tiryagbhyam is the most efficient algorithm that gives minimum delay for multiplication for all types of numbers irrespective of their size. Vedic multiplier is coded in Verilog HDL and stimulated and synthesized by using XILINX software 12.2 on Spartan 3E kit. Further the design of array multiplier is compared with the proposed multiplier in terms of delay, memory and power consumption.
TL;DR: In this article, the authors reverse-engineer the S-Box and reveal its hidden structure, which is based on a sort of 2-round Feistel Network where exclusive-or is replaced by a finite field multiplication.
Abstract: The Russian Federation's standardization agency has recently published a hash function called Streebog and a 128-bit block cipher called Kuznyechik. Both of these algorithms use the same 8-bit S-Box but its design rationale was never made public.
In this paper, we reverse-engineer this S-Box and reveal its hidden structure. It is based on a sort of 2-round Feistel Network where exclusive-or is replaced by a finite field multiplication. This structure is hidden by two different linear layers applied before and after. In total, five different 4-bit S-Boxes, a multiplexer, two 8-bit linear permutations and two finite field multiplications in a field of size $$2^{4}$$ are needed to compute the S-Box.
The knowledge of this decomposition allows a much more efficient hardware implementation by dividing the area and the delay by 2.5 and 8 respectively. However, the small 4-bit S-Boxes do not have very good cryptographic properties. In fact, one of them has a probability 1 differential.
We then generalize the method we used to partially recover the linear layers used to whiten the core of this S-Box and illustrate it with a generic decomposition attack against 4-round Feistel Networks whitened with unknown linear layers. Our attack exploits a particular pattern arising in the Linear Approximations Table of such functions.
TL;DR: A framework of how students develop their ability to construct and coordinate arithmetical units is laid out, which explains precisely why Aiden and Emma have these difficulties and what changed in order for them to resolve them.
TL;DR: In this paper, the authors studied the randomness complexity of multiplication algorithms secure in the d-probing model and provided new theoretical characterizations and constructions, new practical constructions and a new efficient algorithmic tool to analyze the security of such schemes.
Abstract: Many cryptographic algorithms are vulnerable to side channel analysis and several leakage models have been introduced to better understand these flaws. In 2003, Ishai, Sahai and Wagner introduced the d-probing security model, in which an attacker can observe at most d intermediate values during a processing. They also proposed an algorithm that securely performs the multiplication of 2 bits in this model, using only $$dd+1/2$$dd+1/2 random bits to protect the computation. We study the randomness complexity of multiplication algorithms secure in the d-probing model. We propose several contributions: we provide new theoretical characterizations and constructions, new practical constructions and a new efficient algorithmic tool to analyze the security of such schemes.
We start with a theoretical treatment of the subject: we propose an algebraic model for multiplication algorithms and exhibit an algebraic characterization of the security in the d-probing model. Using this characterization, we prove a linear in d lower bound and a quasi-linear non-constructive upper bound for this randomness cost. Then, we construct a new generic algorithm to perform secure multiplication in the d-probing model that only uses $$d + d^2/4$$d+d2/4 random bits.
From a practical point of view, we consider the important cases $$d \le 4$$d≤4 that are actually used in current real-life implementations and we build algorithms with a randomness complexity matching our theoretical lower bound for these small-order cases. Finally, still using our algebraic characterization, we provide a new dedicated verification tool, based on information set decoding, which aims at finding attacks on algorithms for fixed order d at a very low computational cost.
TL;DR: In this article, a work-efficient multithreaded algorithm for sparse matrix-sparse vector multiplication (SpMSpV) where the matrix, the input vector, and the output vector are all sparse is presented.
Abstract: We design and develop a work-efficient multithreaded algorithm for sparse matrix-sparse vector multiplication (SpMSpV) where the matrix, the input vector, and the output vector are all sparse. SpMSpV is an important primitive in the emerging GraphBLAS standard and is the workhorse of many graph algorithms including breadth-first search, bipartite graph matching, and maximal independent set. As thread counts increase, existing multithreaded SpMSpV algorithms can spend more time accessing the sparse matrix data structure than doing arithmetic. Our shared-memory parallel SpMSpV algorithm is work efficient in the sense its total work is proportional to the number of arithmetic operations required. The key insight is to avoid each thread individually scan the list of matrix columns.
Our algorithm is simple to implement and operates on existing column-based sparse matrix formats. It performs well on diverse matrices and vectors with heterogeneous sparsity patterns. A high-performance implementation of the algorithm attains up to 15x speedup on a 24-core Intel Ivy Bridge processor and up to 49x speedup on a 64-core Intel KNL manycore processor. In contrast to implementations of existing algorithms, the performance of our algorithm is sustained on a variety of different input types include matrices representing scale-free and high-diameter graphs.
TL;DR: Two optimised multiplier architectures for large integer multiplication are proposed, including a low-latency hardware architecture of an integer-FFT multiplier and the use of low Hamming weight (LHW) parameters to create a novel hardware architecture for largeinteger multiplication in integer-based FHE schemes.
Abstract: Large integer multiplication is a major performance bottleneck in fully homomorphic encryption (FHE) schemes over the integers. In this paper two optimised multiplier architectures for large integer multiplication are proposed. The first of these is a low-latency hardware architecture of an integer-FFT multiplier. Secondly, the use of low Hamming weight (LHW) parameters is applied to create a novel hardware architecture for large integer multiplication in integer-based FHE schemes. The proposed architectures are implemented, verified and compared on the Xilinx Virtex-7 FPGA platform. Finally, the proposed implementations are employed to evaluate the large multiplication in the encryption step of FHE over the integers. The analysis shows a speed improvement factor of up to 26.2 for the low-latency design compared to the corresponding original integer-based FHE software implementation. When the proposed LHW architecture is combined with the low-latency integer-FFT accelerator to evaluate a single FHE encryption operation, the performance results show that a speed improvement by a factor of approximately 130 is possible.
TL;DR: The proposed spectral element method implementation is based on sparse matrix storage of local shape function derivatives calculated at Gauss–Lobatto–Legendre points and it has been found that computation on multicore GPU is up to 14 times faster than on single CPU.
TL;DR: This paper presents a novel hardware architecture that calculates the elliptic curve point multiplication (ECPM) with the best performance reported so far for ECC point multiplication for arbitrary prime field curves without the use of FPGA reconfiguration.
Abstract: Elliptic curve cryptography (ECC) is widely used as an efficient mechanism to secure private data using public-key protocols. This paper focuses on ECC over prime fields (GF(p)). We present a novel hardware architecture that calculates the elliptic curve point multiplication (ECPM). Our processor supports arbitrary prime fields with sizes up to 1024 bits. Different standards, which use curves in short Weierstrass form are supported. A Xilinx Virtex-7 implementation of the proposed hardware architecture takes from 0.69 ms for a 192-bit point multiplication up to 9.7 ms for 512-bit. The implementation takes only 20 DSP slices and 6816 LUTs. To the authors knowledge, this is the best performance reported so far for ECC point multiplication for arbitrary prime field curves without the use of FPGA reconfiguration.
TL;DR: Mezzadra and Neilson as discussed by the authors described the border as a method for the multiplication of labor, and showed that the method can be used to increase the productivity of workers.
Abstract: Border as Method, or, the Multiplication of Labor. By Sandro Mezzadra and Brett Neilson. xiv and 379 pp.; bibliog., index. Durham, N.C.: Duke University Press, 2013. $24.62 (paper), isbn 0822355035...
TL;DR: In this article, it was shown that for the honest majority setting, and for the dishonest majority setting with preprocessing, any gate-by-gate protocol must communicate a constant number of bits for every multiplication gate, where n is the number of players.
Abstract: Many information-theoretic secure protocols are known for general secure multi-party computation, in the honest majority setting, and in the dishonest majority setting with preprocessing. All known protocols that are efficient in the circuit size of the evaluated function follow the same "gate-by-gate" design pattern: we work through an arithmetic boolean circuit on secret-shared inputs, such that after we process a gate, the output of the gate is represented as a random secret sharing among the players. This approach usually allows non-interactive processing of addition gates but requires communication for every multiplication gate. Thus, while information-theoretic secure protocols are very efficient in terms of computational work, they seem to require more communication and more rounds than computationally secure protocols. Whether this is inherent is an open and probably very hard problem. However, in this work we show that it is indeed inherent for protocols that follow the "gate-by-gate" design pattern. We present the following results:In the honest majority setting, as well as for dishonest majority with preprocessing, any gate-by-gate protocol must communicate $$\varOmega n$$ bits for every multiplication gate, where n is the number of players.In the honest majority setting, we show that one cannot obtain a bound that also grows with the field size. Moreover, for a constant number of players, amortizing over several multiplication gates does not allow us to save on the computational work, and --- in a restricted setting --- we show that this also holds for communication.
All our lower bounds are met upi¾?to a constant factor by known protocols that follow the typical gate-by-gate paradigm. Our results imply that a fundamentally new approach must be found in order to improve the communication complexity of known protocols, such as BGW, GMW, SPDZ etc.
TL;DR: In this article, a change of realization of arithmetic, without altering the remaining structures of a given equation, plays the same role as a symmetry transformation, and an appropriate construction of arithmetic turns out to be particularly important for dynamical systems in fractal space-times.
Abstract: Arithmetic operations can be defined in various ways, even if one assumes commutativity and associativity of addition and multiplication, and distributivity of multiplication with respect to addition. In consequence, whenever one encounters ‘plus’ or ‘times’ one has certain freedom of interpreting this operation. This leads to some freedom in definitions of derivatives, integrals and, thus, practically all equations occurring in natural sciences. A change of realization of arithmetic, without altering the remaining structures of a given equation, plays the same role as a symmetry transformation. An appropriate construction of arithmetic turns out to be particularly important for dynamical systems in fractal space-times. Simple examples from classical and quantum, relativistic and nonrelativistic physics are discussed, including the eigenvalue problem for a quantum harmonic oscillator. It is explained why the change of arithmetic is not equivalent to the usual change of variables.
TL;DR: A new class of probabilistic graphical models is proposed that approximates the full joint probability distribution of discrete multivariate random variables by relying only on integer addition/multiplication and binary bit shift operations.
TL;DR: Two possible architectures are proposed for a Vedic real multiplier based on the URDHVA TIRYAKBHYAM (Vertically and cross wise) sutra of Indian Vedic mathematics and an expression for path delay of an N×N Vedicreal multiplier with minimum path delay architecture is developed.
Abstract: Complex numbers multiplication is a key arithmetic operation to be performed with high speed and less consumption of power in high performance systems such as wireless communications. Hence, in this paper, two possible architectures are proposed for a Vedic real multiplier based on the URDHVA TIRYAKBHYAM (Vertically and cross wise) sutra of Indian Vedic mathematics and an expression for path delay of an N×N Vedic real multiplier with minimum path delay architecture is developed. Then, architectures of four Vedic real multipliers solution, three Vedic real multipliers solution of complex multiplier are presented. The architecture of Vedic real multiplier with minimum path delay is used in the implementation of complex multiplier. The architectures for the four multiplier solution and three multiplier solution of complex multiplier for 32 × 32 bit complex numbers multiplication are coded in VHDL and implemented through Xilinx ISE 13.4 navigator and Modelsim 5.6. Finally, the results are compared with that of the four and three real multipliers solutions using the conventional Booth and Array multipliers.
TL;DR: A novel Double Operand Scanning (DOS) method to speed-up multi-precision squaring with non-redundant representations on SIMD architecture, compatible with separated Montgomery algorithms and highly efficient for RSA crypto system is introduced.
Abstract: Advanced modern processors support Single Instruction Multiple Data (SIMD) instructions (e.g. Intel-AVX, ARM-NEON) and a massive body of research on vector-parallel implementations of modular arithmetic, which are crucial components for modern public-key cryptography ranging from RSA, ElGamal, DSA and ECC, have been conducted. In this paper, we introduce a novel Double Operand Scanning (DOS) method to speed-up multi-precision squaring with non-redundant representations on SIMD architecture. The DOS technique partly doubles the operands and computes the squaring operation without ReadAfter-Write (RAW) dependencies between source and destination variables. Furthermore, we presented Karatsuba Cascade Operand Scanning (KCOS) multiplication and Karatsuba Double Operand Scanning (KDOS) squaring by adopting additive and subtractive Karatsuba’s methods, respectively. The proposed multiplication and squaring methods are compatible with separated Montgomery algorithms and these are highly efficient for RSA crypto system. Finally, our proposed multiplication/squaring, separated Montgomery multiplication/squaring and RSA encryption outperform the best-known results by 22/41%, 25/33% and 30% on the Cortex-A15 platform.
TL;DR: It is proved that border rank algorithms for tensors with symmetry come in families that include representatives with normal forms, which will be useful both to develop new efficient algorithms and to prove lower complexity bounds.
Abstract: We establish basic information about border rank algorithms for the matrix multiplication tensor and other tensors with symmetry. We prove that border rank algorithms for tensors with symmetry (such as matrix multiplication and the determinant polynomial) come in families that include representatives with normal forms. These normal forms will be useful both to develop new efficient algorithms and to prove lower complexity bounds. We derive a border rank version of the substitution method used in proving lower bounds for tensor rank. We use this border-substitution method and a normal form to improve the lower bound on the border rank of matrix multiplication by one, to 2n^2- n+1. We also point out difficulties that will be formidable obstacles to future progress on lower complexity bounds for tensors because of the "wild" structure of the Hilbert scheme of points.
TL;DR: The proposed designs for IDDMM are well suited to be implemented in modern FPGAs, making use of available dedicated multipliers and memory blocks reducing drastically the FPGA's standard logic while keeping an acceptable performance compared with other implementation approaches.
Abstract: This study presents a scalable hardware architecture for modular multiplication in prime fields GF(
p
). A novel iterative digit-digit
Montgomery multiplication (IDDMM) algorithm is proposed and two hardware architectures that compute that algorithm are described. The input operands (multiplicand, multiplier and modulus) are represented using as radix β
= 2
k
. Multiplication over GF(
p
) is possible using almost the same hardware since the complexity of multiplier's kernel module depends mainly on k
and not on p
. The novel hardware architectures of GF(
p
) multipliers were evaluated on three Xilinx FPGA families. Design trade-offs were analysed considering different operand sizes commonly used in cryptography and different digits sizes. The proposed designs for IDDMM are well suited to be implemented in modern FPGAs, making use of available dedicated multipliers and memory blocks reducing drastically the FPGA's standard logic while keeping an acceptable performance compared with other implementation approaches. From the Virtex5 implementation, the proposed MM multiplier reaches a throughput of 242 Mbps using only 219 FPGA slices and achieving a 1024-bit modular multiplication in 4.21μs. This is 26 times less area resources than similar related works in the literature with an improved efficiency of 7x.