TL;DR: This work suggests SELL-$C$-$\sigma, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from general-purpose graphics processing units and vector computer programming and shows its suitability on a variety of hardware platforms.
Abstract: Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-$C$-$\sigma$, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from general-purpose graphics processing units and vector computer programming. We discuss the advantages of SELL-$C$-$\sigma$ compared to established formats like Compressed Row Storage and ELLPACK and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi, and Nvidia Tesla K20) for a wi...
TL;DR: The DBCSR (Distributed Block Compressed Sparse Row) library for scalable sparse matrix–matrix multiplication and its use in the CP2K program for linear-scaling quantum-chemical calculations is presented.
Abstract: Efficient parallel multiplication of sparse matrices is key to enabling many large-scale calculations. This article presents the DBCSR (Distributed Block Compressed Sparse Row) library for scalable sparse matrix–matrix multiplication and its use in the CP2K program for linear-scaling quantum-chemical calculations. The library combines several approaches to implement sparse matrix multiplication in a way that performs well and is demonstrably scalable. Parallel communication has well-defined limits. Data volume decreases with O ( 1 / P ) with increasing process counts P and every process communicates with at most O ( P ) others. Local sparse matrix multiplication is handled efficiently using a combination of techniques: blocking elements together in an application-relevant way, an autotuning library for small matrix multiplications, cache-oblivious recursive multiplication, and multithreading. Additionally, on-the-fly filtering not only increases sparsity but also avoids performing calculations that fall below the filtering threshold. We demonstrate and analyze the performance of the DBCSR library and its various scaling behaviors.
TL;DR: This paper proposes an efficient implementation of point multiplication on Koblitz curves targeting extremely-constrained, secure applications and introduces a new technique for point addition in affine coordinate which requires fewer registers.
Abstract: Recently, considerable research has been performed in cryptography and security to optimize the area, power, timing, and energy needed for the point multiplication operations over binary elliptic curves. In this paper, we propose an efficient implementation of point multiplication on Koblitz curves targeting extremely-constrained, secure applications. We utilize the Gaussian normal basis (GNB) representation of field elements over GF(2m) and employ an efficient bit-level GNB multiplier. One advantage of this GNB multiplier is that we are able to reduce the hardware complexity through sharing the addition/accumulation with other field additions. We utilized the special property of normal basis representation and squarings are implemented very efficiently by only rewiring in hardware. We introduce a new technique for point addition in affine coordinate which requires fewer registers. Based on this technique, we propose an extremely small processor architecture for point multiplication. Through application-specific integrated circuit (ASIC) implementations, we evaluate the area, performance, and energy consumption of the proposed crypto-processor. Utilizing two different working frequencies, it is shown that the proposed architecture reaches better results compared to the previous works, making it suitable for extremely-constrained, secure environments.
TL;DR: This article investigated whether these errors can be prevented by presenting addition and multiplication problems in such a way that students are encouraged to compare the problems at a deeper level and found that the contrast group clearly outperformed the sequential group.
TL;DR: Findings show that the learning of multiplications is particularly interference prone because of feature overlap and that people who are sensitive to this parameter therefore encounter difficulties in memorizing arithmetic facts.
Abstract: The difficulty in memorizing arithmetic facts is a general and persistent hallmark of math learning disabilities. It has recently been suggested that hypersensitivity to interference could prevent a person from storing arithmetic facts. The similarity between arithmetic facts would provoke interference, and learners who are hypersensitive to interference would therefore encounter difficulties in storing arithmetic facts in long-term memory. In this study, we created a measure of the interference weight for each multiplication by measuring the overlap of digits between multiplications. First, we tested whether the interference parameter could predict performance across multiplications by analyzing the data from undergraduates published by Campbell (1997). The interference parameter substantially predicted performance across multiplications. Similarly, the performance across multiplications was substantially determined by the interference parameter in 3rd-grade children, 5th-grade children, and undergraduates we tested. Second, we tested whether people with poor arithmetic facts abilities were particularly sensitive to the interference parameter. We tested this hypothesis in typical development by analyzing the data from the 3rd-grade children, 5th-grade children, and undergraduates. We analyzed data with regard to atypical development from a published case study of dyscalculia as well as from 4th-grade children, with either poor or good multiplication skills, tested twice 1 year apart. Results showed that the individual sensitivity to the interference parameter determined part of the individual differences in multiplication performance in all data sets. These findings show that the learning of multiplications is particularly interference prone because of feature overlap and that people who are sensitive to this parameter therefore encounter difficulties in memorizing arithmetic facts.
TL;DR: In this paper, the complexity of solving nonsingular linear systems of equations with structured matrices has been studied and the authors present a Boolean complexity analysis for the problem of polynomial multiplication and division.
Abstract: We estimate the Boolean complexity of multiplication of structured matrices by a vector and the solution of nonsingular linear systems of equations with these matrices. We study four basic and most popular classes, that is, Toeplitz, Hankel, Cauchy and Vandermonde matrices, for which the cited computational problems are equivalent to the task of polynomial multiplication and division and polynomial and rational multipoint evaluation and interpolation. The Boolean cost estimates for the latter problems have been obtained by Kirrinnis in [10], except for rational interpolation, and we supply them now. All known Boolean cost estimates from [10] for these problems rely on using Kronecker product. This implies the d-fold precision increase for the d-th degree output, but we avoid such an increase by relying on distinct techniques based on employing FFT. Furthermore we simplify the analysis and make it more transparent by combining the representations of our tasks and algorithms both via structured matrices and via polynomials and rational functions. This also enables further extensions of our estimates to cover Trummer's important problem and computations with the popular classes of structured matrices that generalize the four cited basic matrix classes.
TL;DR: The main goal of this paper is to discuss security issues that may arise using either the Gallant–Lambert–Vanstone or Galbraith–Lin–Scott approaches to elliptic curve cryptography.
Abstract: The fastest implementations of elliptic curve cryptography in recent years have been achieved on curves endowed with nontrivial efficient endomorphisms, using techniques due to Gallant–Lambert–Vanstone (GLV) and Galbraith–Lin–Scott (GLS). In such implementations, a scalar multiplication [k]P is computed as a double multiplication [k 1]P + [k 2]ψ(P), for ψ an efficient endomorphism and k 1,k 2 appropriate half-size scalars. To compute a random scalar multiplication, one can either select the scalars k 1,k 2 at random, hoping that the resulting k = k 1 + k 2 λ is close to uniform, or pick a uniform k instead and decompose it as k 1 + k 2 λ afterwards. The main goal of this paper is to discuss security issues that may arise using either approach.
TL;DR: One of the newly proposed methods attains the best average result in experiments on a large set of matrices and obtains a parallel efficiency of 90 percent, while on average it performs close to 60 percent.
Abstract: The sparse matrix-vector multiplication is an important computational kernel, but is hard to efficiently execute even in the sequential case. The problems--namely low arithmetic intensity, inefficient cache use, and limited memory bandwidth--are magnified as the core count on shared-memory parallel architectures increases. Existing techniques are discussed in detail, and categorized chiefly based on their distribution types. Based on this, new parallelization techniques are proposed. The theoretical scalability and memory usage of the various strategies are analyzed, and experiments on multiple NUMA architectures confirm the validity of the results. One of the newly proposed methods attains the best average result in experiments on a large set of matrices. In one of the experiments it obtains a parallel efficiency of 90 percent, while on average it performs close to 60 percent.
TL;DR: In this article, Krahmer-Mendelson-Rauhut et al. showed that the restricted isometry property is sufficient condition for the efficient reconstruction of a nearly k-sparse vector x e Cd from linear measurements Φx.
Abstract: In this paper, we present novel constructions of matrices with the restricted isometry property (RIP) that support fast matrix-vector multiplication. Our guarantees are the best known, and can also be used to obtain the best known guarantees for fast Johnson Lindenstrauss transforms.In compressed sensing, the restricted isometry property is a sufficient condition for the efficient reconstruction of a nearly k-sparse vector x e Cd from m linear measurements Φx. It is desirable for m to be small, and further it is desirable for Φ to support fast matrix-vector multiplication. Among other applications, fast multiplication improves the runtime of iterative recovery algorithms which repeatedly multiply by Φ or Φ*.The main contribution of this work is a novel randomized construction of RIP matrices Φ e Cmxd, preserving the e2 norms of all k-sparse vectors with distortion 1 + e, where the matrix-vector multiply Φx can be computed in nearly linear time. The number of rows m is on the order of e-2k log dlog2(kl oge d), an improvement on previous analyses by a logarithmic factor. Our construction, together with a connection between RIP matrices and the Johnson-Lindenstrauss lemma in [Krahmer-Ward, SIAM. J. Math. Anal. 2011], also implies fast Johnson-Lindenstrauss embeddings with asymptotically fewer rows than previously known.Our construction is actually a recipe for improving any existing family of RIP matrices. Briefly, we apply an appropriate sparse hash matrix with sign flips to any suitable family of RIP matrices. We show that the embedding properties of the original family are maintained, while at the same time improving the number of rows. The main tool in our analysis is a recent bound for the supremum of certain types of Rademacher chaos processes in [Krahmer-Mendelson-Rauhut, Comm. Pure Appl. Math. to appear].
TL;DR: This is the first parallel solution for Caputo fractional reaction–diffusion equation with explicit difference method and the experimental results show that the parallel solution compares well with the analytic solution.
Abstract: The computational complexity of Caputo fractional reaction—diffusion equation is $$O(MN^2)$$ O ( M N 2 ) compared with $$O(MN)$$ O ( M N ) of traditional reaction—diffusion equation, where $$M$$ M , $$N$$ N are the number of time steps and grid points. A efficient parallel solution for Caputo fractional reaction—diffusion equation with explicit difference method is proposed. The parallel solution, which is implemented with MPI parallel programming model, consists of three procedures: preprocessing, parallel solver and postprocessing. The parallel solver involves the parallel tridiagonal matrix vector multiplication, vector vector addition and constant vector multiplication. The sum of constant vector multiplication is optimized. As to the authors' knowledge, this is the first parallel solution for Caputo fractional reaction—diffusion equation. The experimental results show that the parallel solution compares well with the analytic solution. The parallel solution on single Intel Xeon X5540 CPU runs more than three times faster than the serial solution on single X5540 CPU core, and scales quite well on a distributed memory cluster system.
TL;DR: A new approach to design multiplierless constant rotators based on a combined coefficient selection and shift-and-add implementation (CCSSI) that provides an extended design space that offers a larger number of alternatives with respect to previous works.
Abstract: This paper presents a new approach to design multiplierless constant rotators. The approach is based on a combined coefficient selection and shift-and-add implementation (CCSSI) for the design of the rotators. First, complete freedom is given to the selection of the coefficients, i.e., no constraints to the coefficients are set in advance and all the alternatives are taken into account. Second, the shift-and-add implementation uses advanced single constant multiplication (SCM) and multiple constant multiplication (MCM) techniques that lead to low-complexity multiplierless implementations. Third, the design of the rotators is done by a joint optimization of the coefficient selection and shift-and-add implementation. As a result, the CCSSI provides an extended design space that offers a larger number of alternatives with respect to previous works. Furthermore, the design space is explored in a simple and efficient way. The proposed approach has wide applications in numerous hardware scenarios. This includes rotations by single or multiple angles, rotators in single or multiple branches, and different scaling of the outputs. Experimental results for various scenarios are provided. In all of them, the proposed approach achieves significant improvements with respect to state of the art.
TL;DR: Chinese Remainder Theorem (CRT) is employed to reduce the computational complexity of the large-number element-by-element modular multiplication of the leveled fully homomorphic encryption scheme.
Abstract: Gentry introduced the first plausible fully homomorphic encryption (FHE) scheme, which was considered a major breakthrough in cryptography. Several FHE schemes have been proposed to make FHE more efficient for practical applications since then. The leveled fully homomorphic scheme is among the most well-known schemes. In leveled FHE scheme, large-number matrix-vector multiplication is a crucial part of the encryption algorithm. In this paper, Chinese Remainder Theorem (CRT) is employed to reduce the computational complexity of the large-number element-by-element modular multiplication. The first step is called decomposition, in which each large-number element in the matrix and vector is decomposed into many small words. The next step is vector operation that performs the modular multiplications and additions of the decomposed small words. Finally the matrix-vector multiplication results can be obtained through reconstruction. We compare the CRTbased method with Number Theory Library (NTL), showing the proposed method is about 7.8 times faster when executing on CPU. In addition, it is observed that vector operation takes up to 99.6% of the total computation time and the reconstruction only takes 0.4%. Therefore GPU acceleration is employed to speed up the vector operations. Experiment results show that the GPU implementation of the CRT-based method is 35.2 times faster than the same method implemented on CPU and is 273.6 times faster than the NTL library on CPU.
TL;DR: In this paper, an elliptic curve cryptography (ECC) is used to make a fast, and very low-power software implementation of a public-key cryptography algorithm on the ARM Cortex-M0+.
Abstract: In this work, elliptic curve cryptography (ECC) is used to make a fast, and very low-power software implementation of a public-key cryptography algorithm on the ARM Cortex-M0+. An optimization of the Lopez-Dahab field multiplication method is proposed, which aims to reduce the number of memory accesses, as this is a slow operation on the target platform. A mixed C and assembly implementation was made; a random point multiplication requires 34.16 μJ, whereas our fixed point multiplication requires 20.63 μJ. Our implementation's energy consumption beats all other software implementations, on any platform, by a factor of at least 3.3.
TL;DR: This work investigates different solving techniques corresponding to different attacker models and eventually refine the attack when considering particular implementations of the multiplication, particularly on GF(2128) multiplication.
Abstract: In this paper, we study the side-channel security of the field multiplication in GF(2 n ). We particularly focus on GF(2128) multiplication which is the one used in the authentication part of \(\mathsf{AES}\textrm{-}\mathsf{GCM}\) but the proposed attack also applies to other binary extensions. In a hardware implementation using a 128-bit multiplier, the full 128-bit secret is manipulated at once. In this context, classical DPA attacks based on the divide and conquer strategy cannot be applied. In this work, the algebraic structure of the multiplication is leveraged to recover bits of information about the secret multiplicand without having to perform any key-guess. To do so, the leakage corresponding to the writing of the multiplication output into a register is considered. It is assumed to follow a Hamming weight/distance leakage model. Under these particular, yet easily met, assumption we exhibit a nice connection between the key recovery problem and some classical coding and Learning Parities with Noise problems with certain instance parameters. In our case, the noise is very high, but the length of the secret is rather short. In this work we investigate different solving techniques corresponding to different attacker models and eventually refine the attack when considering particular implementations of the multiplication.
TL;DR: In this paper, a change of realization of arithmetic, without altering the remaining structures of a given equation, plays the same role as symmetry transformation, and explains why the change of arithmetic is not equivalent to the usual change of variables, and why it may have implications for the Bell theorem.
Abstract: Arithmetic operations can be defined in various ways, even if one assumes commutativity and associativity of addition and multiplication, and distributivity of multiplication with respect to addition. In consequence, whenever one encounters `plus' or `times' one has certain freedom of interpreting this operation. This leads to some freedom in definitions of derivatives, integrals and, thus, practically all equations occurring in natural sciences. A change of realization of arithmetic, without altering the remaining structures of a given equation, plays the same role as a symmetry transformation. An appropriate construction of arithmetic turns out to be particularly important for dynamical systems in fractal space-times. Simple examples from classical and quantum, relativistic and nonrelativistic physics are discussed, including the eigenvalue problem for a quantum harmonic oscillator. It is explained why the change of arithmetic is not equivalent to the usual change of variables, and why it may have implications for the Bell theorem.
TL;DR: A new diagonal storage format, a hybrid of the diagonal format (DLA) and the compressed sparse row format (CSR) (HDC) will be presented, which overcomes the inefficiency of DLA in storing irregular matrices and the imbalances of CSR in storing non-zero elements.
Abstract: Sparse matrix-vector multiplication SpMV is of singular importance in sparse linear algebra, which is an important issue in scientific computing and engineering practice. Much effort has been put into accelerating SpMV, and a few parallel solutions have been proposed. This paper focuses on a special type of SpMV, namely sparse quasi-diagonal matrix-vector multiplication SQDMV. The sparse quasi-diagonal matrix is the key to solving many differential equations, and very little research has been done in this field. This paper discusses data structures and algorithms for SQDMV that are efficiently implemented on the compute unified device architecture CUDA platform for the fine-grained parallel architecture of the graphics processing unit GPU. A new diagonal storage format, a hybrid of the diagonal format DLA and the compressed sparse row format CSR HDC will be presented, which overcomes the inefficiency of DLA in storing irregular matrices and the imbalances of CSR in storing non-zero elements. Furthermore, HDC can adjust the storage bandwidth of the diagonal to adapt to different discrete degrees of sparse matrix, so as to get a higher compression ratio than DLA and CSR, and reduce the computational complexity. Our implementation in a GPU shows that the performance of HDC is better than that of other formats, especially for matrices with some discrete points outside the main diagonal. In addition, we combine the different parts of HDC to make a unified kernel to get a better compression ratio and a higher speedup ratio in the GPU.
TL;DR: These designs are suitable to construct most of the elliptic curve and pairing based cryptographic processors and show that these designs are efficiently optimized for area × time per bit value with a high throughput rate.
Abstract: This paper presents radix-4 and radix-8 Booth encoded modular multipliers over general F p based on inter-leaved multiplication algorithm. An existing bit serial interleaved multiplication algorithm is modified using radix-4, radix-8 and Booth recoding techniques. The modified radix-4 and radix-8 versions of interleaved multiplication result in 50% and 75% reduction in required number of clock cycles for one modular multiplication over the corresponding bit serial interleaved multipliers, while maintaining a competitive critical path delay. The proposed architectures are implemented in Verilog HDL and synthesized by targeting virtex-6 FPGA platform. Due to an efficient utilization of optimized addition chains available in FPGAs and exploiting the parallelism among operations, the proposed radix-4 and radix-8 multipliers compute one 256 × 256 bit modular multiplication in 1.49µs and 0.93µs respectively, which are 35% and 94% improvement over the corresponding bit serial version. Further, this work also presents a thorough comparison on basis of area, throughput, and area × time per bit value. Which shows that these designs are efficiently optimized for area × time per bit value with a high throughput rate. Thus, these designs are suitable to construct most of the elliptic curve and pairing based cryptographic processors.
TL;DR: In this article, the authors describe basic software techniques to improve the performance of Montgomery modular multiplication on 8-bit AVR-based microcontrollers and present a new variant of the widely-used hybrid method for multiple-precision multiplication that is 10.6% faster than the original hybrid technique.
Abstract: Modular multiplication of large integers is a performance-critical arithmetic operation of many public-key cryptosystems such as RSA, DSA, Diffie-Hellman (DH) and their elliptic curve-based variants ECDSA and ECDH. The computational cost of modular multiplication and related operations (e.g. exponentiation) poses a practical challenge to the widespread deployment of public-key cryptography, especially on embedded devices equipped with 8-bit processors (smart cards, wireless sensor nodes, etc.). In this paper, we describe basic software techniques to improve the performance of Montgomery modular multiplication on 8-bit AVR-based microcontrollers. First, we present a new variant of the widely-used hybrid method for multiple-precision multiplication that is 10.6% faster than the original hybrid technique of Gura et al. Then, we discuss different hybrid Montgomery multiplication algorithms, including Hybrid Finely Integrated Product Scanning (HFIPS), and introduce a novel approach for Montgomery multiplication, which we call Hybrid Separated Product Scanning (HSPS). Finally, we show how to perform the modular subtraction of Montgomery reduction in a regular fashion without execution of conditional statements so as to counteract Simple Power Analysis (SPA) attacks. Our AVR implementation of the HFIPS and HSPS method outperforms the Montgomery multiplication of the MIRACL Crypto SDK by up to 21.58% and 14.24%, respectively, and is twice as fast as the modular multiplication of the TinyECC library.
TL;DR: This paper proposes a generic technique for non-standard tiling generation and generates this tiling for field multiplication in NIST specified curves and compares it with standard school book algorithm to highlight the improvement.
Abstract: High speed DSP blocks present in the modern FPGAs can be used to implement prime field multiplication to accelerate Elliptic Curve scalar multiplication in prime fields. However, compared to logic slices, DSP blocks are scarce resources, hence its usage needs to be optimized. The asymmetric 25 × 18 signed multipliers in FPGAs open a new paradigm for multiplier design, where operand decomposition becomes equivalent to a tiling problem. Previous literature has reported that for asymmetric multiplier, it is possible to generate a tiling (known as non-standard tiling) which requires less number of DSP blocks compared to standard tiling, generated by school book algorithm. In this paper, we propose a generic technique for such tiling generation and generate this tiling for field multiplication in NIST specified curves. We compare our technique with standard school book algorithm to highlight the improvement. The acceleration in ECC scalar multiplication due to the optimized field multiplier is experimentally validated for P-256. The impact of this accelerated scalar multiplication is shown for the key encapsulation algorithm PSEC-KEM (Provably Secure Key Encapsulation Mechanism).
TL;DR: This paper will generalize the algorithm for relaxed power series multiplication to the cases when K is replaced by an effective ring of positive characteristic or by aneffective ring of characteristic zero, which is also torsion-free as a Z-module and comes with an additional algorithm for partial division by integers.
Abstract: In previous work, we have introduced several fast algorithms for relaxed power series multiplication (also known under the name on-line multiplication) up to a given order n. The fastest currently known algorithm works over an effective base field K with sufficiently many 2p-th roots of unity and has algebraic time complexity O(n log ne2[EQUATION]). In this paper, we will generalize this algorithm to the cases when K is replaced by an effective ring of positive characteristic or by an effective ring of characteristic zero, which is also torsion-free as a Z-module and comes with an additional algorithm for partial division by integers. In particular, we may take K to be any effective field. We will also present an asymptotically faster algorithm for relaxed multiplication of p-adic numbers.
TL;DR: This paper presents a (b,2)-way KA decomposition for digit-serial multiplication with low-space complexity and derives a novel scalable SPB/GPB multiplier, suitable for very-large-scale integration (VLSI) implementations.
Abstract: Shifted polynomial basis (SPB) and generalized polynomial basis (GPB) are two variations of polynomial basis representation. SPB/GPB have potential for efficient bit-level and digit-level implementations of multiplication over binary extension fields. This paper presents a (b,2)-way KA decomposition for digit-serial multiplication with low-space complexity. Based on the proposed parallel (b,2)-way KA scheme, we derive a novel scalable SPB/GPB multiplier. Analytical results show that the proposed multiplier could achieve the desired trade-off between space and time complexities. Our proposed multiplier is modular, regular, and suitable for very-large-scale integration (VLSI) implementations. It involves significantly less area complexity, less computation time and less energy consumption compared to the existing digit-serial and scalable multipliers.
TL;DR: Synthesis results show that the proposed multiplier has the lowest area as compared to other tree-based multipliers, without compromising on the speed of the original Wallace multiplier.
Abstract: Multiplication is one of the most commonly used operations in the arithmetic Multipliers based on Wallace reduction tree provide an area-efficient strategy for high speed multiplication A number of modifications are proposed in the literature to optimize the area of the Wallace multiplier This paper proposed a reduced-area Wallace multiplier without compromising on the speed of the original Wallace multiplier Designs are synthesized using Synopsys Design Compiler in 90 nm process technology Synthesis results show that the proposed multiplier has the lowest area as compared to other tree-based multipliers The speed of the proposed and reference multipliers is almost the same
TL;DR: The regular recoding approach is used, and the speed record for a single-core constant-time point multiplication on standardized binary elliptic curves at the \(128\)-bit security level is set.
Abstract: In this paper we introduce new methods for computing constant-time variable-base point multiplications over the Galbraith-Lin-Scott (GLS) and the Koblitz families of elliptic curves. Using a left-to-right double-and-add and a right-to-left halve-and-add Montgomery ladder over a GLS curve, we present some of the fastest timings yet reported in the literature for point multiplication. In addition, we combine these two procedures to compute a multi-core protected scalar multiplication. Furthermore, we designed a novel regular \(\tau \)-adic scalar expansion for Koblitz curves. As a result, using the regular recoding approach, we set the speed record for a single-core constant-time point multiplication on standardized binary elliptic curves at the \(128\)-bit security level.
TL;DR: A new double point multiplication algorithm based on differential addition chains that has a uniform structure and has some degree of built-in resistance against side channel analysis attacks is presented.
Abstract: We present a new double point multiplication algorithm based on differential addition chains. Our proposed scheme has a uniform structure and has some degree of built-in resistance against side channel analysis attacks. We discuss deploying our scheme in a hardware implementation of single point multiplication on binary elliptic curves with efficiently computable endomorphisms. Based on operation counts, we expect to gain accelerations of 30% and 18% for computing single point multiplication with and without availability of parallel multipliers, respectively, and these results are verified in our implementations.
TL;DR: A novel architecture to realize a million-bit multiplication scheme based on the Schonhage-Strassen Algorithm using Number Theoretical Transform (NTT) makes use of an innovative cache architecture along with processing elements customized to match the computation and access patterns of the NTT-based recursive multiplication algorithm.
TL;DR: The algorithm and architecture of a BCD parallel multiplier that exploits some properties of two different redundant BCD codes to speedup its computation are presented and an area improvement is shown roughly in the range 20-35 percent for similar target delays with respect to the fastest implementation.
Abstract: We present the algorithm and architecture of a BCD parallel multiplier that exploits some properties of two different redundant BCD codes to speedup its computation: the redundant BCD excess-3 code (XS-3), and the overloaded BCD representation (ODDS). In addition, new techniques are developed to reduce significantly the latency and area of previous representative high-performance implementations. Partial products are generated in parallel using a signed-digit radix-10 recoding of the BCD multiplier with the digit set [-5, 5], and a set of positive multiplicand multiples (0X, 1X, 2X, 3X, 4X, 5X) coded in XS-3. This encoding has several advantages. First, it is a self-complementing code, so that a negative multiplicand multiple can be obtained by just inverting the bits of the corresponding positive one. Also, the available redundancy allows a fast and simple generation of multiplicand multiples in a carry-free way. Finally, the partial products can be recoded to the ODDS representation by just adding a constant factor into the partial product reduction tree. Since the ODDS uses a similar 4-bit binary encoding as non-redundant BCD, conventional binary VLSI circuit techniques, such as binary carry-save adders and compressor trees, can be adapted efficiently to perform decimal operations. To show the advantages of our architecture, we have synthesized a RTL model for $16\times 16$-digit and $34\times 34$-digit multiplications and performed a comparative survey of the previous most representative designs. We show that the proposed decimal multiplier has an area improvement roughly in the range 20-35 percent for similar target delays with respect to the fastest implementation.
TL;DR: This paper presents a method based on matrix-matrix multiplication concepts for determining the approximate (sparse) inverses of sparse matrices that can be combined with iterative solvers and sparse approximation techniques as well.
Abstract: This paper presents a method based on matrix-matrix multiplication concepts for determining the approximate (sparse) inverses of sparse matrices. The suggested method is a development on the well-known Schulz iteration and it can successfully be combined with iterative solvers and sparse approximation techniques as well. A detailed discussion on the convergence rate of this scheme is furnished. Results of numerical experiments are also reported to illustrate the performance of the proposed method.
TL;DR: Improve in speed compared with earlier reported matrix multipliers, for example, conventional algorithm, systolic array and PNTT-based implementation was found to be ~67, ~56 and ~65%, respectively.
Abstract: A transistor level implementation of an improved matrix multiplier for high-speed digital signal processing applications based on matrix element transformation and multiplication is reported in this study. The improvement in speed was achieved by rearranging the matrix element into a two-dimensional array of processing elements interconnected as a mesh. The edges of each row and column were interconnected in torus structure, facilitating simultaneous implementation of several multiplications. The functionality of the circuitry was verified and the performance parameters for example, propagation delay and dynamic switching power consumptions were calculated using spice spectre using 90 nm CMOS technology. The proposed methodology ensures substantial reduction in propagation delay compared with the conventional algorithm, systolic array and pseudo number theoretic transformation (PNTT)-based implementation, which are the most commonly used techniques, for matrix multiplication. The propagation delay of the implemented 4 × 4 matrix multiplier was only ~2 μs, whereas the power consumption of the implemented 4 × 4 matrix multiplier was ~3.12 mW only. Improvement in speed compared with earlier reported matrix multipliers, for example, conventional algorithm, systolic array and PNTT-based implementation was found to be ~67, ~56 and ~65%, respectively.
TL;DR: This paper focuses on a further optimization of multiprecision multiplication, the most expensive operation involved in PKC, using several innovative methods: carry-once, optimized multiplication and accumulation MAC, unbalanced comb, and optimized comb-window.