Top 444 papers published in the topic of Multiplication in 2014

Showing papers on "Multiplication published in 2014"

A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units

[...]

Moritz Kreutzer, Georg Hager, Gerhard Wellein, Holger Fehske, Alan R. Bishop - Show less +1 more

11 Sep 2014-SIAM Journal on Scientific Computing

TL;DR: This work suggests SELL-$C$-$\sigma, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from general-purpose graphics processing units and vector computer programming and shows its suitability on a variety of hardware platforms.

...read moreread less

Abstract: Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-$C$-$\sigma$, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from general-purpose graphics processing units and vector computer programming. We discuss the advantages of SELL-$C$-$\sigma$ compared to established formats like Compressed Row Storage and ELLPACK and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi, and Nvidia Tesla K20) for a wi...

...read moreread less

248 citations

Journal Article•10.1016/J.PARCO.2014.03.012•

Sparse matrix multiplication: The distributed block-compressed sparse row library

[...]

Urban Borštnik¹, Joost VandeVondele², Valéry Weber¹, Jürg Hutter¹•Institutions (2)

University of Zurich¹, ETH Zurich²

1 May 2014

TL;DR: The DBCSR (Distributed Block Compressed Sparse Row) library for scalable sparse matrix–matrix multiplication and its use in the CP2K program for linear-scaling quantum-chemical calculations is presented.

...read moreread less

Abstract: Efficient parallel multiplication of sparse matrices is key to enabling many large-scale calculations. This article presents the DBCSR (Distributed Block Compressed Sparse Row) library for scalable sparse matrix–matrix multiplication and its use in the CP2K program for linear-scaling quantum-chemical calculations. The library combines several approaches to implement sparse matrix multiplication in a way that performs well and is demonstrably scalable. Parallel communication has well-defined limits. Data volume decreases with O ( 1 / P ) with increasing process counts P and every process communicates with at most O ( P ) others. Local sparse matrix multiplication is handled efficiently using a combination of techniques: blocking elements together in an application-relevant way, an autotuning library for small matrix multiplications, cache-oblivious recursive multiplication, and multithreading. Additionally, on-the-fly filtering not only increases sparsity but also avoids performing calculations that fall below the filtering threshold. We demonstrate and analyze the performance of the DBCSR library and its various scaling behaviors.

...read moreread less

232 citations

Journal Article•10.1109/TCSI.2013.2283691•

Efficient Algorithm and Architecture for Elliptic Curve Cryptography for Extremely Constrained Secure Applications

[...]

Reza Azarderakhsh¹, Kimmo Järvinen², Mehran Mozaffari-Kermani³•Institutions (3)

University of Waterloo¹, Aalto University², Rochester Institute of Technology³

06 Jan 2014-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: This paper proposes an efficient implementation of point multiplication on Koblitz curves targeting extremely-constrained, secure applications and introduces a new technique for point addition in affine coordinate which requires fewer registers.

...read moreread less

Abstract: Recently, considerable research has been performed in cryptography and security to optimize the area, power, timing, and energy needed for the point multiplication operations over binary elliptic curves. In this paper, we propose an efficient implementation of point multiplication on Koblitz curves targeting extremely-constrained, secure applications. We utilize the Gaussian normal basis (GNB) representation of field elements over GF(2m) and employ an efficient bit-level GNB multiplier. One advantage of this GNB multiplier is that we are able to reduce the hardware complexity through sharing the addition/accumulation with other field additions. We utilized the special property of normal basis representation and squarings are implemented very efficiently by only rewiring in hardware. We introduce a new technique for point addition in affine coordinate which requires fewer registers. Based on this technique, we propose an extremely small processor architecture for point multiplication. Through application-specific integrated circuit (ASIC) implementations, we evaluate the area, performance, and energy consumption of the proposed crypto-processor. Utilizing two different working frequencies, it is shown that the proposed architecture reaches better results compared to the previous works, making it suitable for extremely-constrained, secure environments.

...read moreread less

86 citations

Journal Article•10.1016/J.LEARNINSTRUC.2014.04.006•

Delayed benefits of learning elementary algebraic transformations through contrasted comparisons

[...]

Esther Ziegler¹, Elsbeth Stern¹•Institutions (1)

ETH Zurich¹

01 Oct 2014-Learning and Instruction

TL;DR: This article investigated whether these errors can be prevented by presenting addition and multiplication problems in such a way that students are encouraged to compare the problems at a deeper level and found that the contrast group clearly outperformed the sequential group.

...read moreread less

74 citations

Journal Article•10.1037/XGE0000029•

The detrimental effect of interference in multiplication facts storing: typical development and individual differences.

[...]

Alice De Visscher¹, Marie-Pascale Noël¹•Institutions (1)

Université catholique de Louvain¹

01 Apr 2014-Journal of Experimental Psychology: General

TL;DR: Findings show that the learning of multiplications is particularly interference prone because of feature overlap and that people who are sensitive to this parameter therefore encounter difficulties in memorizing arithmetic facts.

...read moreread less

Abstract: The difficulty in memorizing arithmetic facts is a general and persistent hallmark of math learning disabilities. It has recently been suggested that hypersensitivity to interference could prevent a person from storing arithmetic facts. The similarity between arithmetic facts would provoke interference, and learners who are hypersensitive to interference would therefore encounter difficulties in storing arithmetic facts in long-term memory. In this study, we created a measure of the interference weight for each multiplication by measuring the overlap of digits between multiplications. First, we tested whether the interference parameter could predict performance across multiplications by analyzing the data from undergraduates published by Campbell (1997). The interference parameter substantially predicted performance across multiplications. Similarly, the performance across multiplications was substantially determined by the interference parameter in 3rd-grade children, 5th-grade children, and undergraduates we tested. Second, we tested whether people with poor arithmetic facts abilities were particularly sensitive to the interference parameter. We tested this hypothesis in typical development by analyzing the data from the 3rd-grade children, 5th-grade children, and undergraduates. We analyzed data with regard to atypical development from a published case study of dyscalculia as well as from 4th-grade children, with either poor or good multiplication skills, tested twice 1 year apart. Results showed that the individual sensitivity to the interference parameter determined part of the individual differences in multiplication performance in all data sets. These findings show that the learning of multiplications is particularly interference prone because of feature overlap and that people who are sensitive to this parameter therefore encounter difficulties in memorizing arithmetic facts.

...read moreread less

67 citations

Proceedings Article•10.1145/2631948.2631954•

Nearly optimal computations with structured matrices

[...]

Victor Y. Pan¹, Elias P. Tsigaridas²•Institutions (2)

City University of New York¹, University of Paris²

28 Jul 2014

TL;DR: In this paper, the complexity of solving nonsingular linear systems of equations with structured matrices has been studied and the authors present a Boolean complexity analysis for the problem of polynomial multiplication and division.

...read moreread less

Abstract: We estimate the Boolean complexity of multiplication of structured matrices by a vector and the solution of nonsingular linear systems of equations with these matrices. We study four basic and most popular classes, that is, Toeplitz, Hankel, Cauchy and Vandermonde matrices, for which the cited computational problems are equivalent to the task of polynomial multiplication and division and polynomial and rational multipoint evaluation and interpolation. The Boolean cost estimates for the latter problems have been obtained by Kirrinnis in [10], except for rational interpolation, and we supply them now. All known Boolean cost estimates from [10] for these problems rely on using Kronecker product. This implies the d-fold precision increase for the d-th degree output, but we avoid such an increase by relying on distinct techniques based on employing FFT. Furthermore we simplify the analysis and make it more transparent by combining the representations of our tasks and algorithms both via structured matrices and via polynomials and rational functions. This also enables further extensions of our estimates to cover Trummer's important problem and computations with the popular classes of structured matrices that generalize the four cited basic matrix classes.

...read moreread less

60 citations

Book Chapter•10.1007/978-3-662-45611-8_14•

GLV/GLS Decomposition, Power Analysis, and Attacks on ECDSA Signatures with Single-Bit Nonce Bias

[...]

Diego F. Aranha¹, Pierre-Alain Fouque², Benoît Gérard, Jean-Gabriel Kammerer³, Mehdi Tibouchi, Jean-Christophe Zapalowicz⁴ - Show less +2 more•Institutions (4)

State University of Campinas¹, Institut Universitaire de France², University of Rennes³, French Institute for Research in Computer Science and Automation⁴

7 Dec 2014

TL;DR: The main goal of this paper is to discuss security issues that may arise using either the Gallant–Lambert–Vanstone or Galbraith–Lin–Scott approaches to elliptic curve cryptography.

...read moreread less

Abstract: The fastest implementations of elliptic curve cryptography in recent years have been achieved on curves endowed with nontrivial efficient endomorphisms, using techniques due to Gallant–Lambert–Vanstone (GLV) and Galbraith–Lin–Scott (GLS). In such implementations, a scalar multiplication [k]P is computed as a double multiplication [k 1]P + [k 2]ψ(P), for ψ an efficient endomorphism and k 1,k 2 appropriate half-size scalars. To compute a random scalar multiplication, one can either select the scalars k 1,k 2 at random, hoping that the resulting k = k 1 + k 2 λ is close to uniform, or pick a uniform k instead and decompose it as k 1 + k 2 λ afterwards. The main goal of this paper is to discuss security issues that may arise using either approach.

...read moreread less

59 citations

Journal Article•10.1109/TPDS.2013.31•

High-Level Strategies for Parallel Shared-Memory Sparse Matrix-Vector Multiplication

[...]

A. N. Yzelman¹, Dirk Roose²•Institutions (2)

Katholieke Universiteit Leuven¹, University of Copenhagen Faculty of Science²

01 Jan 2014-IEEE Transactions on Parallel and Distributed Systems

TL;DR: One of the newly proposed methods attains the best average result in experiments on a large set of matrices and obtains a parallel efficiency of 90 percent, while on average it performs close to 60 percent.

...read moreread less

Abstract: The sparse matrix-vector multiplication is an important computational kernel, but is hard to efficiently execute even in the sequential case. The problems--namely low arithmetic intensity, inefficient cache use, and limited memory bandwidth--are magnified as the core count on shared-memory parallel architectures increases. Existing techniques are discussed in detail, and categorized chiefly based on their distribution types. Based on this, new parallelization techniques are proposed. The theoretical scalability and memory usage of the various strategies are analyzed, and experiments on multiple NUMA architectures confirm the validity of the results. One of the newly proposed methods attains the best average result in experiments on a large set of matrices. In one of the experiments it obtains a parallel efficiency of 90 percent, while on average it performs close to 60 percent.

...read moreread less

57 citations

Proceedings Article•10.5555/2634074.2634185•

New constructions of RIP matrices with fast multiplication and fewer rows

[...]

Jelani Nelson¹, Eric Price², Mary Wootters³•Institutions (3)

Harvard University¹, Massachusetts Institute of Technology², University of Michigan³

5 Jan 2014

TL;DR: In this article, Krahmer-Mendelson-Rauhut et al. showed that the restricted isometry property is sufficient condition for the efficient reconstruction of a nearly k-sparse vector x e Cd from linear measurements Φx.

...read moreread less

Abstract: In this paper, we present novel constructions of matrices with the restricted isometry property (RIP) that support fast matrix-vector multiplication. Our guarantees are the best known, and can also be used to obtain the best known guarantees for fast Johnson Lindenstrauss transforms.In compressed sensing, the restricted isometry property is a sufficient condition for the efficient reconstruction of a nearly k-sparse vector x e Cd from m linear measurements Φx. It is desirable for m to be small, and further it is desirable for Φ to support fast matrix-vector multiplication. Among other applications, fast multiplication improves the runtime of iterative recovery algorithms which repeatedly multiply by Φ or Φ*.The main contribution of this work is a novel randomized construction of RIP matrices Φ e Cmxd, preserving the e2 norms of all k-sparse vectors with distortion 1 + e, where the matrix-vector multiply Φx can be computed in nearly linear time. The number of rows m is on the order of e-2k log dlog2(kl oge d), an improvement on previous analyses by a logarithmic factor. Our construction, together with a connection between RIP matrices and the Johnson-Lindenstrauss lemma in [Krahmer-Ward, SIAM. J. Math. Anal. 2011], also implies fast Johnson-Lindenstrauss embeddings with asymptotically fewer rows than previously known.Our construction is actually a recipe for improving any existing family of RIP matrices. Briefly, we apply an appropriate sparse hash matrix with sign flips to any suitable family of RIP matrices. We show that the embedding properties of the original family are maintained, while at the same time improving the number of rows. The main tool in our analysis is a recent bound for the supremum of certain types of Rademacher chaos processes in [Krahmer-Mendelson-Rauhut, Comm. Pure Appl. Math. to appear].

...read moreread less

53 citations

Journal Article•10.1007/S11227-014-1123-Z•

An efficient parallel solution for Caputo fractional reaction—diffusion equation

[...]

Chunye Gong¹, Weimin Bao¹, Guojian Tang¹, Bo Yang¹, Jie Liu¹ - Show less +1 more•Institutions (1)

National University of Defense Technology¹

01 Jun 2014-The Journal of Supercomputing

TL;DR: This is the first parallel solution for Caputo fractional reaction–diffusion equation with explicit difference method and the experimental results show that the parallel solution compares well with the analytic solution.

...read moreread less

Abstract: The computational complexity of Caputo fractional reaction—diffusion equation is $$O(MN^2)$$ O ( M N 2 ) compared with $$O(MN)$$ O ( M N ) of traditional reaction—diffusion equation, where $$M$$ M , $$N$$ N are the number of time steps and grid points. A efficient parallel solution for Caputo fractional reaction—diffusion equation with explicit difference method is proposed. The parallel solution, which is implemented with MPI parallel programming model, consists of three procedures: preprocessing, parallel solver and postprocessing. The parallel solver involves the parallel tridiagonal matrix vector multiplication, vector vector addition and constant vector multiplication. The sum of constant vector multiplication is optimized. As to the authors' knowledge, this is the first parallel solution for Caputo fractional reaction—diffusion equation. The experimental results show that the parallel solution compares well with the analytic solution. The parallel solution on single Intel Xeon X5540 CPU runs more than three times faster than the serial solution on single X5540 CPU core, and scales quite well on a distributed memory cluster system.

...read moreread less

48 citations

Journal Article•10.1109/TCSI.2014.2304664•

Low-Complexity Multiplierless Constant Rotators Based on Combined Coefficient Selection and Shift-and-Add Implementation (CCSSI)

[...]

Mario Garrido¹, Fahad Qureshi¹, Oscar Gustafsson¹•Institutions (1)

Linköping University¹

04 Mar 2014-IEEE Transactions on Circuits and Systems

TL;DR: A new approach to design multiplierless constant rotators based on a combined coefficient selection and shift-and-add implementation (CCSSI) that provides an extended design space that offers a larger number of alternatives with respect to previous works.

...read moreread less

Abstract: This paper presents a new approach to design multiplierless constant rotators. The approach is based on a combined coefficient selection and shift-and-add implementation (CCSSI) for the design of the rotators. First, complete freedom is given to the selection of the coefficients, i.e., no constraints to the coefficients are set in advance and all the alternatives are taken into account. Second, the shift-and-add implementation uses advanced single constant multiplication (SCM) and multiple constant multiplication (MCM) techniques that lead to low-complexity multiplierless implementations. Third, the design of the rotators is done by a joint optimization of the coefficient selection and shift-and-add implementation. As a result, the CCSSI provides an extended design space that offers a larger number of alternatives with respect to previous works. Furthermore, the design space is explored in a simple and efficient way. The proposed approach has wide applications in numerous hardware scenarios. This includes rotations by single or multiple angles, rotators in single or multiple branches, and different scaling of the outputs. Experimental results for various scenarios are provided. In all of them, the proposed approach achieves significant improvements with respect to state of the art.

...read moreread less

Proceedings Article•10.1109/ISCAS.2014.6865755•

Accelerating leveled fully homomorphic encryption using GPU

[...]

Wei Wang, Zhilu Chen, Xinming Huang

1 Jun 2014

TL;DR: Chinese Remainder Theorem (CRT) is employed to reduce the computational complexity of the large-number element-by-element modular multiplication of the leveled fully homomorphic encryption scheme.

...read moreread less

Abstract: Gentry introduced the first plausible fully homomorphic encryption (FHE) scheme, which was considered a major breakthrough in cryptography. Several FHE schemes have been proposed to make FHE more efficient for practical applications since then. The leveled fully homomorphic scheme is among the most well-known schemes. In leveled FHE scheme, large-number matrix-vector multiplication is a crucial part of the encryption algorithm. In this paper, Chinese Remainder Theorem (CRT) is employed to reduce the computational complexity of the large-number element-by-element modular multiplication. The first step is called decomposition, in which each large-number element in the matrix and vector is decomposed into many small words. The next step is vector operation that performs the modular multiplications and additions of the decomposed small words. Finally the matrix-vector multiplication results can be obtained through reconstruction. We compare the CRTbased method with Number Theory Library (NTL), showing the proposed method is about 7.8 times faster when executing on CPU. In addition, it is observed that vector operation takes up to 99.6% of the total computation time and the reconstruction only takes 0.4%. Therefore GPU acceleration is employed to speed up the vector operations. Experiment results show that the GPU implementation of the CRT-based method is 35.2 times faster than the same method implemented on CPU and is 273.6 times faster than the NTL library on CPU.

...read moreread less

Proceedings Article•10.1145/2593069.2593238•

Ultra Low-Power implementation of ECC on the ARM Cortex-M0+

[...]

Ruan de Clercq¹, Leif Uhsadel¹, Anthony Van Herrewege¹, Ingrid Verbauwhede¹•Institutions (1)

Katholieke Universiteit Leuven¹

1 Jun 2014

TL;DR: In this paper, an elliptic curve cryptography (ECC) is used to make a fast, and very low-power software implementation of a public-key cryptography algorithm on the ARM Cortex-M0+.

...read moreread less

Abstract: In this work, elliptic curve cryptography (ECC) is used to make a fast, and very low-power software implementation of a public-key cryptography algorithm on the ARM Cortex-M0+. An optimization of the Lopez-Dahab field multiplication method is proposed, which aims to reduce the number of memory accesses, as this is a slow operation on the target platform. A mixed C and assembly implementation was made; a random point multiplication requires 34.16 μJ, whereas our fixed point multiplication requires 20.63 μJ. Our implementation's energy consumption beats all other software implementations, on any platform, by a factor of at least 3.3.

...read moreread less

Book Chapter•10.1007/978-3-662-45608-8_17•

Side-Channel Analysis of Multiplications in GF(2 128 )

[...]

Sonia Belaïd¹, Sonia Belaïd², Pierre-Alain Fouque³, Pierre-Alain Fouque⁴, Benoît Gérard - Show less +1 more•Institutions (4)

École Normale Supérieure¹, Thales Communications², University of Rennes³, Institut Universitaire de France⁴

7 Dec 2014

TL;DR: This work investigates different solving techniques corresponding to different attacker models and eventually refine the attack when considering particular implementations of the multiplication, particularly on GF(2128) multiplication.

...read moreread less

Abstract: In this paper, we study the side-channel security of the field multiplication in GF(2 n ). We particularly focus on GF(2128) multiplication which is the one used in the authentication part of $\mathsf{AES}\textrm{-}\mathsf{GCM}$ but the proposed attack also applies to other binary extensions. In a hardware implementation using a 128-bit multiplier, the full 128-bit secret is manipulated at once. In this context, classical DPA attacks based on the divide and conquer strategy cannot be applied. In this work, the algebraic structure of the multiplication is leveraged to recover bits of information about the secret multiplicand without having to perform any key-guess. To do so, the leakage corresponding to the writing of the multiplication output into a register is considered. It is assumed to follow a Hamming weight/distance leakage model. Under these particular, yet easily met, assumption we exhibit a nice connection between the key recovery problem and some classical coding and Learning Parities with Noise problems with certain instance parameters. In our case, the noise is very high, but the length of the secret is rather short. In this work we investigate different solving techniques corresponding to different attacker models and eventually refine the attack when considering particular implementations of the multiplication.

...read moreread less

Journal Article•10.1007/S40509-015-0056-4•

Relativity of arithmetic as a fundamental symmetry of physics

[...]

Marek Czachor¹, Marek Czachor²•Institutions (2)

Vrije Universiteit Brussel¹, Gdańsk University of Technology²

30 Dec 2014-arXiv: Mathematical Physics

TL;DR: In this paper, a change of realization of arithmetic, without altering the remaining structures of a given equation, plays the same role as symmetry transformation, and explains why the change of arithmetic is not equivalent to the usual change of variables, and why it may have implications for the Bell theorem.

...read moreread less

Abstract: Arithmetic operations can be defined in various ways, even if one assumes commutativity and associativity of addition and multiplication, and distributivity of multiplication with respect to addition. In consequence, whenever one encounters `plus' or `times' one has certain freedom of interpreting this operation. This leads to some freedom in definitions of derivatives, integrals and, thus, practically all equations occurring in natural sciences. A change of realization of arithmetic, without altering the remaining structures of a given equation, plays the same role as a symmetry transformation. An appropriate construction of arithmetic turns out to be particularly important for dynamical systems in fractal space-times. Simple examples from classical and quantum, relativistic and nonrelativistic physics are discussed, including the eigenvalue problem for a quantum harmonic oscillator. It is explained why the change of arithmetic is not equivalent to the usual change of variables, and why it may have implications for the Bell theorem.

...read moreread less

Journal Article•10.1177/1094342013501126•

Optimization of quasi-diagonal matrix-vector multiplication on GPU

[...]

Yang Wangdong¹, Kenli Li¹, Yan Liu², Lin Shi², Lanjun Wan² - Show less +1 more•Institutions (2)

Hunan City University¹, Hunan University²

1 May 2014

TL;DR: A new diagonal storage format, a hybrid of the diagonal format (DLA) and the compressed sparse row format (CSR) (HDC) will be presented, which overcomes the inefficiency of DLA in storing irregular matrices and the imbalances of CSR in storing non-zero elements.

...read moreread less

Abstract: Sparse matrix-vector multiplication SpMV is of singular importance in sparse linear algebra, which is an important issue in scientific computing and engineering practice. Much effort has been put into accelerating SpMV, and a few parallel solutions have been proposed. This paper focuses on a special type of SpMV, namely sparse quasi-diagonal matrix-vector multiplication SQDMV. The sparse quasi-diagonal matrix is the key to solving many differential equations, and very little research has been done in this field. This paper discusses data structures and algorithms for SQDMV that are efficiently implemented on the compute unified device architecture CUDA platform for the fine-grained parallel architecture of the graphics processing unit GPU. A new diagonal storage format, a hybrid of the diagonal format DLA and the compressed sparse row format CSR HDC will be presented, which overcomes the inefficiency of DLA in storing irregular matrices and the imbalances of CSR in storing non-zero elements. Furthermore, HDC can adjust the storage bandwidth of the diagonal to adapt to different discrete degrees of sparse matrix, so as to get a higher compression ratio than DLA and CSR, and reduce the computational complexity. Our implementation in a GPU shows that the performance of HDC is better than that of other formats, especially for matrices with some discrete points outside the main diagonal. In addition, we combine the different parts of HDC to make a unified kernel to get a better compression ratio and a higher speedup ratio in the GPU.

...read moreread less

Proceedings Article•10.1109/FPL.2014.6927452•

Radix-4 and radix-8 booth encoded interleaved modular multipliers over general F p

[...]

Khalid Javeed¹, Xiaojun Wang¹•Institutions (1)

Dublin City University¹

20 Oct 2014

TL;DR: These designs are suitable to construct most of the elliptic curve and pairing based cryptographic processors and show that these designs are efficiently optimized for area × time per bit value with a high throughput rate.

...read moreread less

Abstract: This paper presents radix-4 and radix-8 Booth encoded modular multipliers over general F p based on inter-leaved multiplication algorithm. An existing bit serial interleaved multiplication algorithm is modified using radix-4, radix-8 and Booth recoding techniques. The modified radix-4 and radix-8 versions of interleaved multiplication result in 50% and 75% reduction in required number of clock cycles for one modular multiplication over the corresponding bit serial interleaved multipliers, while maintaining a competitive critical path delay. The proposed architectures are implemented in Verilog HDL and synthesized by targeting virtex-6 FPGA platform. Due to an efficient utilization of optimized addition chains available in FPGAs and exploiting the parallelism among operations, the proposed radix-4 and radix-8 multipliers compute one 256 × 256 bit modular multiplication in 1.49µs and 0.93µs respectively, which are 35% and 94% improvement over the corresponding bit serial version. Further, this work also presents a thorough comparison on basis of area, throughput, and area × time per bit value. Which shows that these designs are efficiently optimized for area × time per bit value with a high throughput rate. Thus, these designs are suitable to construct most of the elliptic curve and pairing based cryptographic processors.

...read moreread less

Book Chapter•10.1007/978-3-319-06734-6_14•

New Speed Records for Montgomery Modular Multiplication on 8-Bit AVR Microcontrollers

[...]

Zhe Liu¹, Johann Großschädl¹•Institutions (1)

University of Luxembourg¹

28 May 2014

TL;DR: In this article, the authors describe basic software techniques to improve the performance of Montgomery modular multiplication on 8-bit AVR-based microcontrollers and present a new variant of the widely-used hybrid method for multiple-precision multiplication that is 10.6% faster than the original hybrid technique.

...read moreread less

Abstract: Modular multiplication of large integers is a performance-critical arithmetic operation of many public-key cryptosystems such as RSA, DSA, Diffie-Hellman (DH) and their elliptic curve-based variants ECDSA and ECDH. The computational cost of modular multiplication and related operations (e.g. exponentiation) poses a practical challenge to the widespread deployment of public-key cryptography, especially on embedded devices equipped with 8-bit processors (smart cards, wireless sensor nodes, etc.). In this paper, we describe basic software techniques to improve the performance of Montgomery modular multiplication on 8-bit AVR-based microcontrollers. First, we present a new variant of the widely-used hybrid method for multiple-precision multiplication that is 10.6% faster than the original hybrid technique of Gura et al. Then, we discuss different hybrid Montgomery multiplication algorithms, including Hybrid Finely Integrated Product Scanning (HFIPS), and introduce a novel approach for Montgomery multiplication, which we call Hybrid Separated Product Scanning (HSPS). Finally, we show how to perform the modular subtraction of Montgomery reduction in a regular fashion without execution of conditional statements so as to counteract Simple Power Analysis (SPA) attacks. Our AVR implementation of the HFIPS and HSPS method outperforms the Montgomery multiplication of the MIRACL Crypto SDK by up to 21.58% and 14.24%, respectively, and is twice as fast as the modular multiplication of the TinyECC library.

...read moreread less

Proceedings Article•10.1145/2593069.2593234•

Tile Before Multiplication: An Efficient Strategy to Optimize DSP Multiplier for Accelerating Prime Field ECC for NIST Curves

[...]

Debapriya Basu Roy¹, Debdeep Mukhopadhyay¹, Masami Izumi, Junko Takahashi•Institutions (1)

Indian Institute of Technology Kharagpur¹

1 Jun 2014

TL;DR: This paper proposes a generic technique for non-standard tiling generation and generates this tiling for field multiplication in NIST specified curves and compares it with standard school book algorithm to highlight the improvement.

...read moreread less

Abstract: High speed DSP blocks present in the modern FPGAs can be used to implement prime field multiplication to accelerate Elliptic Curve scalar multiplication in prime fields. However, compared to logic slices, DSP blocks are scarce resources, hence its usage needs to be optimized. The asymmetric 25 × 18 signed multipliers in FPGAs open a new paradigm for multiplier design, where operand decomposition becomes equivalent to a tiling problem. Previous literature has reported that for asymmetric multiplier, it is possible to generate a tiling (known as non-standard tiling) which requires less number of DSP blocks compared to standard tiling, generated by school book algorithm. In this paper, we propose a generic technique for such tiling generation and generate this tiling for field multiplication in NIST specified curves. We compare our technique with standard school book algorithm to highlight the improvement. The acceleration in ECC scalar multiplication due to the optimized field multiplier is experimentally validated for P-256. The impact of this accelerated scalar multiplication is shown for the key encapsulation algorithm PSEC-KEM (Provably Secure Key Encapsulation Mechanism).

...read moreread less

Proceedings Article•10.1145/2608628.2608657•

Faster relaxed multiplication

[...]

Joris van der Hoeven¹•Institutions (1)

École Polytechnique¹

23 Jul 2014

TL;DR: This paper will generalize the algorithm for relaxed power series multiplication to the cases when K is replaced by an effective ring of positive characteristic or by aneffective ring of characteristic zero, which is also torsion-free as a Z-module and comes with an additional algorithm for partial division by integers.

...read moreread less

Abstract: In previous work, we have introduced several fast algorithms for relaxed power series multiplication (also known under the name on-line multiplication) up to a given order n. The fastest currently known algorithm works over an effective base field K with sufficiently many 2p-th roots of unity and has algebraic time complexity O(n log ne2[EQUATION]). In this paper, we will generalize this algorithm to the cases when K is replaced by an effective ring of positive characteristic or by an effective ring of characteristic zero, which is also torsion-free as a Z-module and comes with an additional algorithm for partial division by integers. In particular, we may take K to be any effective field. We will also present an asymptotically faster algorithm for relaxed multiplication of p-adic numbers.

...read moreread less

Journal Article•10.1109/TCSI.2014.2335031•

Low-Complexity Digit-Serial and Scalable SPB/GPB Multipliers Over Large Binary Extension Fields Using (b,2)-Way Karatsuba Decomposition

[...]

Chiou-Yng Lee¹, Chun-Sheng Yang², Bimal Kumar Meher³, Pramod Kumar Meher⁴, Jeng-Shyang Pan² - Show less +1 more•Institutions (4)

Lunghwa University of Science and Technology¹, Harbin Institute of Technology², Silicon Institute of Technology³, Nanyang Technological University⁴

15 Oct 2014-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: This paper presents a (b,2)-way KA decomposition for digit-serial multiplication with low-space complexity and derives a novel scalable SPB/GPB multiplier, suitable for very-large-scale integration (VLSI) implementations.

...read moreread less

Abstract: Shifted polynomial basis (SPB) and generalized polynomial basis (GPB) are two variations of polynomial basis representation. SPB/GPB have potential for efficient bit-level and digit-level implementations of multiplication over binary extension fields. This paper presents a (b,2)-way KA decomposition for digit-serial multiplication with low-space complexity. Based on the proposed parallel (b,2)-way KA scheme, we derive a novel scalable SPB/GPB multiplier. Analytical results show that the proposed multiplier could achieve the desired trade-off between space and time complexities. Our proposed multiplier is modular, regular, and suitable for very-large-scale integration (VLSI) implementations. It involves significantly less area complexity, less computation time and less energy consumption compared to the existing digit-serial and scalable multipliers.

...read moreread less

Journal Article•10.1155/2014/343960•

Low-Area wallace multiplier

[...]

Shahzad Asif¹, Yinan Kong¹•Institutions (1)

Macquarie University¹

01 Jan 2014-Vlsi Design

TL;DR: Synthesis results show that the proposed multiplier has the lowest area as compared to other tree-based multipliers, without compromising on the speed of the original Wallace multiplier.

...read moreread less

Abstract: Multiplication is one of the most commonly used operations in the arithmetic Multipliers based on Wallace reduction tree provide an area-efficient strategy for high speed multiplication A number of modifications are proposed in the literature to optimize the area of the Wallace multiplier This paper proposed a reduced-area Wallace multiplier without compromising on the speed of the original Wallace multiplier Designs are synthesized using Synopsys Design Compiler in 90 nm process technology Synthesis results show that the proposed multiplier has the lowest area as compared to other tree-based multipliers The speed of the proposed and reference multipliers is almost the same

...read moreread less

Book Chapter•10.1007/978-3-319-13051-4_20•

Fast Point Multiplication Algorithms for Binary Elliptic Curves with and without Precomputation

[...]

Thomaz Oliveira¹, Diego F. Aranha², Julio López², Francisco Rodríguez-Henríquez¹•Institutions (2)

CINVESTAV¹, State University of Campinas²

14 Aug 2014

TL;DR: The regular recoding approach is used, and the speed record for a single-core constant-time point multiplication on standardized binary elliptic curves at the $128$-bit security level is set.

...read moreread less

Abstract: In this paper we introduce new methods for computing constant-time variable-base point multiplications over the Galbraith-Lin-Scott (GLS) and the Koblitz families of elliptic curves. Using a left-to-right double-and-add and a right-to-left halve-and-add Montgomery ladder over a GLS curve, we present some of the fastest timings yet reported in the literature for point multiplication. In addition, we combine these two procedures to compute a multi-core protected scalar multiplication. Furthermore, we designed a novel regular $\tau $-adic scalar expansion for Koblitz curves. As a result, using the regular recoding approach, we set the speed record for a single-core constant-time point multiplication on standardized binary elliptic curves at the $128$-bit security level.

...read moreread less

Journal Article•10.1109/TC.2013.112•

A New Double Point Multiplication Algorithm and Its Application to Binary Elliptic Curves with Endomorphisms

[...]

Reza Azarderakhsh¹, Koray Karabina²•Institutions (2)

Rochester Institute of Technology¹, Florida Atlantic University²

01 Oct 2014-IEEE Transactions on Computers

TL;DR: A new double point multiplication algorithm based on differential addition chains that has a uniform structure and has some degree of built-in resistance against side channel analysis attacks is presented.

...read moreread less

Abstract: We present a new double point multiplication algorithm based on differential addition chains. Our proposed scheme has a uniform structure and has some degree of built-in resistance against side channel analysis attacks. We discuss deploying our scheme in a hardware implementation of single point multiplication on binary elliptic curves with efficiently computable endomorphisms. Based on operation counts, we expect to gain accelerations of 30% and 18% for computing single point multiplication with and without availability of parallel multipliers, respectively, and these results are verified in our implementations.

...read moreread less

Journal Article•10.1016/J.MICPRO.2014.06.003•

A million-bit multiplier architecture for fully homomorphic encryption

[...]

Yarkin Doröz¹, Erdinc Ozturk², Berk Sunar¹•Institutions (2)

Worcester Polytechnic Institute¹, Istanbul Commerce University²

01 Nov 2014-Microprocessors and Microsystems

TL;DR: A novel architecture to realize a million-bit multiplication scheme based on the Schonhage-Strassen Algorithm using Number Theoretical Transform (NTT) makes use of an innovative cache architecture along with processing elements customized to match the computation and access patterns of the NTT-based recursive multiplication algorithm.

...read moreread less

Journal Article•10.7757/PERSNEWMUSI.52.1.0134•

Reexamining PC-Set Multiplication, Complex Multiplication, and Transpositional Combination to Determine Their Formal and Functional Equivalence

[...]

Ciro Scotto

01 Jan 2014-Perspectives of New Music

Journal Article•10.1109/TC.2014.2315626•

Fast Radix-10 Multiplication Using Redundant BCD Codes

[...]

Alvaro Vazquez¹, Elisardo Antelo¹, Javier D. Bruguera¹•Institutions (1)

University of Santiago de Compostela¹

01 Aug 2014-IEEE Transactions on Computers

TL;DR: The algorithm and architecture of a BCD parallel multiplier that exploits some properties of two different redundant BCD codes to speedup its computation are presented and an area improvement is shown roughly in the range 20-35 percent for similar target delays with respect to the fastest implementation.

...read moreread less

Abstract: We present the algorithm and architecture of a BCD parallel multiplier that exploits some properties of two different redundant BCD codes to speedup its computation: the redundant BCD excess-3 code (XS-3), and the overloaded BCD representation (ODDS). In addition, new techniques are developed to reduce significantly the latency and area of previous representative high-performance implementations. Partial products are generated in parallel using a signed-digit radix-10 recoding of the BCD multiplier with the digit set [-5, 5], and a set of positive multiplicand multiples (0X, 1X, 2X, 3X, 4X, 5X) coded in XS-3. This encoding has several advantages. First, it is a self-complementing code, so that a negative multiplicand multiple can be obtained by just inverting the bits of the corresponding positive one. Also, the available redundancy allows a fast and simple generation of multiplicand multiples in a carry-free way. Finally, the partial products can be recoded to the ODDS representation by just adding a constant factor into the partial product reduction tree. Since the ODDS uses a similar 4-bit binary encoding as non-redundant BCD, conventional binary VLSI circuit techniques, such as binary carry-save adders and compressor trees, can be adapted efficiently to perform decimal operations. To show the advantages of our architecture, we have synthesized a RTL model for $16\times 16$-digit and $34\times 34$-digit multiplications and performed a comparative survey of the previous most representative designs. We show that the proposed decimal multiplier has an area improvement roughly in the range 20-35 percent for similar target delays with respect to the fastest implementation.

...read moreread less

Journal Article•10.1080/03081087.2013.825910•

On finding robust approximate inverses for large sparse matrices

[...]

Fazlollah Soleymani¹•Institutions (1)

Islamic Azad University¹

23 Sep 2014-Linear & Multilinear Algebra

TL;DR: This paper presents a method based on matrix-matrix multiplication concepts for determining the approximate (sparse) inverses of sparse matrices that can be combined with iterative solvers and sparse approximation techniques as well.

...read moreread less

Abstract: This paper presents a method based on matrix-matrix multiplication concepts for determining the approximate (sparse) inverses of sparse matrices. The suggested method is a development on the well-known Schulz iteration and it can successfully be combined with iterative solvers and sparse approximation techniques as well. A detailed discussion on the convergence rate of this scheme is furnished. Results of numerical experiments are also reported to illustrate the performance of the proposed method.

...read moreread less

Journal Article•10.1049/IET-CDS.2013.0117•

Improved matrix multiplier design for high-speed digital signal processing applications

[...]

Prabir Saha¹, Arindam Banerjee², Partha Bhattacharyya³, Anup Dandapat¹•Institutions (3)

National Institute of Technology, Meghalaya¹, JIS College of Engineering², Indian Institute of Engineering Science and Technology, Shibpur³

16 Jan 2014-Iet Circuits Devices & Systems

TL;DR: Improve in speed compared with earlier reported matrix multipliers, for example, conventional algorithm, systolic array and PNTT-based implementation was found to be ~67, ~56 and ~65%, respectively.

...read moreread less

Abstract: A transistor level implementation of an improved matrix multiplier for high-speed digital signal processing applications based on matrix element transformation and multiplication is reported in this study. The improvement in speed was achieved by rearranging the matrix element into a two-dimensional array of processing elements interconnected as a mesh. The edges of each row and column were interconnected in torus structure, facilitating simultaneous implementation of several multiplications. The functionality of the circuitry was verified and the performance parameters for example, propagation delay and dynamic switching power consumptions were calculated using spice spectre using 90 nm CMOS technology. The proposed methodology ensures substantial reduction in propagation delay compared with the conventional algorithm, systolic array and pseudo number theoretic transformation (PNTT)-based implementation, which are the most commonly used techniques, for matrix multiplication. The propagation delay of the implemented 4 × 4 matrix multiplier was only ~2 μs, whereas the power consumption of the implemented 4 × 4 matrix multiplier was ~3.12 mW only. Improvement in speed compared with earlier reported matrix multipliers, for example, conventional algorithm, systolic array and PNTT-based implementation was found to be ~67, ~56 and ~65%, respectively.

...read moreread less

Journal Article•10.1002/SEC.779•

Binary and prime field multiplication for public key cryptography on embedded microprocessors

[...]

Hwajeong Seo¹, Yeoncheol Lee¹, Hyunjin Kim¹, Taehwan Park¹, Howon Kim¹ - Show less +1 more•Institutions (1)

Pusan National University¹

01 Apr 2014-Security and Communication Networks

TL;DR: This paper focuses on a further optimization of multiprecision multiplication, the most expensive operation involved in PKC, using several innovative methods: carry-once, optimized multiplication and accumulation MAC, unbalanced comb, and optimized comb-window.

...read moreread less

Abstract: Embedded microprocessors are used in a wide variety of platforms, including Radio frequency identification RFID systems, sensor networks, and smartphones. Unfortunately, as practical use of microprocessors has increased, so have the security problems associated with them. Although public key cryptography PKC can mitigate these problems, standard implementations of PKC also impose a steep computational cost on resource-constrained devices. To reduce this cost, researchers have proposed alternative implementations that accelerate multiprecision multiplication, the most expensive operation involved in PKC. In this paper, we focus on a further optimization of this same operation, using several innovative methods: carry-once, optimized multiplication and accumulation MAC, unbalanced comb, and optimized comb-window. These methods yield further performance improvements of 2%, 17%, 4.5%, and 9.5%, respectively, on representative modern microprocessors including ATmega128 and MSP430. Copyright © 2013 John Wiley & Sons, Ltd.

...read moreread less

...

Expand