Top 426 papers published in the topic of Multiplication in 2017

Showing papers on "Multiplication published in 2017"

Book Chapter•10.1007/978-3-319-70694-8_15•

Homomorphic Encryption for Arithmetic of Approximate Numbers

[...]

Jung Hee Cheon¹, Andrey Kim¹, Miran Kim², Yongsoo Song¹•Institutions (2)

Seoul National University¹, University of California²

3 Dec 2017

TL;DR: A method to construct a homomorphic encryption scheme for approximate arithmetic that supports an approximate addition and multiplication of encrypted messages, together with a new rescaling procedure for managing the magnitude of plaintext.

...read moreread less

Abstract: We suggest a method to construct a homomorphic encryption scheme for approximate arithmetic. It supports an approximate addition and multiplication of encrypted messages, together with a new rescaling procedure for managing the magnitude of plaintext. This procedure truncates a ciphertext into a smaller modulus, which leads to rounding of plaintext. The main idea is to add a noise following significant figures which contain a main message. This noise is originally added to the plaintext for security, but considered to be a part of error occurring during approximate computations that is reduced along with plaintext by rescaling. As a result, our decryption structure outputs an approximate value of plaintext with a predetermined precision.

...read moreread less

1,933 citations

Posted Content•

Quantum resource estimates for computing elliptic curve discrete logarithms

[...]

Martin Roetteler, Michael Naehrig, Krysta M. Svore, Kristin E. Lauter

21 Jun 2017-arXiv: Quantum Physics

TL;DR: In this article, the authors give precise quantum resource estimates for Shor's algorithm to compute discrete logarithms on elliptic curves over prime fields. And they show that ECC is an easier target than RSA.

...read moreread less

Abstract: We give precise quantum resource estimates for Shor's algorithm to compute discrete logarithms on elliptic curves over prime fields. The estimates are derived from a simulation of a Toffoli gate network for controlled elliptic curve point addition, implemented within the framework of the quantum computing software tool suite LIQ$Ui|\rangle$. We determine circuit implementations for reversible modular arithmetic, including modular addition, multiplication and inversion, as well as reversible elliptic curve point addition. We conclude that elliptic curve discrete logarithms on an elliptic curve defined over an $n$-bit prime field can be computed on a quantum computer with at most $9n + 2\lceil\log_2(n)\rceil+10$ qubits using a quantum circuit of at most $448 n^3 \log_2(n) + 4090 n^3$ Toffoli gates. We are able to classically simulate the Toffoli networks corresponding to the controlled elliptic curve point addition as the core piece of Shor's algorithm for the NIST standard curves P-192, P-224, P-256, P-384 and P-521. Our approach allows gate-level comparisons to recent resource estimates for Shor's factoring algorithm. The results also support estimates given earlier by Proos and Zalka and indicate that, for current parameters at comparable classical security levels, the number of qubits required to tackle elliptic curves is less than for attacking RSA, suggesting that indeed ECC is an easier target than RSA.

...read moreread less

212 citations

Journal Article•10.1145/3017994•

Sparse Matrix-Vector Multiplication on GPGPUs

[...]

Salvatore Filippone¹, Valeria Cardellini², Davide Barbieri², Alessandro Fanfarillo²•Institutions (2)

Cranfield University¹, University of Rome Tor Vergata²

09 Jan 2017-ACM Transactions on Mathematical Software

TL;DR: This article provides a review of the techniques for implementing the SpMV kernel on GPGPUs that have appeared in the literature of the last few years, and discusses the issues and tradeoffs that have been encountered by the various researchers.

...read moreread less

Abstract: The multiplication of a sparse matrix by a dense vector (SpMV) is a centerpiece of scientific computing applications: it is the essential kernel for the solution of sparse linear systems and sparse eigenvalue problems by iterative methods. The efficient implementation of the sparse matrix-vector multiplication is therefore crucial and has been the subject of an immense amount of research, with interest renewed with every major new trend in high-performance computing architectures. The introduction of General-Purpose Graphics Processing Units (GPGPUs) is no exception, and many articles have been devoted to this problem.With this article, we provide a review of the techniques for implementing the SpMV kernel on GPGPUs that have appeared in the literature of the last few years. We discuss the issues and tradeoffs that have been encountered by the various researchers, and a list of solutions, organized in categories according to common features. We also provide a performance comparison across different GPGPU models and on a set of test matrices coming from various application domains.

...read moreread less

139 citations

Proceedings Article•10.1145/3061639.3062210•

CFPU: Configurable Floating Point Multiplier for Energy-Efficient Computing

[...]

Mohsen Imani¹, Daniel Peroni¹, Tajana Rosing¹•Institutions (1)

University of California, San Diego¹

18 Jun 2017

TL;DR: This paper proposes a novel approximate floating point multiplier, called CFPU, which significantly reduces energy and improves performance of multiplication at the expense of accuracy, and shows that it can outperforms a standard FPU when at least 4% of multiplications are performed in approximate mode.

...read moreread less

Abstract: Many applications, such as machine learning and data sensing are statistical in nature and can tolerate some level of inaccuracy in their computation. Approximate computation is a viable method to save energy and increase performance by trading energy for accuracy. There are a number of proposed approximate solutions, however, they are limited to a small range of applications because they cannot control the error rate of their output. In this paper, we propose a novel approximate floating point multiplier, called CFPU, which significantly reduces energy and improves performance of multiplication at the expense of accuracy. Our design approximately models multiplication by replacing the most costly step of the operation with a lower energy alternative. In order to tune the level of approximation, CFPU dynamically identifies the inputs which will produce the largest approximation error and processes them in precise CFPU mode. We showed that our CFPU can outperforms a standard FPU when at least 4% of multiplications are performed in approximate mode. In our tested applications this percentage of multiplications is substantially higher, leading to significant energy savings. Our experimental evaluation on AMD Southern Island GPU shows that replacing the proposed CFPU with traditional FPUs results in 77% energy savings and 3.5× energy-delay product improvement over eight general OpenCL applications while providing acceptable quality of service. In addition, for the same level of accuracy, the CFPU provides 2.4× energy-delay product improvement compared to state-of-the-art approximate multipliers.

...read moreread less

89 citations

Book Chapter•10.1007/978-3-319-70700-6_12•

Linear-Time Zero-Knowledge Proofs for Arithmetic Circuit Satisfiability

[...]

Jonathan Bootle¹, Andrea Cerulli¹, Essam Ghadafi², Jens Groth¹, Mohammad Hajiabadi³, Sune K. Jakobsen¹ - Show less +2 more•Institutions (3)

University College London¹, University of the West of England², University of California, Berkeley³

3 Dec 2017

TL;DR: This work gives computationally efficient zero-knowledge proofs of knowledge for arithmetic circuit satisfiability over a large field in an ideal linear commitment model where the prover may commit to secret vectors of field elements, and the verifier can receive certified linear combinations of those vectors.

...read moreread less

Abstract: We give computationally efficient zero-knowledge proofs of knowledge for arithmetic circuit satisfiability over a large field. For a circuit with N addition and multiplication gates, the prover only uses $\mathcal {O}(N)$ multiplications and the verifier only uses $\mathcal {O}(N)$ additions in the field. If the commitments we use are statistically binding, our zero-knowledge proofs have unconditional soundness, while if the commitments are statistically hiding we get computational soundness. Our zero-knowledge proofs also have sub-linear communication if the commitment scheme is compact.

...read moreread less

85 citations

Journal Article•10.1109/TIE.2016.2625241•

An Efficient and Flexible Hardware Implementation of the Dual-Field Elliptic Curve Cryptographic Processor

[...]

Zilong Liu¹, Dongsheng Liu¹, Xuecheng Zou¹•Institutions (1)

Huazhong University of Science and Technology¹

01 Mar 2017-IEEE Transactions on Industrial Electronics

TL;DR: An efficient and flexible dual-field ECC processor which can support arbitrary elliptic curve standards and algorithms using the hardware–software approach is presented.

...read moreread less

Abstract: Elliptic curve cryptography (ECC) has been widely used for the digital signature to ensure the security in communication. It is important for the ECC processor to support a variety of ECC standards to be compatible with different security applications. Thus, a flexible processor which can support different standards and algorithms is desired. In this paper, an efficient and flexible dual-field ECC processor using the hardware–software approach is presented. The proposed processor can support arbitrary elliptic curve. An elaborate modular arithmetic logic unit is designed. It can perform basic modular arithmetic operations and achieve high efficiency. Based on our designed instruction set, the processor can be programmed to perform various point operations based on different algorithms. To demonstrate the flexibility of our processor, a point multiplication algorithm with power analysis resistance is adopted. Our design is implemented in the field-programmable gate array platform and also in the application-specified integrated circuit. After implemented in the 55 nm CMOS process, the processor takes between 0.60 ms (163 bits ECC) and 6.75 ms (571 bits ECC) to finish one-point multiplication. Compared to other related works, the merits of our ECC processor are the high hardware efficiency and flexibility.

...read moreread less

81 citations

Proceedings Article•10.1109/ICPP.2017.19•

High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU

[...]

Yusuke Nagasaka¹, Akira Nukada¹, Satoshi Matsuoka¹•Institutions (1)

Tokyo Institute of Technology¹

1 Aug 2017

TL;DR: This work proposes a new fast SpGEMM algorithm requiring small amount of memory and achieving high performance, and reduces the memory usage by 14.7% in single precision and double precision on average, allowing larger matrices to be computed.

...read moreread less

Abstract: Sparse general matrix-matrix multiplication (SpGEMM) is one of the key kernels of preconditioners such as algebraic multigrid method or graph algorithms. However, the performance of SpGEMM is quite low on modern processors due to random memory access to both input and output matrices. As well as the number and the pattern of non-zero elements in the output matrix, important for achieving locality, are unknown before the execution. Moreover, the state-of-the-art GPU implementations of SpGEMM requires large amounts of memory for temporary results, limiting the matrix size computable on fast GPU device memory. We propose a new fast SpGEMM algorithm requiring small amount of memory and achieving high performance. Calculation of the pattern and value in output matrix is optimized by using GPU's on-chip shared memory and a hash table. Additionally, our algorithm launches multiple kernels running concurrently to improve the utilization of GPU resources. The kernels for the calculation of each row of output matrix are chosen based on the number of non-zero elements. Performance evaluation using matrices from the Sparse Matrix Collection of University Florida on NVIDIA's Pascal generation GPU shows that our approach achieves speedups of up to x4.3 in single precision and x4.4 in double precision compared to existing SpGEMM libraries. Furthermore, the memory usage is reduced by 14.7% in single precision and 10.9% in double precision on average, allowing larger matrices to be computed.

...read moreread less

81 citations

Journal Article•10.1016/J.JSC.2016.07.025•

Sparse FGLM algorithms

[...]

Jean-Charles Faugère, Chenqi Mou

01 May 2017-Journal of Symbolic Computation

TL;DR: A deterministic top-level algorithm is proposed that automatically detects which method to use depending on the input and is able to handle ideals of degree over $40000 and outperforms the {\sf Magma} and {\sf Singular} ones.

...read moreread less

79 citations

Book Chapter•10.1007/978-3-319-63688-7_8•

Secure Arithmetic Computation with Constant Computational Overhead

[...]

Benny Applebaum¹, Ivan Damgård², Yuval Ishai³, Michael Nielsen², Lior Zichron¹ - Show less +1 more•Institutions (3)

Tel Aviv University¹, Aarhus University², Technion – Israel Institute of Technology³

20 Aug 2017

TL;DR: In this article, the complexity of securely evaluating an arithmetic circuit over a finite field was studied in the setting of secure two-party computation with semi-honest adversaries. But the complexity was not considered in this paper.

...read moreread less

Abstract: We study the complexity of securely evaluating an arithmetic circuit over a finite field $\mathbb {F}$ in the setting of secure two-party computation with semi-honest adversaries. In all existing protocols, the number of arithmetic operations per multiplication gate grows either linearly with $\log |\mathbb {F}|$ or polylogarithmically with the security parameter. We present the first protocol that only makes a constant (amortized) number of field operations per gate. The protocol uses the underlying field $\mathbb {F}$ as a black box, and its security is based on arithmetic analogues of well-studied cryptographic assumptions.

...read moreread less

74 citations

Proceedings Article•10.5555/3039686.3039828•

Faster online matrix-vector multiplication

[...]

Kasper Green Larsen¹, Ryan Williams²•Institutions (2)

Aarhus University¹, Stanford University²

16 Jan 2017

TL;DR: In this paper, a randomized O(n 3 /log 2 n) time algorithm for OMV was proposed, running in [EQUATION] randomized time, where w is the word size.

...read moreread less

Abstract: We consider the Online Boolean Matrix-Vector Multiplication (OMV) problem studied by Henzinger et al. [STOC'15]: given an n × n Boolean matrix M, we receive n Boolean vectors v1,...,vn one at a time, and are required to output Mvi (over the Boolean semiring) before seeing the vector vi+1, for all i. Previous known algorithms for this problem are combinatorial, running in O(n3 /log2n) time. Henzinger et al. conjecture there is no O(n3−e) time algorithm for OMV, for all e > 0; their OMV conjecture is shown to imply strong hardness results for many basic dynamic problems.We give a substantially faster method for computing OMV, running in [EQUATION] randomized time. In fact, after seeing [EQUATION] vectors, we already achieve [EQUATION] amortized time for matrix-vector multiplication. Our approach gives a way to reduce matrix-vector multiplication to solving a version of the Orthogonal Vectors problem, which in turn reduces to "small" algebraic matrix-matrix multiplication. Applications include faster independent set detection, partial match retrieval, and 2-CNF evaluation.We also show how a modification of our method gives a cell probe data structure for OMV with worst case [EQUATION] time per query vector, where w is the word size. This result rules out an unconditional proof of the OMV conjecture using purely information-theoretic arguments.

...read moreread less

72 citations

Journal Article•10.19086/DA.1245•

On cap sets and the group-theoretic approach to matrix multiplication

[...]

Jonah Blasiak, Thomas Church, Henry Cohn, Joshua A. Grochow, Eric Naslund, Will Sawin, Christopher Umans - Show less +3 more

16 Jan 2017

TL;DR: In this article, the size of tricolored sum-free sets in groups of bounded exponent was shown to be bounded by a tensor rank due to Tao, which gives a quantitative understanding of the notion of unstable tensors from geometric invariant theory.

...read moreread less

Abstract: In 2003, Cohn and Umans described a framework for proving upper bounds on the exponent ω of matrix multiplication by reducing matrix multiplication to group algebra multiplication, and in 2005 Cohn, Kleinberg, Szegedy, and Umans proposed specific conjectures for how to obtain ω = 2. In this paper we rule out obtaining ω = 2 in this framework from abelian groups of bounded exponent. To do this we bound the size of tricolored sum-free sets in such groups, extending the breakthrough results of Croot, Lev, Pach, Ellenberg, and Gijswijt on cap sets. As a byproduct of our proof, we show that a variant of tensor rank due to Tao gives a quantitative understanding of the notion of unstable tensor from geometric invariant theory.

...read moreread less

Journal Article•10.1109/TC.2017.2677426•

Evaluation of Large Integer Multiplication Methods on Hardware

[...]

Ciara Rafferty¹, Maire O'Neill¹, Neil Hanley¹•Institutions (1)

Queen's University Belfast¹

01 Aug 2017-IEEE Transactions on Computers

TL;DR: A comparison of hardware architectures for large integer multiplication is presented and it is shown that hardware designs of combination multipliers, at a cost of additional hardware resource usage, can offer lower latency compared to individual multiplier designs.

...read moreread less

Abstract: Multipliers requiring large bit lengths have a major impact on the performance of many applications, such as cryptography, digital signal processing (DSP) and image processing. Novel, optimised designs of large integer multiplication are needed as previous approaches, such as schoolbook multiplication, may not be as feasible due to the large parameter sizes. Parameter bit lengths of up to millions of bits are required for use in cryptography, such as in lattice-based and fully homomorphic encryption (FHE) schemes. This paper presents a comparison of hardware architectures for large integer multiplication. Several multiplication methods and combinations thereof are analysed for suitability in hardware designs, targeting the FPGA platform. In particular, the first hardware architecture combining Karatsuba and Comba multiplication is proposed. Moreover, a hardware complexity analysis is conducted to give results independent of any particular FPGA platform. It is shown that hardware designs of combination multipliers, at a cost of additional hardware resource usage, can offer lower latency compared to individual multiplier designs. Indeed, the proposed novel combination hardware design of the Karatsuba-Comba multiplier offers lowest latency for integers greater than 512 bits. For large multiplicands, greater than 16,384 bits, the hardware complexity analysis indicates that the NTT-Karatsuba-Schoolbook combination is most suitable.

...read moreread less

Journal Article•10.1109/TIT.2016.2614685•

An Efficient Framework for Unconditionally Secure Multiparty Computation

[...]

Ashish Choudhury¹, Arpita Patra²•Institutions (2)

International Institute of Information Technology¹, Indian Institute of Science²

01 Jan 2017-IEEE Transactions on Information Theory

TL;DR: This paper proposes a new and simple framework for generating shared and private random multiplication triples with unconditional security and presents the first ever MPC protocols with a linear (in the number of parties) communication overhead per multiplication gate in the circuit.

...read moreread less

Abstract: Threshold unconditionally secure multiparty computation (MPC) allows a set of $n$ mutually distrusting parties to securely compute an agreed function $f$ over some finite field in the presence of a computationally unbounded adversary, who can maliciously corrupt any $t$ out of the $n$ parties. Most of the known efficient MPC protocols are designed in the offline–online framework introduced in a seminal work by Beaver in CRYPTO 1991. In this framework, the parties generate shared random and private multiplication-triples during the offline phase, which are used later in the online phase for securely evaluating the multiplication gates in the circuit representing $f$ . The efficiency of the MPC protocols in this framework then relies on efficient ways of implementing the offline phase. In this paper, we propose a new and simple framework for generating shared and private random multiplication triples with unconditional security. The existing protocols approach this problem by first producing shared pairs of private and random values, followed by securely computing the shared product of each pair of values. The latter task involves a multiplication protocol for shared values that are typically communication intensive. Our framework takes a completely different approach and shuns the use of multiplication protocol. Namely, we ask the parties to verifiably share random multiplication triples and then securely extract shared random multiplication triples unknown to the adversary, from the shared triples. Realizing our framework in the asynchronous and hybrid network setting, 1 we present the first ever MPC protocols with a linear (in the number of parties) communication overhead per multiplication gate in the circuit representing $f$ . These are significant improvements over the best known existing MPC protocols in the asynchronous and hybrid network setting with communication complexity $ \mathcal {O}(n^{2})$ and $ \mathcal {O}(n^{3})$ , respectively. Our framework when applied to the synchronous setting results in round-efficient MPC protocols. 1 In a hybrid network, it is assumed that the network is synchronous up to a certain “point” and asynchronous after that point onward. We assume a hybrid network with just one synchronous round in the beginning.

...read moreread less

Journal Article•10.1002/CTA.2295•

Low latency flexible FPGA implementation of point multiplication on elliptic curves over GF(p)

[...]

Khalid Javeed¹, Khalid Javeed², Xiaojun Wang², Xiaojun Wang³•Institutions (3)

COMSATS Institute of Information Technology¹, Dublin City University², Nanjing University of Information Science and Technology³

01 Feb 2017-International Journal of Circuit Theory and Applications

Book Chapter•10.1007/978-3-319-63697-9_14•

Private Multiplication over Finite Fields

[...]

Sonia Belaïd¹, Fabrice Benhamouda², Alain Passelègue³, Emmanuel Prouff⁴, Adrian Thillard, Damien Vergnaud⁵ - Show less +2 more•Institutions (5)

Thales Communications¹, IBM², University of California, Los Angeles³, University of Paris⁴, École Normale Supérieure⁵

20 Aug 2017

TL;DR: Angebraic characterization of the privacy for multiplication in any finite field is introduced and a novel algebraic characterization for non-interference is proposed (a stronger security notion in this setting) and two generic constructions of multiplication circuits in finite fields that achieve non-Interference in the probing model are presented.

...read moreread less

Abstract: The notion of privacy in the probing model, introduced by Ishai, Sahai, and Wagner in 2003, is nowadays frequently involved to assess the security of circuits manipulating sensitive information. However, provable security in this model still comes at the cost of a significant overhead both in terms of arithmetic complexity and randomness complexity. In this paper, we deal with this issue for circuits processing multiplication over finite fields. Our contributions are manifold. Extending the work of Belaid, Benhamouda, Passelegue, Prouff, Thillard, and Vergnaud at Eurocrypt 2016, we introduce an algebraic characterization of the privacy for multiplication in any finite field and we propose a novel algebraic characterization for non-interference (a stronger security notion in this setting). Then, we present two generic constructions of multiplication circuits in finite fields that achieve non-interference in the probing model. Denoting by d the number of probes used by the adversary, the first proposal reduces the number of bilinear multiplications (i.e., of general multiplications of two non-constant values in the finite field) to only $2d+1$ whereas the state-of-the-art was $O(d^2)$. The second proposal reduces the randomness complexity to d random elements in the underlying finite field, hence improving the $O(d \log d)$ randomness complexity achieved by Belaid et al. in their paper. This construction is almost optimal since we also prove that d / 2 is a lower bound. Eventually, we show that both algebraic constructions can always be instantiated in large enough finite fields. Furthermore, for the important cases $d \in \{2,3\}$, we illustrate that they perform well in practice by presenting explicit realizations for finite fields of practical interest.

...read moreread less

Proceedings Article•10.1109/IPDPS.2017.76•

A Work-Efficient Parallel Sparse Matrix-Sparse Vector Multiplication Algorithm

[...]

Ariful Azad¹, Aydin Buluc¹•Institutions (1)

Lawrence Berkeley National Laboratory¹

1 May 2017

TL;DR: A work-efficient multithreaded algorithm for sparse matrix-sparse vector multiplication (SpMSpV) where the matrix, the input vector, and the output vector are all sparse.

...read moreread less

Abstract: We design and develop a work-efficient multithreaded algorithm for sparse matrix-sparse vector multiplication (SpMSpV) where the matrix, the input vector, and the output vector are all sparse SpMSpV is an important primitive in the emerging GraphBLAS standard and is the workhorse of many graph algorithms including breadth-first search, bipartite graph matching, and maximal independent set As thread counts increase, existing multithreaded SpMSpV algorithms can spend more time accessing the sparse matrix data structure than doing arithmetic Our shared-memory parallel SpMSpV algorithm is work efficient in the sense that its total work is proportional to the number of arithmetic operations required The key insight is to avoid each thread individually scan the list of matrix columns Our algorithm is simple to implement and operates on existing column-based sparse matrix formats It performs well on diverse matrices and vectors with heterogeneous sparsity patterns A high-performance implementation of the algorithm attains up to 15x speedup on a 24-core Intel Ivy Bridge processor and up to 49x speedup on a 64-core Intel KNL manycore processor In contrast to implementations of existing algorithms, the performance of our algorithm is sustained on a variety of different input types include matrices representing scale-free and high-diameter graphs

...read moreread less

Journal Article•10.1109/TPDS.2017.2656893•

Exploiting Locality in Sparse Matrix-Matrix Multiplication on Many-Core Architectures

[...]

Kadir Akbudak¹, Cevdet Aykanat¹•Institutions (1)

Bilkent University¹

01 Aug 2017-IEEE Transactions on Parallel and Distributed Systems

TL;DR: Results show the validity of the models and methods proposed for enhancing the locality in parallel SpGEMM operations on a wide range of sparse matrices from real applications.

...read moreread less

Abstract: Exploiting spatial and temporal localities is investigated for efficient row-by-row parallelization of general sparse matrix-matrix multiplication (SpGEMM) operation of the form $C=A\,B$ on many-core architectures. Hypergraph and bipartite graph models are proposed for 1D rowwise partitioning of matrix $A$ to evenly partition the work across threads with the objective of reducing the number of $B$ -matrix words to be transferred from the memory and between different caches. A hypergraph model is proposed for $B$ -matrix column reordering to exploit spatial locality in accessing entries of thread-private temporary arrays, which are used to accumulate results for $C$ -matrix rows. A similarity graph model is proposed for $B$ -matrix row reordering to increase temporal reuse of these accumulation array entries. The proposed models and methods are tested on a wide range of sparse matrices from real applications and the experiments were carried on a 60-core Intel Xeon Phi processor, as well as a two-socket Xeon processor. Results show the validity of the models and methods proposed for enhancing the locality in parallel SpGEMM operations.

...read moreread less

Proceedings Article•10.1109/ICPP.2017.51•

Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor

[...]

Lijuan Jiang¹, Chao Yang¹, Yulong Ao¹, Wanwang Yin, Wenjing Ma¹, Qiao Sun¹, Fangfang Liu¹, Rongfen Lin, Peng Zhang¹ - Show less +5 more•Institutions (1)

Chinese Academy of Sciences¹

1 Aug 2017

TL;DR: This paper presents a detailed methodology of implementing and optimizing the double-precision general format matrix-matrix multiplication (DGEMM) kernel on the emerging SW26010 processor, which is used to build the Sunway TaihuLight supercomputer.

...read moreread less

Abstract: The matrix-matrix multiplication is an essential building block that can be found in various scientific and engineering applications. High-performance implementations of the matrix-matrix multiplication on state-of-the-art processors may be of great importance for both the vendors and the users. In this paper, we present a detailed methodology of implementing and optimizing the double-precision general format matrix-matrix multiplication (DGEMM) kernel on the emerging SW26010 processor, which is used to build the Sunway TaihuLight supercomputer. We propose a three level blocking algorithm to orchestrate data on the memory hierarchy and expose parallelism on different hardware levels, and design a collective data sharing scheme by using the register communication mechanism to exchange data efficiently among different cores. On top of those, further optimizations are done based on a data-thread mapping method for efficient data distribution, a double buffering scheme for asynchronous DMA data transfer, and an instruction scheduling method for maximizing the pipeline usage. Experiment results show that the proposed DGEMM implementation can fully exploit the unique hardware features provided by SW26010 and can sustain up to 95% of the peak performance.

...read moreread less

Proceedings Article•10.1109/ICCONS.2017.8250742•

High speed vedic multiplier used vedic mathematics

[...]

Dravik KishorBhai Kahar¹, Harsh Mehta¹•Institutions (1)

Parul Institute of Engineering and Technology¹

15 Jun 2017

TL;DR: A Vedic multiplication algorithm is designed by using Vedic mathematics formula Urdhava Tiryakbhyam method means vertically and cross wise, which gets less time delay compared to other algorithms.

...read moreread less

Abstract: Multiplier is main building block of all processor, which improves the speed of Digital Signal Processor (DSP). In special application in which we need to reduce the time delay. In proposed method, we design a Vedic multiplication algorithm by using Vedic mathematics formula Urdhava Tiryakbhyam method means vertically and cross wise. Vedic mathematics is mainly based on 16 Sutras and was rediscovered in early 20th century. In ancient time in India, people used this Sutra for decimal number multiplications effectively. The same basic concept of Vedic mathematics is applied to multiplication of binary number to make usable in the digital hardware system. The speed of the computation process is increased and the processing time is reduced due to decrease of combinational path delay compared to the existing multipliers. In our proposed multiplication algorithm, we get less time delay compared to other algorithms.[1]

...read moreread less

Journal Article•10.13154/TOSC.V2017.I4.188-211•

Shorter Linear Straight-Line Programs for MDS Matrices

[...]

Thorsten Kranz¹, Gregor Leander¹, Ko Stoffelen², Friedrich Wiemer¹•Institutions (2)

Ruhr University Bochum¹, Radboud University Nijmegen²

15 Dec 2017-IACR Cryptology ePrint Archive

TL;DR: In this paper, the Aes Mix-Columns operation with respect to the number of XOR operations has been investigated and a new MDS matrices for lightweight symmetric primitives has been proposed.

...read moreread less

Abstract: Recently a lot of attention is paid to the search for efficiently implementable MDS matrices for lightweight symmetric primitives. Most previous work concentrated on locally optimizing the multiplication with single matrix elements. Separate from this line of work, several heuristics were developed to find shortest linear straightline programs. Solving this problem actually corresponds to globally optimizing multiplications by matrices. In this work we combine those, so far largely independent lines of work. As a result, we achieve implementations of known, locally optimized, and new MDS matrices that significantly outperform all implementations from the literature. Interestingly, almost all previous locally optimized constructions behave very similar with respect to the globally optimized implementation. As a side effect, our work reveals the so far best implementation of the Aes Mix- Columns operation with respect to the number of XOR operations needed.

...read moreread less

Journal Article•10.1016/J.VLSI.2017.02.003•

An efficient VLSI architecture design for logarithmic multiplication by using the improved operand decomposition

[...]

Durgesh Nandan¹, Jitendra Kanungo¹, Anurag Mahajan¹•Institutions (1)

Jaypee University of Engineering and Technology¹

01 Jun 2017-Integration

TL;DR: This paper proposes a novel approach based on the Improved Operand Decomposition (IOD) to make an efficient logarithmic multiplier and subsequent achievement through scale realization at the cost of minimal hardware and delay.

...read moreread less

Journal Article•10.1145/3121432•

Tight and Rigorous Error Bounds for Basic Building Blocks of Double-Word Arithmetic

[...]

Mioara Joldes¹, Jean-Michel Muller², Valentina Popescu²•Institutions (2)

Centre national de la recherche scientifique¹, University of Lyon²

10 Oct 2017-ACM Transactions on Mathematical Software

TL;DR: In this article, the authors analyzed several basic building blocks of double-word arithmetic, including the addition of a double word number and a floating-point number, the multiplication of two double word numbers by a fixed number of floating-points, and the division of two single-word numbers.

...read moreread less

Abstract: We analyze several classical basic building blocks of double-word arithmetic (frequently called “double-double arithmetic” in the literature): the addition of a double-word number and a floating-point number, the addition of two double-word numbers, the multiplication of a double-word number by a floating-point number, the multiplication of two double-word numbers, the division of a double-word number by a floating-point number, and the division of two double-word numbers. For multiplication and division we get better relative error bounds than the ones previously published. For addition of two double-word numbers, we show that the previously published bound was incorrect, and we provide a new relative error bound. We introduce new algorithms for division. We also give examples that illustrate the tightness of our bounds.

...read moreread less

Proceedings Article•10.23919/FPL.2017.8056834•

Generic and universal parallel matrix summation with a flexible compression goal for Xilinx FPGAs

[...]

Thomas B. Preußer¹•Institutions (1)

Dresden University of Technology¹

1 Sep 2017

TL;DR: In this paper, a generic implementation of a bit matrix compressor for Xilinx FPGAs is described, which does not require a generator tool and is agnostic of the aspect ratio of the input matrix.

...read moreread less

Abstract: Bit matrix compression is a highly relevant operation in computer arithmetic. Essentially being a multi-operand addition, it is the key operation behind fast multiplication and many higher-level operations such as multiply-accumulate, the computation of the dot product or the implementation of FIR filters. Compressor implementations have been constantly evolving for greater efficiency both in general and in the context of concrete applications or specific implementation technologies. This paper is building on this history and describes a generic implementation of a bit matrix compressor for Xilinx FPGAs, which does not require a generator tool. It contributes FPGA-oriented metrics for the evaluation of elementary parallel bit counters, a systematic analysis and partial decomposition of previously proposed counters and a fully implemented construction heuristic with a flexible compression target matching the device capabilities. The generic implementation is agnostic of the aspect ratio of the input matrix and can be used for multiplication the same way as it can be for single-column population count operations.

...read moreread less

Proceedings Article•10.1145/3093172.3093228•

Increasing the Efficiency of Sparse Matrix-Matrix Multiplication with a 2.5D Algorithm and One-Sided MPI

[...]

Alfio Lazzaro¹, Joost VandeVondele, Jürg Hutter¹, Ole Schütt²•Institutions (2)

University of Zurich¹, Swiss Federal Laboratories for Materials Science and Technology²

26 Jun 2017

TL;DR: The DBCSR sparse matrix library is extended, which is the basic building block for linear scaling electronic structure theory and low scaling correlated methods in CP2K, and is specifically designed to efficiently perform block-sparse matrix-matrix multiplication of matrices with a relatively large occupation.

...read moreread less

Abstract: Matrix-matrix multiplication is a basic operation in linear algebra and an essential building block for a wide range of algorithms in various scientific fields. Theory and implementation for the dense, square matrix case are well-developed. If matrices are sparse, with application-specific sparsity patterns, the optimal implementation remains an open question. Here, we explore the performance of communication reducing 2.5D algorithms and one-sided MPI communication in the context of linear scaling electronic structure theory. In particular, we extend the DBCSR sparse matrix library, which is the basic building block for linear scaling electronic structure theory and low scaling correlated methods in CP2K. The library is specifically designed to efficiently perform block-sparse matrix-matrix multiplication of matrices with a relatively large occupation. Here, we compare the performance of the original implementation based on Cannon's algorithm and MPI point-to-point communication, with an implementation based on MPI one-sided communications (RMA), in both a 2D and a 2.5D approach. The 2.5D approach trades memory and auxiliary operations for reduced communication, which can lead to a speedup if communication is dominant. The 2.5D algorithm is somewhat easier to implement with one-sided communications. A detailed description of the implementation is provided, also for non ideal processor topologies, since this is important for actual applications. Given the importance of the precise sparsity pattern, and even the actual matrix data, which decides the effective fill-in upon multiplication, the tests are performed within the CP2K package with application benchmarks. Results show a substantial boost in performance for the RMA based 2.5D algorithm, up to 1.80x, which is observed to increase with the number of processes involved in the parallelization.

...read moreread less

Posted Content•

Energy-Efficient Time-Domain Vector-by-Matrix Multiplier for Neurocomputing and Beyond

[...]

Mohammad Bavandpour¹, Mohammad Reza Mahmoodi¹, Dmitri B. Strukov¹•Institutions (1)

University of California, Santa Barbara¹

29 Nov 2017-arXiv: Hardware Architecture

TL;DR: In this article, the authors proposed an energy-efficient mixed-signal approach for vector-by-matrix multiplication in a time domain, where multi-bit values of the input and output vector elements are represented with time-encoded digital signals, while mult-bit matrix weights are realized with current sources, e.g. transistors biased in subthreshold regime.

...read moreread less

Abstract: We propose an extremely energy-efficient mixed-signal approach for performing vector-by-matrix multiplication in a time domain. In such implementation, multi-bit values of the input and output vector elements are represented with time-encoded digital signals, while multi-bit matrix weights are realized with current sources, e.g. transistors biased in subthreshold regime. With our approach, multipliers can be chained together to implement large-scale circuits completely in a time domain. Multiplier operation does not rely on energy-taxing static currents, which are typical for peripheral and input/output conversion circuits of the conventional mixed-signal implementations. As a case study, we have designed a multilayer perceptron, based on two layers of 10x10 four-quadrant vector-by-matrix multipliers, in 55-nm process with embedded NOR flash memory technology, which allows for compact implementation of adjustable current sources. Our analysis, based on memory cell measurements, shows that at high computing speed the drain-induced barrier lowering is a major factor limiting multiplier precision to ~6 bit. Post-layout estimates for a conservative 6-bit digital input/output NxN multiplier designed in 55 nm process, including I/O circuitry for converting between digital and time domain representations, show ~7 fJ/Op for N>200, which can be further lowered well below 1 fJ/Op for more optimal and aggressive design.

...read moreread less

Book Chapter•10.1007/978-3-319-72565-9_9•

How to (Pre-)Compute a Ladder

[...]

Thomaz Oliveira¹, Julio López², Huseyin Hisil³, Armando Faz-Hernández², Francisco Rodríguez-Henríquez¹ - Show less +1 more•Institutions (3)

CINVESTAV¹, State University of Campinas², Yaşar University³

16 Aug 2017

TL;DR: This paper describes a ladder variant that permits to accelerate the fixed-point multiplication function inherent to the Diffie-Hellman key pair generation phase and is the first proposal of a Montgomery ladder procedure for prime elliptic curves that admits the extensive use of pre-computation.

...read moreread less

Abstract: In the RFC 7748 memorandum, the Internet Research Task Force specified a Montgomery-ladder scalar multiplication function based on two recently adopted elliptic curves, “curve25519” and “curve448”. The purpose of this function is to support the Diffie-Hellman key exchange algorithm that will be included in the forthcoming version of the Transport Layer Security cryptographic protocol. In this paper, we describe a ladder variant that permits to accelerate the fixed-point multiplication function inherent to the Diffie-Hellman key pair generation phase. Our proposal combines a right-to-left version of the Montgomery ladder along with the pre-computation of constant values directly derived from the base-point and its multiples. To our knowledge, this is the first proposal of a Montgomery ladder procedure for prime elliptic curves that admits the extensive use of pre-computation. In exchange of very modest memory resources and a small extra programming effort, the proposed ladder obtains significant speedups for software implementations. Moreover, our proposal fully complies with the RFC 7748 specification. A software implementation of the X25519 and X448 functions using our pre-computable ladder yields an acceleration factor of roughly 1.20, and 1.25 when implemented on the Haswell and the Skylake micro-architectures, respectively.

...read moreread less

Journal Article•10.1109/TPDS.2017.2686384•

Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms

[...]

Rong Gu¹, Yun Tang¹, Chen Tian¹, Hucheng Zhou², Guanru Li², Xudong Zheng², Yihua Huang¹ - Show less +3 more•Institutions (2)

Nanjing University¹, Microsoft²

01 Sep 2017-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This paper presents a novel parallel execution strategy, CRMM (Concurrent Replication-based Matrix Multiplication), along with a parallel algorithm, Marlin, for large-scale matrix multiplication on data-parallel platforms, and proposes a number of novel system-level optimizations.

...read moreread less

Abstract: Matrix multiplication is a dominant but very time-consuming operation in many big data analytic applications. Thus its performance optimization is an important and fundamental research issue. The performance of large-scale matrix multiplication on distributed data-parallel platforms is determined by both computation and IO costs. For existing matrix multiplication execution strategies, when the execution concurrency scales up above a threshold, their execution performance deteriorates quickly because the increase of the IO cost outweighs the decrease of the computation cost. This paper presents a novel parallel execution strategy CRMM (Concurrent Replication-based Matrix Multiplication) along with a parallel algorithm, Marlin, for large-scale matrix multiplication on data-parallel platforms. The CRMM strategy exploits higher execution concurrency for sub-block matrix multiplication with the same IO cost. To further improve the performance of Marlin, we also propose a number of novel system-level optimizations, including increasing the concurrency of local data exchange by calling native library in batch, reducing the overhead of block matrix transformation, and reducing disk heavy shuffle operations by exploiting the semantics of matrix computation. We have implemented Marlin as a library along with a set of related matrix operations on Spark and also contributed Marlin to the open-source community. For large-sized matrix multiplication, Marlin outperforms existing systems including Spark MLlib, SystemML and SciDB, with about $1.29\times$ , $3.53\times$ and $2.21\times$ speedup on average, respectively. The evaluation upon a real-world DNN workload also indicates that Marlin outperforms above systems by about $12.8\times$ , $5.1\times$ and $27.2\times$ speedup, respectively.

...read moreread less

Journal Article•10.1109/TC.2017.2701365•

Optimization of Constant Matrix Multiplication with Low Power and High Throughput

[...]

Martin Kumm¹, Martin Hardieck¹, Peter Zipf¹•Institutions (1)

University of Kassel¹

01 Dec 2017-IEEE Transactions on Computers

TL;DR: A heuristic is proposed which evaluates the most attractive graph topologies for CMM circuits which considers both adder depth and pipelining for the first time and shows that these reductions translate to superior results in terms of delay and power consumption compared to the state-of-the-art.

...read moreread less

Abstract: Constant matrix multiplication (CMM), i.e., the multiplication of a constant matrix with a vector, is a common operation in digital signal processing. It is a generalization of multiple constant multiplication (MCM) where a single variable is multiplied by a constant vector. Like MCM, CMM can be reduced to additions/subtractions and bit shifts. Finding a circuit with minimal number of add/subtract operations is known as the CMM problem. While this leads to a reduction in circuit area it may be less efficient for power consumption or throughput. It is well studied for the MCM problem that a) reducing the adder depth (AD) leads to a reduced power consumption and b) pipeline resources have to be considered during optimization to enhance throughput without wasting area. This paper addresses the optimization of CMM circuits which considers both adder depth and pipelining for the first time. For that, a heuristic is proposed which evaluates the most attractive graph topologies. It is shown that the proposed method requires 12.5% less adders with min. AD and 38.5% less pipelined operations. Synthesis results for recent FPGAs show that these reductions also translate to superior results in terms of delay and power consumption compared to the state-of-the-art.

...read moreread less

Journal Article•10.1016/J.NEUCOM.2017.02.009•

A novel parallel multiplier using spiking neural P systems with dendritic delays

[...]

Carlos Díaz¹, Thania Frias¹, Giovanny Sanchez¹, Hector Perez¹, Karina Toscano¹, Gonzalo Duchen¹ - Show less +2 more•Institutions (1)

Instituto Politécnico Nacional¹

24 May 2017-Neurocomputing

TL;DR: This paper develops a novel parallel multiplier that is based on spiking neural P systems and is capable of multiplying two natural numbers with many digits in parallel and employs the divide and conquer strategy to optimize the processing time of the arithmetic operations.

...read moreread less

Journal Article•10.1049/IET-CDS.2015.0337•

Efficient and low-complexity hardware architecture of Gaussian normal basis multiplication over GF(2 m ) for elliptic curve cryptosystems

[...]

Bahram Rashidi, Sayed Masoud Sayedi, Reza Rezaeian Farashahi

01 Mar 2017-Iet Circuits Devices & Systems

TL;DR: Comparison results verify that the proposed architecture of Gaussian normal basis (GNB) multiplier over binary finite field GF(2 m) has better performance in terms of speed and hardware utilisation.

...read moreread less

Abstract: In this paper, an efficient high-speed architecture of Gaussian normal basis (GNB) multiplierover binary finite field GF(2 m ) is presented. The structure is constructed by using some regular modules for computation of exponentiation by powers of 2 and low-cost blocks for multiplication by normal elements of the binary field. For the powers of 2 exponents, the modules are implemented by some simple cyclic shifts in the normal basis representation. As a result, the multiplier has a simple structure with a low critical path delay. The efficiency of the proposed multiplier is examined in terms of area and time complexity based on its implementation on Virtex-4 field programmable gate array family and also its application specific integrated circuit design in 180 nm complementary metal-oxide-semiconductor technology. Comparison results with other structures of the GNB multiplier verify that the proposed architecture has better performance in terms of speed and hardware utilisation.

...read moreread less

...

Expand