Journal Article10.1142/S0129626411000266
High precision integer multiplication with a gpu using strassen's algorithm with multiple fft sizes
Niall Emmart,Charles C. Weems +1 more
54
TL;DR: This paper discusses how the Strassen algorithm was adapted to operate within the limitations of the GPU and how it dealt with other issues encountered in the implementation process, including details of the memory layout of the authors' FFTs.
read more
Abstract: We have improved our prior implementation of Strassens algorithm for high performance multiplication of very large integers on a general purpose graphics processor (GPU). A combination of algorithmic and implementation optimizations result in a factor of up to 13.9 speed improvement over our previous work, running on an NVIDIA 295. We have also reoptimized the implementation for an NVIDIA 480, from which we obtain a factor of up to 19 speedup in comparison with a Core i7 processor core of the same technology generation. To provide a fairer chip to chip comparison, we also determined total GPU throughput on a set of multiplications relative to all of the cores on a multicore chip running in parallel. We find that the GTX 480 provides a factor of six higher throughput than all four cores/eight threads of the Core i7. This paper discusses how we adapted the algorithm to operate within the limitations of the GPU and how we dealt with other issues encountered in the implementation process, including details of the memory layout of our FFTs. Compared with our earlier work, which used Karatsuba's algorithm to guide multiplication of different operand sizes built on top of Strassen's algorithm being applied to fixed-size segments of the operands, we are now able to apply Strassen's algorithm directly to operands ranging in size from 255K bits to 16,320K bits.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Accelerating fully homomorphic encryption using GPU
Wei Wang,Yin Hu,Lianmu Chen,Xinming Huang,Berk Sunar +4 more
- 01 Sep 2012
TL;DR: The GH-FHE primitives for the small setting with a dimension of 2048 on NVIDIA C2050 GPU are implemented and the experimental results show the speedup factors of 7.68, 7.4 and 6.59 for encryption, decryption and recrypt respectively, when compared with the existing CPU implementation.
149
Exploring the Feasibility of Fully Homomorphic Encryption
TL;DR: Two optimizations coupled with a novel precomputation technique are introduced drastically reducing the computation latency for all FHE primitives and the GH FHE scheme on two GPUs is implemented to further speedup the operations.
142
cuHE: A Homomorphic Encryption Accelerator Library
Wei Dai,Berk Sunar +1 more
- 03 Sep 2015
TL;DR: In this article, a CUDA GPU library is proposed to accelerate evaluations with homomorphic schemes defined over polynomial rings enabled with a number of optimizations including algebraic techniques for efficient evaluation, memory minimization techniques, memory and thread scheduling and low level CUDA hand-tuned assembly optimizations.
Accelerating Fully Homomorphic Encryption in Hardware
TL;DR: This contribution presents the first full realization of FHE in hardware based on the Gentry-Halevi fully homomorphic encryption scheme using an optimized multi-million bit multiplierbased on the Schonhage Strassen multiplication algorithm.
126
•Posted Content
cuHE: A Homomorphic Encryption Accelerator Library.
Wei Dai,Berk Sunar +1 more
TL;DR: In this paper, a CUDA GPU library is proposed to accelerate evaluations with homomorphic schemes defined over polynomial rings enabled with a number of optimizations including algebraic techniques for efficient evaluation, memory minimization techniques, memory and thread scheduling and low level CUDA hand-tuned assembly optimizations.
References
An algorithm for the machine calculation of complex Fourier series
J.W. Cooley,John W. Tukey +1 more
TL;DR: Good generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series, applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices.
Modular multiplication without trial division
TL;DR: A method for multiplying two integers modulo N while avoiding division by N, a representation of residue classes so as to speed modular multiplication without affecting the modular addition and subtraction algorithms.
Schnelle Multiplikation großer Zahlen
Arnold Schönhage,Volker Strassen +1 more
TL;DR: Two ways of implementing the algorithm are considered: multitape Turing machines and logical nets (with step=binary logical element.)
1K
•Journal Article
Multiplication of Multidigit Numbers on Automata
Anatoly A. Karatsuba,Yu. Ofman +1 more
996
FFTs in external or hierarchical memory
TL;DR: Advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory that require as few as two passes through the external data set, employ strictly unit stride, long vector transfers between main memory and external storage, and are well suited for vector and parallel computation are described.
266
Related Papers (5)
Arnold Schönhage,Volker Strassen +1 more
Craig Gentry,Shai Halevi +1 more
- 15 May 2011
Dan Boneh,Craig Gentry +1 more
- 01 Jan 2009
Craig Gentry
- 31 May 2009