High precision integer multiplication with a gpu using strassen's algorithm with multiple fft sizes

doi:10.1142/S0129626411000266

Journal Article10.1142/S0129626411000266

High precision integer multiplication with a gpu using strassen's algorithm with multiple fft sizes

Niall Emmart, +1 more

- 21 Nov 2011

- Parallel Processing Letters

- Vol. 21, Iss: 03, pp 359-375

54

TL;DR: This paper discusses how the Strassen algorithm was adapted to operate within the limitations of the GPU and how it dealt with other issues encountered in the implementation process, including details of the memory layout of the authors' FFTs.

Abstract: We have improved our prior implementation of Strassens algorithm for high performance multiplication of very large integers on a general purpose graphics processor (GPU). A combination of algorithmic and implementation optimizations result in a factor of up to 13.9 speed improvement over our previous work, running on an NVIDIA 295. We have also reoptimized the implementation for an NVIDIA 480, from which we obtain a factor of up to 19 speedup in comparison with a Core i7 processor core of the same technology generation. To provide a fairer chip to chip comparison, we also determined total GPU throughput on a set of multiplications relative to all of the cores on a multicore chip running in parallel. We find that the GTX 480 provides a factor of six higher throughput than all four cores/eight threads of the Core i7. This paper discusses how we adapted the algorithm to operate within the limitations of the GPU and how we dealt with other issues encountered in the implementation process, including details of the memory layout of our FFTs. Compared with our earlier work, which used Karatsuba's algorithm to guide multiplication of different operand sizes built on top of Strassen's algorithm being applied to fixed-size segments of the operands, we are now able to apply Strassen's algorithm directly to operands ranging in size from 255K bits to 16,320K bits.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1109/HPEC.2012.6408660

Accelerating fully homomorphic encryption using GPU

Wei Wang, +4 more

- 01 Sep 2012

TL;DR: The GH-FHE primitives for the small setting with a dimension of 2048 on NVIDIA C2050 GPU are implemented and the experimental results show the speedup factors of 7.68, 7.4 and 6.59 for encryption, decryption and recrypt respectively, when compared with the existing CPU implementation.

...read moreread less

149

Journal Article•10.1109/TC.2013.154

Exploring the Feasibility of Fully Homomorphic Encryption

Wei Wang, +4 more

- 01 Mar 2015

- IEEE Transactions on Computers

TL;DR: Two optimizations coupled with a novel precomputation technique are introduced drastically reducing the computation latency for all FHE primitives and the GH FHE scheme on two GPUs is implemented to further speedup the operations.

...read moreread less

142

•Book Chapter•10.1007/978-3-319-29172-7_11

cuHE: A Homomorphic Encryption Accelerator Library

Wei Dai, +1 more

- 03 Sep 2015

TL;DR: In this article, a CUDA GPU library is proposed to accelerate evaluations with homomorphic schemes defined over polynomial rings enabled with a number of optimizations including algebraic techniques for efficient evaluation, memory minimization techniques, memory and thread scheduling and low level CUDA hand-tuned assembly optimizations.

...read moreread less

133

Journal Article•10.1109/TC.2014.2345388

Accelerating Fully Homomorphic Encryption in Hardware

Yarkin Doröz, +2 more

- 01 Jun 2015

- IEEE Transactions on Computers

TL;DR: This contribution presents the first full realization of FHE in hardware based on the Gentry-Halevi fully homomorphic encryption scheme using an optimized multi-million bit multiplierbased on the Schonhage Strassen multiplication algorithm.

...read moreread less

126

•Posted Content

cuHE: A Homomorphic Encryption Accelerator Library.

Wei Dai, +1 more

- 01 Jan 2015

- IACR Cryptology ePrint Archive

TL;DR: In this paper, a CUDA GPU library is proposed to accelerate evaluations with homomorphic schemes defined over polynomial rings enabled with a number of optimizations including algebraic techniques for efficient evaluation, memory minimization techniques, memory and thread scheduling and low level CUDA hand-tuned assembly optimizations.

...read moreread less

85

...

Expand

References

Journal Article•10.1090/S0025-5718-1965-0178586-1

An algorithm for the machine calculation of complex Fourier series

J.W. Cooley, +1 more

- 01 Apr 1965

- Mathematics of Computation

TL;DR: Good generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series, applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices.

...read moreread less

13.1K

•Journal Article•10.1090/S0025-5718-1985-0777282-X

Modular multiplication without trial division

Peter L. Montgomery

- 01 Apr 1985

- Mathematics of Computation

TL;DR: A method for multiplying two integers modulo N while avoiding division by N, a representation of residue classes so as to speed modular multiplication without affecting the modular addition and subtraction algorithms.

...read moreread less

2.8K

Journal Article•10.1007/BF02242355

Schnelle Multiplikation großer Zahlen

Arnold Schönhage, +1 more

- 01 Sep 1971

- Computing

TL;DR: Two ways of implementing the algorithm are considered: multitape Turing machines and logical nets (with step=binary logical element.)

...read moreread less

1K

•Journal Article

Multiplication of Multidigit Numbers on Automata

Anatoly A. Karatsuba, +1 more

- 01 Jan 1963

- Soviet physics. Doklady

996

Journal Article•10.1007/BF00162341

FFTs in external or hierarchical memory

David H. Bailey

- 01 Mar 1990

- The Journal of Supercomputing

TL;DR: Advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory that require as few as two passes through the external data set, employ strictly unit stride, long vector transfers between main memory and external storage, and are well suited for vector and parallel computation are described.

...read moreread less

266