TL;DR: A new implementation of the real-valued split-radix FFT is presented, an algorithm that uses fewer operations than any otherreal-valued power-of-2-length FFT.
Abstract: This tutorial paper describes the methods for constructing fast algorithms for the computation of the discrete Fourier transform (DFT) of a real-valued series. The application of these ideas to all the major fast Fourier transform (FFT) algorithms is discussed, and the various algorithms are compared. We present a new implementation of the real-valued split-radix FFT, an algorithm that uses fewer operations than any other real-valued power-of-2-length FFT. We also compare the performance of inherently real-valued transform algorithms such as the fast Hartley transform (FHT) and the fast cosine transform (FCT) to real-valued FFT algorithms for the computation of power spectra and cyclic convolutions. Comparisons of these techniques reveal that the alternative techniques always require more additions than a method based on a real-valued FFT algorithm and result in computer code of equal or greater length and complexity.
TL;DR: A new N = 2n fast Fourier transform algorithm is presented, which has fewer multiplications and additions than radix 2n, n = 1, 2, 3 algorithms, has the same number of multiplications as the Raderi-Brenner algorithm, but much fewer additions.
Abstract: A new N = 2n fast Fourier transform algorithm is presented, which has fewer multiplications and additions than radix 2n, n = 1, 2, 3 algorithms, has the same number of multiplications as the Raderi-Brenner algorithm, but much fewer additions, and is numerically better conditioned, and is performed ‘in place’ by a repetitive use of a ‘butterfly’-type structure.
TL;DR: This paper presents three optimizations for the Number Theoretic Transform NTT used for polynomial multiplication and proposes an optimization of the ring-LWE encryption system that reduces the number of NTT operations from five to four resulting in a 20% speed-up.
Abstract: In this paper we propose an efficient and compact processor for a ring-LWE based encryption scheme. We present three optimizations for the Number Theoretic Transform NTT used for polynomial multiplication: we avoid pre-processing in the negative wrapped convolution by merging it with the main algorithm, we reduce the fixed computation cost of the twiddle factors and propose an advanced memory access scheme. These optimization techniques reduce both the cycle and memory requirements. Finally, we also propose an optimization of the ring-LWE encryption system that reduces the number of NTT operations from five to four resulting in a 20% speed-up. We use these computational optimizations along with several architectural optimizations to design an instruction-set ring-LWE cryptoprocessor. For dimension 256, our processor performs encryption/decryption operations in 20/9 μs on a Virtex 6 FPGA and only requires 1349 LUTs, 860 FFs, 1 DSP-MULT and 2 BRAMs. Similarly for dimension 512, the processor takes 48/21 μs for performing encryption/decryption operations and only requires 1536 LUTs, 953 FFs, 1 DSP-MULT and 3 BRAMs. Our processors are therefore more than three times smaller than the current state of the art hardware implementations, whilst running somewhat faster.
TL;DR: This algorithm belongs to that class of recently proposed 2n-FFT's which present the same arithmetic complexity (the lowest among any previously published one) and can easily be applied to real and real-symmetric data with reduced arithmetic complexity by removing all redundancy in the algorithm.
Abstract: A new algorithm is presented for the fast computation of the discrete Fourier transform. This algorithm belongs to that class of recently proposed 2n-FFT's which present the same arithmetic complexity (the lowest among any previously published one). Moreover, this algorithm has the advantage of being performed "in-place," by repetitive use of a "butterfly"-type structure, without any data reordering inside the algorithm. Furthermore, it can easily be applied to real and real-symmetric data with reduced arithmetic complexity by removing all redundancy in the algorithm.
TL;DR: The implementation of the FFT on vector computers is described, and in the final section it is demonstrated how savings can be achieved in the case of two-dimensional transforms.