TL;DR: A method to construct a homomorphic encryption scheme for approximate arithmetic that supports an approximate addition and multiplication of encrypted messages, together with a new rescaling procedure for managing the magnitude of plaintext.
Abstract: We suggest a method to construct a homomorphic encryption scheme for approximate arithmetic. It supports an approximate addition and multiplication of encrypted messages, together with a new rescaling procedure for managing the magnitude of plaintext. This procedure truncates a ciphertext into a smaller modulus, which leads to rounding of plaintext. The main idea is to add a noise following significant figures which contain a main message. This noise is originally added to the plaintext for security, but considered to be a part of error occurring during approximate computations that is reduced along with plaintext by rescaling. As a result, our decryption structure outputs an approximate value of plaintext with a predetermined precision.
TL;DR: In this article, the authors give precise quantum resource estimates for Shor's algorithm to compute discrete logarithms on elliptic curves over prime fields. And they show that ECC is an easier target than RSA.
Abstract: We give precise quantum resource estimates for Shor's algorithm to compute discrete logarithms on elliptic curves over prime fields. The estimates are derived from a simulation of a Toffoli gate network for controlled elliptic curve point addition, implemented within the framework of the quantum computing software tool suite LIQ$Ui|\rangle$. We determine circuit implementations for reversible modular arithmetic, including modular addition, multiplication and inversion, as well as reversible elliptic curve point addition. We conclude that elliptic curve discrete logarithms on an elliptic curve defined over an $n$-bit prime field can be computed on a quantum computer with at most $9n + 2\lceil\log_2(n)\rceil+10$ qubits using a quantum circuit of at most $448 n^3 \log_2(n) + 4090 n^3$ Toffoli gates. We are able to classically simulate the Toffoli networks corresponding to the controlled elliptic curve point addition as the core piece of Shor's algorithm for the NIST standard curves P-192, P-224, P-256, P-384 and P-521. Our approach allows gate-level comparisons to recent resource estimates for Shor's factoring algorithm. The results also support estimates given earlier by Proos and Zalka and indicate that, for current parameters at comparable classical security levels, the number of qubits required to tackle elliptic curves is less than for attacking RSA, suggesting that indeed ECC is an easier target than RSA.
TL;DR: This article provides a review of the techniques for implementing the SpMV kernel on GPGPUs that have appeared in the literature of the last few years, and discusses the issues and tradeoffs that have been encountered by the various researchers.
Abstract: The multiplication of a sparse matrix by a dense vector (SpMV) is a centerpiece of scientific computing applications: it is the essential kernel for the solution of sparse linear systems and sparse eigenvalue problems by iterative methods. The efficient implementation of the sparse matrix-vector multiplication is therefore crucial and has been the subject of an immense amount of research, with interest renewed with every major new trend in high-performance computing architectures. The introduction of General-Purpose Graphics Processing Units (GPGPUs) is no exception, and many articles have been devoted to this problem.With this article, we provide a review of the techniques for implementing the SpMV kernel on GPGPUs that have appeared in the literature of the last few years. We discuss the issues and tradeoffs that have been encountered by the various researchers, and a list of solutions, organized in categories according to common features. We also provide a performance comparison across different GPGPU models and on a set of test matrices coming from various application domains.
TL;DR: This paper proposes a novel approximate floating point multiplier, called CFPU, which significantly reduces energy and improves performance of multiplication at the expense of accuracy, and shows that it can outperforms a standard FPU when at least 4% of multiplications are performed in approximate mode.
Abstract: Many applications, such as machine learning and data sensing are statistical in nature and can tolerate some level of inaccuracy in their computation. Approximate computation is a viable method to save energy and increase performance by trading energy for accuracy. There are a number of proposed approximate solutions, however, they are limited to a small range of applications because they cannot control the error rate of their output. In this paper, we propose a novel approximate floating point multiplier, called CFPU, which significantly reduces energy and improves performance of multiplication at the expense of accuracy. Our design approximately models multiplication by replacing the most costly step of the operation with a lower energy alternative. In order to tune the level of approximation, CFPU dynamically identifies the inputs which will produce the largest approximation error and processes them in precise CFPU mode. We showed that our CFPU can outperforms a standard FPU when at least 4% of multiplications are performed in approximate mode. In our tested applications this percentage of multiplications is substantially higher, leading to significant energy savings. Our experimental evaluation on AMD Southern Island GPU shows that replacing the proposed CFPU with traditional FPUs results in 77% energy savings and 3.5× energy-delay product improvement over eight general OpenCL applications while providing acceptable quality of service. In addition, for the same level of accuracy, the CFPU provides 2.4× energy-delay product improvement compared to state-of-the-art approximate multipliers.
TL;DR: This work gives computationally efficient zero-knowledge proofs of knowledge for arithmetic circuit satisfiability over a large field in an ideal linear commitment model where the prover may commit to secret vectors of field elements, and the verifier can receive certified linear combinations of those vectors.
Abstract: We give computationally efficient zero-knowledge proofs of knowledge for arithmetic circuit satisfiability over a large field. For a circuit with N addition and multiplication gates, the prover only uses \(\mathcal {O}(N)\) multiplications and the verifier only uses \(\mathcal {O}(N)\) additions in the field. If the commitments we use are statistically binding, our zero-knowledge proofs have unconditional soundness, while if the commitments are statistically hiding we get computational soundness. Our zero-knowledge proofs also have sub-linear communication if the commitment scheme is compact.
TL;DR: An efficient and flexible dual-field ECC processor which can support arbitrary elliptic curve standards and algorithms using the hardware–software approach is presented.
Abstract: Elliptic curve cryptography (ECC) has been widely used for the digital signature to ensure the security in communication. It is important for the ECC processor to support a variety of ECC standards to be compatible with different security applications. Thus, a flexible processor which can support different standards and algorithms is desired. In this paper, an efficient and flexible dual-field ECC processor using the hardware–software approach is presented. The proposed processor can support arbitrary elliptic curve. An elaborate modular arithmetic logic unit is designed. It can perform basic modular arithmetic operations and achieve high efficiency. Based on our designed instruction set, the processor can be programmed to perform various point operations based on different algorithms. To demonstrate the flexibility of our processor, a point multiplication algorithm with power analysis resistance is adopted. Our design is implemented in the field-programmable gate array platform and also in the application-specified integrated circuit. After implemented in the 55 nm CMOS process, the processor takes between 0.60 ms (163 bits ECC) and 6.75 ms (571 bits ECC) to finish one-point multiplication. Compared to other related works, the merits of our ECC processor are the high hardware efficiency and flexibility.
TL;DR: This work proposes a new fast SpGEMM algorithm requiring small amount of memory and achieving high performance, and reduces the memory usage by 14.7% in single precision and double precision on average, allowing larger matrices to be computed.
Abstract: Sparse general matrix-matrix multiplication (SpGEMM) is one of the key kernels of preconditioners such as algebraic multigrid method or graph algorithms. However, the performance of SpGEMM is quite low on modern processors due to random memory access to both input and output matrices. As well as the number and the pattern of non-zero elements in the output matrix, important for achieving locality, are unknown before the execution. Moreover, the state-of-the-art GPU implementations of SpGEMM requires large amounts of memory for temporary results, limiting the matrix size computable on fast GPU device memory. We propose a new fast SpGEMM algorithm requiring small amount of memory and achieving high performance. Calculation of the pattern and value in output matrix is optimized by using GPU's on-chip shared memory and a hash table. Additionally, our algorithm launches multiple kernels running concurrently to improve the utilization of GPU resources. The kernels for the calculation of each row of output matrix are chosen based on the number of non-zero elements. Performance evaluation using matrices from the Sparse Matrix Collection of University Florida on NVIDIA's Pascal generation GPU shows that our approach achieves speedups of up to x4.3 in single precision and x4.4 in double precision compared to existing SpGEMM libraries. Furthermore, the memory usage is reduced by 14.7% in single precision and 10.9% in double precision on average, allowing larger matrices to be computed.
TL;DR: A deterministic top-level algorithm is proposed that automatically detects which method to use depending on the input and is able to handle ideals of degree over $40000 and outperforms the {\sf Magma} and {\sf Singular} ones.
TL;DR: In this article, the complexity of securely evaluating an arithmetic circuit over a finite field was studied in the setting of secure two-party computation with semi-honest adversaries. But the complexity was not considered in this paper.
Abstract: We study the complexity of securely evaluating an arithmetic circuit over a finite field \(\mathbb {F}\) in the setting of secure two-party computation with semi-honest adversaries. In all existing protocols, the number of arithmetic operations per multiplication gate grows either linearly with \(\log |\mathbb {F}|\) or polylogarithmically with the security parameter. We present the first protocol that only makes a constant (amortized) number of field operations per gate. The protocol uses the underlying field \(\mathbb {F}\) as a black box, and its security is based on arithmetic analogues of well-studied cryptographic assumptions.
TL;DR: In this paper, a randomized O(n 3 /log 2 n) time algorithm for OMV was proposed, running in [EQUATION] randomized time, where w is the word size.
Abstract: We consider the Online Boolean Matrix-Vector Multiplication (OMV) problem studied by Henzinger et al. [STOC'15]: given an n × n Boolean matrix M, we receive n Boolean vectors v1,...,vn one at a time, and are required to output Mvi (over the Boolean semiring) before seeing the vector vi+1, for all i. Previous known algorithms for this problem are combinatorial, running in O(n3 /log2n) time. Henzinger et al. conjecture there is no O(n3−e) time algorithm for OMV, for all e > 0; their OMV conjecture is shown to imply strong hardness results for many basic dynamic problems.We give a substantially faster method for computing OMV, running in [EQUATION] randomized time. In fact, after seeing [EQUATION] vectors, we already achieve [EQUATION] amortized time for matrix-vector multiplication. Our approach gives a way to reduce matrix-vector multiplication to solving a version of the Orthogonal Vectors problem, which in turn reduces to "small" algebraic matrix-matrix multiplication. Applications include faster independent set detection, partial match retrieval, and 2-CNF evaluation.We also show how a modification of our method gives a cell probe data structure for OMV with worst case [EQUATION] time per query vector, where w is the word size. This result rules out an unconditional proof of the OMV conjecture using purely information-theoretic arguments.
TL;DR: In this article, the size of tricolored sum-free sets in groups of bounded exponent was shown to be bounded by a tensor rank due to Tao, which gives a quantitative understanding of the notion of unstable tensors from geometric invariant theory.
Abstract: In 2003, Cohn and Umans described a framework for proving upper bounds on the exponent ω of matrix multiplication by reducing matrix multiplication to group algebra multiplication, and in 2005 Cohn, Kleinberg, Szegedy, and Umans proposed specific conjectures for how to obtain ω = 2. In this paper we rule out obtaining ω = 2 in this framework from abelian groups of bounded exponent. To do this we bound the size of tricolored sum-free sets in such groups, extending the breakthrough results of Croot, Lev, Pach, Ellenberg, and Gijswijt on cap sets. As a byproduct of our proof, we show that a variant of tensor rank due to Tao gives a quantitative understanding of the notion of unstable tensor from geometric invariant theory.
TL;DR: A comparison of hardware architectures for large integer multiplication is presented and it is shown that hardware designs of combination multipliers, at a cost of additional hardware resource usage, can offer lower latency compared to individual multiplier designs.
Abstract: Multipliers requiring large bit lengths have a major impact on the performance of many applications, such as cryptography, digital signal processing (DSP) and image processing. Novel, optimised designs of large integer multiplication are needed as previous approaches, such as schoolbook multiplication, may not be as feasible due to the large parameter sizes. Parameter bit lengths of up to millions of bits are required for use in cryptography, such as in lattice-based and fully homomorphic encryption (FHE) schemes. This paper presents a comparison of hardware architectures for large integer multiplication. Several multiplication methods and combinations thereof are analysed for suitability in hardware designs, targeting the FPGA platform. In particular, the first hardware architecture combining Karatsuba and Comba multiplication is proposed. Moreover, a hardware complexity analysis is conducted to give results independent of any particular FPGA platform. It is shown that hardware designs of combination multipliers, at a cost of additional hardware resource usage, can offer lower latency compared to individual multiplier designs. Indeed, the proposed novel combination hardware design of the Karatsuba-Comba multiplier offers lowest latency for integers greater than 512 bits. For large multiplicands, greater than 16,384 bits, the hardware complexity analysis indicates that the NTT-Karatsuba-Schoolbook combination is most suitable.
TL;DR: This paper proposes a new and simple framework for generating shared and private random multiplication triples with unconditional security and presents the first ever MPC protocols with a linear (in the number of parties) communication overhead per multiplication gate in the circuit.
Abstract: Threshold unconditionally secure multiparty computation (MPC) allows a set of $n$ mutually distrusting parties to securely compute an agreed function $f$ over some finite field in the presence of a computationally unbounded adversary, who can maliciously corrupt any $t$ out of the $n$ parties. Most of the known efficient MPC protocols are designed in the offline–online framework introduced in a seminal work by Beaver in CRYPTO 1991. In this framework, the parties generate shared random and private multiplication-triples during the offline phase, which are used later in the online phase for securely evaluating the multiplication gates in the circuit representing $f$ . The efficiency of the MPC protocols in this framework then relies on efficient ways of implementing the offline phase. In this paper, we propose a new and simple framework for generating shared and private random multiplication triples with unconditional security. The existing protocols approach this problem by first producing shared pairs of private and random values, followed by securely computing the shared product of each pair of values. The latter task involves a multiplication protocol for shared values that are typically communication intensive. Our framework takes a completely different approach and shuns the use of multiplication protocol. Namely, we ask the parties to verifiably share random multiplication triples and then securely extract shared random multiplication triples unknown to the adversary, from the shared triples. Realizing our framework in the asynchronous and hybrid network setting, 1 we present the first ever MPC protocols with a linear (in the number of parties) communication overhead per multiplication gate in the circuit representing $f$ . These are significant improvements over the best known existing MPC protocols in the asynchronous and hybrid network setting with communication complexity $ \mathcal {O}(n^{2})$ and $ \mathcal {O}(n^{3})$ , respectively. Our framework when applied to the synchronous setting results in round-efficient MPC protocols. 1 In a hybrid network, it is assumed that the network is synchronous up to a certain “point” and asynchronous after that point onward. We assume a hybrid network with just one synchronous round in the beginning.
TL;DR: Angebraic characterization of the privacy for multiplication in any finite field is introduced and a novel algebraic characterization for non-interference is proposed (a stronger security notion in this setting) and two generic constructions of multiplication circuits in finite fields that achieve non-Interference in the probing model are presented.
Abstract: The notion of privacy in the probing model, introduced by Ishai, Sahai, and Wagner in 2003, is nowadays frequently involved to assess the security of circuits manipulating sensitive information. However, provable security in this model still comes at the cost of a significant overhead both in terms of arithmetic complexity and randomness complexity. In this paper, we deal with this issue for circuits processing multiplication over finite fields. Our contributions are manifold. Extending the work of Belaid, Benhamouda, Passelegue, Prouff, Thillard, and Vergnaud at Eurocrypt 2016, we introduce an algebraic characterization of the privacy for multiplication in any finite field and we propose a novel algebraic characterization for non-interference (a stronger security notion in this setting). Then, we present two generic constructions of multiplication circuits in finite fields that achieve non-interference in the probing model. Denoting by d the number of probes used by the adversary, the first proposal reduces the number of bilinear multiplications (i.e., of general multiplications of two non-constant values in the finite field) to only \(2d+1\) whereas the state-of-the-art was \(O(d^2)\). The second proposal reduces the randomness complexity to d random elements in the underlying finite field, hence improving the \(O(d \log d)\) randomness complexity achieved by Belaid et al. in their paper. This construction is almost optimal since we also prove that d / 2 is a lower bound. Eventually, we show that both algebraic constructions can always be instantiated in large enough finite fields. Furthermore, for the important cases \(d \in \{2,3\}\), we illustrate that they perform well in practice by presenting explicit realizations for finite fields of practical interest.
TL;DR: A work-efficient multithreaded algorithm for sparse matrix-sparse vector multiplication (SpMSpV) where the matrix, the input vector, and the output vector are all sparse.
Abstract: We design and develop a work-efficient multithreaded algorithm for sparse matrix-sparse vector multiplication (SpMSpV) where the matrix, the input vector, and the output vector are all sparse SpMSpV is an important primitive in the emerging GraphBLAS standard and is the workhorse of many graph algorithms including breadth-first search, bipartite graph matching, and maximal independent set As thread counts increase, existing multithreaded SpMSpV algorithms can spend more time accessing the sparse matrix data structure than doing arithmetic Our shared-memory parallel SpMSpV algorithm is work efficient in the sense that its total work is proportional to the number of arithmetic operations required The key insight is to avoid each thread individually scan the list of matrix columns Our algorithm is simple to implement and operates on existing column-based sparse matrix formats It performs well on diverse matrices and vectors with heterogeneous sparsity patterns A high-performance implementation of the algorithm attains up to 15x speedup on a 24-core Intel Ivy Bridge processor and up to 49x speedup on a 64-core Intel KNL manycore processor In contrast to implementations of existing algorithms, the performance of our algorithm is sustained on a variety of different input types include matrices representing scale-free and high-diameter graphs
TL;DR: Results show the validity of the models and methods proposed for enhancing the locality in parallel SpGEMM operations on a wide range of sparse matrices from real applications.
Abstract: Exploiting spatial and temporal localities is investigated for efficient row-by-row parallelization of general sparse matrix-matrix multiplication (SpGEMM) operation of the form $C=A\,B$ on many-core architectures. Hypergraph and bipartite graph models are proposed for 1D rowwise partitioning of matrix $A$ to evenly partition the work across threads with the objective of reducing the number of $B$ -matrix words to be transferred from the memory and between different caches. A hypergraph model is proposed for $B$ -matrix column reordering to exploit spatial locality in accessing entries of thread-private temporary arrays, which are used to accumulate results for $C$ -matrix rows. A similarity graph model is proposed for $B$ -matrix row reordering to increase temporal reuse of these accumulation array entries. The proposed models and methods are tested on a wide range of sparse matrices from real applications and the experiments were carried on a 60-core Intel Xeon Phi processor, as well as a two-socket Xeon processor. Results show the validity of the models and methods proposed for enhancing the locality in parallel SpGEMM operations.
TL;DR: This paper presents a detailed methodology of implementing and optimizing the double-precision general format matrix-matrix multiplication (DGEMM) kernel on the emerging SW26010 processor, which is used to build the Sunway TaihuLight supercomputer.
Abstract: The matrix-matrix multiplication is an essential building block that can be found in various scientific and engineering applications. High-performance implementations of the matrix-matrix multiplication on state-of-the-art processors may be of great importance for both the vendors and the users. In this paper, we present a detailed methodology of implementing and optimizing the double-precision general format matrix-matrix multiplication (DGEMM) kernel on the emerging SW26010 processor, which is used to build the Sunway TaihuLight supercomputer. We propose a three level blocking algorithm to orchestrate data on the memory hierarchy and expose parallelism on different hardware levels, and design a collective data sharing scheme by using the register communication mechanism to exchange data efficiently among different cores. On top of those, further optimizations are done based on a data-thread mapping method for efficient data distribution, a double buffering scheme for asynchronous DMA data transfer, and an instruction scheduling method for maximizing the pipeline usage. Experiment results show that the proposed DGEMM implementation can fully exploit the unique hardware features provided by SW26010 and can sustain up to 95% of the peak performance.
TL;DR: A Vedic multiplication algorithm is designed by using Vedic mathematics formula Urdhava Tiryakbhyam method means vertically and cross wise, which gets less time delay compared to other algorithms.
Abstract: Multiplier is main building block of all processor, which improves the speed of Digital Signal Processor (DSP). In special application in which we need to reduce the time delay. In proposed method, we design a Vedic multiplication algorithm by using Vedic mathematics formula Urdhava Tiryakbhyam method means vertically and cross wise. Vedic mathematics is mainly based on 16 Sutras and was rediscovered in early 20th century. In ancient time in India, people used this Sutra for decimal number multiplications effectively. The same basic concept of Vedic mathematics is applied to multiplication of binary number to make usable in the digital hardware system. The speed of the computation process is increased and the processing time is reduced due to decrease of combinational path delay compared to the existing multipliers. In our proposed multiplication algorithm, we get less time delay compared to other algorithms.[1]
TL;DR: In this paper, the Aes Mix-Columns operation with respect to the number of XOR operations has been investigated and a new MDS matrices for lightweight symmetric primitives has been proposed.
Abstract: Recently a lot of attention is paid to the search for efficiently implementable MDS matrices for lightweight symmetric primitives. Most previous work concentrated on locally optimizing the multiplication with single matrix elements. Separate from this line of work, several heuristics were developed to find shortest linear straightline programs. Solving this problem actually corresponds to globally optimizing multiplications by matrices. In this work we combine those, so far largely independent lines of work. As a result, we achieve implementations of known, locally optimized, and new MDS matrices that significantly outperform all implementations from the literature. Interestingly, almost all previous locally optimized constructions behave very similar with respect to the globally optimized implementation. As a side effect, our work reveals the so far best implementation of the Aes Mix- Columns operation with respect to the number of XOR operations needed.
TL;DR: This paper proposes a novel approach based on the Improved Operand Decomposition (IOD) to make an efficient logarithmic multiplier and subsequent achievement through scale realization at the cost of minimal hardware and delay.
TL;DR: In this article, the authors analyzed several basic building blocks of double-word arithmetic, including the addition of a double word number and a floating-point number, the multiplication of two double word numbers by a fixed number of floating-points, and the division of two single-word numbers.
Abstract: We analyze several classical basic building blocks of double-word arithmetic (frequently called “double-double arithmetic” in the literature): the addition of a double-word number and a floating-point number, the addition of two double-word numbers, the multiplication of a double-word number by a floating-point number, the multiplication of two double-word numbers, the division of a double-word number by a floating-point number, and the division of two double-word numbers. For multiplication and division we get better relative error bounds than the ones previously published. For addition of two double-word numbers, we show that the previously published bound was incorrect, and we provide a new relative error bound. We introduce new algorithms for division. We also give examples that illustrate the tightness of our bounds.
TL;DR: In this paper, a generic implementation of a bit matrix compressor for Xilinx FPGAs is described, which does not require a generator tool and is agnostic of the aspect ratio of the input matrix.
Abstract: Bit matrix compression is a highly relevant operation in computer arithmetic. Essentially being a multi-operand addition, it is the key operation behind fast multiplication and many higher-level operations such as multiply-accumulate, the computation of the dot product or the implementation of FIR filters. Compressor implementations have been constantly evolving for greater efficiency both in general and in the context of concrete applications or specific implementation technologies. This paper is building on this history and describes a generic implementation of a bit matrix compressor for Xilinx FPGAs, which does not require a generator tool. It contributes FPGA-oriented metrics for the evaluation of elementary parallel bit counters, a systematic analysis and partial decomposition of previously proposed counters and a fully implemented construction heuristic with a flexible compression target matching the device capabilities. The generic implementation is agnostic of the aspect ratio of the input matrix and can be used for multiplication the same way as it can be for single-column population count operations.
TL;DR: The DBCSR sparse matrix library is extended, which is the basic building block for linear scaling electronic structure theory and low scaling correlated methods in CP2K, and is specifically designed to efficiently perform block-sparse matrix-matrix multiplication of matrices with a relatively large occupation.
Abstract: Matrix-matrix multiplication is a basic operation in linear algebra and an essential building block for a wide range of algorithms in various scientific fields. Theory and implementation for the dense, square matrix case are well-developed. If matrices are sparse, with application-specific sparsity patterns, the optimal implementation remains an open question. Here, we explore the performance of communication reducing 2.5D algorithms and one-sided MPI communication in the context of linear scaling electronic structure theory. In particular, we extend the DBCSR sparse matrix library, which is the basic building block for linear scaling electronic structure theory and low scaling correlated methods in CP2K. The library is specifically designed to efficiently perform block-sparse matrix-matrix multiplication of matrices with a relatively large occupation. Here, we compare the performance of the original implementation based on Cannon's algorithm and MPI point-to-point communication, with an implementation based on MPI one-sided communications (RMA), in both a 2D and a 2.5D approach. The 2.5D approach trades memory and auxiliary operations for reduced communication, which can lead to a speedup if communication is dominant. The 2.5D algorithm is somewhat easier to implement with one-sided communications. A detailed description of the implementation is provided, also for non ideal processor topologies, since this is important for actual applications. Given the importance of the precise sparsity pattern, and even the actual matrix data, which decides the effective fill-in upon multiplication, the tests are performed within the CP2K package with application benchmarks. Results show a substantial boost in performance for the RMA based 2.5D algorithm, up to 1.80x, which is observed to increase with the number of processes involved in the parallelization.
TL;DR: In this article, the authors proposed an energy-efficient mixed-signal approach for vector-by-matrix multiplication in a time domain, where multi-bit values of the input and output vector elements are represented with time-encoded digital signals, while mult-bit matrix weights are realized with current sources, e.g. transistors biased in subthreshold regime.
Abstract: We propose an extremely energy-efficient mixed-signal approach for performing vector-by-matrix multiplication in a time domain. In such implementation, multi-bit values of the input and output vector elements are represented with time-encoded digital signals, while multi-bit matrix weights are realized with current sources, e.g. transistors biased in subthreshold regime. With our approach, multipliers can be chained together to implement large-scale circuits completely in a time domain. Multiplier operation does not rely on energy-taxing static currents, which are typical for peripheral and input/output conversion circuits of the conventional mixed-signal implementations. As a case study, we have designed a multilayer perceptron, based on two layers of 10x10 four-quadrant vector-by-matrix multipliers, in 55-nm process with embedded NOR flash memory technology, which allows for compact implementation of adjustable current sources. Our analysis, based on memory cell measurements, shows that at high computing speed the drain-induced barrier lowering is a major factor limiting multiplier precision to ~6 bit. Post-layout estimates for a conservative 6-bit digital input/output NxN multiplier designed in 55 nm process, including I/O circuitry for converting between digital and time domain representations, show ~7 fJ/Op for N>200, which can be further lowered well below 1 fJ/Op for more optimal and aggressive design.
TL;DR: This paper describes a ladder variant that permits to accelerate the fixed-point multiplication function inherent to the Diffie-Hellman key pair generation phase and is the first proposal of a Montgomery ladder procedure for prime elliptic curves that admits the extensive use of pre-computation.
Abstract: In the RFC 7748 memorandum, the Internet Research Task Force specified a Montgomery-ladder scalar multiplication function based on two recently adopted elliptic curves, “curve25519” and “curve448”. The purpose of this function is to support the Diffie-Hellman key exchange algorithm that will be included in the forthcoming version of the Transport Layer Security cryptographic protocol. In this paper, we describe a ladder variant that permits to accelerate the fixed-point multiplication function inherent to the Diffie-Hellman key pair generation phase. Our proposal combines a right-to-left version of the Montgomery ladder along with the pre-computation of constant values directly derived from the base-point and its multiples. To our knowledge, this is the first proposal of a Montgomery ladder procedure for prime elliptic curves that admits the extensive use of pre-computation. In exchange of very modest memory resources and a small extra programming effort, the proposed ladder obtains significant speedups for software implementations. Moreover, our proposal fully complies with the RFC 7748 specification. A software implementation of the X25519 and X448 functions using our pre-computable ladder yields an acceleration factor of roughly 1.20, and 1.25 when implemented on the Haswell and the Skylake micro-architectures, respectively.
TL;DR: This paper presents a novel parallel execution strategy, CRMM (Concurrent Replication-based Matrix Multiplication), along with a parallel algorithm, Marlin, for large-scale matrix multiplication on data-parallel platforms, and proposes a number of novel system-level optimizations.
Abstract: Matrix multiplication is a dominant but very time-consuming operation in many big data analytic applications. Thus its performance optimization is an important and fundamental research issue. The performance of large-scale matrix multiplication on distributed data-parallel platforms is determined by both computation and IO costs. For existing matrix multiplication execution strategies, when the execution concurrency scales up above a threshold, their execution performance deteriorates quickly because the increase of the IO cost outweighs the decrease of the computation cost. This paper presents a novel parallel execution strategy CRMM (Concurrent Replication-based Matrix Multiplication) along with a parallel algorithm, Marlin, for large-scale matrix multiplication on data-parallel platforms. The CRMM strategy exploits higher execution concurrency for sub-block matrix multiplication with the same IO cost. To further improve the performance of Marlin, we also propose a number of novel system-level optimizations, including increasing the concurrency of local data exchange by calling native library in batch, reducing the overhead of block matrix transformation, and reducing disk heavy shuffle operations by exploiting the semantics of matrix computation. We have implemented Marlin as a library along with a set of related matrix operations on Spark and also contributed Marlin to the open-source community. For large-sized matrix multiplication, Marlin outperforms existing systems including Spark MLlib, SystemML and SciDB, with about $1.29\times$ , $3.53\times$ and $2.21\times$ speedup on average, respectively. The evaluation upon a real-world DNN workload also indicates that Marlin outperforms above systems by about $12.8\times$ , $5.1\times$ and $27.2\times$ speedup, respectively.
TL;DR: A heuristic is proposed which evaluates the most attractive graph topologies for CMM circuits which considers both adder depth and pipelining for the first time and shows that these reductions translate to superior results in terms of delay and power consumption compared to the state-of-the-art.
Abstract: Constant matrix multiplication (CMM), i.e., the multiplication of a constant matrix with a vector, is a common operation in digital signal processing. It is a generalization of multiple constant multiplication (MCM) where a single variable is multiplied by a constant vector. Like MCM, CMM can be reduced to additions/subtractions and bit shifts. Finding a circuit with minimal number of add/subtract operations is known as the CMM problem. While this leads to a reduction in circuit area it may be less efficient for power consumption or throughput. It is well studied for the MCM problem that a) reducing the adder depth (AD) leads to a reduced power consumption and b) pipeline resources have to be considered during optimization to enhance throughput without wasting area. This paper addresses the optimization of CMM circuits which considers both adder depth and pipelining for the first time. For that, a heuristic is proposed which evaluates the most attractive graph topologies. It is shown that the proposed method requires 12.5% less adders with min. AD and 38.5% less pipelined operations. Synthesis results for recent FPGAs show that these reductions also translate to superior results in terms of delay and power consumption compared to the state-of-the-art.
TL;DR: This paper develops a novel parallel multiplier that is based on spiking neural P systems and is capable of multiplying two natural numbers with many digits in parallel and employs the divide and conquer strategy to optimize the processing time of the arithmetic operations.
TL;DR: Comparison results verify that the proposed architecture of Gaussian normal basis (GNB) multiplier over binary finite field GF(2 m) has better performance in terms of speed and hardware utilisation.
Abstract: In this paper, an efficient high-speed architecture of Gaussian normal basis (GNB) multiplierover binary finite field GF(2
m
) is presented. The structure is constructed by using some regular modules for computation of exponentiation by powers of 2 and low-cost blocks for multiplication by normal elements of the binary field. For the powers of 2 exponents, the modules are implemented by some simple cyclic shifts in the normal basis representation. As a result, the multiplier has a simple structure with a low critical path delay. The efficiency of the proposed multiplier is examined in terms of area and time complexity based on its implementation on Virtex-4 field programmable gate array family and also its application specific integrated circuit design in 180 nm complementary metal-oxide-semiconductor technology. Comparison results with other structures of the GNB multiplier verify that the proposed architecture has better performance in terms of speed and hardware utilisation.