TL;DR: In this work, a comparatively simple fully homomorphic encryption (FHE) scheme based on the learning with errors (LWE) problem is described, with a new technique for building FHE schemes called the approximate eigenvector method.
Abstract: We describe a comparatively simple fully homomorphic encryption (FHE) scheme based on the learning with errors (LWE) problem. In previous LWE-based FHE schemes, multiplication is a complicated and expensive step involving “relinearization”. In this work, we propose a new technique for building FHE schemes that we call the approximate eigenvector method. In our scheme, for the most part, homomorphic addition and multiplication are just matrix addition and multiplication. This makes our scheme both asymptotically faster and (we believe) easier to understand.
TL;DR: This work obtains the first communication-optimal algorithm for all dimensions of rectangular matrices by combining the dimension-splitting technique with the recursive BFS/DFS approach, and shows significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.
Abstract: Communication-optimal algorithms are known for square matrix multiplication. Here, we obtain the first communication-optimal algorithm for all dimensions of rectangular matrices. Combining the dimension-splitting technique of Frigo, Leiserson, Prokop and Ramachandran (1999) with the recursive BFS/DFS approach of Ballard, Demmel, Holtz, Lipshitz and Schwartz (2012) allows for a communication-optimal as well as cache and network-oblivious algorithm. Moreover, the implementation is simple: approximately 50 lines of code for the shared-memory version. Since the new algorithm minimizes communication across the network, between NUMA domains, and between levels of cache, it performs well in practice on both shared and distributed-memory machines. We show significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.
TL;DR: Two new parallel algorithms are obtained and it is proved that they match the expected communication cost lower bound, and hence they are optimal.
Abstract: Parallel algorithms for sparse matrix-matrix multiplication typically spend most of their time on inter-processor communication rather than on computation, and hardware trends predict the relative cost of communication will only increase. Thus, sparse matrix multiplication algorithms must minimize communication costs in order to scale to large processor counts.In this paper, we consider multiplying sparse matrices corresponding to Erdős-Renyi random graphs on distributed-memory parallel machines. We prove a new lower bound on the expected communication cost for a wide class of algorithms. Our analysis of existing algorithms shows that, while some are optimal for a limited range of matrix density and number of processors, none is optimal in general. We obtain two new parallel algorithms and prove that they match the expected communication cost lower bound, and hence they are optimal.
TL;DR: A practical algorithm for the exact multiplication of square n × n matrices and the asymptotic arithmetic complexity of this algorithm is O(n2.7743).
Abstract: A method for deriving bilinear algorithms for matrix multiplication is proposed. New estimates for the bilinear complexity of a number of problems of the exact and approximate multiplication of rectangular matrices are obtained. In particular, the estimate for the boundary rank of multiplying 3 × 3 matrices is improved and a practical algorithm for the exact multiplication of square n × n matrices is proposed. The asymptotic arithmetic complexity of this algorithm is O(n2.7743).
TL;DR: The paper presents two possible interpretations and realization ways of interval multiplication and division: the possibilistic, unconditional interpretation that is of great meaning for fuzzy arithmetic and fuzzy systems, and the probabilistic, conditional interpretation that requires either knowledge of probability density distributions or assumptions concerning these distributions.
Abstract: The paper presents two possible interpretations and realization ways of interval multiplication and division: the possibilistic, unconditional interpretation that is of great meaning for fuzzy arithmetic and fuzzy systems, and the probabilistic, conditional interpretation that requires either knowledge of probability density distributions or assumptions concerning these distributions. The possibilistic interpretation has a great significance not only for fuzzy arithmetic but also for other sciences that use it such as Computing with Words, Grey Systems, etc. These two interpretations are explained in frame of a new, multidimensional RDM interval-arithmetic. The possibility of realization of interval-arithmetic operations in two ways is an argument for reconciliation of two competing scientific groups that propagate two approaches to uncertainty modeling: the probabilistic and possibilistic one. For many years Professor Zadeh has been claiming in his publications that both approaches are not contradictory but rather complementary.
TL;DR: Low-cost finite impulse response (FIR) designs are presented using the concept of faithfully rounded truncated multipliers and nonuniform coefficient quantization with proper filter order is proposed to minimize total area cost.
Abstract: Low-cost finite impulse response (FIR) designs are presented using the concept of faithfully rounded truncated multipliers. We jointly consider the optimization of bit width and hardware resources without sacrificing the frequency response and output signal precision. Nonuniform coefficient quantization with proper filter order is proposed to minimize total area cost. Multiple constant multiplication/accumulation in a direct FIR structure is implemented using an improved version of truncated multipliers. Comparisons with previous FIR design approaches show that the proposed designs achieve the best area and power results.
TL;DR: In this article, a crossbar array formed by resistive memory devices serves as a memory array that stores the coefficients of a matrix, combined with input and output analog circuits, is used to perform matrix-vector multiplication approximation operations.
Abstract: An apparatus that performs the mathematical matrix-vector multiplication approximation operations using crossbar arrays of resistive memory devices (e.g. memristor, resistive random-access memory, spintronics, etc.). A crossbar array formed by resistive memory devices serves as a memory array that stores the coefficients of a matrix. Combined with input and output analog circuits, the crossbar array system realizes the method of performing matrix-vector multiplication approximation operations with significant performance, area and energy advantages over existing methods and designs. This invention also includes an extended method that realizes the auto-associative neural network recall function using the resistive memory crossbar architecture.
TL;DR: By considering the tradeoff of designs among three components, the overall delay of the proposed 16 × 16-digit multiplier takes about 11 percent less timing delay with 2 percent less area compared to the current fastest design.
Abstract: The decimal multiplication is one of the most important decimal arithmetic operations which have a growing demand in the area of commercial, financial, and scientific computing. In this paper, we propose a parallel decimal multiplication algorithm with three components, which are a partial product generation, a partial product reduction, and a final digit-set conversion. First, a redundant number system is applied to recode not only the multiplier, but also multiples of the multiplicand in signed-digit (SD) numbers. Furthermore, we present a multioperand SD addition algorithm to reduce the partial product array. Finally, a digit-set conversion algorithm with a hybrid prefix network to decrease the number of the logic gates on the critical path is discussed. An analysis of the timing delay and an HDL model synthesized under 90 nm technology show that by considering the tradeoff of designs among three components, the overall delay of the proposed 16 × 16-digit multiplier takes about 11 percent less timing delay with 2 percent less area compared to the current fastest design.
TL;DR: A memristive multiplier circuit demonstrates a fast and highly sensitive pattern recognition for highly complex inputs.
Abstract: Memristors-based resistive logic computation units are introduced. By controlling the memristors' conditional set operation adaptively to one of the input polarities, bipolar signal multiplication of an input and a stored reference bit is performed by unipolar memristor devices and control switches. The multiplication result is registered in an output nonvolatile memristor so that the computed output can be accessed anytime later on by reading the output memristor's state. A memristive multiplier circuit demonstrates a fast and highly sensitive pattern recognition for highly complex inputs.
TL;DR: In this paper, a generalization of the law of free multiplication to non-Hermitian matrices is discussed and a couple of examples illustrating how to use these methods in practice.
Abstract: We review methods to calculate eigenvalue distributions of products of large random matrices. We discuss a generalization of the law of free multiplication to non-Hermitian matrices and give a couple of examples illustrating how to use these methods in practice. In particular we calculate eigenvalue densities of products of Gaussian Hermitian and non-Hermitian matrices including combinations of GUE and Ginibre matrices.
TL;DR: A circuit-synthesis procedure exploits spectral properties of multiplication operators and constructs optimized circuits from the traces of the execution of an appropriate GCD algorithm, reducing gate counts and circuit latency by up to 4-5 times.
Abstract: A major obstacle to implementing Shor's quantum number-factoring algorithm is the large size of modular-exponentiation circuits. We reduce this bottleneck by customizing reversible circuits for modular multiplication to individual runs of Shor's algorithm. Our circuit-synthesis procedure exploits spectral properties of multiplication operators and constructs optimized circuits from the traces of the execution of an appropriate GCD algorithm. Empirically, gate counts are reduced by 4-5 times, and circuit latency is reduced by larger factors.
TL;DR: A fully homomorphic method and system for randomizing an input, wherein all computations are over a commutative ring is described in this article, which can also be used for verifying that a returned result of a calculation performed by a third party is valid for any of the calculations described herein.
Abstract: A fully homomorphic method and system for randomizing an input, wherein all computations are over a commutative ring is described. Equivalent methods for performing the randomization using matrices and polynomials are detailed, as well as ways to mix the matrix and polynomial functions. Addition, multiplication, and division of the matrix and polynomial functions is further described. By performing computations of the functions modulo N over a ring Z N , the functions are usable as encryption functions. The method and system can also be used for verifying that a returned result of a calculation performed by a third party is valid for any of the calculations described herein. Related methods, systems, and apparatus are also described.
TL;DR: A novel software multiplier for performing a polynomial multiplication of two 64-bit binary polynomials based on the VMULL instruction included in the NEON engine supported in many ARM processors is described, obtaining a fast software multiplication in the binary field \(\mathbb{F}_{2^m}\), which is up to 45% faster compared to the best known algorithm.
Abstract: Efficient algorithms for binary field operations are required in several cryptographic operations such as digital signatures over binary elliptic curves and encryption. The main performance-critical operation in these fields is the multiplication, since most processors do not support instructions to carry out a polynomial multiplication. In this paper we describe a novel software multiplier for performing a polynomial multiplication of two 64-bit binary polynomials based on the VMULL instruction included in the NEON engine supported in many ARM processors. This multiplier is then used as a building block to obtain a fast software multiplication in the binary field \(\mathbb{F}_{2^m}\), which is up to 45% faster compared to the best known algorithm. We also illustrate the performance improvement in point multiplication on binary elliptic curves using the new multiplier, improving the performance of standard NIST curves at the 128- and 256-bit levels of security. The impact on the GCM authenticated encryption scheme is also studied, with new speed records. We present timing results of our software implementation on the ARM Cortex-A8, A9 and A15 processors.
TL;DR: This work compares two finite impulse response (FIR) filter architectures for FPGAs for which the coefficients can be reconfigured during run-time and found that if the input word size is greater than approximately half the number of coefficients, the LUT based multiplication scheme needs less resources than the DA architecture and vice versa.
Abstract: This work compares two finite impulse response (FIR) filter architectures for FPGAs for which the coefficients can be reconfigured during run-time. One is a recently proposed filter architecture based on distributed arithmetic (DA) and the other is based on a LUT multiplication scheme. Instead of using the common internal configuration access port (ICAP) for reconfiguration which is able to change the logic as well as the routing, it is sufficient to reconfigure only the logic in the regarded architectures. This is realized by using the configurable look-up table (CFGLUT) primitive of Xilinx that allows reconfiguration times which are orders of magnitudes faster than using ICAP. The resulting FIR filter architectures achieves reconfiguration times of typically less than 100 ns. They can be reconfigured with arbitrary coefficients that are only limited by their length and word size. As their resource consumptions depend on different parameters of the filter, a detailed comparison is done. It turned out that if the input word size is greater than approximately half the number of coefficients, the LUT based multiplication scheme needs less resources than the DA architecture and vice versa.
TL;DR: A novel method, i.e., "Carry-Once", is proposed, which reduces the number of intermediate result computation by size of result accumulation and improves all multi-precision multiplication techniques having Intermediate result computation and show performance enhancement in terms of speed by up to 2.5%, compared with best known results.
Abstract: In this paper, we revisit the previous multi-precision multiplication techniques including "operand-scanning", "hybrid-scanning", "operand-caching", "consecutive operand-caching" and "product-scanning." Particularly, the former four methods execute an intermediate result computation which is process for updating the results with a newly computed result by computing a number of addition operations. This operations is expensive, so efficient implementation is required to boost the performance. For this reason, we propose a novel method, i.e., "Carry-Once", which reduces the number of intermediate result computation by size of result accumulation. The main idea is gathering carry values and updating the values at once. This method improves all multi-precision multiplication techniques having intermediate result computation and show performance enhancement in terms of speed by up to 2.5%, compared with best known results.
TL;DR: Sparse matrix-vector multiplication (SpMxV) is a kernel operation widely used in iterative linear solvers that multiplies the sparse matrix by a dense vector repeatedly in these solvers.
Abstract: Sparse matrix-vector multiplication (SpMxV) is a kernel operation widely used in iterative linear solvers. The same sparse matrix is multiplied by a dense vector repeatedly in these solvers. Matric...
TL;DR: In this article, the authors investigated pointwise multipliers on vector-valued function spaces with Muckenhoupt weights and proved that the characteristic function of the half-space is a pointwise multiplier on Besselpotential spaces with values in a UMD Banach space.
Abstract: We investigate pointwise multipliers on vector-valued function spaces over $\mathbb{R}^d$, equipped with Muckenhoupt weights. The main result is that in the natural parameter range, the characteristic function of the half-space is a pointwise multiplier on Bessel-potential spaces with values in a UMD Banach space. This is proved for a class of power weights, including the unweighted case, and extends the classical result of Shamir and Strichartz. The multiplication estimate is based on the paraproduct technique and a randomized Littlewood-Paley decomposition. An analogous result is obtained for Besov and Triebel-Lizorkin spaces.
TL;DR: This work investigates the optimization of pipelined MCM circuits which include ternary adders, and shows experimentally that 27% less operations are needed on average by using ternaries, resulting in 15% slice and 10% ALM reductions, respectively.
Abstract: The scaling operation, i. e., the multiplication with a single constant is a frequently used operation in many kinds of numeric algorithms. The multiple constant multiplication (MCM) is a generalization where a variable is multiplied by several constants. This kind of operation is heavily used, e. g., in digital filters or discrete transforms. It was shown in recent work that small, fast and power efficient MCM implementations can be realized by using the fast carry chains of FPGAs rather than wasting specialized embedded multipliers. However, in the work so far, only common two-input adders were used. As FPGAs today support ternary adders, i. e., adders with three inputs, this work investigates the optimization of pipelined MCM circuits which include ternary adders. It is shown experimentally that 27% less operations are needed on average by using ternary adders, resulting in 15% slice (Xilinx) and 10% ALM (Altera) reductions, respectively.
TL;DR: This paper examined preservice elementary teachers' knowledge for teaching the associative property (AP) of multiplication and found that most elementary teachers were unable to use concrete contexts (e.g., pictorial representations and word problems) to illustrate AP of multiplication conceptually, particularly due to a fragile understanding of the meaning of multiplication.
TL;DR: The results indicate that differences in educational practices might affect the neural bases of symbolic arithmetic, and suggests that the multiplication problem-size effect might be a verbal retrieval effect in Chinese as compared to American participants.
Abstract: Multiplication problems involving large numbers (e.g., 9 × 8) are more difficult to solve than problems involving small numbers (e.g., 2 × 3). Behavioral research indicates that this problem-size effect might be due to different factors across countries and educational systems. However, there is no neuroimaging evidence supporting this hypothesis. Here, we compared the neural correlates of the multiplication problem-size effect in adults educated in China and the United States. We found a greater neural problem-size effect in Chinese than American participants in bilateral superior temporal regions associated with phonological processing. However, we found a greater neural problem-size effect in American than Chinese participants in right intra-parietal sulcus (IPS) associated with calculation procedures. Therefore, while the multiplication problem-size effect might be a verbal retrieval effect in Chinese as compared to American participants, it may instead stem from the use of calculation procedures in American as compared to Chinese participants. Our results indicate that differences in educational practices might affect the neural bases of symbolic arithmetic.
TL;DR: From the application-specific integrated circuit and field-programmable gate array synthesis results, it is found that the proposed design provides significantly less area-delay and power-delay complexities over the best of the existing designs.
Abstract: This paper presents an area-time-efficient systolic structure for multiplication over GF(2m) based on irreducible all-one polynomial (AOP). We have used a novel cut-set retiming to reduce the duration of the critical-path to one XOR gate delay. It is further shown that the systolic structure can be decomposed into two or more parallel systolic branches, where the pair of parallel systolic branches has the same input operand, and they can share the same input operand registers. From the application-specific integrated circuit and field-programmable gate array synthesis results we find that the proposed design provides significantly less area-delay and power-delay complexities over the best of the existing designs.
TL;DR: An application to the problem of cancer cell line image classification is presented that uses the notion of a co-difference matrix that is analogous to a covariance matrix except that the vector products are based on the new proposed framework.
Abstract: A new framework for signal processing is introduced based on a novel vector product definition that permits a multiplier-free implementation. First a new product of two real numbers is defined as the sum of their absolute values, with the sign determined by product of the hard-limited numbers. This new product of real numbers is used to define a similar product of vectors in RN. The new vector product of two identical vectors reduces to a scaled version of the l1 norm of the vector. The main advantage of this framework is that it yields multiplication-free computationally efficient algorithms for performing some important tasks in signal processing. An application to the problem of cancer cell line image classification is presented that uses the notion of a co-difference matrix that is analogous to a covariance matrix except that the vector products are based on our new proposed framework. Results show the effectiveness of this approach when the proposed co-difference matrix is compared with a covariance matrix.
TL;DR: A new key distribution protocol that focuses on the reduction of computation complexity by performing lesser numbers of multiplication operations using a ternary-tree approach during key updating and reducing the amount of information communicated to the group members during the update operations in the key content is proposed.
Abstract: Secure multimedia multicast applications involve group communications where group membership requires secured dynamic key generation and updating operations. Such operations usually consume high computation time and therefore designing a key distribution protocol with reduced computation time is necessary for multicast applications. In this paper, we propose a new key distribution protocol that focuses on two aspects. The first one aims at the reduction of computation complexity by performing lesser numbers of multiplication operations using a ternary-tree approach during key updating. Moreover, it aims to optimize the number of multiplication operations by using the existing Karatsuba divide and conquer approach for fast multiplication. The second aspect aims at reducing the amount of information communicated to the group members during the update operations in the key content. The proposed algorithm has been evaluated based on computation and communication complexity and a comparative performance analysis of various key distribution protocols is provided. Moreover, it has been observed that the proposed algorithm reduces the computation and communication time significantly.
TL;DR: A rationalized algorithm for calculating the product of sedenions is presented which reduces the number of underlying multiplications and can compute the same result in only 122 multiplications (or multipliers - in hardware implementation case) and 298 additions.
TL;DR: In this paper, a method of matrix multiplication was chosen, and a performance analysis was evaluated, and it was seen that the chosen method was very powerful when dealing with matrices with large sizes and implementing the method using parallel computing based on openMP libraries.
Abstract: Multiplication of large matrices requires a lot of computation time as its complexity is O(n3). Because most image processing applications require higher computational throughputs with minimum time, many sequential and parallel algorithms are developed. In this paper, a method of matrix multiplication was chosen, and analyzed. A performance analysis was evaluated, and it was seen that the chosen method was very powerful when dealing with matrices with large sizes and implementing the method using parallel computing based on openMP libraries
TL;DR: This paper shows how some of these ideal lattices can be constructed from polynomial codes (generalization of cyclic codes) via Construction A, and illustrates how these lattices enable multiplication.
Abstract: As a first step towards distributed computations in a wireless network, we introduce ideal lattices, that is lattices built over an ideal of a ring of integers in a number field, as a tool for constructing lattice codes at the physical layer. These lattices are not only additive groups as all lattices, but they are also equipped with a multiplication, which enables polynomial operations at each node of the wireless network. In this paper, we show how some of these ideal lattices can be constructed from polynomial codes (generalization of cyclic codes) via Construction A, and illustrate how these lattices enable multiplication.
TL;DR: In this article, an analysis and reconstruction of Joncourt's table of triangular numbers is presented, which was an alternative to other methods for the computation of squares, the extraction of square roots, and even the multiplication.
Abstract: This is an analysis and reconstruction of Joncourt's table of triangular numbers, one of only very few such tables, which was an alternative to other methods for the computation of squares, the extraction of square roots, and even the multiplication.
TL;DR: The mapped blocked row format is proposed: a bitmapped sparse matrix format that stores entries as blocks without a fill overhead, thereby offering blocking without additional storage and bandwidth overheads.
Abstract: The problem of obtaining high computational throughput from sparse matrix multiple-vector multiplication routines is considered. Current sparse matrix formats and algorithms have high bandwidth requirements and poor reuse of cache and register loaded entries, which restrict their performance. We propose the mapped blocked row format: a bitmapped sparse matrix format that stores entries as blocks without a fill overhead, thereby offering blocking without additional storage and bandwidth overheads. An efficient algorithm decodes bitmaps using de Bruijn sequences and minimizes the number of conditionals evaluated. Performance is compared with that of popular formats, including vendor implementations of sparse BLAS. Our sparse matrix multiple-vector multiplication algorithm achieves high throughput on all platforms and is implemented using platform neutral optimizations.
TL;DR: This work proposes two optimization techniques for high-speed implementations of the required multiplications with the least possible number of FPGA components and a formulation for the pipelined multiple constant multiplication problem is presented.
Abstract: Two-dimensional finite impulse response (FIR) filters are an important component in many image and video processing systems. The processing of complex video applications in real time requires high computational power, which can be provided using field programmable gate arrays (FPGAs) due to their inherent parallelism. The most resource-intensive components in computing FIR filters are the multiplications of the folding operation. This work proposes two optimization techniques for high-speed implementations of the required multiplications with the least possible number of FPGA components. Both methods use integer linear programming formulations which can be optimally solved by standard solvers. In the first method, a formulation for the pipelined multiple constant multiplication problem is presented. In the second method, also multiplication structures based on look-up tables are taken into account. Due to the low coefficient word size in video processing filters of typically 8 to 12 bits, an optimal solution is found for most of the filters in the benchmark used. A complexity reduction of 8.5% for a Xilinx Virtex 6 FPGA could be achieved compared to state-of-the-art heuristics.