TL;DR: A general multiparty computation protocol secure against an active adversary corrupting up to $$n-1$$ of the n players is proposed, which may be used to compute securely arithmetic circuits over any finite field $$\mathbb {F}_{p^k}$$.
Abstract: We propose a general multiparty computation protocol secure against an active adversary corrupting up to $$n-1$$ of the n players. The protocol may be used to compute securely arithmetic circuits over any finite field $$\mathbb {F}_{p^k}$$. Our protocol consists of a preprocessing phase that is both independent of the function to be computed and of the inputs, and a much more efficient online phase where the actual computation takes place. The online phase is unconditionally secure and has total computational and communication complexity linear in n, the number of players, where earlier work was quadratic in n. Moreover, the work done by each player is only a small constant factor larger than what one would need to compute the circuit in the clear. We show this is optimal for computation in large fields. In practice, for 3 players, a secure 64-bit multiplication can be done in 0.05 ms. Our preprocessing is based on a somewhat homomorphic cryptosystem. We extend a scheme by Brakerski et al., so that we can perform distributed decryption and handle many values in parallel in one ciphertext. The computational complexity of our preprocessing phase is dominated by the public-key operations, we need $$On^2/s$$ operations per secure multiplication where s is a parameter that increases with the security parameter of the cryptosystem. Earlier work in this model needed $$\varOmega n^2$$ operations. In practice, the preprocessing prepares a secure 64-bit multiplication for 3 players in about 13 ms.
TL;DR: It is demonstrated that the parallel SpGEMM methods, which use two-dimensional block data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memory-efficient in the general case.
Abstract: Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. Here we show that SpGEMM also yields efficient algorithms for general sparse-matrix indexing in distributed memory, provided that the underlying SpGEMM implementation is sufficiently flexible and scalable. We demonstrate that our parallel SpGEMM methods, which use two-dimensional block data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memory-efficient in the general case. This algorithm is the first to yield increasing speedup on an unbounded number of processors; our experiments show scaling up to thousands of processors in a variety of test scenarios.
TL;DR: The GH-FHE primitives for the small setting with a dimension of 2048 on NVIDIA C2050 GPU are implemented and the experimental results show the speedup factors of 7.68, 7.4 and 6.59 for encryption, decryption and recrypt respectively, when compared with the existing CPU implementation.
Abstract: As a major breakthrough, in 2009 Gentry introduced the first plausible construction of a fully homomorphic encryption (FHE) scheme FHE allows the evaluation of arbitrary functions directly on encrypted data on untwisted servers In 2010, Gentry and Halevi presented the first FHE implementation on an IBM x3500 server However, this implementation remains impractical due to the high latency of encryption and recryption The Gentry-Halevi (GH) FHE primitives utilize multi-million-bit modular multiplications and additions which are time-consuming tasks for a general purpose computer In the GH-FHE implementation, the most computationally intensive arithmetic operation is modular multiplication In this paper, the million-bit modular multiplication is computed in two steps For large number multiplication, Strassen's FFT based algorithm is employed and accelerated on a graphics processing unit (GPU) through its massive parallelism Subsequently, Barrett modular reduction algorithm is applied to implement modular reduction As an experimental study, we implement the GH-FHE primitives for the small setting with a dimension of 2048 on NVIDIA C2050 GPU The experimental results show the speedup factors of 768, 74 and 659 for encryption, decryption and recrypt respectively, when compared with the existing CPU implementation
TL;DR: This paper proposes an improved second order (ISO) algorithm for training radial basis function (RBF) networks that can normally reach smaller training/testing error with much less number of RBF units.
Abstract: This paper proposes an improved second order (ISO) algorithm for training radial basis function (RBF) networks. Besides the traditional parameters, including centers, widths and output weights, the input weights on the connections between input layer and hidden layer are also adjusted during the training process. More accurate results can be obtained by increasing variable dimensions. Initial centers are chosen from training patterns and other parameters are generated randomly in limited range. Taking the advantages of fast convergence and powerful search ability of second order algorithms, the proposed ISO algorithm can normally reach smaller training/testing error with much less number of RBF units. During the computation process, quasi Hessian matrix and gradient vector are accumulated as the sum of related sub matrices and vectors, respectively. Only one Jacobian row is stored and used for multiplication, instead of the entire Jacobian matrix storage and multiplication. Memory reduction benefits the computation speed and allows the training of problems with basically unlimited number of patterns. Several practical discrete and continuous classification problems are applied to test the properties of the proposed ISO training algorithm.
TL;DR: A new algorithm for multiplying an n × nk matrix by an n–k × n matrix, which is better than all known algorithms for rectangular matrix multiplication and recovers exactly the complexity of the algorithm by Coppersmith and Winograd.
Abstract: Let $\alpha$ be the maximal value such that the product of an $n\times n^\alpha$ matrix by an $n^\alpha\times n$ matrix can be computed with $n^{2+o(1)}$ arithmetic operations. In this paper we show that $\alpha>0.30298$, which improves the previous record $\alpha>0.29462$ by Coppersmith (Journal of Complexity, 1997). More generally, we construct a new algorithm for multiplying an $n\times n^k$ matrix by an $n^k\times n$ matrix, for any value $k
eq 1$. The complexity of this algorithm is better than all known algorithms for rectangular matrix multiplication. In the case of square matrix multiplication (i.e., for $k=1$), we recover exactly the complexity of the algorithm by Coppersmith and Wino grad (Journal of Symbolic Computation, 1990). These new upper bounds can be used to improve the time complexity of several known algorithms that rely on rectangular matrix multiplication. For example, we directly obtain a $O(n^{2.5302})$-time algorithm for the all-pairs shortest paths problem over directed graphs with small integer weights, where $n$ denotes the number of vertices, and also improve the time complexity of sparse square matrix multiplication.
TL;DR: In this article, a new parallel algorithm based on Strassen's fast matrix multiplication algorithm is presented, which is communication-optimal and exhibits perfect strong scaling within the maximum possible range.
Abstract: Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA '11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range.Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of processors ranges from 49 to 7203.Our parallelization approach generalizes to other fast matrix multiplication algorithms.
TL;DR: In this article, the authors contribute a perspective on mathematical embodied cognition consistent with a phenomenological understanding of perception and body motion, based on the analysis of 4 selected episodes in 1 session of an undergraduate mathematics class.
Abstract: In this article we contribute a perspective on mathematical embodied cognition consistent with a phenomenological understanding of perception and body motion. It is based on the analysis of 4 selected episodes in 1 session of an undergraduate mathematics class. The theme of this particular class session was the geometric interpretation of the addition and multiplication of complex numbers. On the basis of these episodes, the article examines 2 conjectures: (a) The mathematical insights developed by an individual or a group are expressed in and constituted by perceptuo-motor activity, and (b) the learning of mathematical ideas is shaped in nondeterministic ways by the setting or learning environment.
TL;DR: A new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication is obtained, and it exhibits perfect strong scaling within the maximum possible range.
Abstract: Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice.
A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA'11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range.
Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of nodes ranges from 49 to 7203.
Our parallelization approach generalizes to other fast matrix multiplication algorithms.
TL;DR: In this article, it was shown that α > 0.30298, which improved the previous record of α>0.29462 by Coppersmith and Winograd.
Abstract: Let {\alpha} be the maximal value such that the product of an n x n^{\alpha} matrix by an n^{\alpha} x n matrix can be computed with n^{2+o(1)} arithmetic operations. In this paper we show that \alpha>0.30298, which improves the previous record \alpha>0.29462 by Coppersmith (Journal of Complexity, 1997). More generally, we construct a new algorithm for multiplying an n x n^k matrix by an n^k x n matrix, for any value k
eq 1. The complexity of this algorithm is better than all known algorithms for rectangular matrix multiplication. In the case of square matrix multiplication (i.e., for k=1), we recover exactly the complexity of the algorithm by Coppersmith and Winograd (Journal of Symbolic Computation, 1990).
These new upper bounds can be used to improve the time complexity of several known algorithms that rely on rectangular matrix multiplication. For example, we directly obtain a O(n^{2.5302})-time algorithm for the all-pairs shortest paths problem over directed graphs with small integer weights, improving over the O(n^{2.575})-time algorithm by Zwick (JACM 2002), and also improve the time complexity of sparse square matrix multiplication.
TL;DR: This is the first FPGA implementation of point multiplication on binary Edwards and generalized Hessian curves represented by ω-coordinates, and it is demonstrated how parallelization in higher levels can be performed by full resource utilization of computing point addition and point-doubling formulas.
Abstract: Efficient implementation of point multiplication is crucial for elliptic curve cryptographic systems. This paper presents the implementation results of an elliptic curve crypto-processor over binary fields GF(2m) on binary Edwards and generalized Hessian curves using Gaussian normal basis (GNB). We demonstrate how parallelization in higher levels can be performed by full resource utilization of computing point addition and point-doubling formulas for both binary Edwards and generalized Hessian curves. Then, we employ the ω-coordinate differential formulations for computing point multiplication. Using a lookup-table (LUT)-based pipelined and efficient digit-level GNB multiplier, we evaluate the LUT complexity and time-area tradeoffs of the proposed crypto-processor on an FPGA. We also compare the implementation results of point multiplication on these curves with the ones on the traditional binary generic curve. To the best of the authors' knowledge, this is the first FPGA implementation of point multiplication on binary Edwards and generalized Hessian curves represented by ω-coordinates.
TL;DR: An experimental analysis of the scalability, based on OpenCL descriptions of the proposed algorithms, suggest that further advantage can be obtained from the proposed RNS approach for GPUs and EC curves supported by underlying finite fields of smaller size, regarding implementations on general purpose multi-cores.
Abstract: Acceleration of cryptographic applications on massive parallel computing platforms, such as Graphic Processing Units (GPUs), becomes a real challenge concerning practical implementations. In this paper, we propose a parallel algorithm for Elliptic Curve (EC) point multiplication in order to compute EC cryptography on these platforms. The proposed approach relies on the usage of the Residue Number System (RNS) to extract parallelism on high-precision integer arithmetic. Results suggest a maximum throughput of 9827 EC multiplications per second and minimum latency of 29.2Â ms for a 224-bit underlying field, in a commercial Nvidia 285 GTX GPU. Performances up to an order of magnitude better in latency and 122% in throughput are achieved regarding other approaches reported in the related art. An experimental analysis of the scalability, based on OpenCL descriptions of the proposed algorithms, suggest that further advantage can be obtained from the proposed RNS approach for GPUs and EC curves supported by underlying finite fields of smaller size, regarding implementations on general purpose multi-cores.
TL;DR: This paper investigates the efficient computation of the Optimal-Ate pairing over Barreto-Naehrig curves in software at different security levels on ARM processors, exploiting state-of-the-art techniques and proposing new optimizations to speed up the computation in the tower field and curve arithmetic.
Abstract: As hardware capabilities increase, low-power devices such as smartphones represent a natural environment for the efficient implementation of cryptographic pairings Few works in the literature have considered such platforms despite their growing importance in a post-PC world In this paper, we investigate the efficient computation of the Optimal-Ate pairing over Barreto-Naehrig curves in software at different security levels on ARM processors We exploit state-of-the-art techniques and propose new optimizations to speed up the computation in the tower field and curve arithmetic In particular, we extend the concept of lazy reduction to inversion in extension fields, analyze an efficient alternative for the sparse multiplication used inside the Miller’s algorithm and reduce further the cost of point/line evaluation formulas in affine and projective homogeneous coordinates In addition, we study the efficiency of using M-type sextic twists in the pairing computation and carry out a detailed comparison between affine and projective coordinate systems Our implementations on various mass-market smartphones and tablets significantly improve the state-of-the-art of pairing computation on ARM-powered devices, outperforming by at least a factor of 37 the best previous results in the literature
TL;DR: This paper discusses efficient implementations of sparse matrix-vector multiplication on NVIDIA's Fermi architecture, the first to introduce conventional L1 caches to GPUs, and focuses on the compressed sparse row (CSR) format for developing general purpose code.
Abstract: Sparse matrix-vector multiplication is an integral part of many scientific algorithms. Several studies have shown that it is a bandwidth-limited operation on current hardware. On cache-based architectures the main factors that influence performance are spatial locality in accessing the matrix, and temporal locality in re-using the elements of the vector.
TL;DR: RPAG outperforms previous methods which are based on pipelining the solutions of conventional MCM algorithms and often produces better results compared to the prominent Hcub algorithm with minimal total AD constraint.
Abstract: This paper addresses the direct optimization of pipelined adder graphs (PAGs) for high speed multiple constant multiplication (MCM). The optimization opportunities are described and a definition of the pipelined multiple constant multiplication (PMCM) problem is given. It is shown that the PMCM problem is a generalization of the MCM problem with limited adder depth (AD). A novel algorithm to solve the PMCM problem heuristically, called RPAG, is presented. RPAG outperforms previous methods which are based on pipelining the solutions of conventional MCM algorithms. A flexible cost evaluation is used which enables the optimization for FPGA or ASIC targets on high or low abstraction levels. Results for both technologies are given and compared with the most recent methods. Even for the special case of limited AD it is shown that RPAG often produces better results compared to the prominent Hcub algorithm with minimal total AD constraint.
TL;DR: This paper presents a new method for parallel binary finite field multiplication which results in subquadratic space complexity and shows that block recombination can be used for efficient implementation of the GHASH function of Galois Counter Mode (GCM).
Abstract: In this paper, we present a new method for parallel binary finite field multiplication which results in subquadratic space complexity. The method is based on decomposing the building blocks of the Fan-Hasan subquadratic Toeplitz matrix-vector multiplier. We reduce the space complexity of their architecture by recombining the building blocks. In comparison to other similar schemes available in the literature, our proposal presents a better space complexity while having the same time complexity. We also show that block recombination can be used for efficient implementation of the GHASH function of Galois Counter Mode (GCM).
TL;DR: The results support the view that addition and multiplication facts are stored in an interrelated semantic network and that RIF of competing addition facts is an intrinsic process of multiplication fact retrieval.
Abstract: Retrieval-induced forgetting (RIF) is a widely studied phenomenon of human memory, but RIF of arithmetic facts remains relatively unexplored. In 2 experiments, we investigated RIF of simple addition facts (2 + 3 = 5) from practice of their multiplication counterparts (2 × 3 = 6). In both experiments, robust RIF expressed in response times occurred only for high-strength small-number addition facts with sums ≤ 10, indicating that RIF from multiplication practice was interference dependent. RIF of addition-fact memory was produced by multiplication retrieval (2 × 3 = ?) but not multiplication study (2 × 3 = 6), supporting an inhibitory mechanism of RIF in arithmetic memory. Finally, RIF occurred with multiplication practiced in word format (three × four) and addition tested later in digit format (3 + 4), which provides evidence that digit and written-word formats for arithmetic accessed a common semantic retrieval network. The results support the view that addition and multiplication facts are stored in an interrelated semantic network and that RIF of competing addition facts is an intrinsic process of multiplication fact retrieval.
TL;DR: In this article, a hardware optimized sparse matrix representation referred to as the Compressed Variable-Length Bit Vector (CVBV) format is used to take advantage of the capabilities of FPGAs and reduce storage and bandwidth requirements across the matrices.
Abstract: A universal single-bitstream FPGA library or ASIC implementation accelerates matrix-vector multiplication processing multiple matrix encodings including dense and multiple sparse formats. A hardware-optimized sparse matrix representation referred to herein as the Compressed Variable-Length Bit Vector (CVBV) format is used to take advantage of the capabilities of FPGAs and reduce storage and bandwidth requirements across the matrices compared to that typically achieved when using the Compressed Sparse Row (CSR) format in typical CPU- and GPU-based approaches. Also disclosed is a class of sparse matrix formats that are better suited for FPGA implementations than existing formats reducing storage and bandwidth requirements. A partitioned CVBV format is described to enable parallel decoding.
TL;DR: This paper investigated children's understanding of area measurement, including the concept of area and the area formula of a rectangle, as well as their strategic knowledge for solving area measurement problems, and found that the children who had a good understanding of the concepts of the area and formula exhibited competency in identifying geometric shapes, using formulas for determining areas, and self correcting mistakes.
TL;DR: The synthesis results show that the computation for calculating the 32-point Fast Fourier transform is efficient in terms of speed.
Abstract: The Fast Fourier Transform (FFT) is one of the rudimentary operations in field of digital signal and image processing. Some of the very vital applications of the fast fourier transform include Signal analysis, Sound filtering, Data compression, Partial differential equations, Multiplication of large integers, Image filtering etc. Fast Fourier transform (FFT) is an efficient implementation of the discrete Fourier transform (DFT). This paper concentrates on the development of the Fast Fourier Transform (FFT), based on Decimation-In-Time (DIT) domain, Radix-2 algorithm, this paper uses VHDL as a design entity, and their Synthesis by Xilinx Synthesis Tool on Vertex kit has been done. The input of Fast Fourier transform has been given by a PS2 KEYBOARD using a test bench and output has been displayed using the waveforms on the Xilinx Design Suite 12.1. The synthesis results show that the computation for calculating the 32-point Fast Fourier transform is efficient in terms of speed.
TL;DR: This article reviewed research on children and adults' use of shortcut procedures that make use of the inverse relation between multiplication and division on two kinds of problems: inversion problems and associativity problems.
Abstract: Researchers have speculated that children find it more difficult to acquire conceptual understanding of the inverse relation between multiplication and division than that between addition and subtraction. We reviewed research on children and adults’ use of shortcut procedures that make use of the inverse relation on two kinds of problems: inversion problems (e.g., \( {9} \times {24} \div {24} \)) and associativity problems (e.g., \( {9} \times {24} \div {8} \)). Both can be solved more easily if the division of the second and third numbers is performed before the multiplication of the first and second numbers. The findings we reviewed suggest that understanding and use of the inverse relation between multiplication and division develops relatively slowly and is difficult for both children and adults to implement in shortcut procedures if they are not flexible problem solvers. We use the findings to expand an existing model, highlight some similarities and differences in solvers’ use of conceptual knowledge across operations, and discuss educational implications of the findings.
TL;DR: One particular number system is developed, termed hyper-dual numbers, which produces exact first- and second-derivative information, which is demonstrated on an unstructured, parallel, unsteady Reynolds-Averaged Navier-Stokes solver.
Abstract: Automatic Differentiation techniques are typically derived based on the chain rule of differentiation. Other methods can be derived based on the inherent mathematical properties of generalized complex numbers that enable first-derivative information to be carried in the non-real part of the number. These methods are capable of producing effectively exact derivative values. However, when second-derivative information is desired, generalized complex numbers are not sufficient. Higher-dimensional extensions of generalized complex numbers, with multiple non-real parts, can produce accurate second-derivative information provided that multiplication is commutative. One particular number system is developed, termed hyper-dual numbers, which produces exact first- and second-derivative information. The accuracy of these calculations is demonstrated on an unstructured, parallel, unsteady Reynolds-Averaged Navier-Stokes solver.
TL;DR: In this paper, the authors introduce a deterministic win-stay-lose-shift rule into the spatial public goods game, according to which a player will change its current strategy only if its payoff is below a predefined aspiration level.
Abstract: We introduce a deterministic win-stay-lose-shift rule into the spatial public goods game, according to which a player will change its current strategy only if its payoff is below a predefined aspiration level. Simulation results on the square lattice and scale-free network indicate that the aspiration level greatly affects the evolution of cooperation. For small multiplication factors, the frequency of cooperation increases to 0.5 as the aspiration level increases. For large multiplication factors, intermediate levels of aspiration prove optimal for the successful evolution of public cooperation. Some qualitative analyses are provided to explain the above results. Besides, we have found that there exists a ping-pong vibration of cooperation at some specific values of multiplication factors and aspiration levels.
TL;DR: With the proposed “sloppy” operations, this work obtains a reduction in delay, area and power dissipation, and the error introduced is still acceptable for applications such as image processing.
Abstract: Sometimes reducing the precision of a numerical processor, by introducing errors, can lead to significant performance (delay, area and power dissipation) improvements without compromising the overall quality of the processing. In this work, we show how to perform the two basic operations, addition and multiplication, in an imprecise manner by simplifying the hardware implementation. With the proposed “sloppy” operations, we obtain a reduction in delay, area and power dissipation, and the error introduced is still acceptable for applications such as image processing.
TL;DR: A high speed 4x4 bit Vedic Multiplier (VM) based on Vertically & Crosswise method of Vedic mathematics, a general multiplication formulae equally applicable to all cases of multiplication is presented.
Abstract: The need of high speed multiplier is increasing as the need of high speed processors are increasing. A Multiplier is one of the key hardware blocks in most fast processing system which is not only a high delay block but also a major source of power dissipation. A conventional processor requires substantially more hardware resources and processing time in the multiplication operation, rather than addition and subtraction. This paper presents a high speed 4x4 bit Vedic Multiplier (VM) based on Vertically & Crosswise method of Vedic mathematics, a general multiplication formulae equally applicable to all cases of multiplication. It is based on generating all partial products and their sum in one step. The coding is done in VHDL (Very High Speed Integrated Circuit Hardware Descriptive Language) while the synthesis and simulation is done using EDA (Electronic Design Automation) tool XilinxISE12.1i. The combinational path delay of 4x4 bit Vedic multiplier obtained after synthesis is compared with normal multipliers and found that the proposed Vedic multiplier circuit seems to have better performance in
TL;DR: A new method to speed up IFp-arithmetic in hardware for pairing-friendly curves, such as the well-known Barreto-Naehrig (BN) curves, using Montgomery reduction in a polynomial ring combined with a coefficient reduction phase using a pseudo-Mersenne number is described.
Abstract: This paper describes a new method to speed up IFp-arithmetic in hardware for pairing-friendly curves, such as the well-known Barreto-Naehrig (BN) curves. We explore the characteristics of the modulus defined by these curves and choose curve parameters such that IFp multiplication becomes more efficient. The proposed algorithm uses Montgomery reduction in a polynomial ring combined with a coefficient reduction phase using a pseudo-Mersenne number. As an application, we show that the performance of pairings on BN curves in hardware can be significantly improved, resulting in a factor 2.5 speedup compared with state-of-the-art hardware implementations.
TL;DR: In this paper, the authors generalized Perrin- Riou's Iwasawa main conjecture for Heegner points on elliptic curves to abelian varieties of GL2-type.
Abstract: In an earlier paper the author proved one divisibility of Perrin- Riou's Iwasawa main conjecture for Heegner points on elliptic curves. In the present paper, that result is generalized to abelian varieties of GL2-type (i.e. abelian varieties with real multiplication defined over totally real fields) under the hypothesis that the abelian variety is associated to a Hilbert modular form via a construction of Zhang.
TL;DR: A new sparse matrix storage format that takes advantage of the diagonal structure of matrices for stencil operations on structured grids, specifically optimize for the case of higher degrees of freedom, where formats such as DIA are forced to explicitly represent many zero elements in the sparse matrix.
Abstract: In this paper, we address efficient sparse matrix-vector multiplication for matrices arising from structured grid problems with high degrees of freedom at each grid node. Sparse matrix-vector multiplication is a critical step in the iterative solution of sparse linear systems of equations arising in the solution of partial differential equations using uniform grids for discretization. With uniform grids, the resulting linear system Ax = b has a matrix A that is sparse with a very regular structure. The specific focus of this paper is on sparse matrices that have a block structure due to the large number of unknowns at each grid point. Sparse matrix storage formats such as Compressed Sparse Row (CSR) and Diagonal format (DIA) are not the most effective for such matrices.In this work, we present a new sparse matrix storage format that takes advantage of the diagonal structure of matrices for stencil operations on structured grids. Unlike other formats such as the Diagonal storage format (DIA), we specifically optimize for the case of higher degrees of freedom, where formats such as DIA are forced to explicitly represent many zero elements in the sparse matrix. We develop efficient sparse matrix-vector multiplication for structured grid computations on GPU architectures using CUDA [25].
TL;DR: This paper proposes a novel method, i.e., “consecutive operand caching”, which reduces the number of required load instructions by caching the operands and boosts the speed of multi-precision multiplication by 3.85%, as compared to previous best known results.
Abstract: In this paper, we revisit the “operand caching” method for multi-precision multiplication, which reduces the number of required load instructions by caching the operands [6]. With the previous method, we can achieve high performance in terms of multiplication speed with modern micro-processors. However, this method does not provide full operand caching when changing the row of partial products. To overcome this problem, we propose a novel method, i.e., “consecutive operand caching”. We divide partial products and reconstruct them yielding common operands between previous and new partial products. Finally, we reduce the number of load instructions and boost the speed of multi-precision multiplication by 3.85%, as compared to previous best known results.
TL;DR: Goren et al. as mentioned in this paper developed a theory of canonical subgroups for abelian modular varieties with real multiplication and obtained new results on the geometry of Hilbert modular varieties in positive characteristic.
Abstract: We obtain new results on the geometry of Hilbert modular varieties in positive characteristic and morphisms between them. Using these results and methods of rigid geometry, we develop a theory of canonical subgroups for abelian varieties with real multiplication. To cite this article: E.Z. Goren, P.L Kassaei, C. R. Acad. Sci. Paris, Ser. I 347 (2009).
TL;DR: In this paper, a truncated version of the classical Fast Fourier Transform (TFT) was introduced for polynomial multiplication with real coefficients, which has the nice property of eliminating the jumps in the complexity at powers of two.
Abstract: In a previous paper [vdH04], we introduced a truncated version of the classical Fast Fourier Transform. When applied to polynomial multiplication, this algorithm has the nice property of eliminating the “jumps” in the complexity at powers of two. When applied to the multiplication of multivariate polynomials or truncated multivariate power series, a non-trivial asymptotic factor was gained with respect to the best previously known algorithms. In the present note, we correct two errors which slipped into the previous paper and we give a new application to the multiplication of polynomials with real coefficients. We also give some further hints on how to implement the TFT in practice.