TL;DR: The basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library are presented.
Abstract: We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective algorithm for executing this operation results. Implementations on a broad selection of architectures are shown to achieve near-peak performance.
TL;DR: It is shown that proving exponential lower bounds on depth four arithmetic circuits imply exponentialLower bounds for unrestricted depth arithmetic circuits, and that for exponential sized circuits additional depth beyond four does not help.
Abstract: We show that proving exponential lower bounds on depth four arithmetic circuits imply exponential lower bounds for unrestricted depth arithmetic circuits. In other words, for exponential sized circuits additional depth beyond four does not help. We then show that a complete black-box derandomization of identity testing problem for depth four circuits with multiplication gates of small fanin implies a nearly complete derandomization of general identity testing.
TL;DR: This paper develops and analyzes two new algorithms that scale significantly better than existing kernels on the multiplication of sparse matrices (SpGEMM) and considers their algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm that would execute different kernels depending on the sparsity of the input matrices.
Abstract: Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on the multiplication of sparse matrices (SpGEMM). We first present the issues with existing sparse matrix representations and multiplication algorithms that make them unscalable to thousands of processors. Then, we develop and analyze two new algorithms that overcome these limitations. We consider our algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm for SpGEMM that would execute different kernels depending on the sparsity of the input matrices. Such a sequential kernel requires a new data structure that exploits the hypersparsity of the individual submatrices owned by a single processor after the 2D partitioning. We experimentally evaluate the performance and characteristics of our algorithms and show that they scale significantly better than existing kernels.
TL;DR: This paper presents an algorithm for calculating a faithful rounding of a vector of floating-point numbers, which adapts to the condition number of the sum, and proves certain constants used in the algorithm to be optimal.
Abstract: Given a vector of floating-point numbers with exact sum $s$, we present an algorithm for calculating a faithful rounding of $s$, i.e., the result is one of the immediate floating-point neighbors of $s$. If the sum $s$ is a floating-point number, we prove that this is the result of our algorithm. The algorithm adapts to the condition number of the sum, i.e., it is fast for mildly conditioned sums with slowly increasing computing time proportional to the logarithm of the condition number. All statements are also true in the presence of underflow. The algorithm does not depend on the exponent range. Our algorithm is fast in terms of measured computing time because it allows good instruction-level parallelism, it neither requires special operations such as access to mantissa or exponent, it contains no branch in the inner loop, nor does it require some extra precision: The only operations used are standard floating-point addition, subtraction, and multiplication in one working precision, for example, double precision. Certain constants used in the algorithm are proved to be optimal.
TL;DR: This paper details the design of a new high-speed pipelined application-specific instruction set processor (ASIP) for elliptic curve cryptography (ECC) using field-programmable gate-array (FPGA) technology.
Abstract: This paper details the design of a new high-speed pipelined application-specific instruction set processor (ASIP) for elliptic curve cryptography (ECC) using field-programmable gate-array (FPGA) technology. Different levels of pipelining were applied to the data path to explore the resulting performances and find an optimal pipeline depth. Three complex instructions were used to reduce the latency by reducing the overall number of instructions, and a new combined algorithm was developed to perform point doubling and point addition using the application specific instructions. An implementation for the United States Government National Institute of Standards and Technology-recommended curve over GF(2163) is shown, which achieves a point multiplication time of 33.05 s at 91 MHz on a Xilinx Virtex-E FPGA-the fastest figure reported in the literature to date. Using the more modern Xilinx Virtex-4 technology, a point multiplication time of 19.55 s was achieved, which translates to over 51120 point multiplications per second.
TL;DR: In this paper, a survey explains how some useful arithmetic operations can be sped up from quadratic time to essentially linear time, and how to speed up arithmetic operations in linear time.
Abstract: This survey explains how some useful arithmetic operations can be sped up from quadratic time to essentially linear time.
TL;DR: It is shown that sparse algorithms are not as scalable as their dense counterparts, because in general, there are not enough non-trivial arithmetic operations to hide the communication costs as well as the sparsity overheads.
Abstract: We identify the challenges that are special to parallel sparse matrix-matrix multiplication (PSpGEMM). We show that sparse algorithms are not as scalable as their dense counterparts, because in general, there are not enough non-trivial arithmetic operations to hide the communication costs as well as the sparsity overheads. We analyze the scalability of 1D and 2D algorithms for PSpGEMM. While the 1D algorithm is a variant of existing implementations, 2D algorithms presented are completely novel. Most of these algorithms are based on the previous research on parallel dense matrix multiplication. We also provide results from preliminary experiments with 2D algorithms.
TL;DR: A reduced-bit multiplication algorithm based on the ancient Vedic multiplication formulae, Urdhva tiryakbhyam and Nikhilam, is proposed and is further optimized by use of some general arithmetic operations such as expansion and bit-shifting to take advantage of bit-reduction in multiplication.
Abstract: A reduced-bit multiplication algorithm based on the ancient Vedic multiplication formulae is proposed in this paper. Both the Vedic multiplication formulae, Urdhva tiryakbhyam and Nikhilam, are first discussed in detail. Urdhva tiryakbhyam, being a general multiplication formula, is equally applicable to all cases of multiplication. It is applied to the digital arithmetic and is shown to yield a multiplier architecture which is very similar to the popular array multiplier. Due to its structure, it leads to a high carry prop- agation delay in case of multiplication of large numbers. Nikhilam Sutra, on the other hand, is more efficient in the multiplication of large numbers as it reduces the multiplication of two large numbers to that of two smaller numbers. The framework of the proposed algorithm is taken from this Sutra and is further optimized by use of some general arithmetic operations such as expansion and bit-shifting to take advantage of bit-reduction in multiplication. We illustrate the proposed algorithm by reducing a general 4£4-bit multiplication to a single 2 £ 2-bit multiplication operation.
TL;DR: In this article, the authors proposed a new architecture of multiplier-and-accumulator (MAC) for high speed multiplication and accumulation arithmetic, by combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved.
Abstract: In this paper, we propose a new architecture of multiplier-and-accumulator (MAC) for high speed multiplication and accumulation arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved. Since the accumulator which has the largest delay in MAC was removed and its function was included into CSA, the overall performance becomes to be elevated. The proposed CSA tree uses 1's complement-based radix-2 modified booth algorithm (MBA) and has the modified array for the sign extension in order to increase the bit density of operands. The CSA propagates the carries by the least significant bits of the partial products and generates the least significant bits in advance for decreasing the number of the input bits of the final adder. Also, the proposed MAC accumulates the intermediate results in the type of sum and carry bits not the output of the final adder for improving the performance by optimizing the efficiency of pipeline scheme. The proposed architecture was synthesized with and 90nm standard CMOS library after designing it. We analyzed the results such as hardware resource, delay, and pipeline which are based on the theoretical and experimental estimation. We used Sakurai's alpha power low for the delay modeling. The proposed MAC has the superior properties to the standard design in many ways and its performance is twice as much than the previous research in the similar clock frequency.
TL;DR: An innovative methodology for accelerating the elliptic curve point formulae over prime fields using the substitution of multiplication with squaring and other cheaper operations, by exploiting the fact that field squaring is generally less costly than multiplication.
Abstract: We present an innovative methodology for accelerating the elliptic curve point formulae over prime fields. This flexible technique uses the substitution of multiplication with squaring and other cheaper operations, by exploiting the fact that field squaring is generally less costly than multiplication. Applying this substitution to the traditional formulae, we obtain faster point operations in unprotected sequential implementations. We show the significant impact our methodology has in protecting against Simple Side-Channel Attacks (SSCA). We modify the ECC point formulae to achieve a faster atomic structure when applying atomicity side-channel protection. In contrast to previous atomic operations that assumed squarings are undistinguishable from multiplications, our new atomic structure offers true SSCA-protection because it includes squaring in its formulation. We also extend our implementation to parallel architectures such as SIMD (Single-Instruction Multiple-Data). With the introduction of a new coordinate system and with the flexibility of our methodology, we present, to our knowledge, the fastest formulae for SIMD-based schemes that are capable of executing 3 and 4 operations simultaneously. Finally, a new parallel SSCA-protected scheme is proposed for multiprocessor/parallel architectures by applying the atomic structure presented in this work. Our parallel and atomic operations are shown to be significantly faster than previous implementations.
TL;DR: The performance, including the data transfer between CPU and GPU, of Jacobi's iterative method for solving linear equations shows that the proposed algorithm is practical for some real applications.
Abstract: Recently a GPU has acquired programmability to perform general purpose computation fast by running ten thousands of threads concurrently. This paper presents a new algorithm for dense matrix-vector multiplication on NVIDIA CUDA architecture. The experimental results on GeForce 8800GTX show that the proposed algorithm runs maximum 15.69 (resp., 32.88) times faster than the sgemv routine in NVIDIA's BIAS library CUBLAS 1.1 (resp., Intel Math Kernel Library 9.1 on one-core of 2.0 GHz Intel Xeon E5335 CPU with SSE3 SIMD instructions) for matrices with order 16 to 12800. The performance, including the data transfer between CPU and GPU, of Jacobi's iterative method for solving linear equations shows that the proposed algorithm is practical for some real applications.
TL;DR: An efficient bit-level-pipelined bit-parallel systolic design for binary field multiplication which requires fewer gates and registers and involves nearly half the time-complexity of the corresponding existing design is derived.
Abstract: Novel systolic and super-systolic architectures are presented for polynomial basis multiplication over GF(2m) based on irreducible trinomials. By suitable cut-set retiming, we have derived here an efficient bit-level-pipelined bit-parallel systolic design for binary field multiplication which requires fewer gates and registers and involves nearly half the time-complexity of the corresponding existing design. We have also suggested a digit-level-pipelined design, which involves lower latency, and fewer registers compared with the bit-level-pipelined structure. Moreover, we have proposed a super-systolic design consisting of a set of systolic arrays in a systolic-pipeline and a pipelined systolic-block design consisting of a pipelined blocks of concurrent systolic arrays. The super-systolic designs have the same average computation time and the same critical path as the proposed bit-level-pipelined design, but can be used to reduce the latency by a factor O(radic(m)) at the cost of marginally higher number of XOR gates and bit-registers. The hardware complexities of proposed super-systolic designs are nearly three times that of the existing bit-parallel structures, but offer very high throughput compared with the others for large values of m. For the field orders m = 233 and m = 409, the proposed structures offer, respectively, ten and eleven times more throughput than the others.
TL;DR: The ALFG presented, offers functions beyond a normal ALU such as shift, parity, XOR, rotate and multiplication along with basic operations.
Abstract: In this work, we make use of a new computing paradigm quantum cellular automata (QCA) to describe the design and layout of arithmetic and logical function generator (ALFG). ALFG is made up of several blocks whose outputs are multiplexed for an intended functionality. The design of ALFG was made using the basic elements of technology such as inverter and majority voter. The layout and simulation results are presented using QCADesigner tool. A total of 16 arithmetic and logical operations can be performed on a pair of 4bit vectors. The ALFG presented, offers functions beyond a normal ALU such as shift, parity, XOR, rotate and multiplication along with basic operations. The total area consumed by ALFG is 11.37 mum2 with aid of 9 clocks to give the final output from the generator. The building blocks of ALFG can be added into a 4bit processor in order to make it feasible for additional applications.
TL;DR: This work presents tile assembly model systems that factor numbers nondeterministically using @Q(1) distinct components and explores bounds on the probability of such a nondeterministic system succeeding and proves that the probability can be made arbitrarily close to 1.
TL;DR: In this article, a hardware accelerator based on a unified arithmetic operator able to perform the operations required by a given algorithm is proposed for the field F397 given by F3[x]/(x97+x12+2).
Abstract: Since their introduction in constructive cryptographic applications, pairings over (hyper)elliptic curves are at the heart of an ever increasing number of protocols. With software implementations being rather slow, the study of hardware architectures became an active research area. In this paper, we discuss several algorithms to compute the etaT pairing in characteristic three and suggest further improvements. These algorithms involve addition, multiplication, cubing, inversion, and sometimes cube root extraction over F3m. We propose a hardware accelerator based on a unified arithmetic operator able to perform the operations required by a given algorithm. We describe the implementation of a compact coprocessor for the field F397 given by F3[x]/(x97+x12+2), which compares favorably with other solutions described in the open literature.
TL;DR: Galoisfield implementations are central to the design of and preference of and preferred approaches limit most applications to permanyreliable and secure systems, withmanysystems im- forming computation in either GF(28) orGF(216) as discussed by the authors.
TL;DR: By designing the signal to be centrally symmetric in the time-domain, the symmetric samples can be added prior to multiplication with the replica sample, thereby saving one multiplication per two received samples.
Abstract: In this paper, we describe the signal design for primary synchronization channel adopted for E-UTRA cellular system. By designing the signal to be centrally symmetric in the time-domain, the symmetric samples can be added prior to multiplication with the replica sample, thereby saving one multiplication per two received samples. Furthermore, a pair of centrally symmetric P-SCH signals can be generated such that one signal is a complex conjugated version of the other signal. The multiplication complexity of detecting both these signals can be reduced to that of just one of the signals.
TL;DR: TinyECCK (Tiny Elliptic Curve Cryptosystem with Koblitz curve – a kind of TinyOS package supporting elliptic curve operations) is presented, which is the first implementation of Koblitzer curve on sensor motes as far as the authors know and comparisons with existing software implementations of ECC outperforms them in terms of running time, code size and supporting services.
Abstract: In this paper, we revisit a generally accepted opinion: implementing Elliptic Curve Cryptosystem (ECC) over GF(2m) on sensor motes using small word size is not appropriate because XOR multiplication over GF(2m) is not efficiently supported by current low-powered microprocessors. Although there are some implementations over GF(2m) on sensor motes, their performances are not satisfactory enough to be used for wireless sensor networks (WSNs). We have found that a field multiplication over GF(2m) are involved in a number of redundant memory accesses and its inefficiency is originated from this problem. Moreover, the field reduction process also requires many redundant memory accesses. Therefore, we propose some techniques for reducing unnecessary memory accesses. With the proposed strategies, the running time of field multiplication and reduction over GF(2163) can be decreased by 21.1% and 24.7%, respectively. These savings noticeably decrease execution times spent in Elliptic Curve Digital Signature Algorithm (ECDSA) operations (signing and verification) by around 15–19%. We present TinyECCK (Tiny Elliptic Curve Cryptosystem with Koblitz curve – a kind of TinyOS package supporting elliptic curve operations) which is the first implementation of Koblitz curve on sensor motes as far as we know. Through comparisons with existing software implementations of ECC built in C or hybrid of C and inline assembly on sensor motes, we show that TinyECCK outperforms them in terms of running time, code size and supporting services. Furthermore, we show that a field multiplication over GF(2m) can be faster than that over GF(p) on 8-bit Atmega128 processor by comparing TinyECCK with TinyECC, a well-known ECC implementation over GF(p). TinyECCK with sect163k1 can generate a signature and verify it in 1.37 and 2.32 secs on a Micaz mote with 13,748-byte of ROM and 1,004-byte of RAM.
TL;DR: A review of the current state-of-the-art is presented with emphasis on Schottky diode based frequency multiplier technology in this article, which has enabled sensitive heterodyne receivers in the terahertz range that will be flown on an upcoming ESA mission to answer fundamental questions about our universe.
Abstract: Compact, robust and broadband sources in the terahertz range are extremely important in diverse applications such as spectroscopy, imaging, communication, and radar. A review of the current state-of-the-art is presented with emphasis on Schottky diode based frequency multiplier technology. Frequency multiplier circuit chips fabricated on few micrometer thick GaAs membranes and packaged in low-loss waveguide circuits have demonstrated tens of microwatts of output power up to 1.9 THz. This breakthrough has enabled sensitive heterodyne receivers in the terahertz range that will be flown on an upcoming ESA mission to answer fundamental questions about our universe. Index Terms – THz technology, frequency multipliers, sources, Schottky diode, waveguide, GaAs membrane * This work was supported by the Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA and the Observatoire de Paris, Laboratoire d’Etude du Rayonnement et de la Matière en Astrophysique, France.
TL;DR: In this article, the authors gave an O(N • log N • 2O(log * n) algorithm for multiplication of two N-bit integers using polynomial multiplication.
Abstract: We give an O(N • log N • 2O(log*N)) algorithm for multiplying two N-bit integers that improves the O(N • log N • log log N) algorithm by Schonhage-Strassen. Both these algorithms use modular arithmetic. Recently, Furer gave an O(N • log N • 2O(log*N)) algorithm which however uses arithmetic over complex numbers as opposed to modular arithmetic. In this paper, we use multivariate polynomial multiplication along with ideas from Furer's algorithm to achieve this improvement in the modular setting. Our algorithm can also be viewed as a p-adic version of Furer's algorithm. Thus, we show that the two seemingly different approaches to integer multiplication, modular and complex arithmetic, are similar.
TL;DR: In this paper, the authors propose adaptive runtime tuning mechanisms to improve the parallel performance of sparse matrix-vector (SVMV) multiplication on distributed memory systems, which is a widely used kernel in scientific applications.
Abstract: Sparse matrix-vector (SpMV) multiplication is a widely used kernel in scientific applications. In these applications, the SpMV multiplication is usually deeply nested within multiple loops and thus executed a large number of times. We have observed that there can be significant performance variability, due to irregular memory access patterns. Static performance optimizations are difficult because the patterns may be known only at runtime. In this paper, we propose adaptive runtime tuning mechanisms to improve the parallel performance on distributed memory systems. Our adaptive iteration-to-process mapping mechanism balances computational load at runtime with negligible overhead (1% on average), and our runtime communication selection algorithm searches for the best communication method for a given data distribution and mapping. Actual runs on 26 real matrices show that our runtime tuning system reduces execution time up to 68.8% (30.9% on average) over a base block-distributed parallel algorithm on distributed systems with 32 nodes.
TL;DR: In this article, the authors introduced the concept of nilpotent submodules and proved that a faithful multiplication module is von Neumann regular if and only if it has no nonzero nil-potent elements and its Krull dimension is zero.
Abstract: All rings are commutative with identity, and all modules are unital. The purpose of this article is to investigate multiplication von Neumann regular modules. For this reason we introduce the concept of nilpotent submodules generalizing nilpotent ideals and then prove that a faithful multiplication module is von Neumann regular if and only if it has no nonzero nilpotent elements and its Krull dimension is zero. We also give a new characterization for the radical of a submodule of a multiplication module and show in particular that the radical of any submodule of a Noetherian multiplication module is a finite intersection of prime submodules.
TL;DR: This work introduces a novel paradigm for the construction of ramp schemes with strong multiplication that allows the secret to be chosen in an extension field, whereas the shares lie in a base field, which leads to a ramp scheme withStrong multiplication from which protocols can be constructed for atomic secure multiplication with communication equal to a linear number of field elements in the size of the network.
Abstract: In this work we introduce a novel paradigm for the construction of ramp schemes with strong multiplication that allows the secret to be chosen in an extension field, whereas the shares lie in a base field. When applied to the setting of Shamir's scheme, for example, this leads to a ramp scheme with strong multiplication from which protocols can be constructed for atomic secure multiplication with communication equal to a linear number of field elements in the size of the network.
This is also achieved by the results from Cramer, Damgaard and de Haan from EUROCRYPT 2007. However, our new ramp scheme has an improved privacy bound that is essentially optimal and leads to a significant mathematical simplification of the earlier results on atomic secure multiplication.
As a result, by considering high degree rational points on algebraic curves, this can now be generalized to algebraic geometric ramp schemes with strong multiplication over a constant size field, which in turn leads to low communication atomic secure multiplication where the base field can now be taken constant, as opposed to earlier work.
TL;DR: Evaluation of two compression schemes for sparse matrix-vector multiplication kernel showed that they can significantly improve the performance of the multithreaded version of the kernel and achieve good scalability for large matrices.
Abstract: The sparse matrix-vector multiplication kernel exhibits limited potential for taking advantage of modern shared memory architectures due to its large memory bandwidth requirements. To decrease memory contention and improve the performance of the kernel we propose two compression schemes. The first, called CSR-DU, targets the reduction of the matrix structural data by applying coarse grain delta encoding for the column indices. The second scheme, called CSR-VI, targets the reduction of the numerical values using indirect indexing and can only be applied to matrices which contain a small number of unique values. Evaluation of both methods on a rich matrix set showed that they can significantly improve the performance of the multithreaded version of the kernel and achieve good scalability for large matrices.
TL;DR: It is shown that Q is interpretable in Q^-.
Abstract: Q^- is a weaker variant of Robinson arithmetic Q in which addition and multiplication are partial functions, i.e. ternary relations that are graphs of possibly non-total functions. We show that Q is interpretable in Q^-. This gives an alternative answer to a question of A. Grzegorczyk whether Q^- is essentially undecidable.
TL;DR: It is demonstrated that fail-stop process failures in the outer product version matrix-matrix multiplication can be tolerated without checkpointing or message logging and based on this checksum relationship maintained in the middle of the computation is demonstrated.
Abstract: It has been proved in previous algorithm-based fault tolerance that, for matrix matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no mater which algorithm is used. However, whether this checksum relationship can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that this checksum relationship is not maintained in the middle of the computation for most algorithms for matrix matrix multiplication. We then prove that, however, for the outer product version matrix matrix multiplication algorithm, this checksum relationship can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures (which are often tolerated by checkpointing or message logging) in the outer product version matrix-matrix multiplication can be tolerated without checkpointing or message logging.
TL;DR: This work proposes an exact algorithm based on the breadth-first search that finds the minimum number of operations solution of mid-size MCM instances in a reasonable time and is compared with the previously proposed efficient heuristics.
Abstract: This paper addresses the multiplication of one data sample with multiple constants using addition/subtraction and shift operations, i.e., the multiple constant multiplications (MCM) problem. The MCM problem finds itself and its variants in many applications, such as digital finite impulse response (FIR) filters, linear signal transforms, and computer arithmetic. Although many efficient algorithms have been proposed to implement the MCM using the fewest number of operations, due to the NP-hardness of the problem, they have been heuristics, i.e., they cannot guarantee the minimum solution. In this work, we propose an exact algorithm based on the breadth-first search that finds the minimum number of operations solution of mid-size MCM instances in a reasonable time. The proposed exact algorithm has been tested on a set of instances including FIR filter and randomly generated instances, and compared with the previously proposed efficient heuristics. It is observed from the experimental results that, even though the previously proposed heuristics obtain similar results with the minimum number of operations solutions, there are instances for which the exact algorithm finds better solutions than the prominent heuristics.
TL;DR: In this article, the authors present a computerized game system for multiplayer tournaments, a method of operating the system, a module to be used in conjunction with the system and a method for playing the game.
Abstract: The present invention discloses a computerized gaming system for multiplayer tournaments, a method of operating the system, a module to be used in conjunction with the system, and a method of playing the game. The method of operating the system comprises a) setting and accommodating a configuration comprising one or more predefined and/or configurable rules for multiplication of stakes and/or a winning pool, conditions thereof, one or more parameters of said multiplication, and one or more tournament limits; b) facilitating obtaining a multiplication proposal from a player; c) obtaining game-related and/or player-related data from the gaming server and/or the billing unit and/or the storage unit, and providing automated authorization of the proposal in accordance with said obtained data and said accommodated configuration; d) presenting the authorized proposal to one or more other players and facilitating obtaining one or more respective replies, e) automated informing the game server and/or the billing unit with regard to the multiplication proposal and reply thereof; and f) automated executing the game rules corresponding to the obtained replies.
TL;DR: In this article, the authors investigated the potential benefits of using Galbraith-Lin-Scott elliptic curves in the case where q is a power of 2 and showed that the Gaussian-Lambert-Vanstone point multiplication method for these curves is significantly faster than point multiplication for general elliptic curve over prime fields.
Abstract: Galbraith, Lin and Scott recently constructed efficiently-computable endomorphisms for a large family of elliptic curves defined over Fq 2 and showed, in the case where q is prime, that the Gallant-Lambert-Vanstone point multiplication method for these curves is significantly faster than point multiplication for general elliptic curves over prime fields. In this paper, we investigate the potential benefits of using Galbraith-Lin-Scott elliptic curves in the case where q is a power of 2. The analysis differs from the q prime case because of several factors, including the availability of the point halving strategy for elliptic curves over binary fields. Our analysis and implementations show that Galbraith-Lin-Scott offers significant acceleration for curves over binary fields, in both doubling- and halving-based approaches. Experimentally, the acceleration surpasses that reported for prime fields (for the platform in common), a somewhat counterintuitive result given the relative costs of point addition and doubling in each case.
TL;DR: A matrix multiplication module (10) and matrix multiplication method are provided that use a variable number of multiplier-accumulator units MAC (20) based on the amount of data elements of the matrices are available or needed for processing at a particular point or stage in the computation process as mentioned in this paper.
Abstract: A matrix multiplication module (10) and matrix multiplication method are provided that use a variable number of multiplier-accumulator units MAC (20) based on the amount of data elements of the matrices are available or needed for processing at a particular point or stage in the computation process As more data elements become available or are needed, more multiplier-accumulator units (20) are used to perform the necessary multiplication and addition operations To multiply an NxM matrix by an MxN matrix, the total (maximum) number of used MAC units is "2*N-1" The number of MAC units used starts with one (1) and increases by two at each computation stage, that is, at the beginning of reading of data elements for each new row of the first matrix The sequence of the number of MAC units is {1, 3, 5, , 2*N-1} for computation stages each of which corresponds to reading of data elements for each new row of the left hand matrix, also called the first matrix