Top 297 papers published in the topic of Multiplication in 2008

Showing papers on "Multiplication published in 2008"

Journal Article•10.1145/1356052.1356053•

Anatomy of high-performance matrix multiplication

[...]

Kazushige Goto¹, Robert A. van de Geijn¹•Institutions (1)

16 May 2008-ACM Transactions on Mathematical Software

TL;DR: The basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library are presented.

...read moreread less

Abstract: We present the basic principles that underlie the high-performance implementation of the matrix-matrix multiplication that is part of the widely used GotoBLAS library. Design decisions are justified by successively refining a model of architectures with multilevel memories. A simple but effective algorithm for executing this operation results. Implementations on a broad selection of architectures are shown to achieve near-peak performance.

...read moreread less

885 citations

Proceedings Article•10.1109/FOCS.2008.32•

Arithmetic Circuits: A Chasm at Depth Four

[...]

Manindra Agrawal¹, V. Vinay•Institutions (1)

Indian Institute of Technology Kanpur¹

25 Oct 2008

TL;DR: It is shown that proving exponential lower bounds on depth four arithmetic circuits imply exponentialLower bounds for unrestricted depth arithmetic circuits, and that for exponential sized circuits additional depth beyond four does not help.

...read moreread less

Abstract: We show that proving exponential lower bounds on depth four arithmetic circuits imply exponential lower bounds for unrestricted depth arithmetic circuits. In other words, for exponential sized circuits additional depth beyond four does not help. We then show that a complete black-box derandomization of identity testing problem for depth four circuits with multiplication gates of small fanin implies a nearly complete derandomization of general identity testing.

...read moreread less

255 citations

Proceedings Article•10.1109/IPDPS.2008.4536313•

On the representation and multiplication of hypersparse matrices

[...]

Aydin Buluc¹, John R. Gilbert¹•Institutions (1)

University of California, Santa Barbara¹

14 Apr 2008

TL;DR: This paper develops and analyzes two new algorithms that scale significantly better than existing kernels on the multiplication of sparse matrices (SpGEMM) and considers their algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm that would execute different kernels depending on the sparsity of the input matrices.

...read moreread less

Abstract: Multicore processors are marking the beginning of a new era of computing where massive parallelism is available and necessary. Slightly slower but easy to parallelize kernels are becoming more valuable than sequentially faster kernels that are unscalable when parallelized. In this paper, we focus on the multiplication of sparse matrices (SpGEMM). We first present the issues with existing sparse matrix representations and multiplication algorithms that make them unscalable to thousands of processors. Then, we develop and analyze two new algorithms that overcome these limitations. We consider our algorithms first as the sequential kernel of a scalable parallel sparse matrix multiplication algorithm and second as part of a polyalgorithm for SpGEMM that would execute different kernels depending on the sparsity of the input matrices. Such a sequential kernel requires a new data structure that exploits the hypersparsity of the individual submatrices owned by a single processor after the 2D partitioning. We experimentally evaluate the performance and characteristics of our algorithms and show that they scale significantly better than existing kernels.

...read moreread less

255 citations

Journal Article•10.1137/050645671•

Accurate Floating-Point Summation Part I: Faithful Rounding

[...]

Siegfried M. Rump, Takeshi Ogita, Shin'ichi Oishi

01 Oct 2008-SIAM Journal on Scientific Computing

TL;DR: This paper presents an algorithm for calculating a faithful rounding of a vector of floating-point numbers, which adapts to the condition number of the sum, and proves certain constants used in the algorithm to be optimal.

...read moreread less

Abstract: Given a vector of floating-point numbers with exact sum $s$, we present an algorithm for calculating a faithful rounding of $s$, i.e., the result is one of the immediate floating-point neighbors of $s$. If the sum $s$ is a floating-point number, we prove that this is the result of our algorithm. The algorithm adapts to the condition number of the sum, i.e., it is fast for mildly conditioned sums with slowly increasing computing time proportional to the logarithm of the condition number. All statements are also true in the presence of underflow. The algorithm does not depend on the exponent range. Our algorithm is fast in terms of measured computing time because it allows good instruction-level parallelism, it neither requires special operations such as access to mantissa or exponent, it contains no branch in the inner loop, nor does it require some extra precision: The only operations used are standard floating-point addition, subtraction, and multiplication in one working precision, for example, double precision. Certain constants used in the algorithm are proved to be optimal.

...read moreread less

213 citations

Journal Article•10.1109/TVLSI.2007.912228•

Fast Elliptic Curve Cryptography on FPGA

[...]

W.N. Chelton¹, Mohammed Benaissa¹•Institutions (1)

University of Sheffield¹

01 Feb 2008-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: This paper details the design of a new high-speed pipelined application-specific instruction set processor (ASIP) for elliptic curve cryptography (ECC) using field-programmable gate-array (FPGA) technology.

...read moreread less

Abstract: This paper details the design of a new high-speed pipelined application-specific instruction set processor (ASIP) for elliptic curve cryptography (ECC) using field-programmable gate-array (FPGA) technology. Different levels of pipelining were applied to the data path to explore the resulting performances and find an optimal pipeline depth. Three complex instructions were used to reduce the latency by reducing the overall number of instructions, and a new combined algorithm was developed to perform point doubling and point addition using the application specific instructions. An implementation for the United States Government National Institute of Standards and Technology-recommended curve over GF(2163) is shown, which achieves a point multiplication time of 33.05 s at 91 MHz on a Xilinx Virtex-E FPGA-the fastest figure reported in the literature to date. Using the more modern Xilinx Virtex-4 technology, a point multiplication time of 19.55 s was achieved, which translates to over 51120 point multiplications per second.

...read moreread less

176 citations

Fast multiplication and its applications

[...]

Daniel J. Bernstein

1 Jan 2008

TL;DR: In this paper, a survey explains how some useful arithmetic operations can be sped up from quadratic time to essentially linear time, and how to speed up arithmetic operations in linear time.

...read moreread less

Abstract: This survey explains how some useful arithmetic operations can be sped up from quadratic time to essentially linear time.

...read moreread less

161 citations

Proceedings Article•10.1109/ICPP.2008.45•

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

[...]

Aydin Buluc¹, John R. Gilbert¹•Institutions (1)

University of California, Santa Barbara¹

9 Sep 2008

TL;DR: It is shown that sparse algorithms are not as scalable as their dense counterparts, because in general, there are not enough non-trivial arithmetic operations to hide the communication costs as well as the sparsity overheads.

...read moreread less

Abstract: We identify the challenges that are special to parallel sparse matrix-matrix multiplication (PSpGEMM). We show that sparse algorithms are not as scalable as their dense counterparts, because in general, there are not enough non-trivial arithmetic operations to hide the communication costs as well as the sparsity overheads. We analyze the scalability of 1D and 2D algorithms for PSpGEMM. While the 1D algorithm is a variant of existing implementations, 2D algorithms presented are completely novel. Most of these algorithms are based on the previous research on parallel dense matrix multiplication. We also provide results from preliminary experiments with 2D algorithms.

...read moreread less

137 citations

Journal Article•

A Reduced-Bit Multiplication Algorithm for Digital Arithmetic

[...]

Harpreet S. Dhillon, Abhijit Mitra

25 Jul 2008-World Academy of Science, Engineering and Technology, International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering

TL;DR: A reduced-bit multiplication algorithm based on the ancient Vedic multiplication formulae, Urdhva tiryakbhyam and Nikhilam, is proposed and is further optimized by use of some general arithmetic operations such as expansion and bit-shifting to take advantage of bit-reduction in multiplication.

...read moreread less

Abstract: A reduced-bit multiplication algorithm based on the ancient Vedic multiplication formulae is proposed in this paper. Both the Vedic multiplication formulae, Urdhva tiryakbhyam and Nikhilam, are first discussed in detail. Urdhva tiryakbhyam, being a general multiplication formula, is equally applicable to all cases of multiplication. It is applied to the digital arithmetic and is shown to yield a multiplier architecture which is very similar to the popular array multiplier. Due to its structure, it leads to a high carry prop- agation delay in case of multiplication of large numbers. Nikhilam Sutra, on the other hand, is more efficient in the multiplication of large numbers as it reduces the multiplication of two large numbers to that of two smaller numbers. The framework of the proposed algorithm is taken from this Sutra and is further optimized by use of some general arithmetic operations such as expansion and bit-shifting to take advantage of bit-reduction in multiplication. We illustrate the proposed algorithm by reducing a general 4£4-bit multiplication to a single 2 £ 2-bit multiplication operation.

...read moreread less

110 citations

Journal Article•

New VLSI Architecture of Parallel Multiplier-Accumulator Based on Radix-2 Modified Booth Algorithm

[...]

Young-Ho Seo, Dong-Wook Kim

01 Jan 2008-Journal of the Institute of Electronics Engineers of Korea

TL;DR: In this article, the authors proposed a new architecture of multiplier-and-accumulator (MAC) for high speed multiplication and accumulation arithmetic, by combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved.

...read moreread less

Abstract: In this paper, we propose a new architecture of multiplier-and-accumulator (MAC) for high speed multiplication and accumulation arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved. Since the accumulator which has the largest delay in MAC was removed and its function was included into CSA, the overall performance becomes to be elevated. The proposed CSA tree uses 1's complement-based radix-2 modified booth algorithm (MBA) and has the modified array for the sign extension in order to increase the bit density of operands. The CSA propagates the carries by the least significant bits of the partial products and generates the least significant bits in advance for decreasing the number of the input bits of the final adder. Also, the proposed MAC accumulates the intermediate results in the type of sum and carry bits not the output of the final adder for improving the performance by optimizing the efficiency of pipeline scheme. The proposed architecture was synthesized with and 90nm standard CMOS library after designing it. We analyzed the results such as hardware resource, delay, and pipeline which are based on the theoretical and experimental estimation. We used Sakurai's alpha power low for the delay modeling. The proposed MAC has the superior properties to the standard design in many ways and its performance is twice as much than the previous research in the similar clock frequency.

...read moreread less

88 citations

Journal Article•10.1109/TC.2007.70815•

Fast and Flexible Elliptic Curve Point Arithmetic over Prime Fields

[...]

Patrick Longa¹, Ali Miri¹•Institutions (1)

University of Ottawa¹

01 Mar 2008-IEEE Transactions on Computers

TL;DR: An innovative methodology for accelerating the elliptic curve point formulae over prime fields using the substitution of multiplication with squaring and other cheaper operations, by exploiting the fact that field squaring is generally less costly than multiplication.

...read moreread less

Abstract: We present an innovative methodology for accelerating the elliptic curve point formulae over prime fields. This flexible technique uses the substitution of multiplication with squaring and other cheaper operations, by exploiting the fact that field squaring is generally less costly than multiplication. Applying this substitution to the traditional formulae, we obtain faster point operations in unprotected sequential implementations. We show the significant impact our methodology has in protecting against Simple Side-Channel Attacks (SSCA). We modify the ECC point formulae to achieve a faster atomic structure when applying atomicity side-channel protection. In contrast to previous atomic operations that assumed squarings are undistinguishable from multiplications, our new atomic structure offers true SSCA-protection because it includes squaring in its formulation. We also extend our implementation to parallel architectures such as SIMD (Single-Instruction Multiple-Data). With the introduction of a new coordinate system and with the flexibility of our methodology, we present, to our knowledge, the fastest formulae for SIMD-based schemes that are capable of executing 3 and 4 operations simultaneously. Finally, a new parallel SSCA-protected scheme is proposed for multiprocessor/parallel architectures by applying the atomic structure presented in this work. Our parallel and atomic operations are shown to be significantly faster than previous implementations.

...read moreread less

87 citations

Proceedings Article•10.1109/IPDPS.2008.4536350•

Faster matrix-vector multiplication on GeForce 8800GTX

[...]

Noriyuki Fujimoto¹•Institutions (1)

Osaka University¹

14 Apr 2008

TL;DR: The performance, including the data transfer between CPU and GPU, of Jacobi's iterative method for solving linear equations shows that the proposed algorithm is practical for some real applications.

...read moreread less

Abstract: Recently a GPU has acquired programmability to perform general purpose computation fast by running ten thousands of threads concurrently. This paper presents a new algorithm for dense matrix-vector multiplication on NVIDIA CUDA architecture. The experimental results on GeForce 8800GTX show that the proposed algorithm runs maximum 15.69 (resp., 32.88) times faster than the sgemv routine in NVIDIA's BIAS library CUBLAS 1.1 (resp., Intel Math Kernel Library 9.1 on one-core of 2.0 GHz Intel Xeon E5335 CPU with SSE3 SIMD instructions) for matrices with order 16 to 12800. The performance, including the data transfer between CPU and GPU, of Jacobi's iterative method for solving linear equations shows that the proposed algorithm is practical for some real applications.

...read moreread less

Journal Article•10.1109/TCSI.2008.916622•

Systolic and Super-Systolic Multipliers for Finite Field $GF(2^{m})$ Based on Irreducible Trinomials

[...]

Pramod Kumar Meher¹•Institutions (1)

Nanyang Technological University¹

01 May 2008-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: An efficient bit-level-pipelined bit-parallel systolic design for binary field multiplication which requires fewer gates and registers and involves nearly half the time-complexity of the corresponding existing design is derived.

...read moreread less

Abstract: Novel systolic and super-systolic architectures are presented for polynomial basis multiplication over GF(2m) based on irreducible trinomials. By suitable cut-set retiming, we have derived here an efficient bit-level-pipelined bit-parallel systolic design for binary field multiplication which requires fewer gates and registers and involves nearly half the time-complexity of the corresponding existing design. We have also suggested a digit-level-pipelined design, which involves lower latency, and fewer registers compared with the bit-level-pipelined structure. Moreover, we have proposed a super-systolic design consisting of a set of systolic arrays in a systolic-pipeline and a pipelined systolic-block design consisting of a pipelined blocks of concurrent systolic arrays. The super-systolic designs have the same average computation time and the same critical path as the proposed bit-level-pipelined design, but can be used to reduce the latency by a factor O(radic(m)) at the cost of marginally higher number of XOR gates and bit-registers. The hardware complexities of proposed super-systolic designs are nearly three times that of the existing bit-parallel structures, but offer very high throughput compared with the others for large values of m. For the field orders m = 233 and m = 409, the proposed structures offer, respectively, ten and eleven times more throughput than the others.

...read moreread less

Proceedings Article•10.1109/NEMS.2008.4484438•

QCA based multiplexing of 16 arithmetic & logical subsystems-A paradigm for nano computing

[...]

V.C. Teja¹, S. Polisetti², S. Kasavajjala³•Institutions (3)

Birla Institute of Technology and Science¹, Qualcomm², STMicroelectronics³

11 Apr 2008

TL;DR: The ALFG presented, offers functions beyond a normal ALU such as shift, parity, XOR, rotate and multiplication along with basic operations.

...read moreread less

Abstract: In this work, we make use of a new computing paradigm quantum cellular automata (QCA) to describe the design and layout of arithmetic and logical function generator (ALFG). ALFG is made up of several blocks whose outputs are multiplexed for an intended functionality. The design of ALFG was made using the basic elements of technology such as inverter and majority voter. The layout and simulation results are presented using QCADesigner tool. A total of 16 arithmetic and logical operations can be performed on a pair of 4bit vectors. The ALFG presented, offers functions beyond a normal ALU such as shift, parity, XOR, rotate and multiplication along with basic operations. The total area consumed by ALFG is 11.37 mum2 with aid of 9 clocks to give the final output from the generator. The building blocks of ALFG can be added into a 4bit processor in order to make it feasible for additional applications.

...read moreread less

Journal Article•10.1016/J.TCS.2007.07.051•

Nondeterministic polynomial time factoring in the tile assembly model

[...]

Yuriy Brun¹•Institutions (1)

University of Southern California¹

10 Apr 2008-Theoretical Computer Science

TL;DR: This work presents tile assembly model systems that factor numbers nondeterministically using @Q(1) distinct components and explores bounds on the probability of such a nondeterministic system succeeding and proves that the probability can be made arbitrarily close to 1.

...read moreread less

Journal Article•10.1109/TC.2008.103•

Algorithms and Arithmetic Operators for Computing the ηT Pairing in Characteristic Three

[...]

Jean-Luc Beuchat¹, Nicolas Brisebarre², Jérémie Detrey, Eiji Okamoto¹, Masaaki Shirase³, Tsuyoshi Takagi³ - Show less +2 more•Institutions (3)

University of Tsukuba¹, École normale supérieure de Lyon², Future University Hakodate³

01 Nov 2008-IEEE Transactions on Computers

TL;DR: In this article, a hardware accelerator based on a unified arithmetic operator able to perform the operations required by a given algorithm is proposed for the field F397 given by F3[x]/(x97+x12+2).

...read moreread less

Abstract: Since their introduction in constructive cryptographic applications, pairings over (hyper)elliptic curves are at the heart of an ever increasing number of protocols. With software implementations being rather slow, the study of hardware architectures became an active research area. In this paper, we discuss several algorithms to compute the etaT pairing in characteristic three and suggest further improvements. These algorithms involve addition, multiplication, cubing, inversion, and sometimes cube root extraction over F3m. We propose a hardware accelerator based on a unified arithmetic operator able to perform the operations required by a given algorithm. We describe the implementation of a compact coprocessor for the field F397 given by F3[x]/(x97+x12+2), which compares favorably with other solutions described in the open literature.

...read moreread less

Optimizing Galois Field Arithmetic forDiverse Processor Architectures and Applications

[...]

Kevin M. Greenan, Ethan L. Miller, E. Schwarz, Cruz Santa

1 Jan 2008

TL;DR: Galoisfield implementations are central to the design of and preference of and preferred approaches limit most applications to permanyreliable and secure systems, withmanysystems im- forming computation in either GF(28) orGF(216) as discussed by the authors.

...read moreread less

Abstract: encrypt blocks ofdatainthecurrent NISTadvanced encryption standard [14]. Current memory, CPUcache sizes Galoisfield implementations arecentral tothedesign of andpreferred approaches limit mostapplications topermanyreliable andsecure systems, withmanysystems im- forming computation ineither GF(28) orGF(216). The plementing theminsoftware. ThetwomostcommonGalois goalofourresearch istostudy themultiplication perforfield operations areaddition andmultiplication; typically, manceofthese commonfields, propose analternate repmultiplication isfarmoreexpensive thanaddition. Insoft- resentation forarbitrary-sized fields andcompare perforware, multiplication isgenerally donewithalook-up toa manceacross allrepresentations ondifferent CPUarchipre-computed table, limiting thesize ofthefield andresult- tectures andfordifferent workloads. inginuneven performance across architectures andappli- Multiplication inGF(2.) isusually done precations. Mutplaln1 F2)1 sal oeusingpe Iathions pp,wfraay extntl-sdm computed look-up tables, whileaddition oftwoelements

...read moreread less

Proceedings Article•10.1109/ISSSTA.2008.85•

Primary Synchronization Signal in E-UTRA

[...]

Branislav Popovic¹, Fredrik Berggren¹•Institutions (1)

Huawei¹

5 Sep 2008

TL;DR: By designing the signal to be centrally symmetric in the time-domain, the symmetric samples can be added prior to multiplication with the replica sample, thereby saving one multiplication per two received samples.

...read moreread less

Abstract: In this paper, we describe the signal design for primary synchronization channel adopted for E-UTRA cellular system. By designing the signal to be centrally symmetric in the time-domain, the symmetric samples can be added prior to multiplication with the replica sample, thereby saving one multiplication per two received samples. Furthermore, a pair of centrally symmetric P-SCH signals can be generated such that one signal is a complex conjugated version of the other signal. The multiplication complexity of detecting both these signals can be reduced to that of just one of the signals.

...read moreread less

Journal Article•10.1093/IETISY/E91-D.5.1338•

TinyECCK: Efficient Elliptic Curve Cryptography Implementation over GF(2m) on 8-Bit Micaz Mote

[...]

Seog Chung Seo¹, Dong-Guk Han², Hyung Chan Kim³, Seokhie Hong¹•Institutions (3)

Korea University¹, Electronics and Telecommunications Research Institute², Columbia University³

01 May 2008-The IEICE transactions on information and systems

TL;DR: TinyECCK (Tiny Elliptic Curve Cryptosystem with Koblitz curve – a kind of TinyOS package supporting elliptic curve operations) is presented, which is the first implementation of Koblitzer curve on sensor motes as far as the authors know and comparisons with existing software implementations of ECC outperforms them in terms of running time, code size and supporting services.

...read moreread less

Abstract: In this paper, we revisit a generally accepted opinion: implementing Elliptic Curve Cryptosystem (ECC) over GF(2m) on sensor motes using small word size is not appropriate because XOR multiplication over GF(2m) is not efficiently supported by current low-powered microprocessors. Although there are some implementations over GF(2m) on sensor motes, their performances are not satisfactory enough to be used for wireless sensor networks (WSNs). We have found that a field multiplication over GF(2m) are involved in a number of redundant memory accesses and its inefficiency is originated from this problem. Moreover, the field reduction process also requires many redundant memory accesses. Therefore, we propose some techniques for reducing unnecessary memory accesses. With the proposed strategies, the running time of field multiplication and reduction over GF(2163) can be decreased by 21.1% and 24.7%, respectively. These savings noticeably decrease execution times spent in Elliptic Curve Digital Signature Algorithm (ECDSA) operations (signing and verification) by around 15–19%. We present TinyECCK (Tiny Elliptic Curve Cryptosystem with Koblitz curve – a kind of TinyOS package supporting elliptic curve operations) which is the first implementation of Koblitz curve on sensor motes as far as we know. Through comparisons with existing software implementations of ECC built in C or hybrid of C and inline assembly on sensor motes, we show that TinyECCK outperforms them in terms of running time, code size and supporting services. Furthermore, we show that a field multiplication over GF(2m) can be faster than that over GF(p) on 8-bit Atmega128 processor by comparing TinyECCK with TinyECC, a well-known ECC implementation over GF(p). TinyECCK with sect163k1 can generate a signature and verify it in 1.37 and 2.32 secs on a Micaz mote with 13,748-byte of ROM and 1,004-byte of RAM.

...read moreread less

Journal Article•10.1515/FREQ.2008.62.5-6.118•

Terahertz Sources Based on Frequency Multiplication and Their Applications

[...]

Alain Maestrini, John Ward, Goutam Chattopadhyay, Erich Schlecht, Imran Mehdi - Show less +1 more

01 Jun 2008-Frequenz

TL;DR: A review of the current state-of-the-art is presented with emphasis on Schottky diode based frequency multiplier technology in this article, which has enabled sensitive heterodyne receivers in the terahertz range that will be flown on an upcoming ESA mission to answer fundamental questions about our universe.

...read moreread less

Abstract: Compact, robust and broadband sources in the terahertz range are extremely important in diverse applications such as spectroscopy, imaging, communication, and radar. A review of the current state-of-the-art is presented with emphasis on Schottky diode based frequency multiplier technology. Frequency multiplier circuit chips fabricated on few micrometer thick GaAs membranes and packaged in low-loss waveguide circuits have demonstrated tens of microwatts of output power up to 1.9 THz. This breakthrough has enabled sensitive heterodyne receivers in the terahertz range that will be flown on an upcoming ESA mission to answer fundamental questions about our universe. Index Terms – THz technology, frequency multipliers, sources, Schottky diode, waveguide, GaAs membrane * This work was supported by the Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, USA and the Observatoire de Paris, Laboratoire d’Etude du Rayonnement et de la Matière en Astrophysique, France.

...read moreread less

Proceedings Article•10.1145/1374376.1374447•

Fast integer multiplication using modular arithmetic

[...]

Anindya De¹, Piyush P. Kurur¹, Chandan Saha¹, Ramprasad Saptharishi¹•Institutions (1)

Indian Institute of Technology Kanpur¹

17 May 2008

TL;DR: In this article, the authors gave an O(N • log N • 2O(log * n) algorithm for multiplication of two N-bit integers using polynomial multiplication.

...read moreread less

Abstract: We give an O(N • log N • 2O(log*N)) algorithm for multiplying two N-bit integers that improves the O(N • log N • log log N) algorithm by Schonhage-Strassen. Both these algorithms use modular arithmetic. Recently, Furer gave an O(N • log N • 2O(log*N)) algorithm which however uses arithmetic over complex numbers as opposed to modular arithmetic. In this paper, we use multivariate polynomial multiplication along with ideas from Furer's algorithm to achieve this improvement in the modular setting. Our algorithm can also be viewed as a p-adic version of Furer's algorithm. Thus, we show that the two seemingly different approaches to integer multiplication, modular and complex arithmetic, are similar.

...read moreread less

Proceedings Article•10.1145/1375527.1375558•

Adaptive runtime tuning of parallel sparse matrix-vector multiplication on distributed memory systems

[...]

Seyong Lee¹, Rudolf Eigenmann¹•Institutions (1)

Purdue University¹

7 Jun 2008

TL;DR: In this paper, the authors propose adaptive runtime tuning mechanisms to improve the parallel performance of sparse matrix-vector (SVMV) multiplication on distributed memory systems, which is a widely used kernel in scientific applications.

...read moreread less

Abstract: Sparse matrix-vector (SpMV) multiplication is a widely used kernel in scientific applications. In these applications, the SpMV multiplication is usually deeply nested within multiple loops and thus executed a large number of times. We have observed that there can be significant performance variability, due to irregular memory access patterns. Static performance optimizations are difficult because the patterns may be known only at runtime. In this paper, we propose adaptive runtime tuning mechanisms to improve the parallel performance on distributed memory systems. Our adaptive iteration-to-process mapping mechanism balances computational load at runtime with negligible overhead (1% on average), and our runtime communication selection algorithm searches for the best communication method for a given data distribution and mapping. Actual runs on 26 real matrices show that our runtime tuning system reduces execution time up to 68.8% (30.9% on average) over a base block-distributed parallel algorithm on distributed systems with 32 nodes.

...read moreread less

Journal Article•10.1080/00927870802186805•

Idempotent and Nilpotent Submodules of Multiplication Modules

[...]

Majid M. Ali¹•Institutions (1)

Sultan Qaboos University¹

11 Dec 2008-Communications in Algebra

TL;DR: In this article, the authors introduced the concept of nilpotent submodules and proved that a faithful multiplication module is von Neumann regular if and only if it has no nonzero nil-potent elements and its Krull dimension is zero.

...read moreread less

Abstract: All rings are commutative with identity, and all modules are unital. The purpose of this article is to investigate multiplication von Neumann regular modules. For this reason we introduce the concept of nilpotent submodules generalizing nilpotent ideals and then prove that a faithful multiplication module is von Neumann regular if and only if it has no nonzero nilpotent elements and its Krull dimension is zero. We also give a new characterization for the radical of a submodule of a multiplication module and show in particular that the radical of any submodule of a Noetherian multiplication module is a finite intersection of prime submodules.

...read moreread less

Book Chapter•10.1007/978-3-540-78967-3_26•

Strongly multiplicative ramp schemes from high degree rational points on curves

[...]

Hao Chen¹, Ronald Cramer², Robbert de Haan, Ignacio Cascudo Pueyo³•Institutions (3)

East China Normal University¹, Leiden University², University of Oviedo³

13 Apr 2008

TL;DR: This work introduces a novel paradigm for the construction of ramp schemes with strong multiplication that allows the secret to be chosen in an extension field, whereas the shares lie in a base field, which leads to a ramp scheme withStrong multiplication from which protocols can be constructed for atomic secure multiplication with communication equal to a linear number of field elements in the size of the network.

...read moreread less

Abstract: In this work we introduce a novel paradigm for the construction of ramp schemes with strong multiplication that allows the secret to be chosen in an extension field, whereas the shares lie in a base field. When applied to the setting of Shamir's scheme, for example, this leads to a ramp scheme with strong multiplication from which protocols can be constructed for atomic secure multiplication with communication equal to a linear number of field elements in the size of the network. This is also achieved by the results from Cramer, Damgaard and de Haan from EUROCRYPT 2007. However, our new ramp scheme has an improved privacy bound that is essentially optimal and leads to a significant mathematical simplification of the earlier results on atomic secure multiplication. As a result, by considering high degree rational points on algebraic curves, this can now be generalized to algebraic geometric ramp schemes with strong multiplication over a constant size field, which in turn leads to low communication atomic secure multiplication where the base field can now be taken constant, as opposed to earlier work.

...read moreread less

Proceedings Article•10.1109/ICPP.2008.62•

Improving the Performance of Multithreaded Sparse Matrix-Vector Multiplication Using Index and Value Compression

[...]

Kornilios Kourtis¹, Georgios Goumas¹, Nectarios Koziris¹•Institutions (1)

National and Kapodistrian University of Athens¹

9 Sep 2008

TL;DR: Evaluation of two compression schemes for sparse matrix-vector multiplication kernel showed that they can significantly improve the performance of the multithreaded version of the kernel and achieve good scalability for large matrices.

...read moreread less

Abstract: The sparse matrix-vector multiplication kernel exhibits limited potential for taking advantage of modern shared memory architectures due to its large memory bandwidth requirements. To decrease memory contention and improve the performance of the kernel we propose two compression schemes. The first, called CSR-DU, targets the reduction of the matrix structural data by applying coarse grain delta encoding for the column indices. The second scheme, called CSR-VI, targets the reduction of the numerical values using indirect indexing and can only be applied to matrices which contain a small number of unique values. Evaluation of both methods on a rich matrix set showed that they can significantly improve the performance of the multithreaded version of the kernel and achieve good scalability for large matrices.

...read moreread less

Journal Article•

An Interpretation of Robinson Arithmetic in its Grzegorczyk's Weaker Variant

[...]

Vítězslav Švejdar¹•Institutions (1)

Charles University in Prague¹

01 May 2008-Fundamenta Informaticae

TL;DR: It is shown that Q is interpretable in Q^-.

...read moreread less

Abstract: Q^- is a weaker variant of Robinson arithmetic Q in which addition and multiplication are partial functions, i.e. ternary relations that are graphs of possibly non-total functions. We show that Q is interpretable in Q^-. This gives an alternative answer to a question of A. Grzegorczyk whether Q^- is essentially undecidable.

...read moreread less

Proceedings Article•10.1109/IPDPS.2008.4536158•

Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments

[...]

Zizhong Chen¹•Institutions (1)

Jacksonville State University¹

14 Apr 2008

TL;DR: It is demonstrated that fail-stop process failures in the outer product version matrix-matrix multiplication can be tolerated without checkpointing or message logging and based on this checksum relationship maintained in the middle of the computation is demonstrated.

...read moreread less

Abstract: It has been proved in previous algorithm-based fault tolerance that, for matrix matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no mater which algorithm is used. However, whether this checksum relationship can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that this checksum relationship is not maintained in the middle of the computation for most algorithms for matrix matrix multiplication. We then prove that, however, for the outer product version matrix matrix multiplication algorithm, this checksum relationship can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures (which are often tolerated by checkpointing or message logging) in the outer product version matrix-matrix multiplication can be tolerated without checkpointing or message logging.

...read moreread less

Proceedings Article•10.1109/NORCHP.2008.4738280•

An Exact Breadth-First Search Algorithm for the Multiple Constant Multiplications Problem

[...]

Levent Aksoy, Ece Olcay Gunes, Paulo Flores¹•Institutions (1)

INESC-ID¹

1 Nov 2008

TL;DR: This work proposes an exact algorithm based on the breadth-first search that finds the minimum number of operations solution of mid-size MCM instances in a reasonable time and is compared with the previously proposed efficient heuristics.

...read moreread less

Abstract: This paper addresses the multiplication of one data sample with multiple constants using addition/subtraction and shift operations, i.e., the multiple constant multiplications (MCM) problem. The MCM problem finds itself and its variants in many applications, such as digital finite impulse response (FIR) filters, linear signal transforms, and computer arithmetic. Although many efficient algorithms have been proposed to implement the MCM using the fewest number of operations, due to the NP-hardness of the problem, they have been heuristics, i.e., they cannot guarantee the minimum solution. In this work, we propose an exact algorithm based on the breadth-first search that finds the minimum number of operations solution of mid-size MCM instances in a reasonable time. The proposed exact algorithm has been tested on a set of instances including FIR filter and randomly generated instances, and compared with the previously proposed efficient heuristics. It is observed from the experimental results that, even though the previously proposed heuristics obtain similar results with the minimum number of operations solutions, there are instances for which the exact algorithm finds better solutions than the prominent heuristics.

...read moreread less

Patent•

System for computerized multiplayer tournament gaming and a method thereof

[...]

Idan Losica

4 Sep 2008

TL;DR: In this article, the authors present a computerized game system for multiplayer tournaments, a method of operating the system, a module to be used in conjunction with the system and a method for playing the game.

...read moreread less

Abstract: The present invention discloses a computerized gaming system for multiplayer tournaments, a method of operating the system, a module to be used in conjunction with the system, and a method of playing the game. The method of operating the system comprises a) setting and accommodating a configuration comprising one or more predefined and/or configurable rules for multiplication of stakes and/or a winning pool, conditions thereof, one or more parameters of said multiplication, and one or more tournament limits; b) facilitating obtaining a multiplication proposal from a player; c) obtaining game-related and/or player-related data from the gaming server and/or the billing unit and/or the storage unit, and providing automated authorization of the proposal in accordance with said obtained data and said accommodated configuration; d) presenting the authorized proposal to one or more other players and facilitating obtaining one or more respective replies, e) automated informing the game server and/or the billing unit with regard to the multiplication proposal and reply thereof; and f) automated executing the game rules corresponding to the obtained replies.

...read moreread less

Posted Content•

Analyzing the Galbraith-Lin-Scott Point Multiplication Method for Elliptic Curves over Binary Fields.

[...]

Darrel Hankerson¹, Koray Karabina², Alfred Menezes²•Institutions (2)

Auburn University¹, University of Waterloo²

01 Jan 2008-IACR Cryptology ePrint Archive

TL;DR: In this article, the authors investigated the potential benefits of using Galbraith-Lin-Scott elliptic curves in the case where q is a power of 2 and showed that the Gaussian-Lambert-Vanstone point multiplication method for these curves is significantly faster than point multiplication for general elliptic curve over prime fields.

...read moreread less

Abstract: Galbraith, Lin and Scott recently constructed efficiently-computable endomorphisms for a large family of elliptic curves defined over Fq 2 and showed, in the case where q is prime, that the Gallant-Lambert-Vanstone point multiplication method for these curves is significantly faster than point multiplication for general elliptic curves over prime fields. In this paper, we investigate the potential benefits of using Galbraith-Lin-Scott elliptic curves in the case where q is a power of 2. The analysis differs from the q prime case because of several factors, including the availability of the point halving strategy for elliptic curves over binary fields. Our analysis and implementations show that Galbraith-Lin-Scott offers significant acceleration for curves over binary fields, in both doubling- and halving-based approaches. Experimentally, the acceleration surpasses that reported for prime fields (for the platform in common), a somewhat counterintuitive result given the relative costs of point addition and doubling in each case.

...read moreread less

Patent•

High Speed and Efficient Matrix Multiplication Hardware Module

[...]

Yassir Salama¹, Assem Salama¹, Dennis Fitzgerald¹•Institutions (1)

Wilmington University¹

4 Jul 2008

TL;DR: A matrix multiplication module (10) and matrix multiplication method are provided that use a variable number of multiplier-accumulator units MAC (20) based on the amount of data elements of the matrices are available or needed for processing at a particular point or stage in the computation process as mentioned in this paper.

...read moreread less

Abstract: A matrix multiplication module (10) and matrix multiplication method are provided that use a variable number of multiplier-accumulator units MAC (20) based on the amount of data elements of the matrices are available or needed for processing at a particular point or stage in the computation process As more data elements become available or are needed, more multiplier-accumulator units (20) are used to perform the necessary multiplication and addition operations To multiply an NxM matrix by an MxN matrix, the total (maximum) number of used MAC units is "2*N-1" The number of MAC units used starts with one (1) and increases by two at each computation stage, that is, at the beginning of reading of data elements for each new row of the first matrix The sequence of the number of MAC units is {1, 3, 5, , 2*N-1} for computation stages each of which corresponds to reading of data elements for each new row of the left hand matrix, also called the first matrix

...read moreread less

...

Expand