TL;DR: In this paper, the APSP problem for weighted directed graphs was solved in O(n2+μ) time, where μ satisfies the equation ω(1, μ, 1) = 1 + 2μ and ω is the exponent of the multiplication of an n × nμ matrix by an nμ × n matrix.
Abstract: We present two new algorithms for solving the All Pairs Shortest Paths (APSP) problem for weighted directed graphs. Both algorithms use fast matrix multiplication algorithms.The first algorithm solves the APSP problem for weighted directed graphs in which the edge weights are integers of small absolute value in O(n2+μ) time, where μ satisfies the equation ω(1, μ, 1) = 1 + 2μ and ω(1, μ, 1) is the exponent of the multiplication of an n × nμ matrix by an nμ × n matrix. Currently, the best available bounds on ω(1, μ, 1), obtained by Coppersmith, imply that μ 0 is an error parameter and W is the largest edge weight in the graph, after the edge weights are scaled so that the smallest non-zero edge weight in the graph is 1. It returns estimates of all the distances in the graph with a stretch of at most 1 + ϵ. Corresponding paths can also be found efficiently.
TL;DR: In this article, a scalable and unified architecture for a Montgomery multiplication module which operates in both types of finite fields GF(p) and GF(2 m ) is described. But the authors do not consider the concurrency in the Montgomery multiplication operation by employing a pipelining design methodology.
Abstract: We describe a scalable and unified architecture for a Montgomery multiplication module which operates in both types of finite fields GF(p) and GF(2 m ) The unified architecture requires only slightly more area than that of the multiplier architecture for the field GF(p). The multiplier is scalable, which means that a fixed-area multiplication module can handle operands of any size, and also, the wordsize can be selected based on the area and performance requirements. We utilize the concurrency in the Montgomery multiplication operation by employing a pipelining design methodology. The upper limit on the precision of the scalable and unified Montgomery multiplier is dictated only by the available memory to store the operands and internal results, and the module is capable of performing infinite-precision Montgomery multiplication in both types of finite fields.
TL;DR: For speeding-up an elliptic scalar multiplication, this work developed a novel configuration of a multiplier over GF(2m), which enables the multiplication of any bit length by using the data conversion method.
Abstract: We describe the implementation of an elliptic curve cryptographic (ECC) coprocessor over GF(2m) on an FPGA and also the result of simulations evaluating its LSI implementation. This coprocessor is suitable for server systems that require efficient ECC operations for various parameters. For speeding-up an elliptic scalar multiplication, we developed a novel configuration of a multiplier over GF(2m), which enables the multiplication of any bit length by using our data conversion method. The FPGA implementation of the coprocessor with our multiplier, operating at 3 MHz, takes 80 ms for 163-bit elliptic scalar multiplication on a pesudo-random curve and takes 45 ms on a Koblitz curve. The 0.25 µm ASIC implementation of the coprocessor, operating at 66 MHz and having a hardware size of 165 Kgates, would take 1.1 ms for 163-bit elliptic scalar multiplication on a pesudo-random curve and would take 0.65 ms on a Koblitz curve.
TL;DR: In this article, the empty number line is used to make sense of numbers and counting and coming to know numbers, and the system of symbols is used for teaching and teaching approaches.
Abstract: 1. Making sense of numbers 2. Counting and coming to know numbers 3. Towards a system of symbols. 4. Addition and subtraction 5. The empty number line 6. Multiplication and Division 7. Written calculations 8. Teaching approaches.
TL;DR: In this article, the authors describe the implementation of an elliptic curve cryptographic (ECC) coprocessor over GF(2 m ) on an FPGA and also the result of simulations evaluating its LSI implementation.
Abstract: We describe the implementation of an elliptic curve cryptographic (ECC) coprocessor over GF(2 m ) on an FPGA and also the result of simulations evaluating its LSI implementation This coprocessor is suitable for server systems that require efficient ECC operations for various parameters For speeding-up an elliptic scalar multiplication, we developed a novel configuration of a multiplier over GF(2 m ) which enables the multiplication of any bit length by using our data conversion method The FPGA implementation of the coprocessor with our multiplier, operating at 3 MHz, takes 80 ms for 163-bit elliptic scalar multiplication on a pesudo-random curve and takes 45 ms on a Koblitz curve The 025 pm ASIC implementation of the coprocessor, operating at 66 MHz and having a hardware size of 165 Kgates, would take 11 ms for 163-bit elliptic scalar multiplication on a pesudo-random curve and would take 065 ms on a Koblitz curve
TL;DR: It is demonstrated that adults’ use of procedural strategies for simple addition is substantially influenced by retrieval interference.
Abstract: Simple addition (e.g., 3 + 2, 7 + 9) may be performed by direct memory retrieval or by such procedures as counting or transformation. The distribution of associations (DOA) model of strategy choice (Siegler, 1988) predicts that procedure use should increase as retrieval interference increases. To test this, 100 undergraduates performed simple addition problems, either after blocks of simple multiplication (high-interference context) or after blocks of simple division problems (low-interference context). Addition took longer and was more error prone after multiplication; in particular, there were more multiplication confusion errors on the relatively easy, small-number addition problems (e.g., 3 + 2 = 6, 4 + 3 = 12), but not on the more difficult, large-number additions. Consistent with the DOA, participants reported greater use of procedures for addition after multiplication, but more so for small addition problems. The findings demonstrate that adults’ use of procedural strategies for simple addition is substantially influenced by retrieval interference.
TL;DR: A division circuit and a square-root circuit are presented that extend traditional stochastic algebra and are able to process analog input signals with a simple and complete processing system.
Abstract: Among the different pulse codification techniques, stochastic pulse codification has its own arithmetic based on the similarity between Boolean algebra and statistical algebra. Summation and multiplication are the two basic arithmetic operations treated in depth in the literature. In this paper we present two digital stochastic circuits that extend traditional stochastic algebra: a division circuit and a square-root circuit, and the interfaces between the analog and stochastic domain. As a result, we are able to process analog input signals with a simple and complete processing system. These circuits can be implemented in low-cost and low-power digital programmable devices.
TL;DR: A pipelined linear array of processor elements (PEs) for matrix computations in an efficient manner is presented in this article, where each PE includes arithmetic circuitry for performing multiply, combine and accumulate operations and a register file for storing inputs and outputs of the arithmetic circuitry.
Abstract: A pipelined linear array of processor elements (PEs) for performing matrix computations in an efficient manner. The linear array generally includes a head PE and a set of regular PEs, the head PE being a functional superset of the regular PE, with interconnections between nearest neighbor PEs in the array and a feedback path from a non-neighbor regular PE back to the head PE. Each PE includes arithmetic circuitry for performing multiply, combine and accumulate operations, and a register file for storing inputs and outputs of the arithmetic circuitry. The head PE further includes a non-linear function generator. Each PE is pipelined such that the latency for an arithmetic operation to complete is a multiple of the period with which new operations can be initiated. A Very Large Instruction Word (VLIW) program or other type of program may be used to control the array. The array is particularly efficient at performing complex matrix operations, such as, e.g., the solution of a set of linear equations, matrix inversion, matrix-matrix multiplication, and computation of covariance and cross correlation.
TL;DR: In this paper, an algorithm for GF(2/sup n/) multiplication is proposed which can alleviate the problem of large memory space or do not fully utilize the resources of the processor on which the software is executed.
Abstract: Many cryptographic systems use multiplication in the finite field GF(2/sup n/) for their underlying computations. In the recent past, a number of look-up table-based algorithms have been proposed for the software implementation of GF(2/sup n/) multiplication. Look-up table-based algorithms can provide speed advantages, but they either require a large memory space or do not fully utilize the resources of the processor on which the software is executed. In this work, an algorithm for GF(2/sup n/) multiplication is proposed which can alleviate this problem. In each iteration of the proposed algorithm, a group of bits of one of the input operands are examined and two look-up tables are accessed. The groupsize determines the table sizes, but does not affect the utilization of the processor resources. It can be used for both software and hardware realizations and is particularly suitable for implementations in memory constrained environment, such as smart cards and embedded cryptosystems.
TL;DR: A fast finite field and elliptic curve (EC) algorithms useful for embedding cryptographic functions on high performance device such that most instructions take just one cycle are proposed.
Abstract: In this paper, we propose fast finite field and elliptic curve (EC) algorithms useful for embedding cryptographic functions on high performance device such that most instructions take just one cycle. In such case, the integer multiplications and additions have the same computational cost so that the computational cost analyses that were previously done in traditional manner may be invalid and in some cases the new algorithms should be introduced for fast computation. In our implementation, column major method for field multiplication and BP inversion algorithm are used for fast field arithmetic, and mixed coordinates method is used for efficient EC exponentiation. We give here analyses on various algorithms that are useful for implementing EC exponentiation on CalmRISC microcontroller with MAC2424 coprocessor, as well as new exact analyses on BP (Bailey-Paar) inversion algorithm and EC exponentiation. Using techniques shown in this paper, we implemented EC exponentiation for various coordinate systems and the best result took 122ms, assuming 50ns clock cycle.
TL;DR: The authors present two systolic architectures to speed up the computation of modular multiplication in RSA cryptosystems by eliminating the one-clock-cycle gap between iterations by pairing off the double-layer architecture.
Abstract: The authors present two systolic architectures to speed up the computation of modular multiplication in RSA cryptosystems. In the double-layer architecture, the main operation of Montgomery's algorithm is partitioned into two parallel operations after using the precomputation of the quotient bit. In the non-interlaced architecture, we eliminate the one-clock-cycle gap between iterations by pairing off the double-layer architecture. We compare our architectures with some previously proposed Montgomery-based systolic architectures, on the basis of both modular multiplication and modular exponentiation. The comparisons indicate that our architectures offer the highest speed, lower hardware complexity, and lower power consumption.
TL;DR: Investigates different architectures implementing bit-parallel constant-coefficient multiplication in FPGA structures, and a novel algorithm for the conversion from two's-complement to CSD representation is presented.
Abstract: Investigates different architectures implementing bit-parallel constant-coefficient multiplication in FPGA structures. First, multiplierless multiplication (MM) architectures employing canonic sign digit (CSD) and sub-structure sharing methods are addressed, and a novel algorithm for the conversion from two's-complement to CSD representation is presented. In the second part of this paper, lookup table-based multiplication (LM) is investigated. Correspondingly, the usage of different memory modules and finding the optimal combination of the memory and adders are considered. The LM architecture also considers reduction of the address width for each memory cell and the possibility of memory sub-structure sharing. Finally, implementation results for the Xilinx XC4000 and Virtex families are presented. As a result, MM generally surpasses the LM architecture. However, the actual choice between these two architectures is coefficient- and input parameter-dependent.
TL;DR: In this paper, a dot product operator (30) uses adder trees (10) of L-1adders and no multiplication circuits, where L is the length of the parallel dot product operators.
Abstract: A dot product operator (30) uses adder trees (10) of L-1
adders and no multiplication circuits, where L is the length of
the parallel dot product operator. Exclusive-or gates 12
provide the function of multiplication by ±1, with the carry-in
ports of adders (14, 16, 18, 20, 32, 34, 38, 44) being used to
form the two's complement, resulting in an extremely efficient
design in terms of area and power.
TL;DR: In this article, it was shown that if S is a commutative R-algebra and ψ: M→an R-module homomorphism, then Sψ(M) is a multiplication S-module.
Abstract: Let R bea commutative ring with identity. An R-module (ideal of R) A is called a multiplication module (ideal) if for each submodule N of A there exists an ideal I of R with N = I A. We give several characterizations of multiplication modules. Using the method of idealization we show how to reduce questions concerning multiplication modules to multiplication ideals. For example, we show that if S is a commutative R-algebra and ψ: M→an R-module homomorphism where M is a multiplication R-module and N is an S-module, then Sψ(M) is a multiplication S-module.
TL;DR: In this paper, an instruction decoder is provided which is responsive to a multiply instruction to control the multiplying circuit to generate a multiplication result for the computation M×N, where M and N are W bit data words.
Abstract: The present invention provides an apparatus and method for processing data using a multiplying circuit for performing a multiplication of a W/2 bit data value by a W bit data value. An instruction decoder is provided which is responsive to a multiply instruction to control the multiplying circuit to generate a multiplication result for the computation M×N, where M and N are W bit data words. The multiplying circuit is arranged to execute a first operation in the which the data word N is multiplied by the most significant W/2 bits of the data word M to generate a first intermediate result having 3W/2 bits, and to then execute a second operation in which the data word N is multiplied by the least significant W/2 bits of the data word M to generate a second intermediate result having 3W/2 bits. The first intermediate result is shifted by W/2 with respect to the second intermediate result and added to the second intermediate result to generate the multiplication result. By performing the two parts of the multiplication in reverse order to the conventional approach, it has been found that the complexity of the circuitry can be reduced, and a reduction in power consumption can be achieved.
TL;DR: This paper addresses the issue of implementing matrix-matrix multiplication on heterogeneous platforms with a (polynomial) column-based heuristic, which turns out to be very satisfactory: the theoretical performance guarantee for the heuristic is derived, and its practical usefulness is assessed through MPI experiments.
Abstract: In this paper, we address the issue of implementing matrix-matrix multiplication on heterogeneous platforms. We target two different classes of heterogeneous computing resources: heterogeneous networks of workstations, and collections of heterogeneous clusters. Intuitively, the problem is to load balance the work with different-speed resources while minimizing the communication volume. We formally state this problem and prove its NP-completeness. Next we introduce a (polynomial) column-based heuristic, which turns out to be very satisfactory: we derive a theoretical performance guarantee for the heuristic, and we assess its practical usefulness through MPI experiments.
TL;DR: In this paper, the pseudo-linear principle is applied for solving nonlinear equations ODE, PDE, difference equations, etc. using pseudo-addition and pseudo-multiplica.
TL;DR: In this article, an implementation of a multi-dimensional Galois field multiplier and a method of Galois Field Multi-dimensional multiplication which are able to support many communication standards having various symbol sizes, different GFs, and different primitive polynomials, in a cost-efficient manner is disclosed.
Abstract: An implementation of a multi-dimensional Galois field multiplier and a method of Galois field multi-dimensional multiplication which are able to support many communication standards having various symbol sizes, different GFs, and different primitive polynomials, in a cost-efficient manner is disclosed. The key to allow a single implementation to perform for all different GF sizes is to align the input data such that the Galois field symbols of the operands are aligned to the left most significant bit (MSB) position of the input data field. Similarly, the primitive polynomial used to create a selected Galois field is aligned to the left MSB position. A polynomial multiply is performed. The product polynomial is then conditionally divided by the primitive polynomial starting with the most significant bit, the condition being if the left most bit of the product is a 1. In other words, if the product polynomial has an MSB of 1, then divide the product with the primitive polynomial. Perform this step until the MSB is 0. In addition, for fields smaller than a maximum size Galois field, the sequence of conditional divisions is further conditioned with a predetermined mask in dependence upon the size of the GF. The resultant product is aligned to the left MSB.
TL;DR: A method for the multiplication of an arbitrary vector by a symmetric centrosymmetric matrix, requiring 5 4 n 2 + O (n) floating-point operations, rather than the 2n 2 operations needed in the case of an arbitrarily matrix.
TL;DR: In this paper, four students participated in a 2-year study (fifth and sixth grades) that examined the development of their understanding of multiplication of fractions and found that students consistently drew on their informal knowledge of partitioning to reconceptualize and partition units to solve problems involving multiplication of fraction in meaningful ways.
TL;DR: In this article, a clock state control circuit provides a control to, stop the output of clock to the outside, switch the clock to a clock other than those output by the PLL oscillation circuit, change the multiplication factor in the pll oscillation circuits, switch a clock to clock output after the pLL output clock is stabilized, and restart output of the clock.
Abstract: In a clock control circuit, a multiplication factor setting unit outputs a multiplication factor. A buffer circuit holds a previous multiplication factor and the multiplication factor output by the multiplication factor setting unit and compares the two multiplication factors. When the multiplication factors are different from each other, a clock state control circuit provides a control to, stop the output of clock to the outside, switch the clock to a clock other than those output by the PLL oscillation circuit, change the multiplication factor in the PLL oscillation circuit, switch the clock to clock output by the PLL oscillation circuit after the PLL output clock is stabilized, and restart output of the clock to the outside.
TL;DR: Theoretical analysis demonstrates that the keystream sequences generated by SSC2 have long period, large linear complexity, and good statistical distribution.
Abstract: SSC2 is a fast software stream cipher designed for wireless handsets with limited computational capabilities. It supports various private key sizes from 4 bytes to 16 bytes. All operations in SSC2 are word-oriented, no complex operations such as multiplication, division, and exponentiation are involved. SSC2 has a very compact structure that makes it easy to implement on 8-,16-, and 32-bit processors. Theoretical analysis demonstrates that the keystream sequences generated by SSC2 have long period, large linear complexity, and good statistical distribution.
TL;DR: An internally analog, externally digital architecture for matrix-vector multiplication is presented, which allows for high data throughput and minimal latency, and is tailored for high-density and low power VLSI implementation.
Abstract: An internally analog, externally digital architecture for matrix-vector multiplication is presented. Fully parallel processing allows for high data throughput and minimal latency. The analog architecture incorporates an array of charge-mode analog computational cells with dynamic storage and row-parallel flash analog-to-digital converters (ADC). Each of the cells includes a dynamic storage element and a charge injection device computing binary inner product of two arguments. The matrix elements are stored in the array of computational cells in bit-parallel fashion, and the input vector is presented bit-serially. Digital post-processing is then performed on the ADC outputs to construct the resulting product with precision higher than that of each conversion. The analog architecture is tailored for high-density and low power VLSI implementation, and matrix dimensions of 128/spl times/512 and ADC resolution of 6 bits for an overall resolution in excess of 8 bits are feasible on a 3 mm/spl times/3 mm chip in standard CMOS 0.5 /spl mu/m technology.
TL;DR: In this article, a new multiplication region structure that achieves very low multiplication noise by impact ionization engineering (I/sup 2/E), which utilizes heterojunctions to achieve greater localization of impact ionisation than spatially uniform structures.
Abstract: The avalanche photodiode (APD) is frequently the photodetector of choice for high-bit-rate, long-haul fiber optic communications, owing to its internal gain, which provides a sensitivity margin compared to PIN photodiodes. Since the multiplication region of an APD plays a critical role in determining the gain, the multiplication noise, and the gain-bandwidth product, numerous research programs have focused on optimizing the multiplication region in order to improve the APD performance. We describe a new multiplication region structure that achieves very low multiplication noise by impact ionization engineering (I/sup 2/E), which utilizes heterojunctions to achieve greater localization of impact ionization than spatially uniform structures. By placing thin layers with relatively low threshold energy (multiplication layer) on each side of a region with higher ionization coefficients (the separation layer), impact ionization is enhanced at the edges in the twin multiplication layers and is suppressed in the center, where the carriers are energized in transit.
TL;DR: This paper describes how long-wordlength (1024-bit) modular exponentiation may be implemented on a standard 32-bit microprocessor core with a total execution lime of under 1 second.
Abstract: This paper describes how long-wordlength (1024-bit) modular exponentiation may be implemented on a standard 32-bit microprocessor core with a total execution lime of under 1 second. The design does not use a long-wordlength arithmetic co-processor. Instead all arithmetic operations are reduced to 32-bit additions, subtractions and binary shifts, and the processor is augmented with a small hardware enhancement to significantly accelerate accumulation of shifted multi-precision numbers. Target performance is achieved by trading fast arithmetic hardware for extra RAM, to facilitate pre-computation of digit multiples and powers. Signed sliding window algorithms are introduced for exponentiation, multiplication and reduction operations, and attention is paid to the integration of enhanced security features such as blinding and verification.
TL;DR: A combinatorial complex multiplier has been designed for use in a pipelined fast Fourier transform processor and a new architecture based on distributed arithmetic, Wallace-trees, and carry-lookahead adders has been developed.
Abstract: A combinatorial complex multiplier has been designed for use in a pipelined fast Fourier transform processor. The performance in terms of throughput of the processor is limited by the multiplication. Therefore, the multiplier is optimized to make the input-to-output delay as short as possible. A new architecture based on distributed arithmetic, Wallace-trees, and carry-lookahead adders has been developed. The multiplier has been fabricated using standard cells in a 0.5-/spl mu/m process and verified for functionality, speed, and power consumption. Running at 40 MHz, a multiplier with input wordlengths of 16+16 times 10+10 bits consumes 54% less power compared to an distributed arithmetic array multiplier fabricated under equal conditions.
TL;DR: In this paper, a scheme for arithmetic operations in finite field and group operations over elliptic curves capable of realizing a very fast implementation was proposed, where the multiplicative inverse calculation and multiplication in the finite field GF(22n) can be realized as combinations of multiplications, additions, and a multiplier in the subfield GF(2n).
Abstract: A scheme for arithmetic operations in finite field and group operations over elliptic curves capable of realizing a very fast implementation. According to this scheme, by using a normal basis [α α+1], the multiplicative inverse calculation and the multiplication in the finite field GF(22n) can be realized as combinations of multiplications, additions and a multiplicative inverse calculation in the subfield GF(2n). Also, by using a standard basis [1 α], the multiplication, the square calculation, and the multiplicative inverse calculation in the finite field GF(22n) can be realized as combinations of multiplications, additions and a multiplicative inverse calculation in the subfield GF(2n). These arithmetic operations can be utilized for calculating rational expressions expressing group operations over elliptic curves that are used in information security techniques such as elliptic curve cryptosystems.
TL;DR: In this paper different architectures for real time image constant coefficients convolutions are considered and the choice between these architectures depends on given coefficients values, however in most cases the MC preferable.
Abstract: In this paper different architectures for real time image constant coefficients convolutions are considered. Accordingly, look-up-table (LUT) based multiplication/convolution, LUT based distributed arithmetic (DA) convolution and multiplierless convolution (MC) implementations into FPGA structures has been investigated. In one result, the choice between these architectures depends on given coefficients values, however in most cases the MC preferable. Furthermore the change of coefficient values in real-time systems is also considered. This work is a contribution to worldwide intense research on developing reconfigurable and user dedicated custom computing machines (CCM).
TL;DR: In this article, the modular exponentiation function used in public key encryption and decryption systems is implemented in a standalone engine having at its core modular multiplication circuits which operate in two phases which share overlapping hardware structures.
Abstract: The modular exponentiation function used in public key encryption and decryption systems is implemented in a standalone engine having at its core modular multiplication circuits which operate in two phases which share overlapping hardware structures. The partitioning of large arrays in the hardware structure, for multiplication and addition, into smaller structures results in a multiplier design comprising a series of nearly identical processing elements linked together in a chained fashion. As a result of the two-phase operation and the chaining together of partitioned processing elements, the overall structure is operable in a pipelined fashion to improve throughput and speed. The chained processing elements are constructed so as to provide a partitionable chain with separate parts for processing factors of the modulus. In this mode, the system is particularly useful for exploiting characteristics of the Chinese Remainder Theorem to perform rapid exponentiation operations. A checksum mechanism is also provided to insure accurate operation without impacting speed and without significantly increasing complexity. While the present disclosure is directed to a complex system which includes a number of features, the present application is particularly directed to the structure and linking of a plurality of almost identical processing elements.
TL;DR: To the best of the knowledge, this design is the first publication that deals with detecting exceptions and trapped overflow and underflow exceptions as an integral part of the rounding unit in a floating point unit.
Abstract: Engineering design methodology recommends designing a system as follows: Start with an unambiguous specification, partition the system into blocks, specify the functionality of each block, design each block separately, and glue the blocks together. Verifying the correctness of an implementation then reduces to a local verification procedure. We apply this methodology for designing a provably correct IEEE rounding unit that can be used for various operations, such as addition and multiplication. First, we provide a mathematical and, hopefully, unambiguous definition of the IEEE Standard which specifies the functionality. We give explicit and concise rules for gluing the rounding unit with a floating-point adder and multiplier. We then present floating-point addition and multiplication algorithms that use the rounding unit. To the best of our knowledge, our design is the first publication that deals with detecting exceptions and trapped overflow and underflow exceptions as an integral part of the rounding unit in a floating point unit. Our abstraction level avoids bit-level representations and arguments to help clarify the functionality of the algorithm.