TL;DR: A simplified model of attention is proposed which is applicable to feed-forward neural networks and can solve the synthetic "addition" and "multiplication" long-term memory problems for sequence lengths which are both longer and more widely varying than the best published results for these tasks.
Abstract: We propose a simplified model of attention which is applicable to feed-forward neural networks and demonstrate that the resulting model can solve the synthetic "addition" and "multiplication" long-term memory problems for sequence lengths which are both longer and more widely varying than the best published results for these tasks.
TL;DR: In this article, the main critical problem that naturally arises in processing Z-number-based information is computation with Z-numbers, which is a more adequate concept for description of real-world information.
TL;DR: In this article, a graph-induced multilinear encoding scheme from lattices was proposed, in which the arithmetic operations that are allowed are restricted through an explicitly defined directed graph (somewhat similar to the asymmetric variant of previous schemes).
Abstract: Graded multilinear encodings have found extensive applications in cryptography ranging from non-interactive key exchange protocols, to broadcast and attribute-based encryption, and even to software obfuscation. Despite seemingly unlimited applicability, essentially only two candidate constructions are known (GGH and CLT). In this work, we describe a new graph-induced multilinear encoding scheme from lattices. In a graph-induced multilinear encoding scheme the arithmetic operations that are allowed are restricted through an explicitly defined directed graph (somewhat similar to the “asymmetric variant” of previous schemes). Our construction encodes Learning With Errors (LWE) samples in short square matrices of higher dimensions. Addition and multiplication of the encodings corresponds naturally to addition and multiplication of the LWE secrets. Security of the new scheme is not known to follow from LWE hardness (or any other “nice” assumption), at present it requires making new hardness assumptions.
TL;DR: In this article, it was shown that there is no truly subcubic (O(n3-e) time algorithm for the online Boolean matrix-vector multiplication problem.
Abstract: Consider the following Online Boolean Matrix-Vector Multiplication problem: We are given an n x n matrix M and will receive n column-vectors of size n, denoted by v1, ..., vn, one by one. After seeing each vector vi, we have to output the product Mvi before we can see the next vector. A naive algorithm can solve this problem using O(n3) time in total, and its running time can be slightly improved to O(n3/log2 n) [Williams SODA'07]. We show that a conjecture that there is no truly subcubic (O(n3-e)) time algorithm for this problem can be used to exhibit the underlying polynomial time hardness shared by many dynamic problems. For a number of problems, such as subgraph connectivity, Pagh's problem, d-failure connectivity, decremental single-source shortest paths, and decremental transitive closure, this conjecture implies tight hardness results. Thus, proving or disproving this conjecture will be very interesting as it will either imply several tight unconditional lower bounds or break through a common barrier that blocks progress with these problems. This conjecture might also be considered as strong evidence against any further improvement for these problems since refuting it will imply a major breakthrough for combinatorial Boolean matrix multiplication and other long-standing problems if the term "combinatorial algorithms" is interpreted as "Strassen-like algorithms" [Ballard et al. SPAA'11].The conjecture also leads to hardness results for problems that were previously based on diverse problems and conjectures -- such as 3SUM, combinatorial Boolean matrix multiplication, triangle detection, and multiphase -- thus providing a uniform way to prove polynomial hardness results for dynamic algorithms; some of the new proofs are also simpler or even become trivial. The conjecture also leads to stronger and new, non-trivial, hardness results, e.g., for the fully-dynamic densest subgraph and diameter problems.
TL;DR: The implementation is fully general and the optimization strategy adaptively processes the SpGEMM workload row-wise to substantially improve performance by decreasing the work complexity and utilizing the memory hierarchy more effectively.
Abstract: Sparse matrix--matrix multiplication (SpGEMM) is a key operation in numerous areas from information to the physical sciences. Implementing SpGEMM efficiently on throughput-oriented processors, such as the graphics processing unit (GPU), requires the programmer to expose substantial fine-grained parallelism while conserving the limited off-chip memory bandwidth. Balancing these concerns, we decompose the SpGEMM operation into three highly parallel phases: expansion, sorting, and contraction, and introduce a set of complementary bandwidth-saving performance optimizations. Our implementation is fully general and our optimization strategy adaptively processes the SpGEMM workload row-wise to substantially improve performance by decreasing the work complexity and utilizing the memory hierarchy more effectively.
TL;DR: In this article, the authors present the first implementation of the 3D SpGEMM formulation that exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrency.
Abstract: Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.
TL;DR: In this paper, it is shown that it is possible and indeed feasible to use secure multiparty computation (SMC) for calculating the probability of a collision between two satellites, using basic floating point arithmetic operators (addition and multiplication) for multiparty computations.
Abstract: In this paper, we show that it is possible and, indeed, feasible to use secure multiparty computation (SMC) for calculating the probability of a collision between two satellites. For this purpose, we first describe basic floating point arithmetic operators (addition and multiplication) for multiparty computations. The operators are implemented on the $${\textsc {Sharemind}}$$SHAREMIND SMC engine. We discuss the implementation details, provide methods for evaluating example elementary functions (inverse, square root, exponentiation of $$e$$e, error function). Using these primitives, we implement a satellite conjunction analysis algorithm and give benchmark results for the primitives as well as the conjunction analysis itself.
TL;DR: An algorithm for general sparse matrix-matrix multiplication (SpGEMM) on many-core architectures, such as GPUs, is presented, implemented by iterative row merging, similar to merge sort.
Abstract: We present an algorithm for general sparse matrix-matrix multiplication (SpGEMM) on many-core architectures, such as GPUs. SpGEMM is implemented by iterative row merging, similar to merge sort, exc...
TL;DR: In this paper, the authors provide apparatuses and methods for performing multi-variable bit-length multiplication operations in a memory array, including AND operations, OR operations and shift operations without transferring data via an input/output (IO) line.
Abstract: Examples of the present disclosure provide apparatuses and methods for performing multi-variable bit-length multiplication operations in a memory An example method comprises performing a multiplication operation on a first vector and a second vector The first vector includes a number of first elements stored in a group of memory cells coupled to a first access line and a number of sense lines of a memory array The second vector includes a number of second elements stored in a group of memory cells coupled to a second access line and the number of sense lines of the memory array The example multiplication operation can include performing a number of AND operations, OR operations and SHIFT operations without transferring data via an input/output (I/O) line
TL;DR: In this article, a server module evaluates a circuit based on concealed inputs provided by respective participant modules, to provide a concealed output, so that no party to the transaction (including the sever module) discovers any other party's non-concealed inputs.
Abstract: A server module evaluates a circuit based on concealed inputs provided by respective participant modules, to provide a concealed output. By virtue of this approach, no party to the transaction (including the sever module) discovers any other party's non-concealed inputs. In a first implementation, the server module evaluates a garbled Boolean circuit. This implementation also uses a three-way oblivious transfer technique to provide a concealed input from one of the participant modules to the serer module. In a second implementation, the server module evaluates an arithmetic circuit based on ciphertexts that have been produced using a fully homomorphic encryption technique. This implementation modifies multiplication operations that are performed in the evaluation of the arithmetic circuit by a modifier factor; this removes bounds placed on the number of the multiplication operations that can be performed.
TL;DR: A new and efficient Montgomery modular multiplication architecture based on a new digit serial computation that relaxes the high-radix partial multiplication to a binary multiplication and performs several multiplications of consecutive zero bits in one clock cycle instead of several clock cycles is presented.
Abstract: Modular exponentiation with a large modulus and exponent is a fundamental operation in many public-key cryptosystems. This operation is usually accomplished by repeating modular multiplications. Montgomery modular multiplication has been widely used to relax the quotient determination. The carry–save adder has been employed to reduce the critical path. This paper presents and evaluates a new and efficient Montgomery modular multiplication architecture based on a new digit serial computation. The proposed architecture relaxes the high-radix partial multiplication to a binary multiplication. It also performs several multiplications of consecutive zero bits in one clock cycle instead of several clock cycles. Moreover, the right-to-left and left-to-right modular exponentiation architectures have been modified to use the proposed modular multiplication architecture as its structural unit. We provide the implementation results on a Xilinx Virtex 5 FPGA demonstrating that the total computation time and throughput rate of the proposed architectures outperform most results so far in the literatures.
TL;DR: In this paper, a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor is proposed, where the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector.
Abstract: Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms. The source code of this work is downloadable at this https URL
TL;DR: This paper proposes efficient and high speed architectures to implement point multiplication on binary Edwards and generalized Hessian curves and employs a newly proposed digit-level hybrid-double Gaussian normal basis multiplier to reduce the latency of point multiplication.
Abstract: High-performance and fast implementation of point multiplication is crucial for elliptic curve cryptographic systems. Recently, considerable research has investigated the implementation of point multiplication on different curves over binary extension fields. In this paper, we propose efficient and high speed architectures to implement point multiplication on binary Edwards and generalized Hessian curves. We perform a data-flow analysis and investigate maximum number of parallel multipliers to be employed to reduce the latency of point multiplication on these curves. Then, we modify the addition and doubling formulations and employ a newly proposed digit-level hybrid-double Gaussian normal basis multiplier to remove the data dependencies and hence reduce the latency of point multiplication. To the best of our knowledge, this is the first time that one employs hybrid-double multiplication technique to reduce the computation time of point multiplication. Moreover, we have implemented our proposed architectures for point multiplication on FPGA and obtained the results of timing and area. Our results indicate that the proposed scheme is one step forward to improve the performance of point multiplication on binary Edward and generalized Hessian curves.
TL;DR: In this paper, a carefully optimized implementation of a ring-LWE encryption scheme for 8-bit AVR processors like the ATxmega128 was presented, which achieved a speedup of 590 k, 672 k, and 276 k clock cycles for key generation, encryption, and decryption, respectively.
Abstract: Public-key cryptography based on the “ring-variant” of the Learning with Errors (ring-LWE) problem is both efficient and believed to remain secure in a post-quantum world. In this paper, we introduce a carefully-optimized implementation of a ring-LWE encryption scheme for 8-bit AVR processors like the ATxmega128. Our research contributions include several optimizations for the Number Theoretic Transform (NTT) used for polynomial multiplication. More concretely, we describe the Move-and-Add (MA) and the Shift-Add-Multiply-Subtract-Subtract (SAMS2) technique to speed up the performance-critical multiplication and modular reduction of coefficients, respectively. We take advantage of incompletely-reduced intermediate results to minimize the total number of reduction operations and use a special coefficient-storage method to decrease the RAM footprint of NTT multiplications. In addition, we propose a byte-wise scanning strategy to improve the performance of a discrete Gaussian sampler based on the Knuth-Yao random walk algorithm. For medium-term security, our ring-LWE implementation needs 590 k, 672 k, and 276 k clock cycles for key-generation, encryption, and decryption, respectively. On the other hand, for long-term security, the execution time of key-generation, encryption, and decryption amount to 2.2 M, 2.6 M, and 686 k cycles, respectively. These results set new speed records for ring-LWE encryption on an 8-bit processor and outperform related RSA and ECC implementations by an order of magnitude.
TL;DR: The architecture and the implementation of a high-performance scalable elliptic curve cryptography processor (ECP) that can support all five NIST recommended prime curves without the need to reconfigure the hardware is presented.
Abstract: The architecture and the implementation of a high-performance scalable elliptic curve cryptography processor (ECP) are presented. The proposed ECP is able to support all five prime field elliptic curves recommended by the National Institute of Standards and Technology (NIST). The design takes advantage of the high-performance capabilities of the DSP48E slices available in Xilinx field-programmable gate arrays (FPGAs) to achieve high speed and low hardware resource utilization. The proposed design parallelizes the underlying prime field operations to reduce the latency of the elliptic curve point multiplication (ECPM) operation. Prime field inversion is performed efficiently using the same arithmetic blocks as the ones used for prime field multiplication and addition/subtraction. To the best of the authors' knowledge, the proposed scalable ECP is the fastest and smallest ECP that can support all five NIST recommended prime curves without the need to reconfigure the hardware. It can compute the ECPM between 1.709 and 28.04 ms using a Xilinx Virtex-5 FPGA.
TL;DR: In this article, a carefully optimized implementation of a ring-LWE encryption scheme for 8-bit AVR processors like the ATxmega128 was presented, which achieved a speedup of 590 k, 672 k, and 276 k clock cycles for key generation, encryption, and decryption, respectively.
Abstract: Public-key cryptography based on the “ring-variant” of the Learning with Errors (ring-LWE) problem is both efficient and believed to remain secure in a post-quantum world. In this paper, we introduce a carefully-optimized implementation of a ring-LWE encryption scheme for 8-bit AVR processors like the ATxmega128. Our research contributions include several optimizations for the Number Theoretic Transform (NTT) used for polynomial multiplication. More concretely, we describe the Move-and-Add (MA) and the Shift-Add-Multiply-Subtract-Subtract (SAMS2) technique to speed up the performance-critical multiplication and modular reduction of coefficients, respectively. We take advantage of incompletely-reduced intermediate results to minimize the total number of reduction operations and use a special coefficient-storage method to decrease the RAM footprint of NTT multiplications. In addition, we propose a byte-wise scanning strategy to improve the performance of a discrete Gaussian sampler based on the Knuth-Yao random walk algorithm. For medium-term security, our ring-LWE implementation needs 590 k, 672 k, and 276 k clock cycles for key-generation, encryption, and decryption, respectively. On the other hand, for long-term security, the execution time of key-generation, encryption, and decryption amount to 2.2 M, 2.6 M, and 686 k cycles, respectively. These results set new speed records for ring-LWE encryption on an 8-bit processor and outperform related RSA and ECC implementations by an order of magnitude.
TL;DR: It is shown that the HRR operations of addition, multiplication, and permutation can be realized with sparse vectors, making an energy-efficient implementation possible and proposing a processor that has both data and instructions embedded in the same high-dimensional vector.
Abstract: Computing with high-dimensional vectors in a manner that resembles computing with numbers is based on Plate's Holographic Reduced Representation (HRR) and is used to model human cognition. Here we examine its hardware realization under constraints suggested by the properties of the brain's circuits. The sparseness of neural firing suggests that the vectors should be sparse. We show that the HRR operations of addition, multiplication, and permutation can be realized with sparse vectors, making an energy-efficient implementation possible. Furthermore, we propose a processor that has both data and instructions embedded in the same high-dimensional vector. The operation is highlighted with a sequence memory example.
TL;DR: This paper proposes a distributed solution named Distributed Kernelized ELM (DK-ELM), which realizes an implementation of ELM with kernels on MapReduce, and experimental results show that DK- ELM has good scalability for massive learning applications.
TL;DR: This paper presents a new design paradigm for the programmable FIR filters by exploiting the extended double base number system (EDBNS) due to its sparsity and innate abstraction of the sum of binary shifted partial products, which can be maximized by a direct mapping from the quasi-minimum EDBNS.
Abstract: Coefficient multipliers are the stumbling blocks in programmable finite impulse response (FIR) digital filters. As the filter coefficients change either dynamically or periodically, the search for common subexpressions for multiplierless implementation needs to be performed over the entire gamut of integers of the desired precision, and the amount of shifts associated with each identified common subexpression needs to be memorized. The complexity of a quality search is thus beyond the existing design algorithms based on conventional binary and signed digit representations. This paper presents a new design paradigm for the programmable FIR filters by exploiting the extended double base number system (EDBNS). Due to its sparsity and innate abstraction of the sum of binary shifted partial products, the sharing of adders in the time-multiplexed multiple constant multiplication block of the programmable FIR filters can be maximized by a direct mapping from the quasi-minimum EDBNS. The multiplexing cost can be further reduced by merging double base terms. Logic synthesis results on more than one hundred programmable filters with filter taps ranging from 10 to 100 and coefficient word lengths of 8, 12, and 16 bits show that the average logic complexity and critical path delay of the programmable FIR filters designed by our proposed algorithm have been reduced by up to 47.81% and 14.32%, respectively over the existing design methods.
TL;DR: This paper provides a solution to the problem of determining the maximal period of unimodular matrix generators of pseudo-random numbers, formulate the necessary and sufficient condition to attain the maximum period, and presents a family of specific generators in the MIXMAX family with superior performance and excellent statistical properties.
TL;DR: In this article, the bilinear Calderon-Zygmund operators and multiplication by functions in a certain subspace of the space of functions of bounded mean oscillations are shown to be compact on appropriate products of weighted Lebesgue spaces.
Abstract: Commutators of bilinear Calder\'on-Zygmund operators and multiplication by functions in a certain subspace of the space of functions of bounded mean oscillations are shown to be compact on appropriate products of weighted Lebesgue spaces.
TL;DR: This paper surveys bit-parallel multipliers for finite field GF according to quadratic and subquadratic arithmetic complexities of the underlying algorithms, various bases used for representing the field elements, and design approaches that rely on polynomial and matrix operations.
TL;DR: In this paper, the sumset (Minkowski sum) of two integer sets is computed by sparse interpolation algorithms and results from analytic number theory, which is used as part of the sparse multiplication algorithm.
Abstract: We present randomized algorithms to compute the sumset (Minkowski sum) of two integer sets, and to multiply two univariate integer polynomials given by sparse representations. Our algorithm for sumset has cost softly linear in the combined size of the inputs and output. This is used as part of our sparse multiplication algorithm, whose cost is softly linear in the combined size of the inputs, output, and the sumset of the supports of the inputs. As a subroutine, we present a new method for computing the coefficients of a sparse polynomial, given a set containing its support. Our multiplication algorithm extends to multivariate Laurent polynomials over finite fields and rational numbers. Our techniques are based on sparse interpolation algorithms and results from analytic number theory.
TL;DR: In this paper, a speculative segmented sum strategy for the CSR-based SpMV algorithm is proposed, where the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector.
Abstract: A speculative segmented sum strategy for the CSR-based SpMV.Utilizing both GPU cores and CPU cores in a heterogeneous processor.No format conversion or tuning overhead for input sparse matrices in the CSR format.High speedup over the CSR-vector algorithm running irregular matrices.No performance penalty for most regular matrices. Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect result. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms.
TL;DR: A promising algorithm for sparse-matrix sparse-vector multiplication (SpMSpV) on the GPU is implemented and the scalability of three approaches -- no sorting, merge sorting, and radix sorting -- in solving this problem is examined.
Abstract: We implement a promising algorithm for sparse-matrix sparse-vector multiplication (SpMSpV) on the GPU. An efficient k-way merge lies at the heart of finding a fast parallel SpMSpV algorithm. We examine the scalability of three approaches -- no sorting, merge sorting, and radix sorting -- in solving this problem. For breadth-first search (BFS), we achieve a 1.26x speedup over state-of-the-art sparse-matrix dense-vector (SpMV) implementations. The algorithm seems generalize able for single-source shortest path (SSSP) and sparse-matrix sparse-matrix multiplication, and other core graph primitives such as maximal independent set and bipartite matching.
TL;DR: In this article, the authors present new speed records for multiprecision multiplication on the AVR ATmega family of 8-bit microcontrollers, achieving a speedup of 1,969 cycles for the multiplication of two 160-bit integers.
Abstract: This paper presents new speed records for multiprecision multiplication on the AVR ATmega family of 8-bit microcontrollers. For example, our software takes only 1,969 cycles for the multiplication of two 160-bit integers; this is more than 15 % faster than that demonstrated in previous work. For 256-bit inputs, our software is not only the first to break through the 6,000-cycle barrier; with only 4,771 cycles it also breaks through the 5,000-cycle barrier and is more than 21 % faster than previous work. We achieve these speed records by carefully optimizing the Karatsuba multiplication technique for AVR ATmega. One might expect that subquadratic-complexity Karatsuba multiplication is only faster than algorithms with quadratic complexity for large inputs. This paper shows that it is in fact faster than fully unrolled product-scanning multiplication already for surprisingly small inputs, starting at 48 bits. Our results thus make Karatsuba multiplication the method of choice for high-performance implementations of elliptic-curve cryptography on AVR ATmega microcontrollers.
TL;DR: In this paper, the use of three-valued oligonucleotide inputs to construct a 3 × 3 multiplication table was demonstrated using DNA as a functional material for ternary computing, and in particular, the system consisted of two threevalued inputs of −1, 0, +1 and a fluorophore/quencher functional hairpin acting as computational and reporter module.
Abstract: Non-Boolean computations implementing operations on multi-valued variables beyond base 2 allow enhanced computational complexity. We introduce DNA as a functional material for ternary computing, and in particular demonstrate the use of three-valued oligonucleotide inputs to construct a 3 × 3 multiplication table. The system consists of two three-valued inputs of −1; 0; +1 and a fluorophore/quencher functional hairpin acting as computational and reporter module. The interaction of the computational hairpin module with the different values of the inputs yields a 3 × 3 multiplication matrix consisting of nine nanostructures that are read out by three distinct fluorescence intensities. By combining three different hairpin computational modules, each modified with a different fluorophore/quencher pair, and using different sets of inputs, the parallel operation of three multiplication tables is demonstrated.
TL;DR: In this paper, the authors characterize the communication cost of a sparse matrix-matrix multiplication algorithm in terms of the size of a cut of an associated hypergraph that encodes the computation for a given input nonzero structure.
Abstract: The performance of parallel algorithms for sparse matrix-matrix multiplication is typically determined by the amount of interprocessor communication performed, which in turn depends on the nonzero structure of the input matrices. In this paper, we characterize the communication cost of a sparse matrix-matrix multiplication algorithm in terms of the size of a cut of an associated hypergraph that encodes the computation for a given input nonzero structure. Obtaining an optimal algorithm corresponds to solving a hypergraph partitioning problem. Our hypergraph model generalizes several existing models for sparse matrix-vector multiplication, and we can leverage hypergraph partitioners developed for that computation to improve application-specific algorithms for multiplying sparse matrices.
TL;DR: In this article, a task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA) is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC).
Abstract: A task-based formulation of Scalable Universal Matrix Multiplication Algorithm (SUMMA), a popular algorithm for matrix multiplication (MM), is applied to the multiplication of hierarchy-free, rank-structured matrices that appear in the domain of quantum chemistry (QC). The novel features of our formulation are: (1) concurrent scheduling of multiple SUMMA iterations, and (2) fine-grained task-based composition. These features make it tolerant of the load imbalance due to the irregular matrix structure and eliminate all artifactual sources of global synchronization.Scalability of iterative computation of square-root inverse of block-rank-sparse QC matrices is demonstrated; for full-rank (dense) matrices the performance of our SUMMA formulation usually exceeds that of the state-of-the-art dense MM implementations (ScaLAPACK and Cyclops Tensor Framework).
TL;DR: It is argued that knowledge about operations between natural numbers needs to be inhibited for students to overcome the natural number bias and to reason with numbers beyond the scope of natural numbers.
Abstract: This study investigates the hypothesis that there is a natural number bias that influences how students understand the effects of arithmetical operations involving both Arabic numerals and numbers that are represented by symbols for missing numbers. It also investigates whether this bias correlates with other aspects of students’ understanding of the number concept beyond natural numbers. Natural number bias has been characterized as the interference of natural number knowledge in reasoning about non-natural numbers. Quantitative data is presented showing that in the case of operations between numbers and missing numbers this bias acts in two main ways. First, it shapes students’ anticipations about the expected outcome of each operation, that is, that the result of addition or multiplication “must” be bigger than the initial numbers and the result of subtraction or division “must” be smaller. Second, it causes students to think that missing numbers stand mostly for natural numbers; this tendency would lead students to make decisions about the general results of operations by substituting only natural numbers for the missing number symbols. It is argued that knowledge about operations between natural numbers needs to be inhibited for students to overcome the natural number bias and to reason with numbers beyond the scope of natural numbers.