Top 435 papers published in the topic of Multiplication in 2021

Showing papers on "Multiplication published in 2021"

Journal Article•10.1016/J.CMA.2020.113609•

The Arithmetic Optimization Algorithm

[...]

Laith Abualigah¹, Ali Diabat², Ali Diabat³, Seyedali Mirjalili⁴, Mohamed Abd Elaziz⁵, Mohamed Abd Elaziz⁶, Amir H. Gandomi⁷ - Show less +3 more•Institutions (7)

Amman Arab University¹, New York University², New York University Abu Dhabi³, Yonsei University⁴, Wuhan University of Technology⁵, Zagazig University⁶, University of Technology, Sydney⁷

01 Apr 2021-Computer Methods in Applied Mechanics and Engineering

TL;DR: Experimental results show that the AOA provides very promising results in solving challenging optimization problems compared with eleven other well-known optimization algorithms.

...read moreread less

2,204 citations

Posted Content•

An optical neural network using less than 1 photon per multiplication.

[...]

Tianyu Wang, Ronald P. Caputo¹, Shi-Yuan Ma, Logan G. Wright, Tatsuhiro Onodera, Brian C. Richard, Peter L. McMahon - Show less +3 more•Institutions (1)

University of Massachusetts Amherst¹

27 Apr 2021-arXiv: Optics

TL;DR: In this paper, an optical neural network achieves 99% accuracy on handwritten-digit classification using 3.2 detected photons per weight multiplication and 90% accuracy using ~0.64 photons (~$2.4 \times 10^{-19}$ J of optical energy).

...read moreread less

Abstract: Deep learning has rapidly become a widespread tool in both scientific and commercial endeavors. Milestones of deep learning exceeding human performance have been achieved for a growing number of tasks over the past several years, across areas as diverse as game-playing, natural-language translation, and medical-image analysis. However, continued progress is increasingly hampered by the high energy costs associated with training and running deep neural networks on electronic processors. Optical neural networks have attracted attention as an alternative physical platform for deep learning, as it has been theoretically predicted that they can fundamentally achieve higher energy efficiency than neural networks deployed on conventional digital computers. Here, we experimentally demonstrate an optical neural network achieving 99% accuracy on handwritten-digit classification using ~3.2 detected photons per weight multiplication and ~90% accuracy using ~0.64 photons (~$2.4 \times 10^{-19}$ J of optical energy) per weight multiplication. This performance was achieved using a custom free-space optical processor that executes matrix-vector multiplications in a massively parallel fashion, with up to ~0.5 million scalar (weight) multiplications performed at the same time. Using commercially available optical components and standard neural-network training methods, we demonstrated that optical neural networks can operate near the standard quantum limit with extremely low optical powers and still achieve high accuracy. Our results provide a proof-of-principle for low-optical-power operation, and with careful system design including the surrounding electronics used for data storage and control, open up a path to realizing optical processors that require only $10^{-16}$ J total energy per scalar multiplication -- which is orders of magnitude more efficient than current digital processors.

...read moreread less

151 citations

Journal Article•10.1109/TC.2020.2992113•

An Improved Logarithmic Multiplier for Energy-Efficient Neural Computing

[...]

Mohammad Saeed Ansari¹, Bruce F. Cockburn¹, Jie Han¹•Institutions (1)

University of Alberta¹

01 Apr 2021-IEEE Transactions on Computers

TL;DR: This article proposes an improved logarithmic multiplier (ILM) that, unlike existing designs, rounds both inputs to their nearest powers of two by using a proposed nearest-one detector (NOD) circuit.

...read moreread less

Abstract: Multiplication is the most resource-hungry operation in neural networks (NNs). Logarithmic multipliers (LMs) simplify multiplication to shift and addition operations and thus reduce the energy consumption. Since implementing the logarithm in a compact circuit often introduces approximation, some accuracy loss is inevitable in LMs. However, this inaccuracy accords with the inherent error tolerance of NNs and their associated applications. This article proposes an improved logarithmic multiplier (ILM) that, unlike existing designs, rounds both inputs to their nearest powers of two by using a proposed nearest-one detector (NOD) circuit. Considering that the output of the NOD uses a one-hot representation, some entries in the truth table of a conventional adder cannot occur. Hence, a compact adder is designed for the reduced truth table. The 8×8 ILM achieves up to 17.48 percent saving in power consumption compared to a recent LM in the literature while being almost 8 percent more accurate. Moreover, the evaluation of the ILM for two benchmark NN workloads shows up to 21.85 percent reduction in energy consumption compared to the NNs implemented with other LMs. Interestingly, using the ILM increases the classification accuracy of the considered NNs by up to 1.4 percent compared to a NN implementation that uses exact multipliers.

...read moreread less

102 citations

Proceedings Article•10.1145/3460120.3484556•

QuickSilver: Efficient and Affordable Zero-Knowledge Proofs for Circuits and Polynomials over Any Field

[...]

Kang Yang, Pratik Sarkar¹, Chenkai Weng², Xiao Wang²•Institutions (2)

Boston University¹, Northwestern University²

12 Nov 2021

TL;DR: In this article, the authors proposed several constant-round ZK protocols in the circuit-based model, which improve the concrete efficiency and enable sublinear amortized communication for circuits with some notion of relaxed uniformity.

...read moreread less

Abstract: Zero-knowledge (ZK) proofs with an optimal memory footprint have attracted a lot of attention, because such protocols can easily prove very large computation with a small memory requirement. Such ZK protocol only needs O(M) memory for both parties, where M is the memory required to verify the statement in the clear. In this paper, we propose several new constant-round ZK protocols in this setting, which improve the concrete efficiency and, at the same time, enable sublinear amortized communication for circuits with some notion of relaxed uniformity. In the circuit-based model, where the computation is represented as a circuit over a field, our ZK protocol achieves a communication complexity of 1 field element per non-linear gate for any field size while keeping the computation very cheap. We implemented our protocol, which shows extremely high efficiency and affordability. Compared to the previous best-known implementation, we achieve 6x--7x improvement in computation and 3x--7x improvement in communication. When running on intro-level AWS instances, our protocol only needs one US dollar to prove one trillion AND gates (or 2.5 US dollars for one trillion multiplication gates over a 61-bit field). In the setting where part of the computation can be represented as a set of polynomials with a "degree-separated" format, we can achieve communication sublinear to the polynomial size: the communication only depends on the total number of distinct variables in all the polynomials and the highest degree of all polynomials, independent of the number of multiplications to compute all polynomials. Using the improved ZK protocol, we can prove matrix multiplication with communication proportional to the input size, rather than the number of multiplications. Proving the multiplication of two 1024 x 1024 matrices, our implementation, with one thread and 1 GB of memory, only needs 10 seconds and communicates 25 MB.

...read moreread less

87 citations

Journal Article•10.1371/JOURNAL.PONE.0255703•

Advanced arithmetic optimization algorithm for solving mechanical engineering design problems

[...]

Jeffrey O. Agushaka¹, Jeffrey O. Agushaka², Absalom E. Ezugwu²•Institutions (2)

Federal University Lafia¹, University of KwaZulu-Natal²

24 Aug 2021-PLOS ONE

TL;DR: In this paper, the authors proposed an improved version of the AOA called nAOA algorithm, which uses the high-density values that the natural logarithm and exponential operators can generate, to enhance the exploratory ability of AOA.

...read moreread less

Abstract: The distributive power of the arithmetic operators: multiplication, division, addition, and subtraction, gives the arithmetic optimization algorithm (AOA) its unique ability to find the global optimum for optimization problems used to test its performance. Several other mathematical operators exist with the same or better distributive properties, which can be exploited to enhance the performance of the newly proposed AOA. In this paper, we propose an improved version of the AOA called nAOA algorithm, which uses the high-density values that the natural logarithm and exponential operators can generate, to enhance the exploratory ability of the AOA. The addition and subtraction operators carry out the exploitation. The candidate solutions are initialized using the beta distribution, and the random variables and adaptations used in the algorithm have beta distribution. We test the performance of the proposed nAOA with 30 benchmark functions (20 classical and 10 composite test functions) and three engineering design benchmarks. The performance of nAOA is compared with the original AOA and nine other state-of-the-art algorithms. The nAOA shows efficient performance for the benchmark functions and was second only to GWO for the welded beam design (WBD), compression spring design (CSD), and pressure vessel design (PVD).

...read moreread less

87 citations

Proceedings Article•10.1109/HPCA51647.2021.00027•

Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework

[...]

Sung-En Chang¹, Yanyu Li¹, Mengshu Sun¹, Runbin Shi², Hayden K.-H. So², Xuehai Qian³, Yanzhi Wang¹, Xue Lin¹ - Show less +4 more•Institutions (3)

Northeastern University¹, University of Hong Kong², University of Southern California³

1 Feb 2021

TL;DR: Zhang et al. as mentioned in this paper proposed an FPGA-centric mixed scheme quantization (MSQ) with an ensemble of the proposed sum-ofpower-of-2 (SP2) and the fixed-point schemes.

...read moreread less

Abstract: Deep Neural Networks (DNNs) have achieved extraordinary performance in various application domains. To support diverse DNN models, efficient implementations of DNN inference on edge-computing platforms, e.g., ASICs, FPGAs, and embedded systems, are extensively investigated. Due to the huge model size and computation amount, model compression is a critical step to deploy DNN models on edge devices. This paper focuses on weight quantization, a hardware-friendly model compression approach that is complementary to weight pruning.Unlike existing methods that use the same quantization scheme for all weights, we propose the first solution that applies different quantization schemes for different rows of the weight matrix. It is motivated by (1) the distribution of the weights in the different rows are not the same; and (2) the potential of achieving better utilization of heterogeneous FPGA hardware resources. To achieve that, we first propose a hardware-friendly quantization scheme named sum-of-power-of-2 (SP2) suitable for Gaussian-like weight distribution, in which the multiplication arithmetic can be replaced with logic shifter and adder, thereby enabling highly efficient implementations with the FPGA LUT resources. In contrast, the existing fixed-point quantization is suitable for Uniform-like weight distribution and can be implemented efficiently by DSP. Then to fully explore the resources, we propose an FPGA-centric mixed scheme quantization (MSQ) with an ensemble of the proposed SP2 and the fixed-point schemes. Combining the two schemes can maintain, or even increase accuracy due to better matching with weight distributions.For the FPGA implementations, we develop a parameterized architecture with heterogeneous Generalized Matrix Multiplication (GEMM) cores—one using LUTs for computations with SP2 quantized weights and the other utilizing DSPs for fixed-point quantized weights. Given the partition ratio among the two schemes based on resource characterization, MSQ quantization training algorithm derives an optimally quantized model for the FPGA implementation. We evaluate our FPGA-centric quantization framework across multiple application domains. With optimal SP2/fixed-point ratios on two FPGA devices, i.e., Zynq XC7Z020 and XC7Z045, we achieve performance improvement of 2.1 × -4.1 × compared to solely exploiting DSPs for all multiplication operations. In addition, the CNN implementations with the proposed MSQ scheme can achieve higher accuracy and comparable hardware utilization efficiency compared to the state-of-the-art designs.

...read moreread less

82 citations

Journal Article•10.1007/S11227-021-03980-5•

Efficient number theoretic transform implementation on GPU for homomorphic encryption

[...]

Özgün Özerk¹, Can Elgezen¹, Ahmet Can Mert¹, Erdinc Ozturk¹, Erkay Savas¹ - Show less +1 more•Institutions (1)

Sabancı University¹

13 Jul 2021-The Journal of Supercomputing

TL;DR: This work presents an efficient and fast implementation of NTT, inverse NTT and NTT-based polynomial multiplication operations for GPU platforms, and demonstrates that the GPU implementation can be utilized as an actual accelerator.

...read moreread less

Abstract: Lattice-based cryptography forms the mathematical basis for current homomorphic encryption schemes, which allows computation directly on encrypted data. Homomorphic encryption enables privacy-preserving applications such as secure cloud computing; yet, its practical applications suffer from the high computational complexity of homomorphic operations. Fast implementations of the homomorphic encryption schemes heavily depend on efficient polynomial arithmetic, multiplication of very large degree polynomials over polynomial rings, in particular. Number theoretic transform (NTT) accelerates large polynomial multiplication significantly, and therefore, it is the core arithmetic operation in the majority of homomorphic encryption scheme implementations. Therefore, practical homomorphic applications require efficient and fast implementations of NTT in different computing platforms. In this work, we present an efficient and fast implementation of NTT, inverse NTT and NTT-based polynomial multiplication operations for GPU platforms. To demonstrate that our GPU implementation can be utilized as an actual accelerator, we experimented with the key generation, the encryption and the decryption operations of the Brakerski/Fan–Vercauteren (BFV) homomorphic encryption scheme implemented in Microsoft’s SEAL homomorphic encryption library on GPU, all of which heavily depend on the NTT-based polynomial multiplication. Our GPU implementations improve the performance of these three BFV operations by up to 141.95 $$\times$$ , 105.17 $$\times$$ and 90.13 $$\times$$ , respectively, on Tesla v100 GPU compared to the highly optimized SEAL library running on an Intel i9-7900X CPU.

...read moreread less

70 citations

Journal Article•10.1016/J.NEUCOM.2021.08.064•

No Routing Needed Between Capsules

[...]

Adam Byerly¹, Adam Byerly², Tatiana Kalganova¹, Ian Dear¹•Institutions (2)

Brunel University London¹, Bradley University²

19 Aug 2021-Neurocomputing

TL;DR: This study shows that a simple convolutional neural network using HVCs performs as well as the prior best performing capsule network on MNIST using 5.5× fewer parameters, 4× fewer training epochs, no reconstruction sub-network, and requiring no routing mechanism.

...read moreread less

69 citations

Journal Article•10.1109/TIT.2021.3050853•

Secure Coded Multi-Party Computation for Massive Matrix Operations

[...]

Hanzaleh Akbari-Nodehi¹, Mohammad Ali Maddah-Ali¹•Institutions (1)

Sharif University of Technology¹

25 Jan 2021-IEEE Transactions on Information Theory

TL;DR: In this article, the authors consider a secure multi-party computation problem (MPC), where the goal is to offload the computation of an arbitrary polynomial function of some massive private matrices (inputs) to a cluster of workers.

...read moreread less

Abstract: In this article, we consider a secure multi-party computation problem (MPC), where the goal is to offload the computation of an arbitrary polynomial function of some massive private matrices (inputs) to a cluster of workers. The workers are not reliable. Some of them may collude to gain information about the input data (semi-honest workers). The system is initialized by sharing a (randomized) function of each input matrix to each server. Since the input matrices are massive, each share’s size is assumed to be at most $1/k$ fraction of the input matrix, for some $k \in \mathbb {N}$ . The objective is to minimize the number of workers needed to perform the computation task correctly, such that even if an arbitrary subset of $t-1$ workers, for some $t \in \mathbb {N}$ , collude, they cannot gain any information about the input matrices. We propose a sharing scheme, called polynomial sharing , and show that it admits basic operations such as adding and multiplication of matrices and transposing a matrix. By concatenating the procedures for basic operations, we show that any polynomial function of the input matrices can be calculated, subject to the problem constraints. We show that the proposed scheme can offer order-wise gain in terms of the number of workers needed, compared to the approaches formed by the concatenation of job splitting and conventional MPC approaches.

...read moreread less

66 citations

Journal Article•10.46586/TCHES.V2021.I2.159-188•

NTT Multiplication for NTT-unfriendly Rings: New Speed Records for Saber and NTRU on Cortex-M4 and AVX2

[...]

Chi-Ming Marvin Chung, Vincent Hwang, Matthias J. Kannwischer¹, Gregor Seiler², Cheng-Jhih Shih, Bo-Yin Yang³ - Show less +2 more•Institutions (3)

Max Planck Society¹, IBM², Academia Sinica³

23 Feb 2021

66 citations

Journal Article•10.1109/ACCESS.2021.3096189•

Accelerating Fully Homomorphic Encryption Through Architecture-Centric Analysis and Optimization

[...]

Wonkyung Jung¹, Eojin Lee², Sangpyo Kim¹, Jongmin Kim¹, Nam-Hoon Kim¹, Keewoo Lee¹, Chohong Min³, Jung Hee Cheon¹, Jung Ho Ahn¹ - Show less +5 more•Institutions (3)

Seoul National University¹, Samsung², Ewha Womans University³

12 Jul 2021-IEEE Access

TL;DR: This paper describes how HE Mul of HEAAN is performed in a manner friendly to non-crypto experts and conducts a disciplined analysis of its computational and memory-access characteristics, through which parallelism is extracted in the key functions composing HE Mul.

...read moreread less

Abstract: Homomorphic Encryption (HE) has drawn significant attention as a privacy-preserving approach for cloud computing because it allows computation on encrypted messages called ciphertexts. Among the numerous HE schemes proposed thus far, HE for Arithmetic of Approximate Numbers (HEAAN) is rapidly gaining in popularity across a wide range of applications, as it supports messages that can tolerate approximate computations with no limit on the number of arithmetic operations applicable to the ciphertexts. A critical shortcoming of HE is the high computation complexity of ciphertext arithmetic; specifically, HE multiplication (HE Mul) is more than 10,000 times slower than the corresponding multiplication between unencrypted messages. This has led to a large body of HE acceleration studies, including those that exploit FPGAs; however, a rigorous analysis of the computational complexity and data access patterns of HE Mul is lacking. Moreover, the proposals mostly focused on designs with small parameter sizes, making it difficult accurately to estimate the performance of the HE accelerators when conducting a series of complex arithmetic operations. In this paper, we first describe how HE Mul of HEAAN is performed in a manner friendly to non-crypto experts. Then, we conduct a disciplined analysis of its computational and memory-access characteristics, through which we (1) extract parallelism in the key functions composing HE Mul and (2) demonstrate how to map the parallelism effectively to popular parallel processing platforms, CPUs and GPUs, by applying a series of optimizations such as transposing matrices and pinning data to threads. This leads to performance improvements of HE Mul on a CPU and a GPU by $2.06\times $ and $4.05\times $ , respectively, over the reference HEAAN running on a CPU with 24 threads.

...read moreread less

Journal Article•10.1109/TC.2020.2988404•

Area-Optimized Accurate and Approximate Softcore Signed Multiplier Architectures

[...]

Salim Ullah¹, Hendrik Schmidl¹, Siva Satyendra Sahoo¹, Semeen Rehman², Akash Kumar¹ - Show less +1 more•Institutions (2)

Dresden University of Technology¹, Vienna University of Technology²

01 Mar 2021-IEEE Transactions on Computers

TL;DR: This work proposes a novel implementation technique for designing resource-efficient and low-power accurate and approximate multipliers which are optimized for FPGA-based systems.

...read moreread less

Abstract: Multiplication is one of the most extensively used arithmetic operations in a wide range of applications. In order to provide resource-efficient and high-performance multipliers, previous works have proposed different designs of accurate and approximate multipliers—mainly for ASIC-based systems. However, the architectural differences between ASICs- and FPGA-based systems limit the effectiveness of these multipliers for FPGA-based systems. Moreover, most of these multiplier designs are valid only for unsigned numbers. To bridge this gap, we propose a novel implementation technique for designing resource-efficient and low-power accurate and approximate signed multipliers which are optimized for FPGA-based systems. Compared to Vivado's area-optimized multiplier IPs, the designs obtained using our proposed technique occupy 47 to 63 percent less area ( Lookup Tables ). To accelerate further research in this direction and reproduce the presented results, the RTL and behavioral models of our proposed methodology are available as an open-source library. 1 1. Online. [Available]: https://cfaed.tu-dresden.de/pd-downloads .

...read moreread less

Journal Article•10.1109/TIT.2021.3068165•

CodedSketch: A Coding Scheme for Distributed Computation of Approximated Matrix Multiplication

[...]

Tayyebeh Jahani-Nezhad¹, Mohammad Ali Maddah-Ali¹•Institutions (1)

Sharif University of Technology¹

23 Mar 2021-IEEE Transactions on Information Theory

TL;DR: This paper proposes CodedSketch, as a distributed straggler-resistant scheme to compute an approximation of the multiplication of two massive matrices, which provides an upper-bound on the recovery threshold as a function of the required accuracy of computation and the probability that the expected accuracy can be violated.

...read moreread less

Abstract: In this paper, we propose CodedSketch , as a distributed straggler-resistant scheme to compute an approximation of the multiplication of two massive matrices. The objective is to reduce the recovery threshold , defined as the total number of worker nodes that the master node needs to wait for to be able to recover the final result. To exploit the fact that only an approximated result is required, in reducing the recovery threshold, some sorts of pre-compression are required. However, compression inherently involves some randomness that would lose the structure of the matrices. On the other hand, considering the structure of the matrices is crucial to reduce the recovery threshold. In CodedSketch, we use count–sketch, as a hash-based compression scheme, on the rows of the first and columns of the second matrix, and a structured polynomial code on the columns of the first and rows of the second matrix. This arrangement allows us to exploit the gain of both in reducing the recovery threshold. To increase the accuracy of computation, multiple independent count–sketches are needed. This independency allows us to theoretically characterize the accuracy of the result and establish the recovery threshold achieved by the proposed scheme. To guarantee the independency of resulting count–sketches in the output, while keeping its cost on the recovery threshold minimum, we use another layer of structured codes. The proposed scheme provides an upper-bound on the recovery threshold as a function of the required accuracy of computation and the probability that the required accuracy can be violated. In addition, it provides an upper-bound on the recovery threshold for the case that the result of the multiplication is sparse, and the exact result is required.

...read moreread less

Journal Article•10.1109/TETC.2021.3050989•

The Effects of Approximate Multiplication on Convolutional Neural Networks

[...]

Min Soo Kim, Alberto Antonio Del Barrio Garcia¹, HyunJin Kim², Nader Bagherzadeh³•Institutions (3)

Complutense University of Madrid¹, Dankook University², University of California, Irvine³

12 Jan 2021-IEEE Transactions on Emerging Topics in Computing

TL;DR: The study identifies the critical factors in the convolution, fully-connected, and batch normalization layers that allow more accurate CNN predictions despite the errors from approximate multiplication, and provides an arithmetic explanation of why bfloat16 multiplication performs well on CNNs.

...read moreread less

Abstract: This paper analyzes the effects of approximate multiplication when performing inferences on deep convolutional neural networks (CNNs). The approximate multiplication can reduce the cost of the underlying circuits so that CNN inferences can be performed more efficiently in hardware accelerators. The study identifies the critical factors in the convolution, fully-connected, and batch normalization layers that allow more accurate CNN predictions despite the errors from approximate multiplication. The same factors also provide an arithmetic explanation of why bfloat16 multiplication performs well on CNNs. The experiments are performed with recognized network architectures to show that the approximate multipliers can produce predictions that are nearly as accurate as the FP32 references, without additional training. For example, the ResNet and Inception-v4 models with Mitch-w6 multiplication produces Top-5 errors that are within 0.2% compared to the FP32 references. A brief cost comparison of Mitch-w6 against bfloat16 is presented, where a MAC operation saves up to 80% of energy compared to the bfloat16 arithmetic. The most far-reaching contribution of this paper is the analytical justification that multiplications can be approximated while additions need to be exact in CNN MAC operations.

...read moreread less

Journal Article•10.1109/JSAIT.2021.3056377•

Analog Lagrange Coded Computing

[...]

Mahdi Soleymani¹, Hessam Mahdavifar¹, A. Salman Avestimehr²•Institutions (2)

University of Michigan¹, University of Southern California²

2 Feb 2021

TL;DR: In this paper, the authors proposed a novel extension of LCC to the analog domain, referred to as analog LCC (ALCC), where all the operations in the proposed ALCC protocol are done over the infinite fields of ${ √ R}/ { √ C}$ but for practical implementations floating-point numbers are used.

...read moreread less

Abstract: A distributed computing scenario is considered, where the computational power of a set of worker nodes is used to perform a certain computation task over a dataset that is dispersed among the workers. Lagrange coded computing (LCC), proposed by Yu et al. , leverages the well-known Lagrange polynomial to perform polynomial evaluation of the dataset in such a scenario in an efficient parallel fashion while keeping the privacy of data amidst possible collusion of workers. This solution relies on quantizing the data into a finite field, so that Shamir’s secret sharing, as one of its main building blocks, can be employed. Such a solution, however, is not properly scalable with the size of dataset, mainly due to computation overflows. To address such a critical issue, we propose a novel extension of LCC to the analog domain, referred to as analog LCC (ALCC). All the operations in the proposed ALCC protocol are done over the infinite fields of ${ \mathbb R}/ { \mathbb C}$ but for practical implementations floating-point numbers are used. We characterize the privacy of data in ALCC, against any subset of colluding workers up to a certain size, in terms of the distinguishing security (DS) and the mutual information security (MIS) metrics. Also, the accuracy of outcome is characterized in a practical setting assuming operations are performed using floating-point numbers. Consequently, a fundamental trade-off between the accuracy of the outcome of ALCC and its privacy level is observed and is numerically evaluated. Moreover, we implement the proposed scheme to perform matrix-matrix multiplication over a batch of matrices. It is observed that ALCC is superior compared to the state-of-the-art LCC, implemented using fixed-point numbers, assuming both schemes use an equal number of bits to represent data symbols.

...read moreread less

Journal Article•10.3390/MATH9111163•

Synchronization in Finite-Time Analysis of Clifford-Valued Neural Networks with Finite-Time Distributed Delays

[...]

Grienggrai Rajchakit, Ramalingam Sriraman, Chee Peng Lim, Panu Sam-ang, Porpattama Hammachukiattikul - Show less +1 more

21 May 2021

TL;DR: Finite-time synchronization of Clifford-valued neural networks with finite-time distributed delays is explored on the basis of a new Lyapunov–Krasovskii functional, suitable controller and new computational techniques.

...read moreread less

Abstract: In this paper, we explore the finite-time synchronization of Clifford-valued neural networks with finite-time distributed delays. To address the problem associated with non-commutativity pertaining to the multiplication of Clifford numbers, the original n-dimensional Clifford-valued drive and response systems are firstly decomposed into the corresponding 2m-dimensional real-valued counterparts. On the basis of a new Lyapunov–Krasovskii functional, suitable controller and new computational techniques, finite-time synchronization criteria are formulated for the corresponding real-valued drive and response systems. The feasibility of the main results is verified by a numerical example.

...read moreread less

Journal Article•10.1109/TCOMM.2020.3032196•

Coded Computing for Resilient, Secure, and Privacy-Preserving Distributed Matrix Multiplication

[...]

Qian Yu¹, A. Salman Avestimehr¹•Institutions (1)

University of Southern California¹

01 Jan 2021-IEEE Transactions on Communications

TL;DR: Entangled polynomial codes can be further extended to also include these three important settings, providing unified frameworks that order-wise reduce the total computational costs by achieving subcubic recovery thresholds.

...read moreread less

Abstract: Coded computing is a new framework to address fundamental issues in large scale distributed computing, by injecting structured randomness and redundancy. We first provide an overview of coded computing and summarize some recent advances. Then we focus on distributed matrix multiplication and consider a common scenario where each worker is assigned a fraction of the multiplication task. In particular, by partitioning two input matrices into $m$ -by- $p$ and $p$ -by- $n$ subblocks, a single multiplication task can be viewed as computing linear combinations of $pmn$ submatrix products, which can be assigned to $pmn$ workers. Such block-partitioning-based designs have been widely studied under the topics of secure, private, and batch computation, where the state of the arts all require computing at least “cubic” ( $pmn$ ) number of submatrix multiplications. Entangled polynomial codes, first presented for straggler mitigation, provides a powerful method for breaking the cubic barrier. It achieves a subcubic recovery threshold, i.e., recovering the final product from any subset of multiplication results with a size order-wise smaller than $pmn$ . We show that entangled polynomial codes can be further extended to also include these three important settings, providing unified frameworks that order-wise reduce the total computational costs by achieving subcubic recovery thresholds.

...read moreread less

Journal Article•10.1007/S10489-021-02438-8•

Novel translation knowledge graph completion model based on 2D convolution

[...]

Jianzhou Feng¹, Qikai Wei¹, Jinman Cui¹, Jing Chen¹•Institutions (1)

Yanshan University¹

02 Jul 2021-Applied Intelligence

TL;DR: This study proposes a simple translation-based model that relies on the fact that the multiplication of subjects and relations is approximately equal to the object to achieve state-of-the-art results compared with the baseline.

...read moreread less

Abstract: The knowledge graph completion task involves predicting missing entities and relations in a knowledge graph. Many models have achieved good results, but they have become increasingly complex. In this study, we propose a simple translation-based model that relies on the fact that the multiplication of subjects and relations is approximately equal to the object. First, we utilize embeddings to represent entities and relations. Second, we perform vector multiplication on subject embedding and relation embedding to generate a 2D matrix and achieve full fusion of embedding at the element level. Third, we adopt a convolutional neural network on the 2D matrix. Thereafter, we can generate feature maps, which are then spliced into a 1D feature vector. The feature vector is transformed into predicted object embedding through a fully connected operation. Finally, we use the scoring function to score the candidate triples. Experimental results strongly demonstrate that the translation knowledge graph completion model based on 2D convolution achieves state-of-the-art results compared with the baseline.

...read moreread less

Proceedings Article•

Beyond Fully-Connected Layers with Quaternions: Parameterization of Hypercomplex Multiplications with 1/n Parameters

[...]

Aston Zhang, Yi Tay¹, Shuai Zhang², Alvin T. S. Chan³, Anh Tuan Luu³, Siu Cheung Hui³, Jie Fu⁴ - Show less +3 more•Institutions (4)

Google¹, École Polytechnique Fédérale de Lausanne², Nanyang Technological University³, Université de Montréal⁴

3 May 2021

TL;DR: In this paper, the authors propose to parameterize hypercomplex multiplications, allowing models to learn multiplication rules from data regardless of whether such rules are predefined, which can be applied to the LSTM and Transformer models on natural language inference, machine translation, text style transfer, and subject verb agreement.

...read moreread less

Abstract: Recent works have demonstrated reasonable success of representation learning in hypercomplex space. Specifically, “fully-connected layers with Quaternions” (4D hypercomplex numbers), which replace real-valued matrix multiplications in fully-connected layers with Hamilton products of Quaternions, both enjoy parameter savings with only 1/4 learnable parameters and achieve comparable performance in various applications. However, one key caveat is that hypercomplex space only exists at very few predefined dimensions (4D, 8D, and 16D). This restricts the flexibility of models that leverage hypercomplex multiplications. To this end, we propose parameterizing hypercomplex multiplications, allowing models to learn multiplication rules from data regardless of whether such rules are predefined. As a result, our method not only subsumes the Hamilton product, but also learns to operate on any arbitrary nD hypercomplex space, providing more architectural flexibility using arbitrarily 1/n learnable parameters compared with the fully-connected layer counterpart. Experiments of applications to the LSTM and Transformer models on natural language inference, machine translation, text style transfer, and subject verb agreement demonstrate architectural flexibility and effectiveness of the proposed approach.

...read moreread less

Proceedings Article•10.1109/CONIT51480.2021.9498358•

Mutation-based Arithmetic Optimization Algorithm for Global Optimization

[...]

Sumika Chauhan¹, Govind Vashishtha¹•Institutions (1)

Sant Longowal Institute of Engineering and Technology¹

25 Jun 2021

TL;DR: In this article, a balance is provided between exploration and exploitation search mechanism by incorporating mutation strategy into the AOA and named as mutation-based arithmetic optimization algorithm (m-AOA).

...read moreread less

Abstract: Real-life optimization problems need an algorithm which efficiently explores the search area to obtain a global optimal solution. The arithmetic optimization algorithm (AOA) is a recently developed algorithm that performs search operation using the basic mathematic functions; Addition, multiplication, division, and Subtraction. However, the AOA stuck at the local optimum solution for some functions due to inadequate balance between diversification and intensification. Therefore, in the present work, a balance is provided between exploration and exploitation search mechanism by incorporating mutation strategy into the AOA and named as mutation-based arithmetic optimization algorithm (m-AOA). The proposed algorithm's performance is tested on twenty-three benchmark functions having different characteristics. Results obtained are compared in terms of performance parameters, including average, standard deviation, median, worst and best values with other optimization algorithms. The comparison suggested that the proposed method outperforms the other algorithms.

...read moreread less

Journal Article•10.4310/ARKIV.2021.V59.N2.A2•

Multiplication in Sobolev spaces, revisited

[...]

A. Behzadan¹, Michael Holst²•Institutions (2)

California State University, Sacramento¹, University of California, San Diego²

01 Jan 2021-Arkiv för Matematik

TL;DR: In this article, the authors re-examine some of the classical pointwise multiplication theorems in Sobolev-Slobodeckij spaces, in part motivated by a simple counter-example that illustrates how certain multiplication theoresms fail when a bounded domain is replaced by Rn, and examine why the same failure is not encountered in Bessel potential spaces.

...read moreread less

Abstract: In this article, we re-examine some of the classical pointwise multiplication theorems in Sobolev-Slobodeckij spaces, in part motivated by a simple counter-example that illustrates how certain multiplication theorems fail in Sobolev-Slobodeckij spaces when a bounded domain is replaced by Rn. We identify the source of the failure, and examine why the same failure is not encountered in Bessel potential spaces. To analyze the situation, we begin with a survey of the classical multiplication results stated and proved in the 1977 article of Zolesio, and carefully distinguish between the case of spaces defined on the all of Rn and spaces defined on a bounded domain (with e.g. a Lipschitz boundary). However, the survey we give has a few new wrinkles; the proofs we include are based almost exclusively on interpolation theory rather than Littlewood-Paley theory and Besov spaces, and some of the results we give and their proofs, including the results for negative exponents, do not appear in the literature in this form. We also include a particularly important variation of one of the multiplication theorems that is relevant to the study of nonlinear PDE systems arising in general relativity and other areas. The conditions for multiplication to be continuous in the case of Sobolev-Slobodeckij spaces are somewhat subtle and intertwined, and as a result, the multiplication theorems of Zolesio in 1977 have been cited (more than once) in the standard literature in slightly more generality than what is actually proved by Zolesio, and in cases that allow for the construction of counter-examples such as the one included here.

...read moreread less

Journal Article•10.1109/ACCESS.2021.3079425•

10T SRAM Computing-in-Memory Macros for Binary and Multibit MAC Operation of DNN Edge Processors

[...]

Van Truong Nguyen¹, Jie-Seok Kim¹, Jong-Wook Lee¹•Institutions (1)

Kyung Hee University¹

12 May 2021-IEEE Access

TL;DR: In this paper, a 10T static random access memory (SRAM) bit-cell is proposed for fully parallel computing and high throughput using 32 parallel binary MAC operations, which achieves 409.6 GOPS throughput, 1001.7 TOPS/W energy efficiency, and a 169.9 TOPS /mm2 throughput area efficiency.

...read moreread less

Abstract: Computing-in-memory (CIM) is a promising approach to reduce latency and improve the energy efficiency of the multiply-and-accumulate (MAC) operation under a memory wall constraint for artificial intelligence (AI) edge processors. This paper proposes an approach focusing on scalable CIM designs using a new ten-transistor (10T) static random access memory (SRAM) bit-cell. Using the proposed 10T SRAM bit-cell, we present two SRAM-based CIM (SRAM-CIM) macros supporting multibit and binary MAC operations. The first design achieves fully parallel computing and high throughput using 32 parallel binary MAC operations. Advanced circuit techniques such as an input-dependent dynamic reference generator and an input-boosted sense amplifier are presented. Fabricated in 28 nm CMOS process, this design achieves 409.6 GOPS throughput, 1001.7 TOPS/W energy efficiency, and a 169.9 TOPS/mm2 throughput area efficiency. The proposed approach effectively solves previous problems such as writing disturb, throughput, and the power consumption of an analog to digital converter (ADC). The second design supports multibit MAC operation (4-b weight, 4-b input, and 8-b output) to increase the inference accuracy. We propose an architecture that divides 4-b weight and 4-b input multiplication to four 2-b multiplication in parallel, which increases the signal margin by $16\times $ compared to conventional 4-b multiplication. Besides, the capacitive digital-to-analog converter (CDAC) area issue is effectively addressed using the intrinsic bit-line capacitance existing in the SRAM-CIM architecture. The proposed approach of realizing four 2-b parallel multiplication using the CDAC is successfully demonstrated with a modified LeNet-5 neural network. These results demonstrate that the proposed 10T bit-cell is promising for realizing robust and scalable SRAM-CIM designs, which is essential for realizing fully parallel edge computing.

...read moreread less

Journal Article•10.1080/03772063.2021.1912655•

High-Speed Hybrid Multiplier Design Using a Hybrid Adder with FPGA Implementation

[...]

V. Thamizharasan¹, N. Kasthuri²•Institutions (2)

Erode Sengunthar Engineering College¹, Kongu Engineering College²

21 Apr 2021-Iete Journal of Research

TL;DR: The major role of electronic devices is providing low power dissipation and high speed with compact area and arithmetic operations, and multiplication is the major role.

...read moreread less

Abstract: The major role of electronic devices is providing low power dissipation and high speed with compact area. The speed of electronic devices depends on arithmetic operations. Multiplication is the imp...

...read moreread less

Journal Article•10.1109/TVLSI.2020.3034046•

A Multirate Fully Parallel LDPC Encoder for the IEEE 802.11n/ac/ax QC-LDPC Codes Based on Reduced Complexity XOR Trees

[...]

Ahmed Mahdi¹, N. Kanistras², Vassilis Paliouras²•Institutions (2)

University of Thessaly¹, University of Patras²

01 Jan 2021-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: An encoding method based on a two-step encoding algorithm for the 12 quasi-cyclic (QC-low-density parity-check) codes specified in the IEEE 802.11n/ac/ax standards allows the implementation of full-parallel architectures that execute the encoding process within a single clock cycle, or more for pipelined implementations, for any of the supported codes.

...read moreread less

Abstract: This article proposes an encoding method based on a two-step encoding algorithm for the 12 quasi-cyclic (QC)-low-density parity-check (LDPC) (QC-LDPC) codes specified in the IEEE 802.11n/ac/ax standards. The proposed approach jointly considers all codes of the particular set, instead of targeting each code separately. The proposed algorithm performs multiplication by inverse matrices. The complexity of the multiplications is significantly reduced by the introduced encoding method. It allows the implementation of full-parallel architectures that execute the encoding process within a single clock cycle, or more for pipelined implementations, for any of the supported codes. A corresponding VLSI encoding architecture based on XOR-gate trees is also proposed. The proposed solution exploits the structure and features of the involved matrices to extract common subexpressions (CSs) using common sub-expression sharing techniques (CSST). Such expressions result due to common features of the original matrices and the corresponding inverses, identified in this article. Innovative subexpression extraction procedures that target the specific codes as a set are introduced here. Furthermore, illustrative single-clock hardware encoders derived by the proposed technique are integrated into 90- and 45-nm technologies at 1 GHz occupying 125 and 107 KGates, respectively, achieving throughput rates up to 1.62 Tbps.

...read moreread less

10.1063/5.0071449•

Reduction of the number of multiplication operations in digital signal processing algorithms by classical methods of discrete Fourier transform

[...]

Adeliya Yu. Burova, Anatoly V. Ryapukhin

1 Jan 2021

Journal Article•10.1109/TETC.2021.3109127•

PLAM: a Posit Logarithm-Approximate Multiplier

[...]

Raul Murillo¹, Alberto Antonio Del Barrio Garcia¹, Guillermo Botella¹, Min Soo Kim², HyunJin Kim³, Nader Bagherzadeh² - Show less +2 more•Institutions (3)

Complutense University of Madrid¹, University of California, Irvine², Dankook University³

06 Sep 2021-IEEE Transactions on Emerging Topics in Computing

TL;DR: In this article, the authors proposed a Posit Logarithm-Approximate Multiplication (PLAM) scheme to reduce the complexity of posit multipliers, one of the most power-hungry arithmetic units.

...read moreread less

Abstract: The Posit Number System was introduced in 2017 as a replacement for floating-point numbers. Since then, the community has explored its application in several areas, such as deep learning, and produced some unit designs which are still far from being competitive with their floating-point counterparts. This paper proposes a Posit Logarithm-Approximate Multiplication (PLAM) scheme to significantly reduce the complexity of posit multipliers, one of the most power-hungry arithmetic units. The impact of this approach is evaluated in deep neural network inference, where there are no significant accuracy drops. Compared with state-of-the-art posit multipliers, experiments show that the proposed technique reduces the area, power, and delay of 32-bit hardware multipliers up to 72.86%, 81.79%, and 17.01%, respectively.

...read moreread less

Journal Article•10.1109/TCSII.2021.3068971•

Low-Power Ternary Multiplication Using Approximate Computing

[...]

Sunmean Kim¹, Yesung Kang¹, Seunghan Baek¹, Youngchang Choi¹, Seokhyeong Kang¹ - Show less +1 more•Institutions (1)

Pohang University of Science and Technology¹

26 Mar 2021-IEEE Transactions on Circuits and Systems Ii-express Briefs

TL;DR: A carry-truncated ternary multiplier, error compensation circuits, and various accuracies are proposed using the low-power design methodology with carbon nanotube FETs and the energy benefit of the proposed

$6 \times 6$ approximate ternARY multipliers have been verified using HSPICE simulation.

...read moreread less

Abstract: We propose a novel approximate computing technique for low-power ternary multiplication. A carry-truncated ternary multiplier, error compensation circuits, and $2 \times 2$ ternary multipliers with various accuracies are proposed using the low-power design methodology with carbon nanotube FETs. An accuracy-configurable design method is proposed to design energy-efficient $6 \times 6$ approximate ternary multipliers. The energy benefit of the proposed $6 \times 6$ approximate ternary multipliers have been verified using HSPICE simulation. The proposed approximate design shows 82.8% power-delay product with 41.8% mean absolute percentage error improvement over the previous approximate multiplier-based design. Image processing applications are conducted using the proposed approximate designs to confirm that the accuracy of ternary multiplication is satisfied the user’s requirement.

...read moreread less

Proceedings Article•10.1145/3447818.3461472•

Distributed-memory parallel algorithms for sparse times tall-skinny-dense matrix multiplication

[...]

Oguz Selvitopi¹, Benjamin Brock², Israt Nisa¹, Alok Tripathy², Katherine Yelick², Aydin Buluc¹ - Show less +2 more•Institutions (2)

Lawrence Berkeley National Laboratory¹, University of California, Berkeley²

3 Jun 2021

TL;DR: In this paper, the performance of various techniques for performing sparse times dense matrix multiplication (SMM) as a distributed computation across many nodes by focusing on GPU accelerators is evaluated, and it is shown that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the local SpMM operations.

...read moreread less

Abstract: Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evaluate the performance of various techniques for performing SpMM as a distributed computation across many nodes by focusing on GPU accelerators. We examine how the actual local computational performance of state-of-the-art SpMM implementations affect computational efficiency as dimensions change when we scale to large numbers of nodes, which proves to be an unexpectedly important bottleneck. We also consider various distribution strategies, including A-Stationary, B-Stationary, and C-Stationary algorithms, 1.5D and 2D algorithms, and RDMA-based and bulk synchronous methods of data transfer. Our results show that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the performance of local SpMM operations. Our evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies.

...read moreread less

Proceedings Article•10.1109/ICCECE51280.2021.9342343•

An optimization of im2col, an important method of CNNs, based on continuous address access

[...]

Haoyu Wang, Chengguang Ma

15 Jan 2021

TL;DR: In this article, an im2col algorithm acceleration method based on continuous memory address read was proposed to speed up the convolutional neural network (CNN) in the case of a single stride.

...read moreread less

Abstract: Convolutional neural networks (CNNs) are now widely used in various common tasks such as image classification, semantic segmentation, and face recognition. Convolution layers are the core layers of CNNs, the computing speed of the convolution layer will directly affect the computing speed of the entire network, thereby affecting the real-time performance. The current general convolutional layer acceleration method is to use the image to column (im2col) algorithm to split the input image into a column matrix, then use the general matrix multiplication (GEMM) to perform matrix multiplication on the column vector and the convolution kernel. This operation can greatly improve the computing speed of the convolutional layer because most computing platforms have more mature optimizations for GEMM. However, DSP is very fast for vector multiplication and addition. In the inference of the convolutional layer, the memory access of the im2col algorithm consumes far more time than the GEMM. This has become a bottleneck for further optimization of computing speed. In this article, I will present an im2col algorithm acceleration method in the case of a single stride based on continuous memory address read. With this method, the speed of the im2col algorithm can be increased by more than 10 times when processing a single-step convolutional layer. This is a portable method. In this article, I’11 show the optimization effects on Xtensa BBE64ep DSP cores and stm32f4 processors.

...read moreread less

Journal Article•10.1093/IMANUM/DRZ074•

A log-barrier Newton-CG method for bound constrained optimization with complexity guarantees

[...]

Michael O'Neill¹, Stephen J. Wright¹•Institutions (1)

University of Wisconsin-Madison¹

22 Jan 2021-Ima Journal of Numerical Analysis

TL;DR: In this article, the authors describe an algorithm based on a logarithmic barrier function, Newton's method, and linear conjugate gradients that obtains an approximate minimizer of a smooth function over the nonnegative orthant.

...read moreread less

Abstract: We describe an algorithm based on a logarithmic barrier function, Newton's method, and linear conjugate gradients that obtains an approximate minimizer of a smooth function over the nonnegative orthant. We develop a bound on the complexity of the approach, stated in terms of the required accuracy and the cost of a single gradient evaluation of the objective function and/or a matrix-vector multiplication involving the Hessian of the objective. The approach can be implemented without explicit calculation or storage of the Hessian.

...read moreread less

...

Expand