TL;DR: Experimental results show that the AOA provides very promising results in solving challenging optimization problems compared with eleven other well-known optimization algorithms.
TL;DR: In this paper, an optical neural network achieves 99% accuracy on handwritten-digit classification using 3.2 detected photons per weight multiplication and 90% accuracy using ~0.64 photons (~$2.4 \times 10^{-19}$ J of optical energy).
Abstract: Deep learning has rapidly become a widespread tool in both scientific and commercial endeavors. Milestones of deep learning exceeding human performance have been achieved for a growing number of tasks over the past several years, across areas as diverse as game-playing, natural-language translation, and medical-image analysis. However, continued progress is increasingly hampered by the high energy costs associated with training and running deep neural networks on electronic processors. Optical neural networks have attracted attention as an alternative physical platform for deep learning, as it has been theoretically predicted that they can fundamentally achieve higher energy efficiency than neural networks deployed on conventional digital computers. Here, we experimentally demonstrate an optical neural network achieving 99% accuracy on handwritten-digit classification using ~3.2 detected photons per weight multiplication and ~90% accuracy using ~0.64 photons (~$2.4 \times 10^{-19}$ J of optical energy) per weight multiplication. This performance was achieved using a custom free-space optical processor that executes matrix-vector multiplications in a massively parallel fashion, with up to ~0.5 million scalar (weight) multiplications performed at the same time. Using commercially available optical components and standard neural-network training methods, we demonstrated that optical neural networks can operate near the standard quantum limit with extremely low optical powers and still achieve high accuracy. Our results provide a proof-of-principle for low-optical-power operation, and with careful system design including the surrounding electronics used for data storage and control, open up a path to realizing optical processors that require only $10^{-16}$ J total energy per scalar multiplication -- which is orders of magnitude more efficient than current digital processors.
TL;DR: This article proposes an improved logarithmic multiplier (ILM) that, unlike existing designs, rounds both inputs to their nearest powers of two by using a proposed nearest-one detector (NOD) circuit.
Abstract: Multiplication is the most resource-hungry operation in neural networks (NNs). Logarithmic multipliers (LMs) simplify multiplication to shift and addition operations and thus reduce the energy consumption. Since implementing the logarithm in a compact circuit often introduces approximation, some accuracy loss is inevitable in LMs. However, this inaccuracy accords with the inherent error tolerance of NNs and their associated applications. This article proposes an improved logarithmic multiplier (ILM) that, unlike existing designs, rounds both inputs to their nearest powers of two by using a proposed nearest-one detector (NOD) circuit. Considering that the output of the NOD uses a one-hot representation, some entries in the truth table of a conventional adder cannot occur. Hence, a compact adder is designed for the reduced truth table. The 8×8 ILM achieves up to 17.48 percent saving in power consumption compared to a recent LM in the literature while being almost 8 percent more accurate. Moreover, the evaluation of the ILM for two benchmark NN workloads shows up to 21.85 percent reduction in energy consumption compared to the NNs implemented with other LMs. Interestingly, using the ILM increases the classification accuracy of the considered NNs by up to 1.4 percent compared to a NN implementation that uses exact multipliers.
TL;DR: In this article, the authors proposed several constant-round ZK protocols in the circuit-based model, which improve the concrete efficiency and enable sublinear amortized communication for circuits with some notion of relaxed uniformity.
Abstract: Zero-knowledge (ZK) proofs with an optimal memory footprint have attracted a lot of attention, because such protocols can easily prove very large computation with a small memory requirement. Such ZK protocol only needs O(M) memory for both parties, where M is the memory required to verify the statement in the clear. In this paper, we propose several new constant-round ZK protocols in this setting, which improve the concrete efficiency and, at the same time, enable sublinear amortized communication for circuits with some notion of relaxed uniformity. In the circuit-based model, where the computation is represented as a circuit over a field, our ZK protocol achieves a communication complexity of 1 field element per non-linear gate for any field size while keeping the computation very cheap. We implemented our protocol, which shows extremely high efficiency and affordability. Compared to the previous best-known implementation, we achieve 6x--7x improvement in computation and 3x--7x improvement in communication. When running on intro-level AWS instances, our protocol only needs one US dollar to prove one trillion AND gates (or 2.5 US dollars for one trillion multiplication gates over a 61-bit field). In the setting where part of the computation can be represented as a set of polynomials with a "degree-separated" format, we can achieve communication sublinear to the polynomial size: the communication only depends on the total number of distinct variables in all the polynomials and the highest degree of all polynomials, independent of the number of multiplications to compute all polynomials. Using the improved ZK protocol, we can prove matrix multiplication with communication proportional to the input size, rather than the number of multiplications. Proving the multiplication of two 1024 x 1024 matrices, our implementation, with one thread and 1 GB of memory, only needs 10 seconds and communicates 25 MB.
TL;DR: In this paper, the authors proposed an improved version of the AOA called nAOA algorithm, which uses the high-density values that the natural logarithm and exponential operators can generate, to enhance the exploratory ability of AOA.
Abstract: The distributive power of the arithmetic operators: multiplication, division, addition, and subtraction, gives the arithmetic optimization algorithm (AOA) its unique ability to find the global optimum for optimization problems used to test its performance. Several other mathematical operators exist with the same or better distributive properties, which can be exploited to enhance the performance of the newly proposed AOA. In this paper, we propose an improved version of the AOA called nAOA algorithm, which uses the high-density values that the natural logarithm and exponential operators can generate, to enhance the exploratory ability of the AOA. The addition and subtraction operators carry out the exploitation. The candidate solutions are initialized using the beta distribution, and the random variables and adaptations used in the algorithm have beta distribution. We test the performance of the proposed nAOA with 30 benchmark functions (20 classical and 10 composite test functions) and three engineering design benchmarks. The performance of nAOA is compared with the original AOA and nine other state-of-the-art algorithms. The nAOA shows efficient performance for the benchmark functions and was second only to GWO for the welded beam design (WBD), compression spring design (CSD), and pressure vessel design (PVD).
TL;DR: Zhang et al. as mentioned in this paper proposed an FPGA-centric mixed scheme quantization (MSQ) with an ensemble of the proposed sum-ofpower-of-2 (SP2) and the fixed-point schemes.
Abstract: Deep Neural Networks (DNNs) have achieved extraordinary performance in various application domains. To support diverse DNN models, efficient implementations of DNN inference on edge-computing platforms, e.g., ASICs, FPGAs, and embedded systems, are extensively investigated. Due to the huge model size and computation amount, model compression is a critical step to deploy DNN models on edge devices. This paper focuses on weight quantization, a hardware-friendly model compression approach that is complementary to weight pruning.Unlike existing methods that use the same quantization scheme for all weights, we propose the first solution that applies different quantization schemes for different rows of the weight matrix. It is motivated by (1) the distribution of the weights in the different rows are not the same; and (2) the potential of achieving better utilization of heterogeneous FPGA hardware resources. To achieve that, we first propose a hardware-friendly quantization scheme named sum-of-power-of-2 (SP2) suitable for Gaussian-like weight distribution, in which the multiplication arithmetic can be replaced with logic shifter and adder, thereby enabling highly efficient implementations with the FPGA LUT resources. In contrast, the existing fixed-point quantization is suitable for Uniform-like weight distribution and can be implemented efficiently by DSP. Then to fully explore the resources, we propose an FPGA-centric mixed scheme quantization (MSQ) with an ensemble of the proposed SP2 and the fixed-point schemes. Combining the two schemes can maintain, or even increase accuracy due to better matching with weight distributions.For the FPGA implementations, we develop a parameterized architecture with heterogeneous Generalized Matrix Multiplication (GEMM) cores—one using LUTs for computations with SP2 quantized weights and the other utilizing DSPs for fixed-point quantized weights. Given the partition ratio among the two schemes based on resource characterization, MSQ quantization training algorithm derives an optimally quantized model for the FPGA implementation. We evaluate our FPGA-centric quantization framework across multiple application domains. With optimal SP2/fixed-point ratios on two FPGA devices, i.e., Zynq XC7Z020 and XC7Z045, we achieve performance improvement of 2.1 × -4.1 × compared to solely exploiting DSPs for all multiplication operations. In addition, the CNN implementations with the proposed MSQ scheme can achieve higher accuracy and comparable hardware utilization efficiency compared to the state-of-the-art designs.
TL;DR: This work presents an efficient and fast implementation of NTT, inverse NTT and NTT-based polynomial multiplication operations for GPU platforms, and demonstrates that the GPU implementation can be utilized as an actual accelerator.
Abstract: Lattice-based cryptography forms the mathematical basis for current homomorphic encryption schemes, which allows computation directly on encrypted data. Homomorphic encryption enables privacy-preserving applications such as secure cloud computing; yet, its practical applications suffer from the high computational complexity of homomorphic operations. Fast implementations of the homomorphic encryption schemes heavily depend on efficient polynomial arithmetic, multiplication of very large degree polynomials over polynomial rings, in particular. Number theoretic transform (NTT) accelerates large polynomial multiplication significantly, and therefore, it is the core arithmetic operation in the majority of homomorphic encryption scheme implementations. Therefore, practical homomorphic applications require efficient and fast implementations of NTT in different computing platforms. In this work, we present an efficient and fast implementation of NTT, inverse NTT and NTT-based polynomial multiplication operations for GPU platforms. To demonstrate that our GPU implementation can be utilized as an actual accelerator, we experimented with the key generation, the encryption and the decryption operations of the Brakerski/Fan–Vercauteren (BFV) homomorphic encryption scheme implemented in Microsoft’s SEAL homomorphic encryption library on GPU, all of which heavily depend on the NTT-based polynomial multiplication. Our GPU implementations improve the performance of these three BFV operations by up to 141.95
$$\times$$
, 105.17
$$\times$$
and 90.13
$$\times$$
, respectively, on Tesla v100 GPU compared to the highly optimized SEAL library running on an Intel i9-7900X CPU.
TL;DR: This study shows that a simple convolutional neural network using HVCs performs as well as the prior best performing capsule network on MNIST using 5.5× fewer parameters, 4× fewer training epochs, no reconstruction sub-network, and requiring no routing mechanism.
TL;DR: In this article, the authors consider a secure multi-party computation problem (MPC), where the goal is to offload the computation of an arbitrary polynomial function of some massive private matrices (inputs) to a cluster of workers.
Abstract: In this article, we consider a secure multi-party computation problem (MPC), where the goal is to offload the computation of an arbitrary polynomial function of some massive private matrices (inputs) to a cluster of workers. The workers are not reliable. Some of them may collude to gain information about the input data (semi-honest workers). The system is initialized by sharing a (randomized) function of each input matrix to each server. Since the input matrices are massive, each share’s size is assumed to be at most $1/k$ fraction of the input matrix, for some $k \in \mathbb {N}$ . The objective is to minimize the number of workers needed to perform the computation task correctly, such that even if an arbitrary subset of $t-1$ workers, for some $t \in \mathbb {N}$ , collude, they cannot gain any information about the input matrices. We propose a sharing scheme, called polynomial sharing , and show that it admits basic operations such as adding and multiplication of matrices and transposing a matrix. By concatenating the procedures for basic operations, we show that any polynomial function of the input matrices can be calculated, subject to the problem constraints. We show that the proposed scheme can offer order-wise gain in terms of the number of workers needed, compared to the approaches formed by the concatenation of job splitting and conventional MPC approaches.
TL;DR: This paper describes how HE Mul of HEAAN is performed in a manner friendly to non-crypto experts and conducts a disciplined analysis of its computational and memory-access characteristics, through which parallelism is extracted in the key functions composing HE Mul.
Abstract: Homomorphic Encryption (HE) has drawn significant attention as a privacy-preserving approach for cloud computing because it allows computation on encrypted messages called ciphertexts. Among the numerous HE schemes proposed thus far, HE for Arithmetic of Approximate Numbers (HEAAN) is rapidly gaining in popularity across a wide range of applications, as it supports messages that can tolerate approximate computations with no limit on the number of arithmetic operations applicable to the ciphertexts. A critical shortcoming of HE is the high computation complexity of ciphertext arithmetic; specifically, HE multiplication (HE Mul) is more than 10,000 times slower than the corresponding multiplication between unencrypted messages. This has led to a large body of HE acceleration studies, including those that exploit FPGAs; however, a rigorous analysis of the computational complexity and data access patterns of HE Mul is lacking. Moreover, the proposals mostly focused on designs with small parameter sizes, making it difficult accurately to estimate the performance of the HE accelerators when conducting a series of complex arithmetic operations. In this paper, we first describe how HE Mul of HEAAN is performed in a manner friendly to non-crypto experts. Then, we conduct a disciplined analysis of its computational and memory-access characteristics, through which we (1) extract parallelism in the key functions composing HE Mul and (2) demonstrate how to map the parallelism effectively to popular parallel processing platforms, CPUs and GPUs, by applying a series of optimizations such as transposing matrices and pinning data to threads. This leads to performance improvements of HE Mul on a CPU and a GPU by $2.06\times $ and $4.05\times $ , respectively, over the reference HEAAN running on a CPU with 24 threads.
TL;DR: This work proposes a novel implementation technique for designing resource-efficient and low-power accurate and approximate multipliers which are optimized for FPGA-based systems.
Abstract: Multiplication is one of the most extensively used arithmetic operations in a wide range of applications. In order to provide resource-efficient and high-performance multipliers, previous works have proposed different designs of accurate and approximate multipliers—mainly for ASIC-based systems. However, the architectural differences between ASICs- and FPGA-based systems limit the effectiveness of these multipliers for FPGA-based systems. Moreover, most of these multiplier designs are valid only for unsigned numbers. To bridge this gap, we propose a novel implementation technique for designing resource-efficient and low-power accurate and approximate signed multipliers which are optimized for FPGA-based systems. Compared to Vivado's area-optimized multiplier IPs, the designs obtained using our proposed technique occupy 47 to 63 percent less area ( Lookup Tables ). To accelerate further research in this direction and reproduce the presented results, the RTL and behavioral models of our proposed methodology are available as an open-source library. 1 1. Online. [Available]: https://cfaed.tu-dresden.de/pd-downloads .
TL;DR: This paper proposes CodedSketch, as a distributed straggler-resistant scheme to compute an approximation of the multiplication of two massive matrices, which provides an upper-bound on the recovery threshold as a function of the required accuracy of computation and the probability that the expected accuracy can be violated.
Abstract: In this paper, we propose CodedSketch , as a distributed straggler-resistant scheme to compute an approximation of the multiplication of two massive matrices. The objective is to reduce the recovery threshold , defined as the total number of worker nodes that the master node needs to wait for to be able to recover the final result. To exploit the fact that only an approximated result is required, in reducing the recovery threshold, some sorts of pre-compression are required. However, compression inherently involves some randomness that would lose the structure of the matrices. On the other hand, considering the structure of the matrices is crucial to reduce the recovery threshold. In CodedSketch, we use count–sketch, as a hash-based compression scheme, on the rows of the first and columns of the second matrix, and a structured polynomial code on the columns of the first and rows of the second matrix. This arrangement allows us to exploit the gain of both in reducing the recovery threshold. To increase the accuracy of computation, multiple independent count–sketches are needed. This independency allows us to theoretically characterize the accuracy of the result and establish the recovery threshold achieved by the proposed scheme. To guarantee the independency of resulting count–sketches in the output, while keeping its cost on the recovery threshold minimum, we use another layer of structured codes. The proposed scheme provides an upper-bound on the recovery threshold as a function of the required accuracy of computation and the probability that the required accuracy can be violated. In addition, it provides an upper-bound on the recovery threshold for the case that the result of the multiplication is sparse, and the exact result is required.
TL;DR: The study identifies the critical factors in the convolution, fully-connected, and batch normalization layers that allow more accurate CNN predictions despite the errors from approximate multiplication, and provides an arithmetic explanation of why bfloat16 multiplication performs well on CNNs.
Abstract: This paper analyzes the effects of approximate multiplication when performing inferences on deep convolutional neural networks (CNNs). The approximate multiplication can reduce the cost of the underlying circuits so that CNN inferences can be performed more efficiently in hardware accelerators. The study identifies the critical factors in the convolution, fully-connected, and batch normalization layers that allow more accurate CNN predictions despite the errors from approximate multiplication. The same factors also provide an arithmetic explanation of why bfloat16 multiplication performs well on CNNs. The experiments are performed with recognized network architectures to show that the approximate multipliers can produce predictions that are nearly as accurate as the FP32 references, without additional training. For example, the ResNet and Inception-v4 models with Mitch-w6 multiplication produces Top-5 errors that are within 0.2% compared to the FP32 references. A brief cost comparison of Mitch-w6 against bfloat16 is presented, where a MAC operation saves up to 80% of energy compared to the bfloat16 arithmetic. The most far-reaching contribution of this paper is the analytical justification that multiplications can be approximated while additions need to be exact in CNN MAC operations.
TL;DR: In this paper, the authors proposed a novel extension of LCC to the analog domain, referred to as analog LCC (ALCC), where all the operations in the proposed ALCC protocol are done over the infinite fields of ${ √ R}/ { √ C}$ but for practical implementations floating-point numbers are used.
Abstract: A distributed computing scenario is considered, where the computational power of a set of worker nodes is used to perform a certain computation task over a dataset that is dispersed among the workers. Lagrange coded computing (LCC), proposed by Yu et al. , leverages the well-known Lagrange polynomial to perform polynomial evaluation of the dataset in such a scenario in an efficient parallel fashion while keeping the privacy of data amidst possible collusion of workers. This solution relies on quantizing the data into a finite field, so that Shamir’s secret sharing, as one of its main building blocks, can be employed. Such a solution, however, is not properly scalable with the size of dataset, mainly due to computation overflows. To address such a critical issue, we propose a novel extension of LCC to the analog domain, referred to as analog LCC (ALCC). All the operations in the proposed ALCC protocol are done over the infinite fields of ${ \mathbb R}/ { \mathbb C}$ but for practical implementations floating-point numbers are used. We characterize the privacy of data in ALCC, against any subset of colluding workers up to a certain size, in terms of the distinguishing security (DS) and the mutual information security (MIS) metrics. Also, the accuracy of outcome is characterized in a practical setting assuming operations are performed using floating-point numbers. Consequently, a fundamental trade-off between the accuracy of the outcome of ALCC and its privacy level is observed and is numerically evaluated. Moreover, we implement the proposed scheme to perform matrix-matrix multiplication over a batch of matrices. It is observed that ALCC is superior compared to the state-of-the-art LCC, implemented using fixed-point numbers, assuming both schemes use an equal number of bits to represent data symbols.
TL;DR: Finite-time synchronization of Clifford-valued neural networks with finite-time distributed delays is explored on the basis of a new Lyapunov–Krasovskii functional, suitable controller and new computational techniques.
Abstract: In this paper, we explore the finite-time synchronization of Clifford-valued neural networks with finite-time distributed delays. To address the problem associated with non-commutativity pertaining to the multiplication of Clifford numbers, the original n-dimensional Clifford-valued drive and response systems are firstly decomposed into the corresponding 2m-dimensional real-valued counterparts. On the basis of a new Lyapunov–Krasovskii functional, suitable controller and new computational techniques, finite-time synchronization criteria are formulated for the corresponding real-valued drive and response systems. The feasibility of the main results is verified by a numerical example.
TL;DR: Entangled polynomial codes can be further extended to also include these three important settings, providing unified frameworks that order-wise reduce the total computational costs by achieving subcubic recovery thresholds.
Abstract: Coded computing is a new framework to address fundamental issues in large scale distributed computing, by injecting structured randomness and redundancy. We first provide an overview of coded computing and summarize some recent advances. Then we focus on distributed matrix multiplication and consider a common scenario where each worker is assigned a fraction of the multiplication task. In particular, by partitioning two input matrices into $m$ -by- $p$ and $p$ -by- $n$ subblocks, a single multiplication task can be viewed as computing linear combinations of $pmn$ submatrix products, which can be assigned to $pmn$ workers. Such block-partitioning-based designs have been widely studied under the topics of secure, private, and batch computation, where the state of the arts all require computing at least “cubic” ( $pmn$ ) number of submatrix multiplications. Entangled polynomial codes, first presented for straggler mitigation, provides a powerful method for breaking the cubic barrier. It achieves a subcubic recovery threshold, i.e., recovering the final product from any subset of multiplication results with a size order-wise smaller than $pmn$ . We show that entangled polynomial codes can be further extended to also include these three important settings, providing unified frameworks that order-wise reduce the total computational costs by achieving subcubic recovery thresholds.
TL;DR: This study proposes a simple translation-based model that relies on the fact that the multiplication of subjects and relations is approximately equal to the object to achieve state-of-the-art results compared with the baseline.
Abstract: The knowledge graph completion task involves predicting missing entities and relations in a knowledge graph. Many models have achieved good results, but they have become increasingly complex. In this study, we propose a simple translation-based model that relies on the fact that the multiplication of subjects and relations is approximately equal to the object. First, we utilize embeddings to represent entities and relations. Second, we perform vector multiplication on subject embedding and relation embedding to generate a 2D matrix and achieve full fusion of embedding at the element level. Third, we adopt a convolutional neural network on the 2D matrix. Thereafter, we can generate feature maps, which are then spliced into a 1D feature vector. The feature vector is transformed into predicted object embedding through a fully connected operation. Finally, we use the scoring function to score the candidate triples. Experimental results strongly demonstrate that the translation knowledge graph completion model based on 2D convolution achieves state-of-the-art results compared with the baseline.
TL;DR: In this paper, the authors propose to parameterize hypercomplex multiplications, allowing models to learn multiplication rules from data regardless of whether such rules are predefined, which can be applied to the LSTM and Transformer models on natural language inference, machine translation, text style transfer, and subject verb agreement.
Abstract: Recent works have demonstrated reasonable success of representation learning in hypercomplex space. Specifically, “fully-connected layers with Quaternions” (4D hypercomplex numbers), which replace real-valued matrix multiplications in fully-connected layers with Hamilton products of Quaternions, both enjoy parameter savings with only 1/4 learnable parameters and achieve comparable performance in various applications. However, one key caveat is that hypercomplex space only exists at very few predefined dimensions (4D, 8D, and 16D). This restricts the flexibility of models that leverage hypercomplex multiplications. To this end, we propose parameterizing hypercomplex multiplications, allowing models to learn multiplication rules from data regardless of whether such rules are predefined. As a result, our method not only subsumes the Hamilton product, but also learns to operate on any arbitrary nD hypercomplex space, providing more architectural flexibility using arbitrarily 1/n learnable parameters compared with the fully-connected layer counterpart. Experiments of applications to the LSTM and Transformer models on natural language inference, machine translation, text style transfer, and subject verb agreement demonstrate architectural flexibility and effectiveness of the proposed approach.
TL;DR: In this article, a balance is provided between exploration and exploitation search mechanism by incorporating mutation strategy into the AOA and named as mutation-based arithmetic optimization algorithm (m-AOA).
Abstract: Real-life optimization problems need an algorithm which efficiently explores the search area to obtain a global optimal solution. The arithmetic optimization algorithm (AOA) is a recently developed algorithm that performs search operation using the basic mathematic functions; Addition, multiplication, division, and Subtraction. However, the AOA stuck at the local optimum solution for some functions due to inadequate balance between diversification and intensification. Therefore, in the present work, a balance is provided between exploration and exploitation search mechanism by incorporating mutation strategy into the AOA and named as mutation-based arithmetic optimization algorithm (m-AOA). The proposed algorithm's performance is tested on twenty-three benchmark functions having different characteristics. Results obtained are compared in terms of performance parameters, including average, standard deviation, median, worst and best values with other optimization algorithms. The comparison suggested that the proposed method outperforms the other algorithms.
TL;DR: In this article, the authors re-examine some of the classical pointwise multiplication theorems in Sobolev-Slobodeckij spaces, in part motivated by a simple counter-example that illustrates how certain multiplication theoresms fail when a bounded domain is replaced by Rn, and examine why the same failure is not encountered in Bessel potential spaces.
Abstract: In this article, we re-examine some of the classical pointwise multiplication theorems in Sobolev-Slobodeckij spaces, in part motivated by a simple counter-example that illustrates how certain multiplication theorems fail in Sobolev-Slobodeckij spaces when a bounded domain is replaced by Rn. We identify the source of the failure, and examine why the same failure is not encountered in Bessel potential spaces. To analyze the situation, we begin with a survey of the classical multiplication results stated and proved in the 1977 article of Zolesio, and carefully distinguish between the case of spaces defined on the all of Rn and spaces defined on a bounded domain (with e.g. a Lipschitz boundary). However, the survey we give has a few new wrinkles; the proofs we include are based almost exclusively on interpolation theory rather than Littlewood-Paley theory and Besov spaces, and some of the results we give and their proofs, including the results for negative exponents, do not appear in the literature in this form. We also include a particularly important variation of one of the multiplication theorems that is relevant to the study of nonlinear PDE systems arising in general relativity and other areas. The conditions for multiplication to be continuous in the case of Sobolev-Slobodeckij spaces are somewhat subtle and intertwined, and as a result, the multiplication theorems of Zolesio in 1977 have been cited (more than once) in the standard literature in slightly more generality than what is actually proved by Zolesio, and in cases that allow for the construction of counter-examples such as the one included here.
TL;DR: In this paper, a 10T static random access memory (SRAM) bit-cell is proposed for fully parallel computing and high throughput using 32 parallel binary MAC operations, which achieves 409.6 GOPS throughput, 1001.7 TOPS/W energy efficiency, and a 169.9 TOPS /mm2 throughput area efficiency.
Abstract: Computing-in-memory (CIM) is a promising approach to reduce latency and improve the energy efficiency of the multiply-and-accumulate (MAC) operation under a memory wall constraint for artificial intelligence (AI) edge processors. This paper proposes an approach focusing on scalable CIM designs using a new ten-transistor (10T) static random access memory (SRAM) bit-cell. Using the proposed 10T SRAM bit-cell, we present two SRAM-based CIM (SRAM-CIM) macros supporting multibit and binary MAC operations. The first design achieves fully parallel computing and high throughput using 32 parallel binary MAC operations. Advanced circuit techniques such as an input-dependent dynamic reference generator and an input-boosted sense amplifier are presented. Fabricated in 28 nm CMOS process, this design achieves 409.6 GOPS throughput, 1001.7 TOPS/W energy efficiency, and a 169.9 TOPS/mm2 throughput area efficiency. The proposed approach effectively solves previous problems such as writing disturb, throughput, and the power consumption of an analog to digital converter (ADC). The second design supports multibit MAC operation (4-b weight, 4-b input, and 8-b output) to increase the inference accuracy. We propose an architecture that divides 4-b weight and 4-b input multiplication to four 2-b multiplication in parallel, which increases the signal margin by $16\times $ compared to conventional 4-b multiplication. Besides, the capacitive digital-to-analog converter (CDAC) area issue is effectively addressed using the intrinsic bit-line capacitance existing in the SRAM-CIM architecture. The proposed approach of realizing four 2-b parallel multiplication using the CDAC is successfully demonstrated with a modified LeNet-5 neural network. These results demonstrate that the proposed 10T bit-cell is promising for realizing robust and scalable SRAM-CIM designs, which is essential for realizing fully parallel edge computing.
TL;DR: The major role of electronic devices is providing low power dissipation and high speed with compact area and arithmetic operations, and multiplication is the major role.
Abstract: The major role of electronic devices is providing low power dissipation and high speed with compact area. The speed of electronic devices depends on arithmetic operations. Multiplication is the imp...
TL;DR: An encoding method based on a two-step encoding algorithm for the 12 quasi-cyclic (QC-low-density parity-check) codes specified in the IEEE 802.11n/ac/ax standards allows the implementation of full-parallel architectures that execute the encoding process within a single clock cycle, or more for pipelined implementations, for any of the supported codes.
Abstract: This article proposes an encoding method based on a two-step encoding algorithm for the 12 quasi-cyclic (QC)-low-density parity-check (LDPC) (QC-LDPC) codes specified in the IEEE 802.11n/ac/ax standards. The proposed approach jointly considers all codes of the particular set, instead of targeting each code separately. The proposed algorithm performs multiplication by inverse matrices. The complexity of the multiplications is significantly reduced by the introduced encoding method. It allows the implementation of full-parallel architectures that execute the encoding process within a single clock cycle, or more for pipelined implementations, for any of the supported codes. A corresponding VLSI encoding architecture based on XOR-gate trees is also proposed. The proposed solution exploits the structure and features of the involved matrices to extract common subexpressions (CSs) using common sub-expression sharing techniques (CSST). Such expressions result due to common features of the original matrices and the corresponding inverses, identified in this article. Innovative subexpression extraction procedures that target the specific codes as a set are introduced here. Furthermore, illustrative single-clock hardware encoders derived by the proposed technique are integrated into 90- and 45-nm technologies at 1 GHz occupying 125 and 107 KGates, respectively, achieving throughput rates up to 1.62 Tbps.
TL;DR: In this article, the authors proposed a Posit Logarithm-Approximate Multiplication (PLAM) scheme to reduce the complexity of posit multipliers, one of the most power-hungry arithmetic units.
Abstract: The Posit Number System was introduced in 2017 as a replacement for floating-point numbers. Since then, the community has explored its application in several areas, such as deep learning, and produced some unit designs which are still far from being competitive with their floating-point counterparts. This paper proposes a Posit Logarithm-Approximate Multiplication (PLAM) scheme to significantly reduce the complexity of posit multipliers, one of the most power-hungry arithmetic units. The impact of this approach is evaluated in deep neural network inference, where there are no significant accuracy drops. Compared with state-of-the-art posit multipliers, experiments show that the proposed technique reduces the area, power, and delay of 32-bit hardware multipliers up to 72.86%, 81.79%, and 17.01%, respectively.
TL;DR: A carry-truncated ternary multiplier, error compensation circuits, and various accuracies are proposed using the low-power design methodology with carbon nanotube FETs and the energy benefit of the proposed $6 \times 6$ approximate ternARY multipliers have been verified using HSPICE simulation.
Abstract: We propose a novel approximate computing technique for low-power ternary multiplication. A carry-truncated ternary multiplier, error compensation circuits, and $2 \times 2$ ternary multipliers with various accuracies are proposed using the low-power design methodology with carbon nanotube FETs. An accuracy-configurable design method is proposed to design energy-efficient $6 \times 6$ approximate ternary multipliers. The energy benefit of the proposed $6 \times 6$ approximate ternary multipliers have been verified using HSPICE simulation. The proposed approximate design shows 82.8% power-delay product with 41.8% mean absolute percentage error improvement over the previous approximate multiplier-based design. Image processing applications are conducted using the proposed approximate designs to confirm that the accuracy of ternary multiplication is satisfied the user’s requirement.
TL;DR: In this paper, the performance of various techniques for performing sparse times dense matrix multiplication (SMM) as a distributed computation across many nodes by focusing on GPU accelerators is evaluated, and it is shown that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the local SpMM operations.
Abstract: Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evaluate the performance of various techniques for performing SpMM as a distributed computation across many nodes by focusing on GPU accelerators. We examine how the actual local computational performance of state-of-the-art SpMM implementations affect computational efficiency as dimensions change when we scale to large numbers of nodes, which proves to be an unexpectedly important bottleneck. We also consider various distribution strategies, including A-Stationary, B-Stationary, and C-Stationary algorithms, 1.5D and 2D algorithms, and RDMA-based and bulk synchronous methods of data transfer. Our results show that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the performance of local SpMM operations. Our evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies.
TL;DR: In this article, an im2col algorithm acceleration method based on continuous memory address read was proposed to speed up the convolutional neural network (CNN) in the case of a single stride.
Abstract: Convolutional neural networks (CNNs) are now widely used in various common tasks such as image classification, semantic segmentation, and face recognition. Convolution layers are the core layers of CNNs, the computing speed of the convolution layer will directly affect the computing speed of the entire network, thereby affecting the real-time performance. The current general convolutional layer acceleration method is to use the image to column (im2col) algorithm to split the input image into a column matrix, then use the general matrix multiplication (GEMM) to perform matrix multiplication on the column vector and the convolution kernel. This operation can greatly improve the computing speed of the convolutional layer because most computing platforms have more mature optimizations for GEMM. However, DSP is very fast for vector multiplication and addition. In the inference of the convolutional layer, the memory access of the im2col algorithm consumes far more time than the GEMM. This has become a bottleneck for further optimization of computing speed. In this article, I will present an im2col algorithm acceleration method in the case of a single stride based on continuous memory address read. With this method, the speed of the im2col algorithm can be increased by more than 10 times when processing a single-step convolutional layer. This is a portable method. In this article, I’11 show the optimization effects on Xtensa BBE64ep DSP cores and stm32f4 processors.
TL;DR: In this article, the authors describe an algorithm based on a logarithmic barrier function, Newton's method, and linear conjugate gradients that obtains an approximate minimizer of a smooth function over the nonnegative orthant.
Abstract: We describe an algorithm based on a logarithmic barrier function, Newton's method, and linear conjugate gradients that obtains an approximate minimizer of a smooth function over the nonnegative orthant. We develop a bound on the complexity of the approach, stated in terms of the required accuracy and the cost of a single gradient evaluation of the objective function and/or a matrix-vector multiplication involving the Hessian of the objective. The approach can be implemented without explicit calculation or storage of the Hessian.