TL;DR: The Neural Cache architecture as mentioned in this paper re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for deep neural networks, which is capable of fully executing convolutional, fully connected, and pooling layers in-cache.
Abstract: This paper presents the Neural Cache architecture, which re-purposes cache structures to transform them into massively parallel compute units capable of running inferences for Deep Neural Networks. Techniques to do in-situ arithmetic in SRAM arrays, create efficient data mapping and reducing data movement are proposed. The Neural Cache architecture is capable of fully executing convolutional, fully connected, and pooling layers in-cache. The proposed architecture also supports quantization in-cache. Our experimental results show that the proposed architecture can improve inference latency by 18.3X over state-of-art multi-core CPU (Xeon E5), 7.7X over server class GPU (Titan Xp), for Inception v3 model. Neural Cache improves inference throughput by 12.4X over CPU (2.2X over GPU), while reducing power consumption by 50% over CPU (53% over GPU).
TL;DR: The design of a complete, stored-program digital optical computer is described, and a fully functional, proof-of-principle prototype can be achieved by using LiNbO(3) directional couplers as logic elements and fiber-optic delay lines as memory elements.
Abstract: The design of a complete, stored-program digital optical computer is described. A fully functional, proof-of-principle prototype can be achieved by using LiNbO(3) directional couplers as logic elements and fiber-optic delay lines as memory elements. The key design issues are computation in a realm where propagation delays are much greater than logic delays and implementation of circuits without fip-flops. The techniques developed to address these issues yield architectures that do not change as their clocking speed is scaled upward and the size is scaled downward proportionally; these are called speed-scalable architectures. Signal amplitude restoration and resynchronization are accomplished by the novel technique of switching in a fresh copy of the system clock. Device characteristics that are important to the proof-of-principle demonstration are discussed, including the special properties and limitations that are important when designing with them. Design principles are exemplified by the design of an n-bit counter. Following this, the design for a stored-program bit-serial computer is described. We estimate that the described prototype architecture can be operated in the 100-MHz region with off-the-shelf components, and in the O. 1-1-THz region with foreseeable future components.
TL;DR: This work explores another approach, based on the exploitation of embedded multipliers available in modern FPGAs and the use of high-performances FPGA, which exhibits a 15-fold improvement over throughput/hardware cost ratio of previously published results.
Abstract: Currently, the best known algorithm for factorizing modulus of the RSA public key cryptosystem is the Number Field Sieve. One of its important phases usually combines a sieving technique and a method for checking smoothness of mid-size numbers. For this factorization, the Elliptic Curve Method (ECM) is an attractive solution. As ECM is highly regular and many parallel computations are required, hardware-based platforms were shown to be more cost-effective than software solutions. The few papers dealing with implementation of ECM on FPGA are all based on bit-serial architectures. They use only general-purpose logic and low-cost FPGAs which appear as the best performance/cost solution. This work explores another approach, based on the exploitation of embedded multipliers available in modern FPGAs and the use of high-performances FPGAs. The proposed architecture - based on a fully parallel and pipelined modular multiplier circuit - exhibits a 15-fold improvement over throughput/hardware cost ratio of previously published results.
TL;DR: This study serves as a good reference for designers who wish to accomplish high-performance, low-power implementations of clockless digital VLSI circuits.
Abstract: We present various 4-bit /spl times/ 4-bit unsigned multipliers designed using the delay-insensitive null convention logic (NCL) paradigm. They represent bit-serial, iterative, and fully parallel multiplication architectures. NCL is a self-timed logic paradigm in which control is inherent in each datum. NCL follows the so-called weak conditions of Seitz's delay-insensitive signaling scheme. Like other delay-insensitive logic methods, the NCL paradigm assumes that forks in wires are isochronic. NCL uses symbolic completeness of expression to achieve delay-insensitive behavior. Simulation results show a large variance in circuit performance in terms of power, area, and speed. This study serve as a good reference for designers who wish to accomplish high-performance, low-power implementations of clockless digital VLSI circuits.
TL;DR: This work implemented a feedforward neural network on a FPGA (field programmable gate array) to find the minimum precision required to maintain a recognition rate of at least 95% for two characters within an optical character recognition application.
Abstract: This work implemented a feedforward neural network on a FPGA (field programmable gate array) A study was conducted to find the minimum precision required to maintain a recognition rate of at least 95% for two characters within an optical character recognition application To reduce the circuit size, the bit serial architecture was realised to perform the arithmetic operation This resulted in an optimal use of the FPGA resources