TL;DR: The approach presented is based on a reformulation of the solution to modular multiplication within the context of RSA exponentiation, and the resulting RSA units exhibit the highest data rates reported in the literature to date, reflecting the very low and word length independent critical path delay achieved.
Abstract: Modified Montgomery multiplication and associated RSA modular exponentiation algorithms and circuit architectures are presented. These modified multipliers use carry save adders (CSAs) to perform large word length additions. These have the attraction that, when repeatedly used to perform RSA modular exponentiation, the (carry save) format of the output words is compatible with that required by the multiplier inputs. This avoids the repeated interim output/input format conversion, needed when previously reported Montgomery multipliers are used for RSA modular exponentiation. Thus, the lengthy and costly conventional additions required at each stage are avoided. As a consequence, the critical path delay and, hence, the data throughput rate of the resulting Montgomery multiplier architectures are also word length independent. The approach presented is based on a reformulation of the solution to modular multiplication within the context of RSA exponentiation. Two algorithmic variants are presented, one based on a five-to-two CSA and the other on a four-to-two CSA plus multiplexer. The practical application of the approach has been demonstrated by using this to design special purpose RSA processing units with 512-bit and 1024-bit key sizes. The resulting RSA units exhibit the highest data rates reported in the literature to date, reflecting the very low and word length independent critical path delay achieved.
TL;DR: In this article, the effects of hardware controlled energy efficiency features for the Intel Skylake-SP processor were analyzed and it was shown that data has a significant impact on processor power consumption which causes a large error in energy models relying only on instructions.
Abstract: The overwhelming majority of High Performance Computing (HPC) systems and server infrastructure uses Intel x86 processors. This makes an architectural analysis of these processors relevant for a wide audience of administrators and performance engineers. In this paper, we describe the effects of hardware controlled energy efficiency features for the Intel Skylake-SP processor. Due to the prolonged micro-architecture cycles, which extend the previous Tick-Tock scheme by Intel, our findings will also be relevant for succeeding architectures. The findings of this paper include the following: C-state latencies increased significantly over the Haswell-EP processor generation. The mechanism that controls the uncore frequency has a latency of approximately 10ms and it is not possible to truly fix the uncore frequency to a specific level. The out-of-order throttling for workloads using 512 bit wide vectors also occurs at low processor frequencies. Data has a significant impact on processor power consumption which causes a large error in energy models relying only on instructions.
TL;DR: The new smart card is in the final design stage; the first test chips should be available by the end of 1990, and CORSAIR achieves up to 40 (8 bit) MIPS with a clock speed of 6 Mhz.
Abstract: Algorithms best suited for flexible smart card applications are based on public key cryptosystems -- RSA, zero-knowledge protocols ... Their practical implementation (execution in ? 1 second) entails a computing power beyond the reach of classical smart cards, since large integers (512 bits) have to be manipulated in complex ways (exponentiation). CORSAIR achieves up to 40 (8 bit) MIPS with a clock speed of 6 Mhz. This allows to compute XE mod M, with 512 bit operands, in less than 1.5 second (0.4 sec for a signature). The new smart card is in the final design stage; the first test chips should be available by the end of 1990.
TL;DR: Extensive evaluations on real-world graphs demonstrate the performance superiority of SCAN-XP over existing approaches, which runs approximately 100 times faster than SCAN.
Abstract: The structural graph clustering method SCAN, proposed by Xu et al, is successfully used in many applications because it not only detects densely connected nodes as clusters but also extracts sparsely connected nodes as hubs or outliers However, it is difficult to applying SCAN to large-scale graphs since SCAN needs to evaluate the density for all adjacent nodes included in the given graphs In this paper, so as to address the above problem, we present a novel algorithm SCAN-XP that performs over Intel Xeon Phi We designed SCAN-XP in order to make best use of the hardware potential of Intel Xeon Phi by employing the following approaches: First, SCAN-XP avoids the bottlenecks that arise from parallel graph computations by providing good load balances among cores on the Intel Xeon Phi Second, SCAN-XP effectively exploits 512 bit SIMD instructions implemented in the Intel Xeon Phi to speed up the density evaluations As a result, SCAN-XP detects clusters, hubs, and outliers from large-scale graphs with much shorter computation time than SCAN Specifically, SCAN-XP runs approximately 100 times faster than SCAN; for the graphs with 100 million edges, SCAN-XP is able to perform in a few seconds In this paper, extensive evaluations on real-world graphs demonstrate the performance superiority of SCAN-XP over existing approaches
TL;DR: Hardware architectures for the 512-bit hash function Whirlpool, which is one of the ISO/IEC 10118-3 standard algorithms, are proposed and the performances of the proposed architectures are evaluated using a 0.18-mum CMOS standard cell library.
Abstract: Hardware architectures for the 512-bit hash function Whirlpool, which is one of the ISO/IEC 10118-3 standard algorithms, are proposed and the performances of the proposed architectures are evaluated using a 0.18-mum CMOS standard cell library. The fastest implementation achieved a throughput of 9.59 Gbps with a gate count of 167.4 K, which is two times faster than the fastest conventional implementation on an FPGA platform. A compact implementation obtained 38.9 Kgates with 2.49 Gbps. The FIPS 180-2 standard hash functions SHA-256 and SHA-512, which are the most popular algorithms in practical use, were also synthesized using the same ASIC library for performance comparisons. The small and fast SHA-256 implementations achieved 11.0 Kgates with 726 Mbps and 30.7 Kgates with 1.97 Gbps, respectively. The gate count and throughput are both approximately 1/4 those of to Whirlpool, and thus the hardware efficiencies defined as the throughput/gate are almost the same for SHA-256/- 512 and Whirlpool in the present implementations. However, Whirlpool is more flexible than SHA-256/-512 in terms of the variety of hardware architectures. The various architectures for the datapath and primitive function blocks are also described in the present paper.