TL;DR: In this paper, the authors show that a $390 mass-market quad-core 2.4GHz Intel Westmere (Xeon E5620) CPU can create 109000 signatures per second and verify 71000 signature per second on an elliptic curve at a 2128 security level.
Abstract: This paper shows that a $390 mass-market quad-core 2.4GHz Intel Westmere (Xeon E5620) CPU can create 109000 signatures per second and verify 71000 signatures per second on an elliptic curve at a 2128 security level. Public keys are 32 bytes, and signatures are 64 bytes. These performance figures include strong defenses against software side-channel attacks: there is no data flow from secret keys to array indices, and there is no data flow from secret keys to branch conditions.
TL;DR: The proposed strategy is based on entropy slices which allows exploiting parallelism in the entropy decoding stage while maintaining high coding efficiency and is possible to achieve real-time performance for 1920×1080p50 and 2560×1600 video resolutions.
Abstract: In this paper we propose and evaluate a parallelization strategy for the emerging HEVC video coding standard. The proposed strategy is based on entropy slices which allows exploiting parallelism in the entropy decoding stage while maintaining high coding efficiency. Our approach requires to encode videos with one entropy slice per LCU row in order to decode multiple LCU rows in a wavefront parallel manner. Evaluations performed on a PC with 12 Intel Xeon cores running at 3.3 GHz show that it is possible to achieve real-time performance for 1920×1080p50 (53.1 fps) and 2560×1600 (29.5fps) video resolutions with speedups of 5.2× and 6.3× compared to sequential execution, respectively.
TL;DR: This paper explores optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron-based Cray XE6, Intel®, Xeon® E5-2670 and X5550 processor-based Infiniband clusters, as well as the new Intel® Xeon Phi coprocessor (Knights Corner).
Abstract: Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this paper, we explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron-based Cray XE6, Intel® Xeon® E5-2670 and X5550 processor-based Infiniband clusters, as well as the new Intel® Xeon Phi™ coprocessor (Knights Corner). Our work examines a variety of novel techniques including communication-aggregation, threaded wavefront-based DRAM communication-avoiding, dynamic threading decisions, SIMDization, and fusion of operators. We quantify performance through each phase of the V-cycle for both single-node and distributed-memory experiments and provide detailed analysis for each class of optimization. Results show our optimizations yield significant speedups across a variety of subdomain sizes while simultaneously demonstrating the potential of multi- and manycore processors to dramatically accelerate single-node performance. However, our analysis also indicates that improvements in networks and communication will be essential to reap the potential of manycore processors in large-scale multigrid calculations.
TL;DR: A very efficient, mixed-precision hybrid CPU-GPU implementation of the 1D implicit PIC algorithm exploiting a fundamental feature of the method, the segregation of particle-orbit computations from the field solver, while remaining fully self-consistent.
TL;DR: This work proposes to study the impact on the energy footprint of two advanced algorithmic strategies in the context of high performance dense linear algebra libraries: mixed precision algorithms with iterative refinement allow to run at the peak performance of single precision floating-point arithmetic while achieving double precision accuracy.
Abstract: We propose to study the impact on the energy footprint of two advanced algorithmic strategies in the context of high performance dense linear algebra libraries: (1) mixed precision algorithms with iterative refinement allow to run at the peak performance of single precision floating-point arithmetic while achieving double precision accuracy and (2) tree reduction technique exposes more parallelism when factorizing tall and skinny matrices for solving over determined systems of linear equations or calculating the singular value decomposition. Integrated within the PLASMA library using tile algorithms, which will eventually supersede the block algorithms from LAPACK, both strategies further excel in performance in the presence of a dynamic task scheduler while targeting multicore architecture. Energy consumption measurements are reported along with parallel performance numbers on a dual-socket quad-core Intel Xeon as well as a quad-socket quad-core Intel Sandy Bridge chip, both providing component-based energy monitoring at all levels of the system, through the Power Pack framework and the Running Average Power Limit model, respectively.
TL;DR: This paper shows the great potential of lightweight cryptography in fast and timing-attack resistant software implementations in cloud computing by exploiting bitslice implementation by demonstrating bitlice implementations of the PRESENT and Piccolo light-weight block ciphers.
Abstract: This paper shows the great potential of lightweight cryptography in fast and timing-attack resistant software implementations in cloud computing by exploiting bitslice implementation This is demonstrated by bitslice implementations of the PRESENT and Piccolo light-weight block ciphers In particular, bitsliced PRESENT-80/128 achieves 473 cycles/byte and Piccolo-80 achieves 457 cycles/byte including data conversion on an Intel Xeon E3-1280 processor (Sandy Bridge microarchitecture) It is also expected that bitslice implementation offers resistance to side channel attacks such as cache timing attacks and cross-VM attacks in a multi-tenant cloud environment Lightweight cryptography is not limited to constrained devices, and this work opens the way to its application in cloud computing
TL;DR: The MAPLE architecture is described, its design space is explored with a simulator, how to automatically map application kernels to the hardware is illustrated, and its performance improvement and energy benefits over classic server-based implementations are presented.
Abstract: Applications that use learning and classification algorithms operate on large amounts of unstructured data, and have stringent performance constraints. For such applications, the performance of general purpose processors scales poorly with data size because of their limited support for fine-grained parallelism and absence of software-managed caches. The large intermediate data in these applications also limits achievable performance on many-core processors such as GPUs. To accelerate such learning applications, we present a programmable accelerator that can execute multiple learning and classification algorithms. To architect such an accelerator, we profile five representative workloads, and find that their computationally intensive portions can be formulated as matrix or vector operations generating large amounts of intermediate data, which are then reduced by a secondary operation such as array ranking, finding max/min and aggregation. Our proposed accelerator, called MAPLE, has hundreds of simple processing elements (PEs) laid out in a two-dimensional grid, with two key features. First, it uses dynamic in-memory processing where on-chip memory blocks perform the secondary reduction operations. Second, MAPLE uses banked off-chip memory, and organizes its PEs into independent groups each with its own off-chip memory bank. These two features allow MAPLE to scale its performance with data size. We also present an Atom based energy-efficient heterogeneous system with MAPLE as the accelerator that satisfies the application’s performance requirements at a lower system power. This article describes the MAPLE architecture, explores its design space with a simulator, illustrates how to automatically map application kernels to the hardware, and presents its performance improvement and energy benefits over classic server-based implementations. We implement a 512-PE FPGA prototype of MAPLE and find that it is 1.5-10x faster than a 2.5 GHz quad-core Xeon processor despite running at a modest 125 MHz clock rate. With MAPLE connected to a 1.6GHz dual-core Atom, we show an energy improvement of 38-84p over the Xeon server coupled to a 1.3 GHz 240 core Tesla GPU.
TL;DR: This work modified the less memory-intensive "Anderson method" to give faster convergence to 3D-RISM calculations, and reduced the total computational time by a factor of 8, 1.4 times by the modified Andersen method and 5.7 times by GPU.
Abstract: A fast algorithm is proposed to solve the three-dimensional reference interaction site model (3D-RISM) theory on a graphics processing unit (GPU). 3D-RISM theory is a powerful tool for investigating biomolecular processes in solution; however, such calculations are often both memory-intensive and time-consuming. We sought to accelerate these calculations using GPUs, but to work around the problem of limited memory size in GPUs, we modified the less memory-intensive "Anderson method" to give faster convergence to 3D-RISM calculations. Using this method on a Tesla C2070 GPU, we reduced the total computational time by a factor of 8, 1.4 times by the modified Andersen method and 5.7 times by GPU, compared to calculations on an Intel Xeon machine (eight cores, 3.33 GHz) with the conventional method.
TL;DR: A novel implementation of the genetic algorithm exploiting a multi-GPU cluster where every GPU evolves a single island and the overall GPU performance of the proposed GA reaches 5.67 TFLOPS.
Abstract: This paper introduces a novel implementation of the genetic algorithm exploiting a multi-GPU cluster. The proposed implementation employs an island-based genetic algorithm where every GPU evolves a single island. The individuals are processed by CUDA warps, which enables the solution of large knapsack instances and eliminates undesirable thread divergence. The MPI interface is used to exchange genetic material among isolated islands and collect statistical data. The characteristics of the proposed GAs are investigated on a two-node cluster composed of 14 Fermi GPUs and 4 six-core Intel Xeon processors. The overall GPU performance of the proposed GA reaches 5.67 TFLOPS.
TL;DR: This paper presents a performance evaluation of concurrent lock-free implementations of four popular data structures on GPUs and achieves speedup of up to 7.4, 11.3, 30.7, and 30.8, respectively on the GPU compared to the Xeon server.
Abstract: Graphics processing units (GPUs) have emerged as a strong candidate for high-performance computing. While regular data-parallel computations with little or no synchronization are easy to map on the GPU architectures, it is a challenge to scale up computations on dynamically changing pointer-linked data structures. The traditional lock-based implementations are known to offer poor scalability due to high lock contention in the presence of thousands of active threads, which is common in GPU architectures. In this paper, we present a performance evaluation of concurrent lock-free implementations of four popular data structures on GPUs. We implement a set using lock-free linked list, hash table, skip list, and priority queue. On the first three data structures, we evaluate the performance of different mixes of addition, deletion, and search operations. The priority queue is designed to support retrieval and deletion of the minimum element and addition operations to the set. We evaluate the performance of these lock-free data structures on a Tesla C2070 Fermi GPU and compare it with the performance of multi-threaded lock-free implementations for CPU running on a 24-core Intel Xeon server. The linked list, hash table, skip list, and priority queue implementations achieve speedup of up to 7.4, 11.3, 30.7, and 30.8, respectively on the GPU compared to the Xeon server.
TL;DR: In this article, a data-driven load-balancing algorithm for fast multipole methods is proposed, where the most time-consuming stages of the FMMs are broken into smaller tasks and the algorithm can then be represented as a Directed Acyclic Graph (DAG) where nodes represent tasks and edges represent dependencies among them.
Abstract: Fast multipole methods have O(N) complexity, are compute bound, and require very little synchronization, which makes them a favorable algorithm on next-generation supercomputers. Their most common application is to accelerate N-body problems, but they can also be used to solve boundary integral equations. When the particle distribution is irregular and the tree structure is adaptive, load-balancing becomes a non-trivial question. A common strategy for load-balancing FMMs is to use the work load from the previous step as weights to statically repartition the next step. The authors discuss in the paper another approach based on data-driven execution to efficiently tackle this challenging load-balancing problem. The core idea consists of breaking the most time-consuming stages of the FMMs into smaller tasks. The algorithm can then be represented as a Directed Acyclic Graph (DAG) where nodes represent tasks, and edges represent dependencies among them. The execution of the algorithm is performed by asynchronously scheduling the tasks using the QUARK runtime environment, in a way such that data dependencies are not violated for numerical correctness purposes. This asynchronous scheduling results in an out-of-order execution. The performance results of the data-driven FMM execution outperform the previous strategy and show linear speedup on a quad-socket quad-core Intel Xeon system.
TL;DR: This paper provides low level micro-benchmark results to demonstrate processor, memory, I/O, and network performance; standard HPC benchmarks; and performance on data intensive applications to demonstrate Gordon's performance on typical workloads.
Abstract: The Gordon data intensive supercomputer entered service in 2012 as an allocable computing system in the NSF Extreme Science and Engineering Discovery Environment (XSEDE) program. Gordon has several innovative features that make it ideal for data intensive computing including: 1,024, compute nodes based on Intel's Sandy Bridge (Xeon E5) processor; 64 I/O nodes with an aggregate of 300 TB of high performance flash (SSD); large, virtual SMP "supernodes" of up to 2 TB DRAM; a dual-rail, QDR InfiniBand, 3D torus network based on commodity hardware and open source software; and a 100 GB/s Lustre based parallel file system, with over 4 PB of disk space. In this paper we present the motivation, design, and performance of Gordon. We provide: low level micro-benchmark results to demonstrate processor, memory, I/O, and network performance; standard HPC benchmarks; and performance on data intensive applications to demonstrate Gordon's performance on typical workloads. We highlight the inherent risks in, and describe mitigation strategies for, deploying a data intensive supercomputer like Gordon which embodies significant innovative technologies. Finally we present our experiences thus far in supporting users and managing Gordon.
TL;DR: The solution of dense symmetric and Hermitian eigenproblems by the LAPACK divide and conquer algorithm on modern heterogeneous systems is considered, which overcomes performance bottlenecks faced by current implementations that are optimized for a homogeneous multicore.
Abstract: With the raw computing power of graphics processing units (GPUs) being more widely available in commodity multicore systems, there is an imminent need to harness their power for important numerical libraries such as LAPACK. In this paper, we consider the solution of dense symmetric and Hermitian eigenproblems by the LAPACK divide and conquer algorithm on such modern heterogeneous systems. We focus on how to make the best use of the individual strengths of the massively parallel manycore GPUs and multicore CPUs. The resulting algorithm overcomes performance bottlenecks faced by current implementations that are optimized for a homogeneous multicore. On a dual socket quad-core Intel Xeon 2.33 GHz with an NVIDIA GTX 280 GPU, we typically obtain up to about a tenfold improvement in performance for the complete dense problem. The techniques described here thus represent an example of how to develop numerical software to efficiently use heterogeneous architectures. As heterogeneity becomes more common in the architecture design, the significance of and need for this work are expected to grow.
TL;DR: An interactive compilation feedback system that guides programmers in iteratively modifying their application source code helps leverage the compiler's ability to generate loop-parallel code, resulting in scalable parallelized code that runs up to 8.3 times faster on an eight-core Intel Xeon 5570 system and up to 12.5 times better on a quad-core IBM POWER6 system.
Abstract: The performance of many parallel applications relies not on instruction-level parallelism but on loop-level parallelism. Unfortunately, automatic parallelization of loops is a fragile process, many different obstacles affect or prevent it in practice. To address this predicament we developed an interactive compilation feedback system that guides programmers in iteratively modifying their application source code. This helps leverage the compiler's ability to generate loop-parallel code. We employ our system to modify two sequential benchmarks dealing with image processing and edge detection, resulting in scalable parallelized code that runs up to 8.3 times faster on an eight-core Intel Xeon 5570 system and up to 12.5 times faster on a quad-core IBM POWER6 system. Benchmark performance varies significantly between the systems. This suggests that semi-automatic parallelization should be combined with target-specific optimizations. Furthermore, comparing the first benchmark to manually-parallelized, hand-optimized pthreads and OpenMP versions, we find that code generated using our approach typically outperforms the pthreads code (within 93-339%). It also performs competitively against the OpenMP code (within 75-111%). The second benchmark outperforms manually-parallelized and optimized OpenMP code (within 109-242%).
TL;DR: This paper presents a comparison of OpenMP and OpenCL based on the parallel implementation of algorithms from various fields of computer applications and observed that OpenCL programming model is a good option for mapping threads on different processing cores.
Abstract: This paper presents a comparison of OpenMP and OpenCL based on the parallel implementation of algorithms from various fields of computer applications. The focus of our study is on the performance of benchmark comparing OpenMP and OpenCL. We observed that OpenCL programming model is a good option for mapping threads on different processing cores. Balancing all available cores and allocating sufficient amount of work among all computing units, can lead to improved performance. In our simulation, we used Fedora operating system; a system with Intel Xeon Dual core processor having thread count 24 coupled with NVIDIA Quadro FX 3800 as graphical processing unit.
TL;DR: A comparison between ARM and Xeon is presented to evaluate if ARM is the future building block to HPC and points that although ARM having lower peak power, Xeon has still a better tradeoff from the user's point-of-view.
Abstract: Most High Performance Computing (HPC) systems today are known as "power hungry" because they aim at computing speed regardless to energy consumption. Some scientific applications still claim more speed and the community expects to reach exascale by the end of the decade. Nevertheless, to reach exascale we need to search alternatives to cope with energy constraints. A promising step forward in this direction is the usage of low power processors such as ARM. ARM processors target low power consumption in contrast with Xeon that are conventional on HPC aiming at computing speed. This paper presents a comparison between ARM and Xeon to evaluate if ARM is the future building block to HPC. We choose to use time-to-solution, peak power, and energy-to-solution to evaluate both processors from the user's perspective. The results point that although ARM having lower peak power, Xeon has still a better tradeoff from the user's point-of-view.
TL;DR: This paper proposes the architecture of an FPGA-based inverted index search engine, as well as implementations of essential components, including decoder, matcher and ranker, using FPGAs as the implementation platform for power efficient index serving.
Abstract: Web search engines are now using tens of thousands of index servers that consume huge amount of power. In this paper, we investigate FPGAs as the implementation platform for power efficient index serving. We propose the architecture of an FPGA-based inverted index search engine, as well as implementations of essential components, including decoder, matcher and ranker. We successfully boot up the FPGA-based search engine and run experiments on real-world data from a commercial search engine. The targeted FPGA-based hardware index server could achieve up to 19.52X power efficiency and 7.17X price efficiency over an Intel Xeon server with highly optimized software. This is the first complete work using FPGAs to implement query processing for Web search engines.
TL;DR: Two extensions of both right-looking and left-looking one-sided factorization algorithms to utilize the computing power of current heterogeneous architectures are proposed and are now parts of the MAGMA software package, a set of the state-of-the-art dense linear algebra routines for a multicore with GPUs.
Abstract: One-sided dense matrix factorizations are important computational kernels in many scientific and engineering simulations. In this paper, we propose two extensions of both right-looking (LU and QR) and left-looking (Cholesky) one-sided factorization algorithms to utilize the computing power of current heterogeneous architectures. We first describe a new class of non-GPU-resident algorithms that factorize only a submatrix of a coefficient matrix on a GPU at a time. We then extend the algorithms to use multiple GPUs attached to a multicore. These extensions not only enable the factorization of a matrix that does not fit in the aggregated memory of the multiple GPUs at once, but also provide potential of fully utilizing the computing power of the architectures. Since data movement is expensive on the current architectures, these algorithms are designed to minimize the data movement at multiple levels. To demonstrate the effectiveness of these algorithms, we present their performance on a single compute node of the Keeneland system, which consists of twelve Intel Xeon processors and three NVIDIA GPUs. The performance results show both negligible overheads and scalable performance of our non-GPU-resident and multi-GPU algorithms, respectively. These extensions are now parts of the MAGMA software package, a set of the state-of-the-art dense linear algebra routines for a multicore with GPUs.
TL;DR: A novel multi-frame and multi-slice parallel video encoding approach with simultaneous encoding of predicted frames that leads to speedups comparable to those obtained by state-of-the-art approaches, but without the disadvantage of requiring bidirectional frames.
Abstract: This paper describes a novel multi-frame and multi-slice parallel video encoding approach with simultaneous encoding of predicted frames. The approach, when applied to H.264 encoding, leads to speedups comparable to those obtained by state-of-the-art approaches, but without the disadvantage of requiring bidirectional frames. The new approach uses a number of slices equal or greater than the number of cores used and supports three motion estimation modes. Their combination leads to various tradeoffs between speedup and visual quality loss. For an H.264 baseline profile encoder based on Intel IPP code samples running on a two quad core Xeon system (8 cores in total), our experiments show an average speedup of 7.20×, with an average quality loss of 0.22 dB (compared to a non-parallelized version) for the most efficiency motion estimation mode, and an average speedup of 7.95×, with a quality loss of 1.85 dB for the faster motion estimation mode
TL;DR: A novel simulator for PDP systems accelerated by the use of the computational power of GPUs is introduced and how to achieve up to a 7x speedup on a NVIDA Tesla C1060 compared to an optimized multicore version on a Intel 4-core i5 Xeon for large systems is shown.
Abstract: Population Dynamics P systems (PDP systems, in short) provide a new formal bio-inspired modeling framework, which has been successfully used by ecologists These models are validated using software tools against actual measurements The goal is to use P systems simulations to adopt a priori management strategies for real ecosystems
Software for PDP systems is still in an early stage The simulation of PDP systems is both computationally and data intensive for large models Therefore, the development of efficient simulators is needed for this field In this paper, we introduce a novel simulator for PDP systems accelerated by the use of the computational power of GPUs We discuss the implementation of each part of the simulator, and show how to achieve up to a 7x speedup on a NVIDA Tesla C1060 compared to an optimized multicore version on a Intel 4-core i5 Xeon for large systems Other results and testing methodologies are also included
TL;DR: This is the first implementation of a general purpose particle transport simulation with unstructured grid on GPU and shows that the performance speedup of NVIDIA M2050 GPU with double precision floating operations ranges from 11.03 to 17.96 compared with the serial implementation on Intel Xeon X5355 and Core Q6600.
TL;DR: Experimental results show the compiler can generate scalable parallel code with execution times that are comparable to hand-written VENICE assembly code, thus avoiding the process of writing assembly code used by previous SVPs.
Abstract: This paper describes the compiler design for VENICE, a new soft vector processor (SVP). The compiler is a new back-end target for Microsoft Accelerator, a high-level data parallel library for C++ and C#. This allows us to automatically compile high-level programs into VENICE assembly code, thus avoiding the process of writing assembly code used by previous SVPs. Experimental results show the compiler can generate scalable parallel code with execution times that are comparable to hand-written VENICE assembly code. On data-parallel applications, VENICE at 100MHz on an Altera DE3 platform runs at speeds comparable to one core of a 3.5GHz Intel Xeon W3690 processor, beating it in performance on four of six benchmarks by up to 3.2x.
TL;DR: To improve energy efficiency, two energy-saving techniques in the runtime in charge of scheduling the computations are incorporated, to block idle threads and enable the transition to a more energy-friendly state of the general-purpose cores.
Abstract: We investigate the balance between the time-to-solution and the energy consumption of a task-parallel execution of the Cholesky and LU factorizations on a hybrid platform, equipped with a multi-core processor and several GPUs. To improve energy efficiency, we incorporate two energy-saving techniques in the runtime in charge of scheduling the computations, to block idle threads and enable the transition to a more energy-friendly state of the general-purpose cores. Experiments on an Intel Xeon-based platform connected to an NVIDIA Tesla server report an average reduction of the energy consumption close to 9% (38% when only the consumption associated with the application is considered), for a minor increase in the execution time of the algorithm.
TL;DR: A novel approach to sparse matrix-vector multiplication (SpMV) based on the proposed scan, unlike the existing ones that make use of backward segmented operations, uses forward ones for more efficient caching.
Abstract: We present a novel parallel algorithm for computing the scan operations on x86 multicore processors. The existing best known parallel scan for the same platform requires the number of processors to be a power of two. But this constraint is removed from our proposed method. In the design of the algorithm architectural considerations for x86 multicore processors are given so that the rate of cache misses is reduced and the cost of thread synchronization and management is minimized. Results from tests made on a machine with dual-socket \times quad-core Intel Xeon E5405 showed that the proposed solution outperformed the best known parallel reference. A novel approach to sparse matrix-vector multiplication (SpMV) based on the proposed scan is then explained. The approach, unlike the existing ones that make use of backward segmented operations, uses forward ones for more efficient caching. An implementation of the proposed SpMV was tested against the SpMV in Intel's Math Kernel Library (MKL) and merits were found in the proposed approach.
TL;DR: A novel parallel architecture for accelerating quadrature methods used for pricing complex multi-dimensional options, such as discrete barrier, Bermudan and American options is presented and a parametrizable automated system is presented for generating hardware quadratures evaluation cores with an arbitrary number of dimensions.
Abstract: This paper presents a novel parallel architecture for accelerating quadrature methods used for pricing complex multi-dimensional options, such as discrete barrier, Bermudan and American options. We explore different designs of the quadrature evaluation core including optimized pipelined hardware designs in reconfigurable logic and a compute unified device architecture (CUDA)-based graphics processing unit (GPU) design. A parametrizable automated system is presented for generating hardware quadrature evaluation cores with an arbitrary number of dimensions. The performance and energy consumption of field-programmable gate arrays (FPGAs), GPUs, and central processing units (CPUs) are compared across different number of dimensions and precisions. Our evaluation shows that the 100 MHz Virtex-4 xc4vlx160 FPGA design is 4.6 times faster and 25.9 times more energy efficient than a multi-threaded optimized software implementation running on a Xeon W3504 dual-core CPU. It is also 2.6 times faster and 25.4 times more energy efficient than a GPU with comparable silicon process technology.
TL;DR: Self-virtualization is proposed, which provides the operating system with the capability to turn on and off virtualization on demand, without disturbing running applications, which enables computer systems to reap most benefits from virtualization without sacrificing performance.
Abstract: Virtualization has recently gained popularity largely due to its promise in increasing utilization, improving availability and enhancing security. Very often, the role of computer systems needs to change as the business environment changes. Initially, the system may only need to host one operating system and seek full execution speed. Later, it may be required to add other functionalities such as allowing easy software/hardware maintenance, surviving system failures and hosting multiple operating systems. Virtualization allows these functionalities to be supported easily and effectively. However, virtualization techniques generally incur non-negligible performance penalty. Fortunately, many virtualization-enabled features such as online software/hardware maintenance and fault tolerance do not require virtualization standby all the time. Based on this observation, this paper proposes a technique, called Self-virtualization, which provides the operating system with the capability to turn on and off virtualization on demand, without disturbing running applications. This technique enables computer systems to reap most benefits from virtualization without sacrificing performance. This paper presents the design and implementation of Mercury, a working prototype based on Linux and Xen virtual machine monitor. The performance measurement shows that Mercury incurs very little overhead: about 0.2 ms on 3 GHz Xeon CPU to complete a mode switch, and negligible performance degradation compared to Linux. Keywords dependability, performance, self-virtualization, dynamic virtualization.
TL;DR: It is shown the Lanczos method can be specialized only for extremal eigenvalues computation and presented an architecture which can achieve a sustained single precision floating-point performance of 175 GFLOPs on Virtex6-SX475T for a dense matrix of size 335×335.
Abstract: Iterative numerical algorithms with high memory bandwidth requirements but medium-size data sets (matrix size ˜ a few 100s) are highly appropriate for FPGA acceleration. This paper presents a streaming architecture comprising floating-point operators coupled with high-bandwidth on-chip memories for the Lanczos method, an iterative algorithm for symmetric eigenvalues computation. We show the Lanczos method can be specialized only for extremal eigenvalues computation and present an architecture which can achieve a sustained single precision floating-point performance of 175 GFLOPs on Virtex6-SX475T for a dense matrix of size 335×335. We perform a quantitative comparison with the parallel implementations of the Lanczos method using optimized Intel MKL and CUBLAS libraries for multi-core and GPU respectively. We find that for a range of matrices the FPGA implementation outperforms both multi-core and GPU; a speed up of 8.2-27.3× (13.4× geo. mean) over an Intel Xeon X5650 and 26.2-116× (52.8× geo. mean) over an Nvidia C2050 when FPGA is solving a single eigenvalue problem whereas a speed up of 41-520× (103× geo.mean) and 131-2220× (408× geo.mean) respectively when it is solving multiple eigenvalue problems.
TL;DR: This work assembled a financial analytics workload for derivative pricing, an important area of computational finance, and finds that large caches on both architectures, out-of-order cores on Intel® Xeon PhiTM1, and large compute and memory bandwidth on Intel®, Xeon Phi TM deliver high level of performance on financial analytics.
Abstract: In the past 20 years, computerization has driven explosive growth in the volume of financial markets and in the variety of traded financial instruments. Increasingly sophisticated mathematical and statistical methods and rapidly expanding computational power to drive them have given rise to the field of computational finance. The wide applicability of these models, their computational intensity, and their real-time constraints require high-throughput parallel architectures. In this work, we have assembled a financial analytics workload for derivative pricing, an important area of computational finance. We characterize and compare our workload's performance on two modern, parallel architectures: the Intel® Xeon PhiTM1 Processor 2680, and the recently announced Intel® Xeon PhiTM 1 `Knights Corner' coprocessor. In addition to analysis of the peak performance of the workloads on each architecture, we also quantify the impact of several levels of compiler and algorithmic optimization. Overall, we find that large caches on both architectures, out-of-order cores on Intel® Xeon PhiTM1, and large compute and memory bandwidth on Intel® Xeon PhiTM deliver high level of performance on financial analytics.
TL;DR: The initial scheme and its evolutions, the attacks it had to face and the countermeasures applied are presented, the system remains practical and a software implementation of the signing primitive is presented.
Abstract: CFS is the first practical code-based signature scheme. In the present paper, we present the initial scheme and its evolutions, the attacks it had to face and the countermeasures applied. We will show that all things considered the system remains practical and we present a software implementation of the signing primitive. For eighty bits of security our implementation produces a signature in 1.3 seconds on a single core of Intel Xeon W3670 at 3.20 GHz. Moreover the computation is easy to distribute and we can take full profit of multicore processors reducing the signature time to a fraction of second in software.
TL;DR: A fast parallel implementation of MD simulation with the Morse potential on Tianhe-1A, a petascale heterogeneous supercomputer, shows that large-scale MD simulations can benefit enormously from GPU acceleration in petascales supercomputing platforms.
Abstract: Molecular Dynamics (MD) simulations have been widely used in the study of macromolecules. To ensure an acceptable level of statistical accuracy relatively large number of particles are needed, which calls for high performance implementations of MD. These days heterogeneous systems, with their high performance potential, low power consumption, and high price-performance ratio, offer a viable alternative for running MD simulations. In this paper we introduce a fast parallel implementation of MD simulation with the Morse potential on Tianhe-1A, a petascale heterogeneous supercomputer. Our code achieves a speedup of 3.6$\times$ on one NVIDIA Tesla M2050 GPU (containing 14 Streaming Multiprocessors) compared to a 2.93GHz six-core Intel Xeon X5670 CPU. In addition, our code runs faster on 1024 compute nodes (with two CPUs and one GPU inside a node) than on 4096 GPU-excluded nodes, effectively rendering one GPU more efficient than six six-core CPUs. Our work shows that large-scale MD simulations can benefit enormously from GPU acceleration in petascale supercomputing platforms. Our performance results are achieved by using (1) a patch-cell design to exploit parallelism across the simulation domain, (2) a new GPU kernel developed by taking advantage of Newton's Third Law to reduce redundant force computation on GPUs, (3) two optimization methods including a dynamic load balancing strategy that adjusts the workload, and a communication overlapping method to overlap the communications between CPUs and GPUs.