Top 134 papers published in the topic of Xeon in 2013

Showing papers on "Xeon published in 2013"

Journal Article•10.1007/S11227-012-0825-3•

GPU-accelerated preconditioned iterative linear solvers

[...]

Ruipeng Li¹, Yousef Saad¹•Institutions (1)

01 Feb 2013-The Journal of Supercomputing

TL;DR: This work is an overview of the preliminary experience in developing a high-performance iterative linear solver accelerated by GPU coprocessors and techniques for speeding up sparse matrix-vector product (SpMV) kernels and finding suitable preconditioning methods are discussed.

...read moreread less

Abstract: This work is an overview of our preliminary experience in developing a high-performance iterative linear solver accelerated by GPU coprocessors. Our goal is to illustrate the advantages and difficulties encountered when deploying GPU technology to perform sparse linear algebra computations. Techniques for speeding up sparse matrix-vector product (SpMV) kernels and finding suitable preconditioning methods are discussed. Our experiments with an NVIDIA TESLA M2070 show that for unstructured matrices SpMV kernels can be up to 8 times faster on the GPU than the Intel MKL on the host Intel Xeon X5675 Processor. Overall performance of the GPU-accelerated Incomplete Cholesky (IC) factorization preconditioned CG method can outperform its CPU counterpart by a smaller factor, up to 3, and GPU-accelerated The incomplete LU (ILU) factorization preconditioned GMRES method can achieve a speed-up nearing 4. However, with better suited preconditioning techniques for GPUs, this performance can be further improved.

...read moreread less

345 citations

Proceedings Article•10.1145/2464996.2465013•

Efficient sparse matrix-vector multiplication on x86-based many-core processors

[...]

Xing Liu¹, Mikhail Smelyanskiy², Edmond Chow¹, Pradeep Dubey²•Institutions (2)

Georgia Institute of Technology¹, Intel²

10 Jun 2013

TL;DR: This work describes an efficient implementation of SpMV on the IntelR Xeon PhiTM Coprocessor, codenamed Knights Corner (KNC), that exploits the salient architectural features of KNC, such as large caches and hardware support for irregular memory accesses.

...read moreread less

Abstract: Sparse matrix-vector multiplication (SpMV) is an important kernel in many scientific applications and is known to be memory bandwidth limited. On modern processors with wide SIMD and large numbers of cores, we identify and address several bottlenecks which may limit performance even before memory bandwidth: (a) low SIMD efficiency due to sparsity, (b) overhead due to irregular memory accesses, and (c) load-imbalance due to non-uniform matrix structures.We describe an efficient implementation of SpMV on the IntelR Xeon PhiTM Coprocessor, codenamed Knights Corner (KNC), that addresses the above challenges. Our implementation exploits the salient architectural features of KNC, such as large caches and hardware support for irregular memory accesses. By using a specialized data structure with careful load balancing, we attain performance on average close to 90% of KNC's achievable memory bandwidth on a diverse set of sparse matrices. Furthermore, we demonstrate that our implementation is 3.52x and 1.32x faster, respectively, than the best available implementations on dual IntelR XeonR Processor E5-2680 and the NVIDIA Tesla K20X architecture.

...read moreread less

282 citations

Proceedings Article•10.1109/IPDPS.2013.44•

Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi Coprocessors

[...]

Simon J. Pennycook¹, Christopher J. Hughes², Mikhail Smelyanskiy², Stephen A. Jarvis¹•Institutions (2)

University of Warwick¹, Intel²

20 May 2013

TL;DR: The importance of effective SIMD utilisation for molecular dynamics codes on current and future hardware is demonstrated and the considerable performance increase afforded by the use of Intel®Xeon Phi™coprocessors for highly parallel workloads is demonstrated.

...read moreread less

Abstract: We analyse gather-scatter performance bottlenecks in molecular dynamics codes and the challenges that they pose for obtaining benefits from SIMD execution. This analysis informs a number of novel code-level and algorithmic improvements to Sandia's miniMD benchmark, which we demonstrate using three SIMD widths (128-, 256and 512bit). The applicability of these optimisations to wider SIMD is discussed, and we show that the conventional approach of exposing more parallelism through redundant computation is not necessarily best. In single precision, our optimised implementation is up to 5x faster than the original scalar code running on Intel®Xeon®processors with 256-bit SIMD, and adding a single Intel®Xeon Phi™coprocessor provides up to an additional 2x performance increase. These results demonstrate: (i) the importance of effective SIMD utilisation for molecular dynamics codes on current and future hardware; and (ii) the considerable performance increase afforded by the use of Intel®Xeon Phi™coprocessors for highly parallel workloads.

...read moreread less

165 citations

Proceedings Article•10.1109/IPDPS.2013.113•

Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor

[...]

Alexander Heinecke¹, Karthikeyan Vaidyanathan², Mikhail Smelyanskiy², Alexander Kobotov², Roman S. Dubtsov², Greg Henry², Aniruddha G. Shet², George Z. Chrysos², Pradeep Dubey² - Show less +5 more•Institutions (2)

Information Technology University¹, Intel²

20 May 2013

TL;DR: This paper describes how several flavors of the Linpack benchmark are accelerated on Intel's recently released Intel® Xeon Phi™1 co-processor (code-named Knights Corner) in both native and hybrid configurations.

...read moreread less

Abstract: Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel's recently released Intel® Xeon Phi™1 co-processor (code-named Knights Corner) in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of Knights Corner's salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on Knights Corner employs novel dynamic scheduling and achieves close to 80% efficiency - the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.

...read moreread less

157 citations

Journal Article•10.1007/S11227-011-0672-7•

Parallel data mining techniques on Graphics Processing Unit with Compute Unified Device Architecture (CUDA)

[...]

Liheng Jian¹, Cheng Wang², Ying Liu¹, Shenshen Liang¹, Weidong Yi¹, Yong Shi³ - Show less +2 more•Institutions (3)

Chinese Academy of Sciences¹, Agilent Technologies², University of Nebraska Omaha³

01 Jun 2013-The Journal of Supercomputing

TL;DR: Three techniques to speedup fundamental problems in data mining algorithms on the CUDA platform are proposed: scalable thread scheduling scheme for irregular pattern, parallel distributed top-k scheme, and parallel high dimension reduction scheme.

...read moreread less

Abstract: Recent development in Graphics Processing Units (GPUs) has enabled inexpensive high performance computing for general-purpose applications. Compute Unified Device Architecture (CUDA) programming model provides the programmers adequate C language like APIs to better exploit the parallel power of the GPU. Data mining is widely used and has significant applications in various domains. However, current data mining toolkits cannot meet the requirement of applications with large-scale databases in terms of speed. In this paper, we propose three techniques to speedup fundamental problems in data mining algorithms on the CUDA platform: scalable thread scheduling scheme for irregular pattern, parallel distributed top-k scheme, and parallel high dimension reduction scheme. They play a key role in our CUDA-based implementation of three representative data mining algorithms, CU-Apriori, CU-KNN, and CU-K-means. These parallel implementations outperform the other state-of-the-art implementations significantly on a HP xw8600 workstation with a Tesla C1060 GPU and a Core-quad Intel Xeon CPU. Our results have shown that GPU + CUDA parallel architecture is feasible and promising for data mining applications.

...read moreread less

104 citations

Proceedings Article•10.1145/2493123.2462903•

NUMA-aware shared-memory collective communication for MPI

[...]

Shigang Li¹, Torsten Hoefler², Marc Snir³•Institutions (3)

University of Science and Technology Beijing¹, ETH Zurich², University of Illinois at Urbana–Champaign³

17 Jun 2013

TL;DR: The design and optimizations of MPI collectives for clusters of NUMA nodes are investigated, performance models for collective communication using shared memory are developed, and several algorithms for various collectives are developed.

...read moreread less

Abstract: As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimizations of MPI collectives for clusters of NUMA nodes. We develop performance models for collective communication using shared memory, and we develop several algorithms for various collectives. Experiments are conducted on both Xeon X5650 and Opteron 6100 InfiniBand clusters. The measurements agree with the model and indicate that different algorithms dominate for short vectors and long vectors. We compare our shared-memory allreduce with several traditional MPI implementations -- Open MPI, MPICH2, and MVAPICH2 -- that utilize system shared memory to facilitate interprocess communication. On a 16-node Xeon cluster and 8-node Opteron cluster, our implementation achieves on average 2.5X and 2.3X speedup over MVAPICH2, respectively. Our techniques enable an efficient implementation of collective operations on future multi- and manycore systems.

...read moreread less

97 citations

Book Chapter•10.1007/978-3-642-40047-6_56•

Assessing the performance of OpenMP programs on the intel xeon phi

[...]

Dirk Schmidl¹, Tim Cramer¹, Sandra Wienke¹, Christian Terboven¹, Matthias S. Müller¹ - Show less +1 more•Institutions (1)

RWTH Aachen University¹

26 Aug 2013

TL;DR: This work compares a Xeon-based two-socket compute node with the Xeon Phi stand-alone in scalability and performance using OpenMP codes and shows significant differences in absolute application performance and scalability.

...read moreread less

Abstract: The Intel Xeon Phi has been introduced as a new type of compute accelerator that is capable of executing native x86 applications. It supports programming models that are well-established in the HPC community, namely MPI and OpenMP, thus removing the necessity to refactor codes for using accelerator-specific programming paradigms. Because of its native x86 support, the Xeon Phi may also be used stand-alone, meaning codes can be executed directly on the device without the need for interaction with a host. In this sense, the Xeon Phi resembles a big SMP on a chip if its 240 logical cores are compared to a common Xeon-based compute node offering up to 32 logical cores. In this work, we compare a Xeon-based two-socket compute node with the Xeon Phi stand-alone in scalability and performance using OpenMP codes. Considering both as individual SMP systems, they come at a very similar price and power envelope, but our results show significant differences in absolute application performance and scalability. We also show in how far common programming idioms for the Xeon multi-core architecture are applicable for the Xeon Phi many-core architecture and which challenges the changing ratio of core count to single core performance poses for the application programmer.

...read moreread less

77 citations

Proceedings Article•10.1109/IJCNN.2013.6707074•

Exploring the design space of specialized multicore neural processors

[...]

Tarek M. Taha¹, Raqibul Hasan¹, Chris Yakopcic¹, Mark McLean•Institutions (1)

University of Dayton¹

1 Aug 2013

TL;DR: This study examines the design of several novel specialized multicore neural processors based on SRAM cores and memristor devices and indicates that these specialized systems can be between two to five orders more energy efficient compared to the traditional HPC systems.

...read moreread less

Abstract: This study examines the design of several novel specialized multicore neural processors. Systems based on SRAM cores and memristor devices were examined. Detailed circuit simulations were used to ensure that the systems could be compared accurately. Two types of memristor cores were examined: digital and analog cores. Novel circuits were designed for both of these memristor systems. Additionally full system evaluation of multicore processors based on these cores and specialized routing circuits were developed. Our results show that the memristor systems yield the highest throughput and lowest power. We compared these specialized systems to more traditional HPC systems. Two commodity high performance processors were examined: a six core Intel Xeon processor, and an NVIDIA Tesla M2070 GPGPU. Care was taken to ensure the code on each platform was very efficient (multi-threaded on the Xeon processor, and a high device utilization CUDA program on the GPGPU). Our results indicate that the specialized systems can be between two to five orders more energy efficient compared to the traditional HPC systems. Additionally the specialized cores take up much less die area - allowing in some cases a reduction from 179 Xeon six-core processor chips to 1 memristor based multicore chip and a corresponding reduction in power from 17 kW down to 0.07 W.

...read moreread less

64 citations

Proceedings Article•10.1109/IPDPSW.2013.231•

Compiler-Based Data Prefetching and Streaming Non-temporal Store Generation for the Intel(R) Xeon Phi(TM) Coprocessor

[...]

Rakesh Krishnaiyer¹, Emre Kultursay², Pankaj Chawla, Serguei Preis, Anatoly Zvezdin, Hideki Saito¹ - Show less +2 more•Institutions (2)

Intel¹, Pennsylvania State University²

20 May 2013

TL;DR: The results show that the Intel® Composer XE 2013 compiler can make effective use of software prefetching instructions to hide memory latencies and special store instructions to save bandwidth on streaming non-temporal store operations to achieve significant performance improvements.

...read moreread less

Abstract: The Intel® Xeon Phi™ coprocessor has software prefetching instructions to hide memory latencies and special store instructions to save bandwidth on streaming non-temporal store operations. In this work, we provide details on compiler-based generation of these instructions and evaluate their impact on the performance of the Intel® Xeon Phi™ coprocessor using a wide range of parallel applications with different characteristics. Our results show that the Intel® Composer XE 2013 compiler can make effective use of these mechanisms to achieve significant performance improvements.

...read moreread less

54 citations

Journal Article•10.1016/J.JPDC.2013.03.015•

Parallel multi-dimensional range query processing with R-trees on GPU

[...]

Jinwoong Kim¹, Sul-Gi Kim¹, Beomseok Nam¹•Institutions (1)

Ulsan National Institute of Science and Technology¹

01 Aug 2013-Journal of Parallel and Distributed Computing

TL;DR: An extensive experimental study shows that MPTS R-tree traversal algorithm on NVIDIA Tesla M2090 GPU consistently outperforms traditional recursive R-trees search algorithm on Intel Xeon E5506 processors.

...read moreread less

52 citations

Journal Article•10.1002/NME.4452•

Generation of large finite-element matrices on multiple graphics processors

[...]

Adam Dziekonski¹, P. Sypek¹, Adam Lamecki¹, Michal Mrozowski¹•Institutions (1)

Gdańsk University of Technology¹

13 Apr 2013-International Journal for Numerical Methods in Engineering

TL;DR: This paper proposes to generate the large sparse linear systems arising in finite‐element analysis in an iterative manner on several GPUs and to use the graphics accelerators concurrently with CPUs performing collection and addition of the matrix fragments using a fast multithreaded procedure.

...read moreread less

Abstract: SUMMARY This paper presents techniques for generating very large finite-element matrices on a multicore workstation equipped with several graphics processing units (GPUs). To overcome the low memory size limitation of the GPUs, and at the same time to accelerate the generation process, we propose to generate the large sparse linear systems arising in finite-element analysis in an iterative manner on several GPUs and to use the graphics accelerators concurrently with CPUs performing collection and addition of the matrix fragments using a fast multithreaded procedure. The scheduling of the threads is organized in such a way that the CPU operations do not affect the performance of the process, and the GPUs are idle only when data are being transferred from GPU to CPU. This approach is verified on two workstations: the first consists of two 6-core Intel Xeon X5690 processors with two Fermi GPUs: each GPU is a GeForce GTX 590 with two graphics processors and 1.5 GB of fast RAM; the second workstation is equipped with two Tesla C2075 boards carrying 6 GB of RAM each and two 12-core Opteron 6174s. For the latter setup, we demonstrate the fast generation of sparse finite-element matrices as large as 10 million unknowns, with over 1 billion nonzero entries. Comparing with the single-threaded and multithreaded CPU implementations, the GPU-based version of the algorithm based on the ideas presented in this paper reduces the finite-element matrix-generation time in double precision by factors of 100 and 30, respectively. Copyright © 2012 John Wiley & Sons, Ltd.

...read moreread less

Proceedings Article•10.1109/ICPP.2013.87•

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

[...]

Arunmoezhi Ramachandran¹, Jerome Vienne², Rob F. Van der Wijngaart³, Lars Koesterke², Ilya Sharapov³ - Show less +1 more•Institutions (3)

University of Texas at Dallas¹, University of Texas at Austin², Intel³

1 Oct 2013

TL;DR: This work uses the NPB-OpenMP version to examine the performance of the Intel's new Xeon Phi co-processor and focus in particular on the many core aspect of the Xeon Phi architecture.

...read moreread less

Abstract: NAS parallel benchmarks (NPB) are a set of applications commonly used to evaluate parallel systems. We use the NPB-OpenMP version to examine the performance of the Intel's new Xeon Phi co-processor and focus in particular on the many core aspect of the Xeon Phi architecture. A first analysis studies the scalability up to 244 threads on 61 cores and the impact of affinity settings on scaling. It also compares performance characteristics of Xeon Phi and traditional Xeon CPUs. The application of several well-established optimization techniques allows us to identify common bottlenecks that can specifically impede performance on the Xeon Phi but are not as severe on multi-core CPUs. We also find that many of the OpenMP-parallel loops are too short (in terms of the number of loop iterations) for a balanced execution by 244 threads. New or redesigned benchmarks will be needed to accommodate the greatly increased number of cores and threads. At the end, we summarize our findings in a set recommendations for performance optimization for Xeon Phi.

...read moreread less

Proceedings Article•10.1145/2451116.2451137•

Regularities considered harmful: forcing randomness to memory accesses to reduce row buffer conflicts for multi-core, multi-bank systems

[...]

Heekwon Park¹, Seungjae Baek¹, Jongmoo Choi², Donghee Lee³, Sam H. Noh⁴ - Show less +1 more•Institutions (4)

University of Pittsburgh¹, Dankook University², Seoul National University³, Hongik University⁴

16 Mar 2013

TL;DR: A novel kernel-level memory allocator, called M3 (M-cube, Multi-core Multi-bank Memory allocator), that has the following two features: it introduces and makes use of a notion of a memory container, which is defined as a unit of memory that comprises the minimum number of page frames that can cover all the banks of the memory organization.

...read moreread less

Abstract: We propose a novel kernel-level memory allocator, called M3 (M-cube, Multi-core Multi-bank Memory allocator), that has the following two features. First, it introduces and makes use of a notion of a memory container, which is defined as a unit of memory that comprises the minimum number of page frames that can cover all the banks of the memory organization, by exclusively assigning a container to a core so that each core achieves bank parallelism as much as possible. Second, it orchestrates page frame allocation so that pages that threads access are dispersed randomly across multiple banks so that each thread's access pattern is randomized. The development of M3 is based on a tool that we develop to fully understand the architectural characteristics of the underlying memory organization. Using an extension of this tool, we observe that the same application that accesses pages in a random manner outperforms one that accesses pages in a regular pattern such as sequential or same ordered accesses. This is because such randomized accesses reduces inter-thread access interference on the row-buffer in memory banks. We implement M3 in the Linux kernel version 2.6.32 on the Intel Xeon system that has 16 cores and 32GB DRAM. Performance evaluation with various workloads show that M3 improves the overall performance for memory intensive benchmarks by up to 85% with an average of about 40%.

...read moreread less

Proceedings Article•10.1109/DSD.2013.108•

Evaluating the Hardware Performance of a Million-Bit Multiplier

[...]

Yarkin Doröz, Erdinc Ozturk¹, Berk Sunar•Institutions (1)

Istanbul Commerce University¹

4 Sep 2013

TL;DR: Estimates show that the performance of the novel architecture designed to realize a million-bit multiplication architecture matches that of previously reported software implementations on a high-end 3 Ghz Intel Xeon processor, while requiring only a tiny fraction of the area.

...read moreread less

Abstract: In this work we present the first full and complete evaluation of a very large multiplication scheme in custom hardware. We designed a novel architecture to realize a million-bit multiplication architecture based on the Schonhage-Strassen Algorithm and the Number Theoretical Transform (NTT). The construction makes use of an innovative cache architecture along with processing elements customized to match the computation and access patterns of the FFT-based recursive multiplication algorithm. When synthesized using a 90nm TSMC library operating at a frequency of 666 MHz, our architecture is able to compute the product of integers in excess of a million bits in 7.74 milliseconds. Estimates show that the performance of our design matches that of previously reported software implementations on a high-end 3 Ghz Intel Xeon processor, while requiring only a tiny fraction of the area.

...read moreread less

Proceedings Article•10.1145/2503210.2503242•

Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

[...]

Jongsoo Park, Ganesh Bikshandi, Karthikeyan Vaidyanathan, Ping Tak Peter Tang¹, Pradeep Dubey, Daehyun Kim - Show less +2 more•Institutions (1)

Intel¹

17 Nov 2013

TL;DR: This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations by leveraging a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches.

...read moreread less

Abstract: This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nodes, which is 1.5x than achievable on a same number of Intel® Xeon® nodes. It is a challenge to fully utilize the compute capability presented by many-core wide-vector processors for bandwidth-bound FFT computation. We leverage a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running FFT on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging HPC systems that are increasingly communication limited.

...read moreread less

Book Chapter•10.1007/978-1-4614-8745-6_3•

Intel® Xeon Phi™ Coprocessors

[...]

Jim Jeffers¹•Institutions (1)

Intel¹

1 Jan 2013

TL;DR: This chapter provides an overview of the Intel Xeon Phi Coprocessor including its hardware and software architecture and the key usages that enable these highly parallel applications, like many in Geoscience to achieve new levels of performance while using familiar, standard programming models.

...read moreread less

Abstract: Intel recently launched the Intel® Xeon Phi™ Coprocessor to enhance the performance of the growing category of highly parallel scientific applications This chapter provides an overview of the Intel Xeon Phi Coprocessor including its hardware and software architecture and the key usages that enable these highly parallel applications, like many in Geoscience to achieve new levels of performance while using familiar, standard programming models

...read moreread less

Journal Article•10.1063/1.4873137•

Graphics processing units accelerated semiclassical initial value representation molecular dynamics

[...]

Dario Tamascelli¹, Francesco Saverio Dambrosio¹, Riccardo Conte², Michele Ceotto¹•Institutions (2)

University of Milan¹, Emory University²

17 Dec 2013-arXiv: Computational Physics

TL;DR: This paper presents a Graphics Processing Units (GPUs) implementation of the Semiclassical Initial Value Representation (SC-IVR) propagator for vibrational molecular spectroscopy calculations, showing a reduction in computational time and power consumption and semiclassical GPU calculations are shown to be environment friendly.

...read moreread less

Abstract: This paper presents a Graphics Processing Units (GPUs) implementation of the Semiclassical Initial Value Representation (SC-IVR) propagator for vibrational molecular spectroscopy calculations. The time-averaging formulation of the SC-IVR for power spectrum calculations is employed. Details about the GPU implementation of the semiclassical code are provided. Four molecules with an increasing number of atoms are considered and the GPU-calculated vibrational frequencies perfectly match the benchmark values. The computational time scaling of two GPUs (NVIDIA Tesla C2075 and Kepler K20) respectively versus two CPUs (Intel Core i5 and Intel Xeon E5-2687W) and the critical issues related to the GPU implementation are discussed. The resulting reduction in computational time and power consumption is significant and semiclassical GPU calculations are shown to be environment friendly.

...read moreread less

Proceedings Article•10.1145/2485922.2485977•

QuickRec: prototyping an intel architecture extension for record and replay of multithreaded programs

[...]

Gilles Pokam¹, Klaus Danne¹, Cristiano Pereira¹, Rolf Kassa¹, Tim Kranich¹, Shiliang Hu¹, Justin Gottschlich¹, Nima Honarmand², Nathan Dautenhahn², Samuel T. King², Josep Torrellas² - Show less +7 more•Institutions (2)

Intel¹, University of Illinois at Urbana–Champaign²

23 Jun 2013

TL;DR: It is demonstrated that RnR can be implemented efficiently on a real multicore IA system, and it is shown that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead.

...read moreread less

Abstract: There has been significant interest in hardware-assisted deterministic Record and Replay (RnR) systems for multithreaded programs on multiprocessors. However, no proposal has implemented this technique in a hardware prototype with full operating system support. Such an implementation is needed to assess RnR practicality.This paper presents QuickRec, the first multicore Intel Architecture (IA) prototype of RnR for multithreaded programs. QuickRec is based on QuickIA, an Intel emulation platform for rapid prototyping of new IA extensions. QuickRec is composed of a Xeon server platform with FPGA-emulated second-generation Pentium cores, and Capo3, a full software stack for managing the recording hardware from within a modified Linux kernel.This paper's focus is understanding and evaluating the implementation issues of RnR on a real platform. Our effort leads to some lessons learned, as well as to some pointers for future research. We demonstrate that RnR can be implemented efficiently on a real multicore IA system. In particular, we show that the rate of memory log generation is insignificant, and that the recording hardware has negligible performance overhead. However, the software stack incurs an average recording overhead of nearly 13%, which must be reduced to enable always-on use of RnR.

...read moreread less

Journal Article•10.1145/2450153.2450154•

High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures

[...]

Hatem Ltaief¹, Piotr Luszczek², Jack Dongarra³•Institutions (3)

King Abdullah University of Science and Technology¹, University of Tennessee², Oak Ridge National Laboratory³

03 May 2013-ACM Transactions on Mathematical Software

TL;DR: The new high-performance BRD achieves up to a 30-fold speedup against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.

...read moreread less

Abstract: This article presents a new high-performance bidiagonal reduction (BRD) for homogeneous multicore architectures. This article is an extension of the high-performance tridiagonal reduction implemented by the same authors [Luszczek et al., IPDPS 2011] to the BRD case. The BRD is the first step toward computing the singular value decomposition of a matrix, which is one of the most important algorithms in numerical linear algebra due to its broad impact in computational science. The high performance of the BRD described in this article comes from the combination of four important features: (1) tile algorithms with tile data layout, which provide an efficient data representation in main memory; (2) a two-stage reduction approach that allows to cast most of the computation during the first stage (reduction to band form) into calls to Level 3 BLAS and reduces the memory traffic during the second stage (reduction from band to bidiagonal form) by using high-performance kernels optimized for cache reuse; (3) a data dependence translation layer that maps the general algorithm with column-major data layout into the tile data layout; and (4) a dynamic runtime system that efficiently schedules the newly implemented kernels across the processing units and ensures that the data dependencies are not violated. A detailed analysis is provided to understand the critical impact of the tile size on the total execution time, which also corresponds to the matrix bandwidth size after the reduction of the first stage. The performance results show a significant improvement over currently established alternatives. The new high-performance BRD achieves up to a 30-fold speedup on a 16-core Intel Xeon machine with a 12000× 12000 matrix size against the state-of-the-art open source and commercial numerical software packages, namely LAPACK, compiled with optimized and multithreaded BLAS from MKL as well as Intel MKL version 10.2.

...read moreread less

Proceedings Article•10.1145/2493123.2462921•

COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors

[...]

Srihari Cadambi¹, Giuseppe Coviello¹, Cheng-Hong Li¹, Rajat Phull¹, Kunal Rao¹, Murugan Sankaradass¹, Srimat Chakradhar¹ - Show less +3 more•Institutions (1)

Princeton University¹

17 Jun 2013

TL;DR: A new, user-level middleware called COSMIC is proposed that improves performance and reliability of multiprocessing on coprocessors like the Xeon Phi, and increases multiprocessioning reliability by exploiting programmer-specified per-processCoprocessor memory requirements to completely avoid memory oversubscription and crashes.

...read moreread less

Abstract: It is remarkably easy to offload processing to Intel's newest manycore coprocessor, the Xeon-Phi: it supports a popular ISA (x86-based), a popular OS (Linux) and a popular programming model (OpenMP). Unfortunately, easy portability does not automatically ensure high performance. Additional programmer effort is necessary to leverage the new performance-oriented hardware features. But programmer optimizations alone are insufficient. Multiprocessing is also necessary to improve hardware utilization, and Linux makes it easy for processes to share the manycore coprocessor. However multiprocessing inefficiencies can easily offset gains made by the programmer. Our experiments on a production, high-performance Xeon server with multiple Xeon Phi coprocessors show that multiprocessing on coprocessors not only slows down the processes but also introduces unreliability (some processes crash unexpectedly). We propose a new, user-level middleware called COSMIC that improves performance and reliability of multiprocessing on coprocessors like the Xeon Phi. COSMIC seamlessly fits in the existing Xeon Phi software stack and is transparent to programmers. It manages Xeon Phi processes that execute parallel regions offloaded to the coprocessors. Offloads typically have programmer-driven performance directives like thread and affinity requirements. Unlike the existing Xeon Phi software stack, COSMIC does fair scheduling of both processes and offloads, and takes into account conflicting requirements of offloads belonging to different processes. By doing so, COSMIC has two clear benefits. First, it improves multiprocessing performance by preventing thread and memory oversubscription, by avoiding inter-offload interference and by reducing load imbalance on coprocessors and cores. Second, it increases multiprocessing reliability by exploiting programmer-specified per-process coprocessor memory requirements to completely avoid memory oversubscription and crashes. Our experiments on several representative Xeon Phi workloads show that, in a multiprocessing environment, COSMIC improves average core utilization by up to 3 times, reduces make-span by up to 52%, reduces average process latency (turn-around-time) by 70%, and completely eliminates process crashes.

...read moreread less

Journal Article•10.1016/J.CAGEO.2012.07.017•

Accelerating the discontinuous Galerkin method for seismic wave propagation simulations using the graphic processing unit (GPU)-single-GPU implementation

[...]

Dawei Mu¹, Po Chen¹, Liqiang Wang¹•Institutions (1)

University of Wyoming¹

01 Feb 2013-Computers & Geosciences

TL;DR: This is the first study that explores the potential of accelerating the ADER-DG method for seismic wave-propagation simulations using a GPU and the implementation obtained a speedup factor of about 24.3 for the single-precision version of the GPU code.

...read moreread less

Journal Article•10.1016/J.COMPFLUID.2012.04.024•

Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments

[...]

Satoshi Ito¹, Kazuya Goto, Kenji Ono¹•Institutions (1)

University of Tokyo¹

10 Jul 2013-Computers & Fluids

TL;DR: Proposed mapping of subdomains to CPU/cores was evaluated on massively paralleled Intel Xeon PC cluster and confirmed that it could reduce communication time and achieve higher parallel performance than without mapping in several benchmark tests.

...read moreread less

Proceedings Article•10.5555/2523721.2523741•

L1-bandwidth aware thread allocation in multicore SMT processors

[...]

Josue Feliu, Julio Sahuquillo, Salvador Petit, José Duato

7 Oct 2013

TL;DR: This work addresses the L1 cache bandwidth problem in SMT processors experimentally on real hardware and proposes two L1 bandwidth aware thread to core (t2c) allocation policies, namely Static and Dynamic t2c allocation, respectively.

...read moreread less

Abstract: Improving the utilization of shared resources is a key issue to increase performance in SMT processors. Recent work has focused on resource sharing policies to enhance the processor performance, but their proposals mainly concentrate on novel hardware mechanisms that adapt to the dynamic resource requirements of the running threads. This work addresses the L1 cache bandwidth problem in SMT processors experimentally on real hardware. Unlike previous work, this paper concentrates on thread allocation, by selecting the proper pair of co-runners to be launched to the same core. The relation between L1 bandwidth requirements of each benchmark and its performance (IPC) is analyzed. We found that for individual benchmarks, performance is strongly connected to L1 bandwidth consumption, and this observation remains valid when several co-runners are launched to the same SMT core. Based on these findings we propose two L1 bandwidth aware thread to core (t2c) allocation policies, namely Static and Dynamic t2c allocation, respectively. The aim of these policies is to properly balance L1 bandwidth requirements of the running threads among the processor cores. Experiments on a Xeon E5645 processor show that the proposed policies significantly improve the performance of the Linux OS kernel regardless the number of cores considered.

...read moreread less

Book Chapter•10.1007/978-3-319-10214-6_2•

Performance Evaluation of the Intel Sandy Bridge Based NASA Pleiades Using Scientific and Engineering Applications

[...]

Subhash Saini¹, Johnny Chang¹, Haoqiang Jin¹•Institutions (1)

Ames Research Center¹

18 Nov 2013

TL;DR: A performance evaluation of Pleiades based on the Intel Xeon E5-2670 processor, a fourth-generation eight-core Sandy Bridge architecture, and compare it with the previous third generation Nehalem architecture is presented.

...read moreread less

Abstract: We present a performance evaluation of Pleiades based on the Intel Xeon E5-2670 processor, a fourth-generation eight-core Sandy Bridge architecture, and compare it with the previous third generation Nehalem architecture. Several architectural features have been incorporated in Sandy Bridge: (a) four memory channels as opposed to three in Nehalem; (b) memory speed increased from 1333 MHz to 1600 MHz; (c) ring to connect on-chip L3 cache with cores, system agent, memory controller, and QPI agent and I/O controller to increase the scalability; (d) new AVX unit with wider vector registers of 256 bit; (e) integration of PCI-Express 3.0 controllers into the I/O subsystem on chip; (f) new Turbo Boost version 2.0 where base frequency of processor increased from 2.6 to 3.2 GHz; and (g) QPI link rate from 6.4 to 8 GT/s and two QPI links to second socket. We critically evaluate these new features using several low-level benchmarks, and four full-scale scientific and engineering applications.

...read moreread less

Proceedings Article•10.1109/RTAS.2013.6531100•

Bringing theory into practice: A userspace library for multicore real-time scheduling

[...]

Malcolm S. Mollison¹, James H. Anderson¹•Institutions (1)

University of North Carolina at Chapel Hill¹

9 Apr 2013

TL;DR: This paper describes the first userspace scheduler that supports preemptive, dynamic-priority, migrating real-time tasks on multicore hardware, and reports empirical latency and overhead measurements.

...read moreread less

Abstract: As multicore computing hardware has become more ubiquitous, real-time scheduling theory aimed at multicore systems has become increasingly sophisticated and diverse. Real-time operating systems (RTOSs) are ill-suited for this kind of rapid change, and the slow-moving RTOS ecosystem is falling further and further behind advances in real-time scheduling theory. Thus, supporting new functionality in a layer of middleware software running in userspace (i.e., outside the RTOS kernel) has been proposed. In this paper, we describe the first userspace scheduler that supports preemptive, dynamic-priority, migrating real-time tasks on multicore hardware, and report empirical latency and overhead measurements. On an eight-core Intel Xeon platform, these measurements are in the range of ones to tens of microseconds under most tested configurations. We believe that this approach may prove superior to a kernel-based approach for supporting a subset of future real-world realtime applications.

...read moreread less

Journal Article•10.1109/TPDS.2012.217•

Robust and Scalable String Pattern Matching for Deep Packet Inspection on Multicore Processors

[...]

Yi-Hua E. Yang¹, Viktor K. Prasanna²•Institutions (2)

Xilinx¹, University of Southern California²

01 Nov 2013-IEEE Transactions on Parallel and Distributed Systems

TL;DR: A head-body finite automaton (HBFA) is proposed, which implements SPM in two parts: a head DFA (H-DFA) and a body NFA (B-NFA), which achieves 3x to 8x throughput when matching real-life large dictionaries against inputs with high match ratios.

...read moreread less

Abstract: Conventionally, dictionary-based string pattern matching (SPM) has been implemented as Aho-Corasick deterministic finite automaton (AC-DFA). Due to its large memory footprint, a large-dictionary AC-DFA can experience poor cache performance when matching against inputs with high match ratio on multicore processors. We propose a head-body finite automaton (HBFA), which implements SPM in two parts: a head DFA (H-DFA) and a body NFA (B-NFA). The H-DFA matches the dictionary up to a predefined prefix length in the same way as AC-DFA, but with a much smaller memory footprint. The B-NFA extends the matching to full dictionary lengths in a compact variable-stride branch data structure, accelerated by single-instruction multiple-data (SIMD) operations. A branch grafting mechanism is proposed to opportunistically advance the state of the H-DFA with the matching progress in the B-NFA. Compared with a fully populated AC-DFA, our HBFA prototype has <;1/5 construction time, requires <;1/20 runtime memory, and achieves 3x to 8x throughput when matching real-life large dictionaries against inputs with high match ratios. The throughput scales up 27x to over 34 Gbps on a 32-core Intel Manycore Testing Lab machine based on the Intel Xeon X7560 processors.

...read moreread less

Book Chapter•10.1007/978-3-642-38750-0_18•

Offload Compiler Runtime for the Intel® Xeon PhiTM Coprocessor

[...]

Chris J. Newburn¹, Rajiv Deodhar¹, Dmitriev Serguei N¹, Ravi Murty¹, Ravi Narayanaswamy¹, John A. Wiegert¹, Francisco Chinchilla¹, Russell W. McGuire¹ - Show less +4 more•Institutions (1)

Intel¹

16 Jun 2013

TL;DR: The paper addresses why offload to a coprocessor is useful, how it is specified, and what the conditions for the profitability of offload are, and enumerates the key performance features for this heterogeneous computing stack, related to initialization, data movement and invocation.

...read moreread less

Abstract: The Intel® Xeon PhiTM coprocessor platform enables offload of computation from a host processor to a coprocessor that is a fully-functional Intel® Architecture CPU. This paper presents the C/C++ and Fortran compiler offload runtime for that coprocessor. The paper addresses why offload to a coprocessor is useful, how it is specified, and what the conditions for the profitability of offload are. It also serves as a guide to potential third-party developers of offload runtimes, such as a gcc-based offload compiler, ports of existing commercial offloading compilers to Intel® Xeon PhiTM coprocessor such as CAPS®, and third-party offload library vendors that Intel is working with, such as NAG® and MAGMA®. It describes the software architecture and design of the offload compiler runtime. It enumerates the key performance features for this heterogeneous computing stack, related to initialization, data movement and invocation. Finally, it evaluates the performance impact of those features for a set of directed micro-benchmarks and larger workloads.

...read moreread less

Proceedings Article•10.1109/FPT.2013.6718346•

Implementation of high performance hardware architecture of OpenSURF algorithm on FPGA

[...]

Xitian Fan¹, Chenlu Wu¹, Wei Cao¹, Xuegong Zhou¹, Shengye Wang¹, Lingli Wang¹ - Show less +2 more•Institutions (1)

Fudan University¹

1 Dec 2013

TL;DR: A sliding window method is proposed to extract feature points in parallel at selected scale levels and the time cost in feature extraction can be greatly reduced, and data reuse strategy is proposed in orientation generation and descriptor generation to reduce the memory access times.

...read moreread less

Abstract: This paper proposes a high performance hardware architecture of Speeded Up Robust Features (SURF) algorithm based on OpenSURF. In order to achieve high processing frame rate, the hardware architecture is designed with several characteristics. Firstly, a sliding window method is proposed to extract feature points in parallel at selected scale levels. As a result, the time cost in feature extraction can be greatly reduced. Secondly, data reuse strategy is proposed in orientation generation and descriptor generation to reduce the memory access times. In this way, 3.87x and 2.25X speedup are achieved respectively. Thirdly, the integral image is segmented to buffer in different memory blocks in order to support multiple data accessing in one clock cycle, which will further reduce the whole calculating time of our implementation. The hardware architecture is implemented on an XC6VSX475T FPGA with 156 MHz and its maximal frame rate for VGA format image can reach 356 frames per second (fps), which is 6.25 times frame rate of OpenSURF running on a server with a Xeon 5650 processor, and 6 times the reported frame rate of the recent implementation on three Vritex4 FPGAs [8].

...read moreread less

Proceedings Article•10.1145/2484762.2484794•

Optimizing the PCIT algorithm on stampede's Xeon and Xeon Phi processors for faster discovery of biological networks

[...]

Lars Koesterke¹, Kent Milfeld¹, Matthew W. Vaughn¹, Dan Stanzione¹, James E. Koltes², Nathan T. Weeks², James M. Reecy² - Show less +3 more•Institutions (2)

University of Texas at Austin¹, Iowa State University²

22 Jul 2013

TL;DR: The PCIT algorithm is re-implemented with exemplary parallel, vector, I/O, memory and instruction optimizations for today's multi- and many-core architectures, targeting the processor architectures of the Stampede supercomputer, but will also benefit other architectures.

...read moreread less

Abstract: The PCIT method is an important technique for detecting interactions between networks. The PCIT algorithm has been used in the biological context to infer complex regulatory mechanisms and interactions in genetic networks, in genome wide association studies, and in other similar problems. In this work, the PCIT algorithm is re-implemented with exemplary parallel, vector, I/O, memory and instruction optimizations for today's multi- and many-core architectures. The evolution and performance of the new code targets the processor architectures of the Stampede supercomputer, but will also benefit other architectures. The Stampede system consists of an Intel Xeon E5 processor base system with an innovative component comprised of Intel Xeon Phi Coprocessors. Optimized results and an analysis are presented for both the Xeon and the Xeon Phi.

...read moreread less

Proceedings Article•

A 22nm 2.5MB slice on-die L3 cache for the next generation Xeon ® processor

[...]

Wei Chen¹, Szu-Liang Chen¹, Siufu Chiu¹, Raghuraman Ganesan¹, Venkata Lukka¹, Wei Wing Mar¹, Stefan Rusu¹ - Show less +3 more•Institutions (1)

Intel¹

11 Jun 2013

TL;DR: Per-die post silicon RWA fusing and per-diePost silicon redundancy flow have significantly improved L3 cache Vccmin by more than 150mv.

...read moreread less

Abstract: The 20-way set associative 2.5MB slice ported L3 cache for the multi-core Xeon® Processor uses 0.108 um2 cell in a 22nm tri-gate technology with 2.7TB maximum bandwidth. It is protected by double-error correction/triple-error detection ECC. The basic building block is designed to support floorplan style on each processor with large L3 cache. On die fuse storage enables high resolution repair coverage. Programmable transient voltage collapse writes assist (TVC-WA), wordline underdrive read assist (WLUD-RA) circuits (RWA) is used to achieve aggressive Vccmin goal. Per-die post silicon RWA fusing and per-die post silicon redundancy flow have significantly improved L3 cache Vccmin by more than 150mv.Shutoff per slice features has reduced SRAM leakage.

...read moreread less

...

Expand