Top 147 papers published in the topic of Xeon in 2014

Showing papers on "Xeon published in 2014"

Journal Article•10.1016/J.JCP.2014.08.024•

Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer

[...]

Chuanfu Xu¹, Xiaogang Deng¹, Lilun Zhang¹, Jianbin Fang², Guangxue Wang, Yi Jiang, Wei Cao¹, Yonggang Che¹, Yongxian Wang¹, Zhenghua Wang¹, Wei Liu¹, Xinghua Cheng¹ - Show less +8 more•Institutions (2)

National University of Defense Technology¹, Delft University of Technology²

01 Dec 2014-Journal of Computational Physics

TL;DR: This paper port and optimize high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer, and proposes a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features.

...read moreread less

95 citations

Proceedings Article•10.1109/SC.2014.82•

Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices

[...]

Jongsoo Park¹, Mikhail Smelyanskiy¹, Karthikeyan Vaidyanathan¹, Alexander Heinecke¹, Dhiraj D. Kalamkar¹, Xing Liu², Md. Mosotofa Ali Patwary¹, Yutong Lu³, Pradeep Dubey¹ - Show less +5 more•Institutions (3)

Intel¹, Georgia Institute of Technology², National University of Defense Technology³

16 Nov 2014

TL;DR: This work implements a shared-memory implementation of Gauss-Seidel smoother on Xeon Phi that balances parallelism, data access locality, CG convergence rate, and communication overhead, and demonstrates that the optimizations not only benefit HPCG original dataset, which is based on structured 3D grid, but also a wide range of unstructured matrices.

...read moreread less

Abstract: A new sparse high performance conjugate gradient benchmark (HPCG) has been recently released to address challenges in the design of sparse linear solvers for the next generation extreme-scale computing systems. Key computation, data access, and communication pattern in HPCG represent building blocks commonly found in today's HPC applications. While it is a well known challenge to efficiently parallelize Gauss-Seidel smoother, the most time-consuming kernel in HPCG, our algorithmic and architecture-aware optimizations deliver 95% and 68% of the achievable bandwidth on Xeon and Xeon Phi, respectively. Based on available parallelism, our Xeon Phi shared-memory implementation of Gauss-Seidel smoother selectively applies block multi-color reordering. Combined with MPI parallelization, our implementation balances parallelism, data access locality, CG convergence rate, and communication overhead. Our implementation achieved 580 TFLOPS (82% parallelization efficiency) on Tianhe-2 system, ranking first on the most recent HPCG list in July 2014. In addition, we demonstrate that our optimizations not only benefit HPCG original dataset, which is based on structured 3D grid, but also a wide range of unstructured matrices.

...read moreread less

64 citations

Proceedings Article•10.1145/2554688.2554790•

FPGA-based biophysically-meaningful modeling of olivocerebellar neurons

[...]

Georgios Smaragdos¹, Sebastian Isaza², Martijn F. van Eijk³, Ioannis Sourdis⁴, Christos Strydis¹ - Show less +1 more•Institutions (4)

Erasmus University Rotterdam¹, University of Antioquia², Delft University of Technology³, Chalmers University of Technology⁴

26 Feb 2014

TL;DR: This work ported a highly detailed ION cell network model, originally coded in Matlab, onto an FPGA chip, and translated to HLS C code for the Xilinx Vivado toolflow and various algorithmic and arithmetic optimizations were applied.

...read moreread less

Abstract: The Inferior-Olivary nucleus (ION) is a well-charted region of the brain, heavily associated with sensorimotor control of the body. It comprises ION cells with unique properties which facilitate sensory processing and motor-learning skills. Various simulation models of ION-cell networks have been written in an attempt to unravel their mysteries. However, simulations become rapidly intractable when biophysically plausible models and meaningful network sizes (>=100 cells) are modeled. To overcome this problem, in this work we port a highly detailed ION cell network model, originally coded in Matlab, onto an FPGA chip. It was first converted to ANSI C code and extensively profiled. It was, then, translated to HLS C code for the Xilinx Vivado toolflow and various algorithmic and arithmetic optimizations were applied. The design was implemented in a Virtex 7 (XC7VX485T) device and can simulate a 96-cell network at real-time speed, yielding a speedup of x700 compared to the original Matlab code and x12.5 compared to the reference C implementation running on a Intel Xeon 2.66GHz machine with 20GB RAM. For a 1,056-cell network (non-real-time), an FPGA speedup of x45 against the C code can be achieved, demonstrating the design's usefulness in accelerating neuroscience research. Limited by the available on-chip memory, the FPGA can maximally support a 14,400-cell network (non-real-time) with online parameter configurability for cell state and network size. The maximum throughput of the FPGA ION-network accelerator can reach 2.13 GFLOPS.

...read moreread less

52 citations

Proceedings Article•10.1109/IPDPS.2014.82•

Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2

[...]

Wei Xue¹, Chao Yang, Haohuan Fu¹, Xinliang Wang¹, Yangtong Xu¹, Lin Gan¹, Yutong Lu², Xiaoqian Zhu² - Show less +4 more•Institutions (2)

Tsinghua University¹, National University of Defense Technology²

19 May 2014

TL;DR: A hybrid algorithm for the petascale global simulation of atmospheric dynamics on Tianhe-2, the world's current top-ranked supercomputer developed by China's National University of Defense Technology, to enable flexible domain partition between an arbitrary number of processors and accelerators.

...read moreread less

Abstract: This paper presents a hybrid algorithm for the petascale global simulation of atmospheric dynamics on Tianhe-2, the world's current top-ranked supercomputer developed by China's National University of Defense Technology (NUDT). Tianhe-2 is equipped with both Intel Xeon CPUs and Intel Xeon Phi accelerators. A key idea of the hybrid algorithm is to enable flexible domain partition between an arbitrary number of processors and accelerators, so as to achieve a balanced and efficient utilization of the entire system. We also present an asynchronous and concurrent data transfer scheme to reduce the communication overhead between CPU and accelerators. The acceleration of our global atmospheric model is conducted to improve the use of the Intel MIC architecture. For the single-node test on Tianhe-2 against two Intel Ivy Bridge CPUs (24 cores), we can achieve 2.07x, 3.18x, and 4.35x speedups when using one, two, and three Intel Xeon Phi accelerators respectively. The average performance gain from SIMD vectorization on the Intel Xeon Phi processors is around 5x (out of the 8x theoretical case). Based on successful computation-communication overlapping, large-scale tests indicate that a nearly ideal weak-scaling efficiency of 93.5% is obtained when we gradually increase the number of nodes from 6 to 8,664 (nearly 1.7 million cores). In the strong-scaling test, the parallel efficiency is about 77% when the number of nodes increases from 1,536 to 8,664 for a fixed 65,664 × 5,664 × 6 mesh with 77.6 billion unknowns.

...read moreread less

51 citations

Proceedings Article•10.1109/ISSCC.2014.6757356•

5.4 Ivytown: A 22nm 15-core enterprise Xeon® processor family

[...]

Stefan Rusu¹, Harry Muljono¹, David J. Ayers¹, Simon M. Tam¹, Wei Chen¹, Aaron K. Martin¹, Shenggao Li¹, Sujal Vora¹, Raj Varada¹, Edward Wang¹ - Show less +6 more•Institutions (1)

Intel¹

6 Mar 2014

TL;DR: The next-generation enterprise Xeon® server processor has 15 dual-threaded 64b Ivybridge cores and 37.5MB shared L3 cache and CMOS muxes embedded in the ring bus are programmably operable in a 2-or-3-columns configuration.

...read moreread less

Abstract: The next-generation enterprise Xeon® server processor has 15 dual-threaded 64b Ivybridge cores [1] and 37.5MB shared L3 cache. The system interface includes two on-chip memory controllers, each with two memory channels and supports multiple system topologies. The processor has 4.31B transistors in a high-κ metal-gate tri-gate 22nm CMOS technology with 9 metal layers [2]. The design supports a wide array of product offerings with thermal design power ranging from 40 to 150W and frequencies ranging from 1.4 to 3.8GHz. Fig. 5.4.1(a) shows the processor block diagram. The floorplan (Fig. 5.4.1(b)) is driven by the ring bus routability and latency, as well as the chop requirements to smaller core counts. The cores and associated L3 cache are organized in columns of five, with the ring bus segment embedded. The fully populated die has 15-cores in three columns. The 10-core chop removes the rightmost 3rd column and its dedicated top and bottom IOs. CMOS muxes embedded in the ring bus are programmably operable in a 2-or-3-columns configuration. The 6-core chop removes the 2nd and 4th rows from the 10-core die.

...read moreread less

48 citations

Proceedings Article•10.1109/ULTSYM.2014.0555•

A multi-threaded version of Field II

[...]

Jørgen Arendt Jensen¹•Institutions (1)

Technical University of Denmark¹

23 Oct 2014

TL;DR: A multi-threaded version of Field II has been developed, which automatically can use the multi-core capabilities of modern CPUs and is fully compatible with older versions, and only a single command has been added for setting the number of threads to use.

...read moreread less

Abstract: A multi-threaded version of Field II has been developed, which automatically can use the multi-core capabilities of modern CPUs. The memory allocation routines were rewritten to minimize the number of dynamic allocations and to make pre-allocations possible for each thread. This ensures that the simulation job can be automatically partitioned and the interdependence between threads minimized. The new code has been compared to Field II version 3.22, October 27, 2013 (latest free-ware version). A 64 element 5 MHz focused array transducer was simulated. One million point scatterers randomly distributed in a plane of 20 × 50 mm (width × depth) with random Gaussian amplitudes were simulated using the command calc scat. Dual Intel Xeon CPU E5-2630 2.60 GHz CPUs were used under Ubuntu Linux 10.02 and Matlab version 2013b. Each CPU holds 6 cores with hyper-threading, corresponding to a total of 24 hyper-threading cores. The averaged simulation time for 10 realizations for the old version was 85.1 s. A single thread run for the new version took 27.7 s; a speed-up of 3.1. Employing all 24 cores gave a simulation time of 3.27 s for the one million scatterers corresponding to a speed-up factor of 26 times. The speed-up in general depends on the transducer, scatterers and simulation, and it varies across applications between 13 and 30. The program is fully compatible with older versions, and only a single command has been added for setting the number of threads to use. The division of labor is automatically handled by the program. For a phantom with 100,000 scatterers, it is now possible to simulate a full 128 line image in around 42 seconds with full precision.

...read moreread less

45 citations

Journal Article•10.1016/J.IS.2014.01.005•

Parallel Online Spatial and Temporal Aggregations on Multi-core CPUs and Many-Core GPUs

[...]

Jianting Zhang¹, Simin You², Le Gruenwald³•Institutions (3)

City College of New York¹, The Graduate Center, CUNY², University of Oklahoma³

01 Aug 2014-Information Systems

TL;DR: The designs, implementations and experiments show the feasibility of building a high-performance OLAP system for processing large-scale taxi trip data for real-time, interactive data explorations and opens the paths to achieving even higher OLAP query efficiency for large- scale applications.

...read moreread less

39 citations

Proceedings Article•10.1109/SC.2014.60•

Efficient implementation of many-body quantum chemical methods on the intel® xeon phi™ coprocessor

[...]

Edoardo Aprà¹, Michael Klemm², Karol Kowalski¹•Institutions (2)

Pacific Northwest National Laboratory¹, Intel²

16 Nov 2014

TL;DR: The implementation and performance of the highly accurate CCSD(T) quantum chemistry method on the Intel® Xeon Phi coprocessor within the context of the NWChem computational chemistry package is presented.

...read moreread less

Abstract: This paper presents the implementation and performance of the highly accurate CCSD(T) quantum chemistry method on the Intel® Xeon Phi™ coprocessor within the context of the NWChem computational chemistry package. The widespread use of highly correlated methods in electronic structure calculations is contingent upon the interplay between advances in theory and the possibility of utilizing the ever-growing computer power of emerging heterogeneous architectures. We discuss the design decisions of our implementation as well as the optimizations applied to the compute kernels and data transfers between host and coprocessor. We show the feasibility of adopting the Intel® Many Integrated Core Architecture and the Intel Xeon Phi coprocessor for developing efficient computational chemistry modeling tools. Remarkable scalability is demonstrated by benchmarks. Our solution scales up to a total of 62560 cores with the concurrent utilization of Intel® Xeon® processors and Intel Xeon Phi coprocessors.

...read moreread less

38 citations

Journal Article•10.1016/J.COMPFLUID.2014.04.021•

Solving seven-equation model for compressible two-phase flow using multiple GPUs

[...]

Shan Liang¹, Wei Liu¹, Li Yuan¹•Institutions (1)

Chinese Academy of Sciences¹

22 Jul 2014-Computers & Fluids

TL;DR: Numerical tests against several one- and two-dimensional compressible two-phase flow problems with high density and high pressure ratios demonstrate that the application of an HLLC-type approximate Riemann solver in conjunction with the third-order TVD Runge–Kutta method is accurate and robust.

...read moreread less

36 citations

Proceedings Article•10.5555/2616606.2617013•

Unveiling eurora - thermal and power characterization of the most energy-efficient supercomputer in the world

[...]

Andrea Bartolini¹, Matteo Cacciari¹, Carlo Cavazzoni, Giampietro Tecchiolli, Luca Benini¹ - Show less +1 more•Institutions (1)

University of Bologna¹

24 Mar 2014

TL;DR: A novel, low-overhead monitoring infrastructure capable to track in detail and in real-time the thermal and power characteristics of Eurora's components with fine-grained resolution is presented.

...read moreread less

Abstract: Eurora (EURopean many integrated cORe Architecture) is today the most energy efficient supercomputer in the world. Ranked 1st in the Green500 in July 2013, is a prototype built from Eurotech and Cineca toward next-generation Tier-0 systems in the PRACE 2IP EU project. Eurora's outstanding energy-efficiency is achieved by adopting a direct liquid cooling solution and a heterogeneous architecture with best-in-class general purpose HW components (Intel Xeon E5, Intel Xeon Phi and NVIDIA Kepler K20). In this paper we present a novel, low-overhead monitoring infrastructure capable to track in detail and in real-time the thermal and power characteristics of Eurora's components with fine-grained resolution. Our experiments give insights on Eurora's thermal/power trade-offs and highlight opportunities for run-time power/thermal management and optimization.

...read moreread less

35 citations

Journal Article•10.1002/CPE.3132•

Data-driven execution of fast multipole methods

[...]

Hatem Ltaief¹, Rio Yokota¹•Institutions (1)

King Abdullah University of Science and Technology¹

10 Aug 2014-Concurrency and Computation: Practice and Experience

TL;DR: The authors discuss in the paper another approach based on data‐driven execution to efficiently tackle this challenging load balancing problem, which consists of breaking the most time‐consuming stages of the FMMs into smaller tasks.

...read moreread less

Abstract: Fast multipole methods FMMs have ON complexity, are compute bound, and require very little synchronization, which makes them a favorable algorithm on next-generation supercomputers. Their most common application is to accelerate N-body problems, but they can also be used to solve boundary integral equations. When the particle distribution is irregular and the tree structure is adaptive, load balancing becomes a non-trivial question. A common strategy for load balancing FMMs is to use the work load from the previous step as weights to statically repartition the next step. The authors discuss in the paper another approach based on data-driven execution to efficiently tackle this challenging load balancing problem. The core idea consists of breaking the most time-consuming stages of the FMMs into smaller tasks. The algorithm can then be represented as a directed acyclic graph where nodes represent tasks and edges represent dependencies among them. The execution of the algorithm is performed by asynchronously scheduling the tasks using the queueing and runtime for kernels runtime environment, in a way such that data dependencies are not violated for numerical correctness purposes. This asynchronous scheduling results in an out-of-order execution. The performance results of the data-driven FMM execution outperform the previous strategy and show linear speedup on a quad-socket quad-core Intel Xeon system.Copyright © 2013 John Wiley & Sons, Ltd.

...read moreread less

Journal Article•10.1016/J.PROCS.2014.05.020•

FPGA-based Acceleration of Detecting Statistical Epistasis in GWAS

[...]

Lars Wienbrandt¹, Jan Christian Kässens¹, Jorge González-Domínguez², Bertil Schmidt², David Ellinghaus¹, Manfred Schimmler¹ - Show less +2 more•Institutions (2)

University of Kiel¹, University of Mainz²

1 Jan 2014

TL;DR: This work shows how to benefit from FPGA technology for highly parallel creation of contingency tables in a systolic chain with a subsequent statistical test to studyotype-by-genotype interactions (epistasis).

...read moreread less

Abstract: Genotype-by-genotype interactions (epistasis) are believed to be a significant source of unexplained genetic variation causing complex chronic diseases but have been ignored in genome-wide association studies (GWAS) due to the computational burden of analysis. In this work we show how to benefit from FPGA technology for highly parallel creation of contingency tables in a systolic chain with a subsequent statistical test. We present the implementation for the FPGA-based hardware platform RIVYERA S6-LX150 containing 128 Xilinx Spartan6-LX150 FPGAs. For performance evaluation we compare against the method iLOCi[9]. iLOCi claims to outperform other available tools in terms of accuracy. However, analysis of a dataset from the Wellcome Trust Case Control Consortium (WTCCC) with about 500,000 SNPs and 5,000 samples still takes about 19 hours on a MacPro workstation with two Intel Xeon quad-core CPUs, while our FPGA-based implementation requires only 4 minutes.

...read moreread less

Proceedings Article•10.1109/BIGDATA.2014.7004219•

Parallel Breadth First Search on GPU clusters

[...]

Zhisong Fu, Harish Kumar Dasari¹, Bradley R. Bebee, Martin Berzins¹, Bryan B. Thompson - Show less +1 more•Institutions (1)

University of Utah¹

1 Oct 2014

TL;DR: Previous research on GPUs and on scalable graph processing on supercomputers is extended and it is demonstrated that a high-performance parallel graph machine can be created using commodity GPUs and networking hardware.

...read moreread less

Abstract: Fast, scalable, low-cost, and low-power execution of parallel graph algorithms is important for a wide variety of commercial and public sector applications. Breadth First Search (BFS) imposes an extreme burden on memory bandwidth and network communications and has been proposed as a benchmark that may be used to evaluate current and future parallel computers. Hardware trends and manufacturing limits strongly imply that many-core devices, such as NVIDIA® GPUs and the Intel® Xeon Phi®, will become central components of such future systems. GPUs are well known to deliver the highest FLOPS/watt and enjoy a very significant memory bandwidth advantage over CPU architectures. Recent work has demonstrated that GPUs can deliver high performance for parallel graph algorithms and, further, that it is possible to encapsulate that capability in a manner that hides the low level details of the GPU architecture and the CUDA language but preserves the high throughput of the GPU. We extend previous research on GPUs and on scalable graph processing on supercomputers and demonstrate that a high-performance parallel graph machine can be created using commodity GPUs and networking hardware.

...read moreread less

Proceedings Article•10.1109/IPDPSW.2014.194•

Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor

[...]

Lei Jin¹, Zhaokang Wang¹, Rong Gu¹, Chunfeng Yuan¹, Yihua Huang¹ - Show less +1 more•Institutions (1)

Nanjing University¹

19 May 2014

TL;DR: A many- core algorithm which is based on a parallel method and is used in the Intel Xeon Phi many-core systems to speed up the unsupervised training process of Sparse Autoencoder and Restricted Boltzmann Machine and suggests that theIntel Xeon Phi can offer an efficient but more general-purposed way to parallelize the deep learning algorithm compared to GPU.

...read moreread less

Abstract: As a new area of machine learning research, the deep learning algorithm has attracted a lot of attention from the research community. It may bring human beings to a higher cognitive level of data. Its unsupervised pre-training step allows us to find high-dimensional representations or abstract features which work much better than the principal component analysis (PCA) method. However, it will face problems when being applied to deal with large scale data due to its intensive computation from many levels of training process against large scale data. The sequential deep learning algorithms usually can not finish the computation in an acceptable time. In this paper, we propose a many-core algorithm which is based on a parallel method and is used in the Intel Xeon Phi many-core systems to speed up the unsupervised training process of Sparse Autoencoder and Restricted Boltzmann Machine (RBM). Using the sequential training algorithm as a baseline to compare, we adopted several optimization methods to parallelize the algorithm. The experimental results show that our fully-optimized algorithm gains more than 300-fold speedup on parallelized Sparse Autoencoder compared with the original sequential algorithm on the Intel Xeon Phi coprocessor. Also, we ran the fully-optimized code on both the Intel Xeon Phi coprocessor and an expensive Intel Xeon CPU. Our method on the Intel Xeon Phi coprocessor is 7 to 10 times faster than the Intel Xeon CPU for this application. In addition to this, we compared our fully-optimized code on the Intel Xeon Phi with a Matlab code running on single Intel Xeon CPU. Our method on the Intel Xeon Phi runs 16 times faster than the Matlab implementation. The result also suggests that the Intel Xeon Phi can offer an efficient but more general-purposed way to parallelize the deep learning algorithm compared to GPU. It also achieves faster speed with better parallelism than the Intel Xeon CPU.

...read moreread less

Journal Article•10.1016/J.MICPRO.2014.06.003•

A million-bit multiplier architecture for fully homomorphic encryption

[...]

Yarkin Doröz¹, Erdinc Ozturk², Berk Sunar¹•Institutions (2)

Worcester Polytechnic Institute¹, Istanbul Commerce University²

01 Nov 2014-Microprocessors and Microsystems

TL;DR: A novel architecture to realize a million-bit multiplication scheme based on the Schonhage-Strassen Algorithm using Number Theoretical Transform (NTT) makes use of an innovative cache architecture along with processing elements customized to match the computation and access patterns of the NTT-based recursive multiplication algorithm.

...read moreread less

Journal Article•10.1088/1748-0221/9/04/P04005•

First evaluation of the CPU, GPGPU and MIC architectures for real time particle tracking based on Hough transform at the LHC

[...]

V. Halyo¹, Patrick LeGresley¹, Paul Lujan¹, V. Karpusenko, Andrey Vladimirov - Show less +1 more•Institutions (1)

Princeton University¹

07 Apr 2014-Journal of Instrumentation

TL;DR: In this paper, a new tracking algorithm based on the Hough transform was evaluated for the first time on multi-core Intel i7-3770 and Intel Xeon E5-2697v2 CPUs, an NVIDIA Tesla K20c GPU, and an Intel Xeon Phi 7120 coprocessor.

...read moreread less

Abstract: Recent innovations focused around parallel processing, either through systems containing multiple processors or processors containing multiple cores, hold great promise for enhancing the performance of the trigger at the LHC and extending its physics program. The flexibility of the CMS/ATLAS trigger system allows for easy integration of computational accelerators, such as NVIDIA's Tesla Graphics Processing Unit (GPU) or Intel's Xeon Phi, in the High Level Trigger. These accelerators have the potential to provide faster or more energy efficient event selection, thus opening up possibilities for new complex triggers that were not previously feasible. At the same time, it is crucial to explore the performance limits achievable on the latest generation multicore CPUs with the use of the best software optimization methods. In this article, a new tracking algorithm based on the Hough transform will be evaluated for the first time on multi-core Intel i7-3770 and Intel Xeon E5-2697v2 CPUs, an NVIDIA Tesla K20c GPU, and an Intel Xeon Phi 7120 coprocessor. Preliminary time performance will be presented.

...read moreread less

Proceedings Article•10.1145/2568058.2568067•

High level transforms for SIMD and low-level computer vision algorithms

[...]

Lionel Lacassagne, Daniel Etiemble, Ali Hassan Zahraee, Alain Dominguez¹, Pascal Vezolle² - Show less +1 more•Institutions (2)

Intel¹, IBM²

16 Feb 2014

TL;DR: This paper presents a review of algorithmic transforms for IBM, Intel and ARM SIMD multicore processors to accelerate the implementation of low level image processing algorithms and shows that these optimizations provide a significant acceleration.

...read moreread less

Abstract: This paper presents a review of algorithmic transforms called High Level Transforms for IBM, Intel and ARM SIMD multicore processors to accelerate the implementation of low level image processing algorithms. We show that these optimizations provide a significant acceleration. A first evaluation of 512-bit SIMD Xeon- Phi is also presented. We focus on the point that the combination of optimizations leading to the best execution time cannot be predicted, and thus, systematic benchmarking is mandatory. Once the best configuration is found for each architecture, a comparison of these performances is presented. The Harris points detection operator is selected as being representative of low level image processing and computer vision algorithms. Being composed of five convolutions, it is more complex than a simple filter and enables more opportunities to combine optimizations. The presented work can scale across a wide range of codes using 2D stencils and convolutions.

...read moreread less

Book Chapter•10.1007/978-3-319-07518-1_23•

Fast and Energy-efficient Breadth-First Search on a Single NUMA System

[...]

Yuichiro Yasui¹, Katsuki Fujisawa¹, Yukinori Sato•Institutions (1)

Kyushu University¹

22 Jun 2014

TL;DR: The computational complexity of the bottom-up, a major bottleneck in NUMA-optimized BFS, is investigated and the relationship between vertex out-degree and bottom- up performance is clarified.

...read moreread less

Abstract: Breadth-first search BFS is an important graph analysis kernel. The Graph500 benchmark measures a computer's BFS performance using the traversed edges per second TEPS ratio. Our previous nonuniform memory access NUMA-optimized BFS reduced memory accesses to remote RAM on a NUMA architecture system; its performance was 11 GTEPS giga TEPS on a 4-way Intel Xeon E5-4640 system. Herein, we investigated the computational complexity of the bottom-up, a major bottleneck in NUMA-optimized BFS. We clarify the relationship between vertex out-degree and bottom-up performance. In November 2013, our new implementation achieved a Graph500 benchmark performance of 37.66 GTEPS fastest for a single node on an SGI Altix UV1000 one-rack and 31.65 GTEPS fastest for a single server on a 4-way Intel Xeon E5-4650 system. Furthermore, we achieved the highest Green Graph500 performance of 153.17 MTEPS/W mega TEPS per watt on an Xperia-A SO-04E with a Qualcomm Snapdragon S4 Pro APQ8064.

...read moreread less

Journal Article•10.1007/S11227-014-1262-2•

Accelerating solid---fluid interaction based on the immersed boundary method on multicore and GPU architectures

[...]

Pedro Valero-Lara

01 Nov 2014-The Journal of Supercomputing

TL;DR: This work proposes several approaches to accelerate the solid–fluid interaction through the use of the Immersed Boundary method on multicore and GPU architectures, focusing on memory management and workload mapping.

...read moreread less

Abstract: This work proposes several approaches to accelerate the solid---fluid interaction through the use of the Immersed Boundary method on multicore and GPU architectures. Different optimizations on both architectures have been proposed, focusing on memory management and workload mapping. We have chosen two different test scenarios which consist of single-solid and multiple-solid simulations. The performance analysis has been carried out on an intensive set of test cases to analyze the proposed optimizations using multiple CPUs (2) and GPUs (4). An effective performance is obtained for single-solid executions using one CPU (Intel Xeon E5520) achieving a speedup peak equal to 5.5. It is reached a higher benefit on multiple solids obtaining a top speedup of approximately 5.9 and 9 using one CPU (8 cores) and two CPUs (16 cores), respectively. On GPU (Kepler K20c) architecture, two different approaches are presented as the best alternative: one for single-solid executions and one for multiple-solid executions. The best approach obtained for one solid executions achieves a speedup of approximately 17 with respect the sequential counterpart. In contrast, for multiple-solid executions the benefit is much higher, being this type of problems much more suitable for GPU and reaching a peak speedup of 68, 115 and 162 using 1, 2 and 4 GPUs, respectively.

...read moreread less

Journal Article•10.1016/J.CPC.2014.02.014•

ppohDEM: Computational performance for open source code of the discrete element method ☆

[...]

Daisuke Nishiura¹, Miki Matsuo¹, Hide Sakaguchi¹•Institutions (1)

Japan Agency for Marine-Earth Science and Technology¹

01 May 2014-Computer Physics Communications

TL;DR: This work uses OpenMP and MPI to parallelize DEM for efficient operation on many types of memory, including shared memory, and at any scale, from small PC clusters to supercomputers, and describes a new algorithm for the descending storage method (DSM) based on a sort technique that makes creation of contact candidate pair lists more efficient.

...read moreread less

Journal Article•10.1002/CPE.3252•

Discovery of biological networks using an optimized partial correlation coefficient with information theory algorithm on Stampede's Xeon and Xeon Phi processors

[...]

Lars Koesterke¹, James E. Koltes², Nathan T. Weeks², Kent Milfeld¹, Matthew W. Vaughn¹, James M. Reecy², Dan Stanzione¹ - Show less +3 more•Institutions (2)

University of Texas at Austin¹, Iowa State University²

10 Sep 2014-Concurrency and Computation: Practice and Experience

TL;DR: The PCIT algorithm is re‐implemented with exemplary parallel, vector, input/output (I/O), memory, and instruction optimizations for today's multi‐core and many‐core architectures.

...read moreread less

Abstract: The partial correlation coefficient with information theory PCIT method is an important technique for detecting interactions between networks. The PCIT algorithm has been used in the biological context to infer complex regulatory mechanisms and interactions in genetic networks, in genome wide association studies, and in other similar problems. In this work, the PCIT algorithm is re-implemented with exemplary parallel, vector, input/output I/O, memory, and instruction optimizations for today's multi-core and many-core architectures. The evolution and performance of the new code targets the processor architectures of the Stampede supercomputer but will also benefit other architectures. The Stampede system consists of an Intel Xeon E5 processor base system with an innovative component consist of Intel Xeon Phi Coprocessors. Optimized results and an analysis are presented for both the Xeon and the Xeon Phi. Copyright © 2014 John Wiley & Sons, Ltd.

...read moreread less

Journal Article•10.1002/CPE.3027•

Visual exploration of data by using multidimensional scaling on multicore CPU, GPU, and MPI cluster

[...]

Piotr Pawliczek¹, Piotr Pawliczek², Witold Dzwinel¹, David A. Yuen³•Institutions (3)

AGH University of Science and Technology¹, University of Texas at Austin², University of Minnesota³

10 Mar 2014-Concurrency and Computation: Practice and Experience

TL;DR: This work demonstrates that the performance of the novel efficient parallel algorithms for MDS mapping based on virtual particle dynamics implemented in compute unified device architecture environment on a PC equipped with a modern GPU board is considerably faster than its MPI/OpenMP parallel implementation on the modern midrange professional cluster.

...read moreread less

Abstract: Visual and interactive data exploration requires fast and reliable tools for embedding of an original data space in 32-dimensional Euclidean space. Multidimensional scaling MDS is a good candidate. However, owing to at least OM2 memory and time complexity, MDS is computationally demanding for interactive visualization of data sets consisting of order of 104 objects on computer systems, ranging from PC with multicore CPU processor, graphics processing unit GPU board to midrange MPI clusters. To explore interactively data sets of that size, we have developed novel efficient parallel algorithms for MDS mapping based on virtual particle dynamics. We demonstrate that the performance of our MDS algorithms implemented in compute unified device architecture environment on a PC equipped with a modern GPU board Tesla M2090, GeForce GTX 480 is considerably faster than its MPI/OpenMP parallel implementation on the modern midrange professional cluster 10 nodes, each equipped with 2x Intel Xeon X5670 CPUs. We also show that the hybridized two-level MPI/CUDA implementation, run on a cluster of GPU nodes, can additionally provide a linear speedup. Copyright 2013 John Wiley & Sons, Ltd.

...read moreread less

Proceedings Article•10.1051/SNAMC/201404213•

GEANT4-MT : bringing multi-threading into GEANT4 production

[...]

Sunil Ahn, John Apostolakis¹, Makoto Asai², D. Brandt², Gene Cooperman³, G. Cosmo¹, Andrea Dotti², Xin Dong³, Soon Yung Jun⁴, Andrzej Nowak¹ - Show less +6 more•Institutions (4)

CERN¹, SLAC National Accelerator Laboratory², Northeastern University³, Fermilab⁴

1 Jun 2014

TL;DR: The revision of GEant4-MT for inclusion in the production-level release scheduled for end of 2013 is reported on, which has involved significant re-engineering of the prototype in order to incorporate it into the main GEANT4 development line, and the porting of GEANT 4-MT threading code to additional platforms.

...read moreread less

Abstract: GEANT4-MT is the multi-threaded version of the GEANT4 particle transport code.(1, 2) The key goals for the design of GEANT4-MT have been a) the need to reduce the memory footprint of the multi-threaded application compared to the use of separate jobs and processes; b) to create an easy migration of the existing applications; and c) to use efficiently many threads or cores, by scaling up to tens and potentially hundreds of workers. The first public release of a GEANT4-MT prototype was made in 2011. We report on the revision of GEANT4-MT for inclusion in the production-level release scheduled for end of 2013. This has involved significant re-engineering of the prototype in order to incorporate it into the main GEANT4 development line, and the porting of GEANT4-MT threading code to additional platforms. In order to make the porting of applications as simple as possible, refinements addressed the needs of standalone applications. Further adaptations were created to improve the fit with the frameworks of High Energy Physics (HEP) experiments. We report on performances measurements on Intel Xeon™, AMD Opteron™ the first trials of GEANT4-MT on the Intel Many Integrated Cores (MIC) architecture, in the form of the Xeon Phi™ co-processor.(3) These indicate near-linear scaling through about 200 threads on 60 cores, when holding fixed the number of events per thread.

...read moreread less

Proceedings Article•10.1109/HPEC.2014.7040979•

High level programming of FPGAs for HPC and data centric applications

[...]

Oren Segal¹, Nasibeh Nasiri¹, Martin Margala¹, Wim Vanderbauwhede²•Institutions (2)

University of Massachusetts Lowell¹, University of Glasgow²

1 Sep 2014

TL;DR: Two high performance computing applications (Lava Molecular Dynamics and Nearest-Neighbours) and a data centric application (Document Classification) were compiled using Altera's OpenCL compiler and programmed on a Nallatech FPGA board.

...read moreread less

Abstract: Heterogeneous computing offers a promising solution for high performance and energy efficient computing. Until recently the high performance heterogeneous computing arena was dominated by discrete GPUs but in recent years, new solutions based on devices such as APUs and FPGAs have emerged. These new solutions show promise for further improvements in energy efficiency. FPGA based heterogeneous computing is an especially promising direction since it allows for the creation of custom hardware solutions for data centric parallel applications. One of the main issues delaying wide spread adoption of FPGAs as main stream high performance computing devices is the difficulty in programming them. Altera's OpenCL implementation for FPGAs provides a high level of abstraction and increased ease of programmability of FPGAs. Two high performance computing applications (Lava Molecular Dynamics and Nearest-Neighbours) and a data centric application (Document Classification) were compiled using Altera's OpenCL compiler and programmed on a Nallatech FPGA board. Hardware utilization, kernel execution time and total execution time are reported. Up to 5.3x, 4.3x and 1.3x speed up over the Dual Xeon processor implementations was achieved respectively for LavaMD, Nearest-Neighbours and Document Classification.

...read moreread less

Journal Article•10.1145/2693714.2693716•

PEACH2: An FPGA-based PCIe network device for Tightly Coupled Accelerators

[...]

Yuetsu Kodama¹, Toshihiro Hanawa², Taisuke Boku¹, Mitsuhisa Sato¹•Institutions (2)

University of Tsukuba¹, University of Tokyo²

03 Dec 2014-ACM Sigarch Computer Architecture News

TL;DR: The details of the design and implementation of the PEACH2 chip, with respect to its routing mechanism and its DMA controller using FPGA, and the performance on a new platform that uses the latest Xeon CPU, IvyBridge, and achieved 2.3 GBytes/sec between GPUs over nodes.

...read moreread less

Abstract: In recent years, heterogeneous clusters using accelerators are often used for high performance computing systems. In such clusters, inter-node communication between accelerators requires several memory copies via CPU memory, and the communication latency incurred severely reduces performance. To solve this problem, we have been proposing a Tightly Coupled Accelerators (TCA) architecture intended to reduce the communication latency between accelerators over different nodes. In the TCA architecture, PCI Express packets are used for communication among GPUs over nodes. We developed a communication chip that we call the named PEACH2 chip, to help implement the TCA architecture. In this paper, we describe the details of the design and implementation of the PEACH2 chip, with respect to its routing mechanism and its DMA controller using FPGA. We evaluated the PEACH2 on a new platform that uses the latest Xeon CPU, IvyBridge, and achieved 2.3 GBytes/sec between GPUs over nodes, while the performance was only 880 MBytes/sec on the previous platform with SandyBridge.

...read moreread less

Proceedings Article•10.1145/2597652.2597682•

A programming system for xeon phis with runtime SIMD parallelization

[...]

Xin Huo¹, Bin Ren¹, Gagan Agrawal¹•Institutions (1)

Ohio State University¹

10 Jun 2014

TL;DR: This paper considers the problem of accelerating applications involving different communication patterns on Xeon Phis, with an emphasis on effectively using available SIMD parallelism, and offers an API for both shared memory andSIMD parallelization, and demonstrates its implementation.

...read moreread less

Abstract: The Intel Xeon Phi offers a promising solution to coprocessing, since it is based on the popular x86 instruction set. However, to fully utilize its potential, applications must be vectorized to leverage the wide SIMD lanes, in addition to effective large-scale shared memory parallelism. Compared to the SIMT execution model on GPGPUs with CUDA or OpenCL, SIMD parallelism with a SSE-like instruction set imposes many restrictions, and has generally not benefitted applications involving branches, irregular accesses, or even reductions in the past. In this paper, we consider the problem of accelerating applications involving different communication patterns on Xeon Phis, with an emphasis on effectively using available SIMD parallelism. We offer an API for both shared memory and SIMD parallelization, and demonstrate its implementation. We use implementations of overloaded functions as a mechanism for providing SIMD code, which is assisted by runtime data reordering and our methods to effectively manage control flow. Our extensive evaluation with 6 popular applications shows large gains over the SIMD parallelization achieved by the production (ICC) compiler, and we even outperform OpenMP for MIMD parallelism.

...read moreread less

Proceedings Article•10.1109/SC.2014.22•

A unified programming model for intra- and inter-node offloading on Xeon Phi clusters

[...]

Matthias Noack¹, Florian Wende¹, Thomas Steinke¹, Frank Cordes•Institutions (1)

Zuse Institute Berlin¹

16 Nov 2014

TL;DR: The effectiveness of the HAM-Offload framework is demonstrated by using it to enable a real-world application from the field of molecular dynamics to use multiple local and remote Xeon Phis.

...read moreread less

Abstract: Standard offload programming models for the Xeon Phi, e.g. Intel LEO and OpenMP 4.0, are restricted to a single compute node and hence a limited number of coprocessors. Scaling applications across a Xeon Phi cluster/supercomputer thus requires hybrid programming approaches, usually MPI+X. In this work, we present a framework based on heterogeneous active messages (HAM-Offload) that provides the means to offload work to local and remote (co)processors using a unified offload API. Since HAM-Offload provides similar primitives as current local offload frameworks, existing applications can be easily ported to overcome the single-node limitation while keeping the convenient offload programming model. We demonstrate the effectiveness of the framework by using it to enable a real-world application from the field of molecular dynamics to use multiple local and remote Xeon Phis. The evaluation shows good scaling behavior. Compared with LEO, performance is equal for large offloads and significantly better for small offloads.

...read moreread less

Journal Article•10.1007/S00450-012-0227-Z•

Modeling power and energy of the task-parallel Cholesky factorization on multicore processors

[...]

Pedro Alonso, Manuel F. Dolz, Rafael Mayo, Enrique S. Quintana-Ortí

01 May 2014-Computer Science - Research and Development

TL;DR: This model assumes a task-parallel execution of the Cholesky factorization process, with concurrency leveraged via a run-time as those recently proposed in projects like SMPSs, PLASMA or libflame, and decomposes the power usage into its system, static and dynamic components.

...read moreread less

Abstract: In this paper we introduce a model for the total energy consumption of the Cholesky factorization on a multicore processor. Our model assumes a task-parallel execution of the factorization process, with concurrency leveraged via a run-time as those recently proposed in projects like SMPSs, PLASMA or libflame, and decomposes the power usage into its system, static and dynamic components. A few simple experiments provide experimental data (parameters) with enough accuracy to assemble the model, which can then be used to estimate the actual power dissipation and energy consumption of the global algorithm. Experimental results on an 8-core platform equipped with Intel Xeon processors reveal the precision of the model.

...read moreread less

Posted Content•10.5194/GMDD-7-8941-2014•

Intel Xeon Phi accelerated Weather Research and Forecasting (WRF) Goddard microphysics scheme

[...]

Jarno Mielikainen¹, B. Huang¹, A. H.-L. Huang¹•Institutions (1)

University of Wisconsin-Madison¹

12 Dec 2014-Geoscientific Model Development Discussions

TL;DR: The results show that the optimizations improved performance of Goddard microphysics scheme on Xeon Phi 7120P by a factor of 4.7× and reduced the Goddard microPHysics scheme's share of the total WRF processing time from 20.0 to 7.5%.

...read moreread less

Abstract: The Weather Research and Forecasting (WRF) model is a numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs The WRF development is a done in collaboration around the globe Furthermore, the WRF is used by academic atmospheric scientists, weather forecasters at the operational centers and so on The WRF contains several physics components The most time consuming one is the microphysics One microphysics scheme is the Goddard cloud microphysics scheme It is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model The Goddard microphysics scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points Compared to the earlier microphysics schemes, the Goddard scheme incorporates a large number of improvements Thus, we have optimized the Goddard scheme code In this paper, we present our results of optimizing the Goddard microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect The Intel MIC is capable of executing a full operating system and entire programs rather than just kernels as the GPU does The MIC coprocessor supports all important Intel development tools Thus, the development environment is one familiar to a vast number of CPU developers Although, getting a maximum performance out of MICs will require using some novel optimization techniques Those optimization techniques are discussed in this paper The results show that the optimizations improved performance of Goddard microphysics scheme on Xeon Phi 7120P by a factor of 47× In addition, the optimizations reduced the Goddard microphysics scheme's share of the total WRF processing time from 200 to 75% Furthermore, the same optimizations improved performance on Intel Xeon E5-2670 by a factor of 28× compared to the original code

...read moreread less

Proceedings Article•10.1109/DS-RT.2014.19•

Efficient Neighbor Searching for Agent-Based Simulation on GPU

[...]

Xiaosong Li¹, Wentong Cai¹, Stephen John Turner¹•Institutions (1)

Nanyang Technological University¹

1 Oct 2014

TL;DR: An enhanced neighbor sharing strategy is introduced that greatly speeds up this procedure when comparing with CPU implementations and is developed from a global-memory-only implementation, and then gradually improved by efficiently utilizing the much faster shared memory.

...read moreread less

Abstract: This paper introduces a strategy to accelerate neighbor searching in agent-based simulations on GPU platforms. Because of their autonomous nature, agents can be processed by threads concurrently on GPU, and the overall simulation can be accelerated consequently. Each agent will simultaneously carry out a sense-think-act cycle in every time step. The neighbor searching is a crucial part in the sensing stage. Detecting and accessing neighbors is a memory intensive task and often becomes the major time consumer in an agent-based simulation. Our contribution, an enhanced neighbor sharing strategy, greatly speeds up this procedure when comparing with CPU implementations. The strategy is developed from a global-memory-only implementation, and then gradually improved by efficiently utilizing the much faster shared memory. In our case studies, speedups of 89.08 and 11.51 are obtained on an NVIDIA Tesla K20 GPU compared with the sequential implementation and OpenMP parallel implementation respectively on an Intel Xeon E5-2670 CPU.

...read moreread less

...

Expand