TL;DR: This paper port and optimize high-order multi-block structured CFD software HOSTA on the GPU-accelerated TianHe-1A supercomputer, and proposes a gather/scatter optimization to minimize PCI-e data transfer times for ghost and singularity data of 3D grid blocks, and overlap the collaborative computation and communication as far as possible using some advanced CUDA and MPI features.
TL;DR: This work implements a shared-memory implementation of Gauss-Seidel smoother on Xeon Phi that balances parallelism, data access locality, CG convergence rate, and communication overhead, and demonstrates that the optimizations not only benefit HPCG original dataset, which is based on structured 3D grid, but also a wide range of unstructured matrices.
Abstract: A new sparse high performance conjugate gradient benchmark (HPCG) has been recently released to address challenges in the design of sparse linear solvers for the next generation extreme-scale computing systems. Key computation, data access, and communication pattern in HPCG represent building blocks commonly found in today's HPC applications. While it is a well known challenge to efficiently parallelize Gauss-Seidel smoother, the most time-consuming kernel in HPCG, our algorithmic and architecture-aware optimizations deliver 95% and 68% of the achievable bandwidth on Xeon and Xeon Phi, respectively. Based on available parallelism, our Xeon Phi shared-memory implementation of Gauss-Seidel smoother selectively applies block multi-color reordering. Combined with MPI parallelization, our implementation balances parallelism, data access locality, CG convergence rate, and communication overhead. Our implementation achieved 580 TFLOPS (82% parallelization efficiency) on Tianhe-2 system, ranking first on the most recent HPCG list in July 2014. In addition, we demonstrate that our optimizations not only benefit HPCG original dataset, which is based on structured 3D grid, but also a wide range of unstructured matrices.
TL;DR: This work ported a highly detailed ION cell network model, originally coded in Matlab, onto an FPGA chip, and translated to HLS C code for the Xilinx Vivado toolflow and various algorithmic and arithmetic optimizations were applied.
Abstract: The Inferior-Olivary nucleus (ION) is a well-charted region of the brain, heavily associated with sensorimotor control of the body. It comprises ION cells with unique properties which facilitate sensory processing and motor-learning skills. Various simulation models of ION-cell networks have been written in an attempt to unravel their mysteries. However, simulations become rapidly intractable when biophysically plausible models and meaningful network sizes (>=100 cells) are modeled. To overcome this problem, in this work we port a highly detailed ION cell network model, originally coded in Matlab, onto an FPGA chip. It was first converted to ANSI C code and extensively profiled. It was, then, translated to HLS C code for the Xilinx Vivado toolflow and various algorithmic and arithmetic optimizations were applied. The design was implemented in a Virtex 7 (XC7VX485T) device and can simulate a 96-cell network at real-time speed, yielding a speedup of x700 compared to the original Matlab code and x12.5 compared to the reference C implementation running on a Intel Xeon 2.66GHz machine with 20GB RAM. For a 1,056-cell network (non-real-time), an FPGA speedup of x45 against the C code can be achieved, demonstrating the design's usefulness in accelerating neuroscience research. Limited by the available on-chip memory, the FPGA can maximally support a 14,400-cell network (non-real-time) with online parameter configurability for cell state and network size. The maximum throughput of the FPGA ION-network accelerator can reach 2.13 GFLOPS.
TL;DR: A hybrid algorithm for the petascale global simulation of atmospheric dynamics on Tianhe-2, the world's current top-ranked supercomputer developed by China's National University of Defense Technology, to enable flexible domain partition between an arbitrary number of processors and accelerators.
Abstract: This paper presents a hybrid algorithm for the petascale global simulation of atmospheric dynamics on Tianhe-2, the world's current top-ranked supercomputer developed by China's National University of Defense Technology (NUDT). Tianhe-2 is equipped with both Intel Xeon CPUs and Intel Xeon Phi accelerators. A key idea of the hybrid algorithm is to enable flexible domain partition between an arbitrary number of processors and accelerators, so as to achieve a balanced and efficient utilization of the entire system. We also present an asynchronous and concurrent data transfer scheme to reduce the communication overhead between CPU and accelerators. The acceleration of our global atmospheric model is conducted to improve the use of the Intel MIC architecture. For the single-node test on Tianhe-2 against two Intel Ivy Bridge CPUs (24 cores), we can achieve 2.07x, 3.18x, and 4.35x speedups when using one, two, and three Intel Xeon Phi accelerators respectively. The average performance gain from SIMD vectorization on the Intel Xeon Phi processors is around 5x (out of the 8x theoretical case). Based on successful computation-communication overlapping, large-scale tests indicate that a nearly ideal weak-scaling efficiency of 93.5% is obtained when we gradually increase the number of nodes from 6 to 8,664 (nearly 1.7 million cores). In the strong-scaling test, the parallel efficiency is about 77% when the number of nodes increases from 1,536 to 8,664 for a fixed 65,664 × 5,664 × 6 mesh with 77.6 billion unknowns.
TL;DR: The next-generation enterprise Xeon® server processor has 15 dual-threaded 64b Ivybridge cores and 37.5MB shared L3 cache and CMOS muxes embedded in the ring bus are programmably operable in a 2-or-3-columns configuration.
Abstract: The next-generation enterprise Xeon® server processor has 15 dual-threaded 64b Ivybridge cores [1] and 37.5MB shared L3 cache. The system interface includes two on-chip memory controllers, each with two memory channels and supports multiple system topologies. The processor has 4.31B transistors in a high-κ metal-gate tri-gate 22nm CMOS technology with 9 metal layers [2]. The design supports a wide array of product offerings with thermal design power ranging from 40 to 150W and frequencies ranging from 1.4 to 3.8GHz. Fig. 5.4.1(a) shows the processor block diagram. The floorplan (Fig. 5.4.1(b)) is driven by the ring bus routability and latency, as well as the chop requirements to smaller core counts. The cores and associated L3 cache are organized in columns of five, with the ring bus segment embedded. The fully populated die has 15-cores in three columns. The 10-core chop removes the rightmost 3rd column and its dedicated top and bottom IOs. CMOS muxes embedded in the ring bus are programmably operable in a 2-or-3-columns configuration. The 6-core chop removes the 2nd and 4th rows from the 10-core die.
TL;DR: A multi-threaded version of Field II has been developed, which automatically can use the multi-core capabilities of modern CPUs and is fully compatible with older versions, and only a single command has been added for setting the number of threads to use.
Abstract: A multi-threaded version of Field II has been developed, which automatically can use the multi-core capabilities of modern CPUs. The memory allocation routines were rewritten to minimize the number of dynamic allocations and to make pre-allocations possible for each thread. This ensures that the simulation job can be automatically partitioned and the interdependence between threads minimized. The new code has been compared to Field II version 3.22, October 27, 2013 (latest free-ware version). A 64 element 5 MHz focused array transducer was simulated. One million point scatterers randomly distributed in a plane of 20 × 50 mm (width × depth) with random Gaussian amplitudes were simulated using the command calc scat. Dual Intel Xeon CPU E5-2630 2.60 GHz CPUs were used under Ubuntu Linux 10.02 and Matlab version 2013b. Each CPU holds 6 cores with hyper-threading, corresponding to a total of 24 hyper-threading cores. The averaged simulation time for 10 realizations for the old version was 85.1 s. A single thread run for the new version took 27.7 s; a speed-up of 3.1. Employing all 24 cores gave a simulation time of 3.27 s for the one million scatterers corresponding to a speed-up factor of 26 times. The speed-up in general depends on the transducer, scatterers and simulation, and it varies across applications between 13 and 30. The program is fully compatible with older versions, and only a single command has been added for setting the number of threads to use. The division of labor is automatically handled by the program. For a phantom with 100,000 scatterers, it is now possible to simulate a full 128 line image in around 42 seconds with full precision.
TL;DR: The designs, implementations and experiments show the feasibility of building a high-performance OLAP system for processing large-scale taxi trip data for real-time, interactive data explorations and opens the paths to achieving even higher OLAP query efficiency for large- scale applications.
TL;DR: The implementation and performance of the highly accurate CCSD(T) quantum chemistry method on the Intel® Xeon Phi coprocessor within the context of the NWChem computational chemistry package is presented.
Abstract: This paper presents the implementation and performance of the highly accurate CCSD(T) quantum chemistry method on the Intel® Xeon Phi™ coprocessor within the context of the NWChem computational chemistry package. The widespread use of highly correlated methods in electronic structure calculations is contingent upon the interplay between advances in theory and the possibility of utilizing the ever-growing computer power of emerging heterogeneous architectures. We discuss the design decisions of our implementation as well as the optimizations applied to the compute kernels and data transfers between host and coprocessor. We show the feasibility of adopting the Intel® Many Integrated Core Architecture and the Intel Xeon Phi coprocessor for developing efficient computational chemistry modeling tools. Remarkable scalability is demonstrated by benchmarks. Our solution scales up to a total of 62560 cores with the concurrent utilization of Intel® Xeon® processors and Intel Xeon Phi coprocessors.
TL;DR: Numerical tests against several one- and two-dimensional compressible two-phase flow problems with high density and high pressure ratios demonstrate that the application of an HLLC-type approximate Riemann solver in conjunction with the third-order TVD Runge–Kutta method is accurate and robust.
TL;DR: A novel, low-overhead monitoring infrastructure capable to track in detail and in real-time the thermal and power characteristics of Eurora's components with fine-grained resolution is presented.
Abstract: Eurora (EURopean many integrated cORe Architecture) is today the most energy efficient supercomputer in the world. Ranked 1st in the Green500 in July 2013, is a prototype built from Eurotech and Cineca toward next-generation Tier-0 systems in the PRACE 2IP EU project. Eurora's outstanding energy-efficiency is achieved by adopting a direct liquid cooling solution and a heterogeneous architecture with best-in-class general purpose HW components (Intel Xeon E5, Intel Xeon Phi and NVIDIA Kepler K20). In this paper we present a novel, low-overhead monitoring infrastructure capable to track in detail and in real-time the thermal and power characteristics of Eurora's components with fine-grained resolution. Our experiments give insights on Eurora's thermal/power trade-offs and highlight opportunities for run-time power/thermal management and optimization.
TL;DR: The authors discuss in the paper another approach based on data‐driven execution to efficiently tackle this challenging load balancing problem, which consists of breaking the most time‐consuming stages of the FMMs into smaller tasks.
TL;DR: This work shows how to benefit from FPGA technology for highly parallel creation of contingency tables in a systolic chain with a subsequent statistical test to studyotype-by-genotype interactions (epistasis).
Abstract: Genotype-by-genotype interactions (epistasis) are believed to be a significant source of unexplained genetic variation causing complex chronic diseases but have been ignored in genome-wide association studies (GWAS) due to the computational burden of analysis. In this work we show how to benefit from FPGA technology for highly parallel creation of contingency tables in a systolic chain with a subsequent statistical test. We present the implementation for the FPGA-based hardware platform RIVYERA S6-LX150 containing 128 Xilinx Spartan6-LX150 FPGAs. For performance evaluation we compare against the method iLOCi[9]. iLOCi claims to outperform other available tools in terms of accuracy. However, analysis of a dataset from the Wellcome Trust Case Control Consortium (WTCCC) with about 500,000 SNPs and 5,000 samples still takes about 19 hours on a MacPro workstation with two Intel Xeon quad-core CPUs, while our FPGA-based implementation requires only 4 minutes.
TL;DR: Previous research on GPUs and on scalable graph processing on supercomputers is extended and it is demonstrated that a high-performance parallel graph machine can be created using commodity GPUs and networking hardware.
Abstract: Fast, scalable, low-cost, and low-power execution of parallel graph algorithms is important for a wide variety of commercial and public sector applications. Breadth First Search (BFS) imposes an extreme burden on memory bandwidth and network communications and has been proposed as a benchmark that may be used to evaluate current and future parallel computers. Hardware trends and manufacturing limits strongly imply that many-core devices, such as NVIDIA® GPUs and the Intel® Xeon Phi®, will become central components of such future systems. GPUs are well known to deliver the highest FLOPS/watt and enjoy a very significant memory bandwidth advantage over CPU architectures. Recent work has demonstrated that GPUs can deliver high performance for parallel graph algorithms and, further, that it is possible to encapsulate that capability in a manner that hides the low level details of the GPU architecture and the CUDA language but preserves the high throughput of the GPU. We extend previous research on GPUs and on scalable graph processing on supercomputers and demonstrate that a high-performance parallel graph machine can be created using commodity GPUs and networking hardware.
TL;DR: A many- core algorithm which is based on a parallel method and is used in the Intel Xeon Phi many-core systems to speed up the unsupervised training process of Sparse Autoencoder and Restricted Boltzmann Machine and suggests that theIntel Xeon Phi can offer an efficient but more general-purposed way to parallelize the deep learning algorithm compared to GPU.
Abstract: As a new area of machine learning research, the deep learning algorithm has attracted a lot of attention from the research community. It may bring human beings to a higher cognitive level of data. Its unsupervised pre-training step allows us to find high-dimensional representations or abstract features which work much better than the principal component analysis (PCA) method. However, it will face problems when being applied to deal with large scale data due to its intensive computation from many levels of training process against large scale data. The sequential deep learning algorithms usually can not finish the computation in an acceptable time. In this paper, we propose a many-core algorithm which is based on a parallel method and is used in the Intel Xeon Phi many-core systems to speed up the unsupervised training process of Sparse Autoencoder and Restricted Boltzmann Machine (RBM). Using the sequential training algorithm as a baseline to compare, we adopted several optimization methods to parallelize the algorithm. The experimental results show that our fully-optimized algorithm gains more than 300-fold speedup on parallelized Sparse Autoencoder compared with the original sequential algorithm on the Intel Xeon Phi coprocessor. Also, we ran the fully-optimized code on both the Intel Xeon Phi coprocessor and an expensive Intel Xeon CPU. Our method on the Intel Xeon Phi coprocessor is 7 to 10 times faster than the Intel Xeon CPU for this application. In addition to this, we compared our fully-optimized code on the Intel Xeon Phi with a Matlab code running on single Intel Xeon CPU. Our method on the Intel Xeon Phi runs 16 times faster than the Matlab implementation. The result also suggests that the Intel Xeon Phi can offer an efficient but more general-purposed way to parallelize the deep learning algorithm compared to GPU. It also achieves faster speed with better parallelism than the Intel Xeon CPU.
TL;DR: A novel architecture to realize a million-bit multiplication scheme based on the Schonhage-Strassen Algorithm using Number Theoretical Transform (NTT) makes use of an innovative cache architecture along with processing elements customized to match the computation and access patterns of the NTT-based recursive multiplication algorithm.
TL;DR: In this paper, a new tracking algorithm based on the Hough transform was evaluated for the first time on multi-core Intel i7-3770 and Intel Xeon E5-2697v2 CPUs, an NVIDIA Tesla K20c GPU, and an Intel Xeon Phi 7120 coprocessor.
Abstract: Recent innovations focused around parallel processing, either through systems containing multiple processors or processors containing multiple cores, hold great promise for enhancing the performance of the trigger at the LHC and extending its physics program. The flexibility of the CMS/ATLAS trigger system allows for easy integration of computational accelerators, such as NVIDIA's Tesla Graphics Processing Unit (GPU) or Intel's Xeon Phi, in the High Level Trigger. These accelerators have the potential to provide faster or more energy efficient event selection, thus opening up possibilities for new complex triggers that were not previously feasible. At the same time, it is crucial to explore the performance limits achievable on the latest generation multicore CPUs with the use of the best software optimization methods. In this article, a new tracking algorithm based on the Hough transform will be evaluated for the first time on multi-core Intel i7-3770 and Intel Xeon E5-2697v2 CPUs, an NVIDIA Tesla K20c GPU, and an Intel Xeon Phi 7120 coprocessor. Preliminary time performance will be presented.
TL;DR: This paper presents a review of algorithmic transforms for IBM, Intel and ARM SIMD multicore processors to accelerate the implementation of low level image processing algorithms and shows that these optimizations provide a significant acceleration.
Abstract: This paper presents a review of algorithmic transforms called High Level Transforms for IBM, Intel and ARM SIMD multicore processors to accelerate the implementation of low level image processing algorithms. We show that these optimizations provide a significant acceleration. A first evaluation of 512-bit SIMD Xeon- Phi is also presented. We focus on the point that the combination of optimizations leading to the best execution time cannot be predicted, and thus, systematic benchmarking is mandatory. Once the best configuration is found for each architecture, a comparison of these performances is presented. The Harris points detection operator is selected as being representative of low level image processing and computer vision algorithms. Being composed of five convolutions, it is more complex than a simple filter and enables more opportunities to combine optimizations. The presented work can scale across a wide range of codes using 2D stencils and convolutions.
TL;DR: The computational complexity of the bottom-up, a major bottleneck in NUMA-optimized BFS, is investigated and the relationship between vertex out-degree and bottom- up performance is clarified.
Abstract: Breadth-first search BFS is an important graph analysis kernel. The Graph500 benchmark measures a computer's BFS performance using the traversed edges per second TEPS ratio. Our previous nonuniform memory access NUMA-optimized BFS reduced memory accesses to remote RAM on a NUMA architecture system; its performance was 11 GTEPS giga TEPS on a 4-way Intel Xeon E5-4640 system. Herein, we investigated the computational complexity of the bottom-up, a major bottleneck in NUMA-optimized BFS. We clarify the relationship between vertex out-degree and bottom-up performance. In November 2013, our new implementation achieved a Graph500 benchmark performance of 37.66 GTEPS fastest for a single node on an SGI Altix UV1000 one-rack and 31.65 GTEPS fastest for a single server on a 4-way Intel Xeon E5-4650 system. Furthermore, we achieved the highest Green Graph500 performance of 153.17 MTEPS/W mega TEPS per watt on an Xperia-A SO-04E with a Qualcomm Snapdragon S4 Pro APQ8064.
TL;DR: This work proposes several approaches to accelerate the solid–fluid interaction through the use of the Immersed Boundary method on multicore and GPU architectures, focusing on memory management and workload mapping.
Abstract: This work proposes several approaches to accelerate the solid---fluid interaction through the use of the Immersed Boundary method on multicore and GPU architectures. Different optimizations on both architectures have been proposed, focusing on memory management and workload mapping. We have chosen two different test scenarios which consist of single-solid and multiple-solid simulations. The performance analysis has been carried out on an intensive set of test cases to analyze the proposed optimizations using multiple CPUs (2) and GPUs (4). An effective performance is obtained for single-solid executions using one CPU (Intel Xeon E5520) achieving a speedup peak equal to 5.5. It is reached a higher benefit on multiple solids obtaining a top speedup of approximately 5.9 and 9 using one CPU (8 cores) and two CPUs (16 cores), respectively. On GPU (Kepler K20c) architecture, two different approaches are presented as the best alternative: one for single-solid executions and one for multiple-solid executions. The best approach obtained for one solid executions achieves a speedup of approximately 17 with respect the sequential counterpart. In contrast, for multiple-solid executions the benefit is much higher, being this type of problems much more suitable for GPU and reaching a peak speedup of 68, 115 and 162 using 1, 2 and 4 GPUs, respectively.
TL;DR: This work uses OpenMP and MPI to parallelize DEM for efficient operation on many types of memory, including shared memory, and at any scale, from small PC clusters to supercomputers, and describes a new algorithm for the descending storage method (DSM) based on a sort technique that makes creation of contact candidate pair lists more efficient.
TL;DR: The PCIT algorithm is re‐implemented with exemplary parallel, vector, input/output (I/O), memory, and instruction optimizations for today's multi‐core and many‐core architectures.
TL;DR: This work demonstrates that the performance of the novel efficient parallel algorithms for MDS mapping based on virtual particle dynamics implemented in compute unified device architecture environment on a PC equipped with a modern GPU board is considerably faster than its MPI/OpenMP parallel implementation on the modern midrange professional cluster.
Abstract: Visual and interactive data exploration requires fast and reliable tools for embedding of an original data space in 32-dimensional Euclidean space. Multidimensional scaling MDS is a good candidate. However, owing to at least OM2 memory and time complexity, MDS is computationally demanding for interactive visualization of data sets consisting of order of 104 objects on computer systems, ranging from PC with multicore CPU processor, graphics processing unit GPU board to midrange MPI clusters. To explore interactively data sets of that size, we have developed novel efficient parallel algorithms for MDS mapping based on virtual particle dynamics. We demonstrate that the performance of our MDS algorithms implemented in compute unified device architecture environment on a PC equipped with a modern GPU board Tesla M2090, GeForce GTX 480 is considerably faster than its MPI/OpenMP parallel implementation on the modern midrange professional cluster 10 nodes, each equipped with 2x Intel Xeon X5670 CPUs. We also show that the hybridized two-level MPI/CUDA implementation, run on a cluster of GPU nodes, can additionally provide a linear speedup. Copyright 2013 John Wiley & Sons, Ltd.
TL;DR: The revision of GEant4-MT for inclusion in the production-level release scheduled for end of 2013 is reported on, which has involved significant re-engineering of the prototype in order to incorporate it into the main GEANT4 development line, and the porting of GEANT 4-MT threading code to additional platforms.
Abstract: GEANT4-MT is the multi-threaded version of the GEANT4 particle transport code.(1, 2) The key goals for the design of GEANT4-MT have been a) the need to reduce the memory footprint of the multi-threaded application compared to the use of separate jobs and processes; b) to create an easy migration of the existing applications; and c) to use efficiently many threads or cores, by scaling up to tens and potentially hundreds of workers. The first public release of a GEANT4-MT prototype was made in 2011. We report on the revision of GEANT4-MT for inclusion in the production-level release scheduled for end of 2013. This has involved significant re-engineering of the prototype in order to incorporate it into the main GEANT4 development line, and the porting of GEANT4-MT threading code to additional platforms. In order to make the porting of applications as simple as possible, refinements addressed the needs of standalone applications. Further adaptations were created to improve the fit with the frameworks of High Energy Physics (HEP) experiments. We report on performances measurements on Intel Xeon™, AMD Opteron™ the first trials of GEANT4-MT on the Intel Many Integrated Cores (MIC) architecture, in the form of the Xeon Phi™ co-processor.(3) These indicate near-linear scaling through about 200 threads on 60 cores, when holding fixed the number of events per thread.
TL;DR: Two high performance computing applications (Lava Molecular Dynamics and Nearest-Neighbours) and a data centric application (Document Classification) were compiled using Altera's OpenCL compiler and programmed on a Nallatech FPGA board.
Abstract: Heterogeneous computing offers a promising solution for high performance and energy efficient computing. Until recently the high performance heterogeneous computing arena was dominated by discrete GPUs but in recent years, new solutions based on devices such as APUs and FPGAs have emerged. These new solutions show promise for further improvements in energy efficiency. FPGA based heterogeneous computing is an especially promising direction since it allows for the creation of custom hardware solutions for data centric parallel applications. One of the main issues delaying wide spread adoption of FPGAs as main stream high performance computing devices is the difficulty in programming them. Altera's OpenCL implementation for FPGAs provides a high level of abstraction and increased ease of programmability of FPGAs. Two high performance computing applications (Lava Molecular Dynamics and Nearest-Neighbours) and a data centric application (Document Classification) were compiled using Altera's OpenCL compiler and programmed on a Nallatech FPGA board. Hardware utilization, kernel execution time and total execution time are reported. Up to 5.3x, 4.3x and 1.3x speed up over the Dual Xeon processor implementations was achieved respectively for LavaMD, Nearest-Neighbours and Document Classification.
TL;DR: The details of the design and implementation of the PEACH2 chip, with respect to its routing mechanism and its DMA controller using FPGA, and the performance on a new platform that uses the latest Xeon CPU, IvyBridge, and achieved 2.3 GBytes/sec between GPUs over nodes.
Abstract: In recent years, heterogeneous clusters using accelerators are often used for high performance computing systems. In such clusters, inter-node communication between accelerators requires several memory copies via CPU memory, and the communication latency incurred severely reduces performance. To solve this problem, we have been proposing a Tightly Coupled Accelerators (TCA) architecture intended to reduce the communication latency between accelerators over different nodes. In the TCA architecture, PCI Express packets are used for communication among GPUs over nodes. We developed a communication chip that we call the named PEACH2 chip, to help implement the TCA architecture. In this paper, we describe the details of the design and implementation of the PEACH2 chip, with respect to its routing mechanism and its DMA controller using FPGA. We evaluated the PEACH2 on a new platform that uses the latest Xeon CPU, IvyBridge, and achieved 2.3 GBytes/sec between GPUs over nodes, while the performance was only 880 MBytes/sec on the previous platform with SandyBridge.
TL;DR: This paper considers the problem of accelerating applications involving different communication patterns on Xeon Phis, with an emphasis on effectively using available SIMD parallelism, and offers an API for both shared memory andSIMD parallelization, and demonstrates its implementation.
Abstract: The Intel Xeon Phi offers a promising solution to coprocessing, since it is based on the popular x86 instruction set. However, to fully utilize its potential, applications must be vectorized to leverage the wide SIMD lanes, in addition to effective large-scale shared memory parallelism. Compared to the SIMT execution model on GPGPUs with CUDA or OpenCL, SIMD parallelism with a SSE-like instruction set imposes many restrictions, and has generally not benefitted applications involving branches, irregular accesses, or even reductions in the past. In this paper, we consider the problem of accelerating applications involving different communication patterns on Xeon Phis, with an emphasis on effectively using available SIMD parallelism. We offer an API for both shared memory and SIMD parallelization, and demonstrate its implementation. We use implementations of overloaded functions as a mechanism for providing SIMD code, which is assisted by runtime data reordering and our methods to effectively manage control flow. Our extensive evaluation with 6 popular applications shows large gains over the SIMD parallelization achieved by the production (ICC) compiler, and we even outperform OpenMP for MIMD parallelism.
TL;DR: The effectiveness of the HAM-Offload framework is demonstrated by using it to enable a real-world application from the field of molecular dynamics to use multiple local and remote Xeon Phis.
Abstract: Standard offload programming models for the Xeon Phi, e.g. Intel LEO and OpenMP 4.0, are restricted to a single compute node and hence a limited number of coprocessors. Scaling applications across a Xeon Phi cluster/supercomputer thus requires hybrid programming approaches, usually MPI+X. In this work, we present a framework based on heterogeneous active messages (HAM-Offload) that provides the means to offload work to local and remote (co)processors using a unified offload API. Since HAM-Offload provides similar primitives as current local offload frameworks, existing applications can be easily ported to overcome the single-node limitation while keeping the convenient offload programming model. We demonstrate the effectiveness of the framework by using it to enable a real-world application from the field of molecular dynamics to use multiple local and remote Xeon Phis. The evaluation shows good scaling behavior. Compared with LEO, performance is equal for large offloads and significantly better for small offloads.
TL;DR: This model assumes a task-parallel execution of the Cholesky factorization process, with concurrency leveraged via a run-time as those recently proposed in projects like SMPSs, PLASMA or libflame, and decomposes the power usage into its system, static and dynamic components.
Abstract: In this paper we introduce a model for the total energy consumption of the Cholesky factorization on a multicore processor. Our model assumes a task-parallel execution of the factorization process, with concurrency leveraged via a run-time as those recently proposed in projects like SMPSs, PLASMA or libflame, and decomposes the power usage into its system, static and dynamic components. A few simple experiments provide experimental data (parameters) with enough accuracy to assemble the model, which can then be used to estimate the actual power dissipation and energy consumption of the global algorithm. Experimental results on an 8-core platform equipped with Intel Xeon processors reveal the precision of the model.
TL;DR: The results show that the optimizations improved performance of Goddard microphysics scheme on Xeon Phi 7120P by a factor of 4.7× and reduced the Goddard microPHysics scheme's share of the total WRF processing time from 20.0 to 7.5%.
Abstract: The Weather Research and Forecasting (WRF) model is a numerical weather prediction system designed to serve both atmospheric research and operational forecasting needs The WRF development is a done in collaboration around the globe Furthermore, the WRF is used by academic atmospheric scientists, weather forecasters at the operational centers and so on The WRF contains several physics components The most time consuming one is the microphysics One microphysics scheme is the Goddard cloud microphysics scheme It is a sophisticated cloud microphysics scheme in the Weather Research and Forecasting (WRF) model The Goddard microphysics scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points Compared to the earlier microphysics schemes, the Goddard scheme incorporates a large number of improvements Thus, we have optimized the Goddard scheme code In this paper, we present our results of optimizing the Goddard microphysics scheme on Intel Many Integrated Core Architecture (MIC) hardware The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect The Intel MIC is capable of executing a full operating system and entire programs rather than just kernels as the GPU does The MIC coprocessor supports all important Intel development tools Thus, the development environment is one familiar to a vast number of CPU developers Although, getting a maximum performance out of MICs will require using some novel optimization techniques Those optimization techniques are discussed in this paper The results show that the optimizations improved performance of Goddard microphysics scheme on Xeon Phi 7120P by a factor of 47× In addition, the optimizations reduced the Goddard microphysics scheme's share of the total WRF processing time from 200 to 75% Furthermore, the same optimizations improved performance on Intel Xeon E5-2670 by a factor of 28× compared to the original code
TL;DR: An enhanced neighbor sharing strategy is introduced that greatly speeds up this procedure when comparing with CPU implementations and is developed from a global-memory-only implementation, and then gradually improved by efficiently utilizing the much faster shared memory.
Abstract: This paper introduces a strategy to accelerate neighbor searching in agent-based simulations on GPU platforms. Because of their autonomous nature, agents can be processed by threads concurrently on GPU, and the overall simulation can be accelerated consequently. Each agent will simultaneously carry out a sense-think-act cycle in every time step. The neighbor searching is a crucial part in the sensing stage. Detecting and accessing neighbors is a memory intensive task and often becomes the major time consumer in an agent-based simulation. Our contribution, an enhanced neighbor sharing strategy, greatly speeds up this procedure when comparing with CPU implementations. The strategy is developed from a global-memory-only implementation, and then gradually improved by efficiently utilizing the much faster shared memory. In our case studies, speedups of 89.08 and 11.51 are obtained on an NVIDIA Tesla K20 GPU compared with the sequential implementation and OpenMP parallel implementation respectively on an Intel Xeon E5-2670 CPU.