TL;DR: ParaView was developed to analyze extremely large datasets using distributed memory computing resources and has become an integral tool in many national laboratories, universities and industry, and has won several awards related to high performance computation.
Abstract: ParaView is an open-source, multi-platform data analysis and visualization application. ParaView users can quickly build visualizations to analyze their data using qualitative and quantitative techniques. The data exploration can be done interactively in 3D or programmatically using ParaViews batch processing capabilities. ParaView was developed to analyze extremely large datasets using distributed memory computing resources. It can be run on supercomputers to analyze datasets of petascale size as well as on laptops for smaller data, has become an integral tool in many national laboratories, universities and industry, and has won several awards related to high performance computation. ParaView is open-source (BSD licensed, commercial software friendly). As with any of successful open-source project, ParaView is supported by an active user and developer community.
TL;DR: This paper presents a simulation driven prediction model that can predict job performance with high accuracy for Apache Spark platform and evaluated the prediction framework using four real-life applications to show that the model can achieve high prediction accuracy.
Abstract: Apache Spark is an open source distributed data processing platform that uses distributed memory abstraction to process large volume of data efficiently. However, performance of a particular job on Apache Spark platform can vary significantly depending on the input data type and size, design and implementation of the algorithm, and computing capability, making it extremely difficult to predict the performance metric of a job such as execution time, memory footprint, and I/O cost. To address this challenge, in this paper, we present a simulation driven prediction model that can predict job performance with high accuracy for Apache Spark platform. Specifically, as Apache spark jobs are often consist of multiple sequential stages, the presented prediction model simulates the execution of the actual job by using only a fraction of the input data, and collect execution traces (e.g., I/O overhead, memory consumption, execution time) to predict job performance for each execution stage individually. We evaluated our prediction framework using four real-life applications on a 13 node cluster, and experimental results show that the model can achieve high prediction accuracy.
TL;DR: Current results show several emerging memory technologies related to nonvolatile memory (NVM) devices are being investigated as an alternative for DRAM and could offer high-bandwidth I/O, larger main memory capacities, persistent data structures, and new approaches for application resilience and output postprocessing.
Abstract: For extreme-scale high-performance computing systems, system-wide power consumption has been identified as one of the key constraints moving forward, where DRAM main memory systems account for about 30 to 50 percent of a node's overall power consumption. As the benefits of device scaling for DRAM memory slow, it will become increasingly difficult to keep memory capacities balanced with increasing computational rates offered by next-generation processors. However, several emerging memory technologies related to nonvolatile memory (NVM) devices are being investigated as an alternative for DRAM. Moving forward, NVM devices could offer solutions for HPC architectures. Researchers are investigating how to integrate these emerging technologies into future extreme-scale HPC systems and how to expose these capabilities in the software stack and applications. Current results show several of these strategies could offer high-bandwidth I/O, larger main memory capacities, persistent data structures, and new approaches for application resilience and output postprocessing, such as transaction-based incremental checkpointing and in situ visualization, respectively.
TL;DR: This paper describes the implementation of a parallel fast multipole method for evaluating potentials for discrete and continuous source distributions and discusses several algorithmic improvements and performance optimizations including cache locality, vectorization, shared memory parallelism and use of coprocessors.
Abstract: We describe our implementation of a parallel fast multipole method for evaluating potentials for discrete and continuous source distributions. The first requires summation over the source points and the second requiring integration over a continuous source density. Both problems require (N2) complexity when computed directly; however, can be accelerated to (N) time using FMM. In our PVFMM software library, we use kernel independent FMM and this allows us to compute potentials for a wide range of elliptic kernels. Our method is high order, adaptive and scalable. In this paper, we discuss several algorithmic improvements and performance optimizations including cache locality, vectorization, shared memory parallelism and use of coprocessors. Our distributed memory implementation uses space-filling curve for partitioning data and a hypercube communication scheme. We present convergence results for Laplace, Stokes and Helmholtz (low wavenumber) kernels for both particle and volume FMM. We measure efficiency of our method in terms of CPU cycles per unknown for different accuracies and different kernels. We also demonstrate scalability of our implementation up to several thousand processor cores on the Stampede platform at the Texas Advanced Computing Center.
TL;DR: Quark as discussed by the authors is a performance emulator for persistent memory, which emulates a wide range of NVM latencies and bandwidth characteristics for performance evaluation of emerging byte-addressable NVMs and their impact on applications performance.
Abstract: Next-generation non-volatile memory (NVM) technologies, such as phase-change memory and memristors, can enable computer systems infrastructure to continue keeping up with the voracious appetite of data-centric applications for large, cheap, and fast storage. Persistent memory has emerged as a promising approach to accessing emerging byte-addressable non-volatile memory through processor load/store instructions. Due to lack of commercially available NVM, system software researchers have mainly relied on emulation to model persistent memory performance. However, existing emulation approaches are either too simplistic, or too slow to emulate large-scale workloads, or require special hardware. To fill this gap and encourage wider adoption of persistent memory, we developed a performance emulator for persistent memory, called Quartz. Quartz enables an efficient emulation of a wide range of NVM latencies and bandwidth characteristics for performance evaluation of emerging byte-addressable NVMs and their impact on applications performance (without modifying or instrumenting their source code) by leveraging features available in commodity hardware. Our emulator is implemented on three latest Intel Xeon-based processor architectures: Sandy Bridge, Ivy Bridge, and Haswell. To assist researchers and engineers in evaluating design decisions with emerging NVMs, we extend Quartz for emulating the application execution on future systems with two types of memory: fast, regular volatile DRAM and slower persistent memory. We evaluate the effectiveness of our approach by using a set of specially designed memory-intensive benchmarks and real applications. The accuracy of the proposed approach is validated by running these programs both on our emulation platform and a multisocket (NUMA) machine that can support a range of memory latencies. We show that Quartz can emulate a range of performance characteristics with low overhead and good accuracy (with emulation errors 0.2% - 9%).
TL;DR: This paper explores accelerating the performance of K-means clustering using three approaches: 1) shared memory using OpenMP, 2) distributed memory with message passing (MPI), and 3) heterogeneous computing with NVIDIA Graphics Processing Units (GPUs) programmed with CUDA-C.
Abstract: K-Means clustering is a popular unsupervised machine learning method which has been used in diverse applications including image processing, information retrieval, social sciences and weather forecasting. However, clustering is computationally expensive especially when applied to large datasets. In this paper, we explore accelerating the performance of K-means clustering using three approaches: 1) shared memory using OpenMP, 2) distributed memory with message passing (MPI), and 3) heterogeneous computing with NVIDIA Graphics Processing Units (GPUs) programmed with CUDA-C. While others have looked at accelerating K-means clustering, this is the first study that compares these different approaches. In addition, K-means performance is very sensitive to the initial means chosen. We evaluate different initializations in parallel and choose the best one to use for the entire algorithm. We evaluate results on a range of images from small (300×300 pixels) to large (1164×1200 pixel). Our results show that all three parallel programming approaches give speed-up, with the best results obtained by OpenMP for smaller images and CUDA-C for larger ones. Each of these approaches gives approximately thirty times overall speed-up compared to a sequential implementation of K-means. In addition, our parallel initialization gives an additional 1.5 to 2.5 times speed-up over the accelerated parallel versions.
TL;DR: The STRUMPACK (STRUctured Matrices PACKAGE) as mentioned in this paper is a distributed memory library for computations with dense structured matrices, where a matrix is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with low numerical rank.
Abstract: We present a distributed-memory library for computations with dense structured matrices. A matrix is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with low numerical rank. Here, we use Hierarchically Semi-Separable representations (HSS). Such matrices appear in many applications, e.g., finite element methods, boundary element methods, etc. Exploiting this structure allows for fast solution of linear systems and/or fast computation of matrix-vector products, which are the two main building blocks of matrix computations. The compression algorithm that we use, that computes the HSS form of an input dense matrix, relies on randomized sampling with a novel adaptive sampling mechanism. We discuss the parallelization of this algorithm and also present the parallelization of structured matrix-vector product, structured factorization and solution routines. The efficiency of the approach is demonstrated on large problems from different academic and industrial applications, on up to 8,000 cores.
This work is part of a more global effort, the STRUMPACK (STRUctured Matrices PACKage) software package for computations with sparse and dense structured matrices. Hence, although useful on their own right, the routines also represent a step in the direction of a distributed-memory sparse solver.
TL;DR: The Power ISA transactional memory architecture, the POWER8 implementation of this architecture, and two practical uses of this Architecture—Transactional Lock Elision (TLE) and Thread-Level Speculation (TLS)—and provide performance results for these uses are detailed.
Abstract: With multi-core processors, parallel programming has taken on greater importance. Traditional parallel programming techniques based on critical sections controlled by locking have several well-known drawbacks. To allow for more efficient parallel programming with higher performance, the IBM POWER8™ processor implements a hardware transactional memory facility. Transactional memory allows groups of load and store operations to execute and commit as a single atomic unit without the use of traditional locks, thereby improving performance and simplifying the parallel programming model. The POWER8 transactional memory facility provides a robust capability to execute transactions that can survive interrupts. It also allows non-speculative accesses within transactions, which facilitates debugging and thread-level speculation. Unique challenges caused by implementing transactional memory on top of the Power ISA (Instruction Set Architecture) weakly consistent memory model are addressed. We detail the Power ISA transactional memory architecture, the POWER8 implementation of this architecture, and two practical uses of this architecture—Transactional Lock Elision (TLE) and Thread-Level Speculation (TLS)—and provide performance results for these uses.
TL;DR: In this article, a sparse linear system solver based on a multifrontal variant of Gaussian elimination is presented, which exploits low-rank approximation of the resulting dense frontal matrices.
Abstract: We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite. The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK -- STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices.
TL;DR: In this article, the authors use a neuroscience-inspired model of learning, spike-timing-dependent plasticity, to develop a bio-inspired approach for programming memory devices, which naturally gives rise to an inference engine.
Abstract: Cognitive tasks are essential for the modern applications of electronics, and rely on the capability to perform inference. The Von Neumann bottleneck is an impor- tant issue for such tasks, and emerging memory devices offer an opportunity to overcome this issue by fusing computing and memory, in nonvolatile instant ON/OFF systems. A vision for accomplishing this is to use brain-inspired architectures, which excel at inference and do not differentiate between computing and memory. In this work, we use a neuroscience-inspired model of learning, spike-timing-dependent plasticity, to devel- op a bioinspired approach for programming memory devices, which naturally gives rise to an inference engine. The method is then adapted to different memory devices, including multi- valued memories (cumulative memristive device, phase- change memory) and stochastic binary memories (conductive bridge memory, spin transfer torque magnetic tunnel junction). By means of system-level simulations, we investigate several applications, including image recognition and pattern detec- tion within video and auditory data. We compare the results of the different devices. Stochastic binary devices require the use of redundancy, the extent of which depends tremendously on the considered task. A theoretical analysis allows us to understand how the various devices differ, and ties the inference engine to the machine learning algorithm of expec- tation-maximization. Monte Carlo simulations demonstrate an exceptional robustness of the inference engine with respect to device variations and other issues. A theoretical analysis explains the roots of this robustness. These results highlight a possible new bioinspired paradigm for programming emerging memory devices, allowing the natural learning of a complex inference engine. The physics of the memory devices plays an active role. The results open the way for a reinvention of the role of memory, when solving inference tasks.
TL;DR: The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.
TL;DR: AccFFT extends existing FFT libraries for CUDA-enabled Graphics Processing Units (GPUs) to distributed memory clusters and uses overlapping communication method to reduce the overhead of PCIe transfers from/to GPU.
Abstract: We present a new library for parallel distributed Fast Fourier Transforms (FFT) The importance of FFT in science and engineering and the advances in high performance computing necessitate further improvements AccFFT extends existing FFT libraries for CUDA-enabled Graphics Processing Units (GPUs) to distributed memory clusters We use overlapping communication method to reduce the overhead of PCIe transfers from/to GPU We present numerical results on the Maverick platform at the Texas Advanced Computing Center (TACC) and on the Titan system at the Oak Ridge National Laboratory (ORNL) We present the scaling of the library up to 4,096 K20 GPUs of Titan
TL;DR: This work addresses the problem of automated memory partitioning for emerging architectures, taking into account potentially parallel data accesses to physically independent banks, using the Z-polyhedral model for program analysis and adopting a partitioning scheme based on integer lattices.
Abstract: Emerging architectures, such as reconfigurable hardware platforms, provide the unprecedented opportunity of customizing the memory infrastructure based on application access patterns. This work addresses the problem of automated memory partitioning for such architectures, taking into account potentially parallel data accesses to physically independent banks. Targeted at affine static control parts (SCoPs), the technique relies on the Z-polyhedral model for program analysis and adopts a partitioning scheme based on integer lattices. The approach enables the definition of a solution space including previous works as particular cases. The problem of minimizing the total amount of memory required across the partitioned banks, referred to as storage minimization throughout the article, is tackled by an optimal approach yielding asymptotically zero memory waste or, as an alternative, an efficient approach ensuring arbitrarily small waste. The article also presents a prototype toolchain and a detailed step-by-step case study demonstrating the impact of the proposed technique along with extensive comparisons with alternative approaches in the literature.
TL;DR: This paper develops a runtime system called U-tube, which provides a framework to adopt memory hotplug or ballooning for dynamic memory allocation and implements fine-grained memory hot Plug in Xen, and makes a quantitative comparison between memory hot plug and ballooning.
Abstract: In virtualization environments, static memory allocation for virtual machines (VMs) can lead to severe service level agreement (SLA) violations or inefficient use of memory. Dynamic memory allocation mechanisms such as ballooning and memory hotplug were proposed to handle the dynamics of memory demands. However, these mechanisms so far have not been quantitatively or comparatively studied. In this paper, we first develop a runtime system called U-tube, which provides a framework to adopt memory hotplug or ballooning for dynamic memory allocation. We then implement fine-grained memory hotplug in Xen. We demonstrate the effectiveness of U-tube for dynamic memory management through two case studies: dynamic memory balancing and memory overcommitment. With these two case studies, we make a quantitative comparison between memory hotplug and ballooning. The experiments show that there is no absolute winner for different scenarios. Our findings can be very useful for practitioners to choose the suitable dynamic memory management techniques in different scenarios.
TL;DR: In this paper, the authors present new algorithms for massively parallel granular dynamics simulations on distributed memory architectures using a domain partitioning approach, where collisions are modelled with hard contacts to hide their micro-dynamics and thus to extend the time and length scales.
Abstract: This article presents new algorithms for massively parallel granular dynamics simulations on distributed memory architectures using a domain partitioning approach. Collisions are modelled with hard contacts in order to hide their micro-dynamics and thus to extend the time and length scales that can be simulated. The global multi-contact problem is solved using a non-linear block Gauss-Seidel method that is conforming to the subdomain structure. The parallel algorithms employ a sophisticated protocol between processors that delegate algorithmic tasks such as contact treatment and position integration uniquely and robustly to the processors. Communication overhead is minimized through aggressive message aggregation, leading to excellent strong and weak scaling. The robustness and scalability is assessed on three clusters including two peta-scale supercomputers with up to 458,752 processor cores. The simulations can reach unprecedented resolution of up to ten billion (
$$10^{10}$$
) non-spherical particles and contacts.
TL;DR: The results show that the Mammoth system can reduce the job execution time by more than 40 percent in typical cases, without requiring any modifications of the Hadoop programs, and can obtain similar performance as Spark when the memory is sufficient.
Abstract: The MapReduce platform has been widely used for large-scale data processing and analysis recently. It works well if the hardware of a cluster is well configured. However, our survey has indicated that common hardware configurations in small- and medium-size enterprises may not be suitable for such tasks. This situation is more challenging for memory-constrained systems, in which the memory is a bottleneck resource compared with the CPU power and thus does not meet the needs of large-scale data processing. The traditional high performance computing (HPC) system is an example of the memory-constrained system according to our survey. In this paper, we have developed Mammoth, a new MapReduce system, which aims to improve MapReduce performance using global memory management. In Mammoth, we design a novel rule-based heuristic to prioritize memory allocation and revocation among execution units (mapper, shuffler, reducer, etc.), to maximize the holistic benefits of the Map/Reduce job when scheduling each memory unit. We have also developed a multi-threaded execution engine, which is based on Hadoop but runs in a single JVM on a node. In the execution engine, we have implemented the algorithm of memory scheduling to realize global memory management, based on which we further developed the techniques such as sequential disk accessing, multi-cache and shuffling from memory, and solved the problem of full garbage collection in the JVM. We have conducted extensive experiments to compare Mammoth against the native Hadoop platform. The results show that the Mammoth system can reduce the job execution time by more than 40 percent in typical cases, without requiring any modifications of the Hadoop programs. When a system is short of memory, Mammoth can improve the performance by up to 5.19 times, as observed for I/O intensive applications, such as PageRank. We also compared Mammoth with Spark. Although Spark can achieve better performance than Mammoth for interactive and iterative applications when the memory is sufficient, our experimental results show that for batch processing applications, Mammoth can adapt better to various memory environments and outperform Spark when the memory is insufficient, and can obtain similar performance as Spark when the memory is sufficient. Given the growing importance of supporting large-scale data processing and analysis and the proven success of the MapReduce platform, the Mammoth system can have a promising potential and impact.
TL;DR: This work develops a comprehensive approach to eliminate timing channels in the memory controller that has two key elements: shape the memory access behavior of each thread so that it has an unchanging memory access pattern, and mathematically show that the proposed system yields zero information leakage.
Abstract: Trusted applications frequently execute in tandem with untrusted applications on personal devices and in cloud environments. Since these co-scheduled applications share hardware resources, the latencies encountered by the untrusted application betray information about whether the trusted applications are accessing shared resources or not. Prior studies have shown that such information leaks can be used by the untrusted application to decipher keys or launch covert-channel attacks. Prior work has also proposed techniques to eliminate information leakage in various shared resources. The best known solution to eliminate information leakage in the memory system incurs high performance penalties. This work develops a comprehensive approach to eliminate timing channels in the memory controller that has two key elements: (i) We shape the memory access behavior of each thread so that it has an unchanging memory access pattern. (ii) We show how efficient memory access pipelines can be constructed to process the resulting memory accesses without introducing any resource conflicts. We mathematically show that the proposed system yields zero information leakage. We then show that various page mapping policies can impact the throughput of our secure memory system. We also introduce techniques to re-order requests from different threads to boost performance without leaking information. Our best solution offers throughput that is 27% lower than that of an optimized non-secure baseline, and that is 69% higher than the best known competing scheme.
TL;DR: In this paper, a hybrid parallelization approach that uses both threading (shared memory) and message-passing (distributed memory) is proposed to overcome computational and memory constraints on the size of datasets that can be analyzed.
Abstract: We consider parallel computation for Gaussian process calculations to overcome computational and memory constraints on the size of datasets that can be analyzed. Using a hybrid parallelization approach that uses both threading (shared memory) and message-passing (distributed memory), we implement the core linear algebra operations used in spatial statistics and Gaussian process regression in an R package called bigGP that relies on C and MPI. The approach divides the covariance matrix into blocks such that the computational load is balanced across processes while communication between processes is limited. The package provides an API enabling R programmers to implement Gaussian process-based methods by using the distributed linear algebra operations without any C or MPI coding. We illustrate the approach and software by analyzing an astrophysics dataset with n = 67, 275 observations.
TL;DR: Extensive experiments on several real-world graphs show that GraphMap outperforms existing distributed memory-based systems for various iterative graph algorithms.
Abstract: In recent years, systems researchers have devoted considerable effort to the study of large-scale graph processing. Existing distributed graph processing systems such as Pregel, based solely on distributed memory for their computations, fail to provide seamless scalability when the graph data and their intermediate computational results no longer fit into the memory; and most distributed approaches for iterative graph computations do not consider utilizing secondary storage a viable solution. This paper presents GraphMap, a distributed iterative graph computation framework that maximizes access locality and speeds up distributed iterative graph computations by effectively utilizing secondary storage. GraphMap has three salient features: (1) It distinguishes data states that are mutable during iterative computations from those that are read-only in all iterations to maximize sequential access and minimize random access. (2) It entails a two-level graph partitioning algorithm that enables balanced workloads and locality-optimized data placement. (3) It contains a proposed suite of locality-based optimizations that improve computational efficiency. Extensive experiments on several real-world graphs show that GraphMap outperforms existing distributed memory-based systems for various iterative graph algorithms.
TL;DR: Major improvements to the computational infrastructure of the CTF subchannel code are described so that full-core, pincell-resolved simulations can now be performed in much shorter run-times, either in stand-alone mode or as part of coupled-code multi-physics calculations.
TL;DR: In this article, the authors consider the problem of tree-shaped task graphs with multiple processors and design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan.
Abstract: This article investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise in the multifrontal method of sparse matrix factorization. The peak memory needed for the processing of the entire tree depends on the execution order of the tasks. With one processor, the objective of the tree traversal is to minimize the required memory. This problem was well studied, and optimal polynomial algorithms were proposed.Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With multiple processors comes the additional objective to minimize the time needed to traverse the tree—that is, to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide inapproximability results even for unit weight trees. We design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan. Some of these heuristics are able to process a tree while keeping the memory usage under a given memory limit. The different heuristics are evaluated in an extensive experimental evaluation using realistic trees.
TL;DR: A massive-thread EMT program (MT-EMTP) is proposed and an efficient node-mapping structure is proposed to transform the original power system admittance matrix into a block-node diagonal sparse format to exploit themassive-thread parallel GPU architecture.
Abstract: The electromagnetic transient (EMT) simulation of a large-scale power system consumes so much computational power that parallel programming techniques are urgently needed in this area For example, realistic-sized power systems include thousands of buses, generators, and transmission lines Massive-thread computing is one of the key developments that can increase the EMT computational capabilities substantially when the processing unit has enough hardware cores Compared to the traditional CPU, the graphic-processing unit (GPU) has many more cores with distributed memory which can offer higher data throughput This paper proposes a massive-thread EMT program (MT-EMTP) and develops massive-thread parallel modules for linear passive elements, the universal line model, and the universal machine model for offline EMT simulation An efficient node-mapping structure is proposed to transform the original power system admittance matrix into a block-node diagonal sparse format to exploit the massive- thread parallel GPU architecture The developed MT-EMTP program has been tested on large-scale power systems of up to 2458 three-phase buses with detailed component modeling The simulation results and execution times are compared with mainstream commercial software, EMTP-RV, to show the improvement in performance with equivalent accuracy
TL;DR: Under weak scaling and for certain classes of graphs, the graph contraction algorithm scales consistently better than the four other algorithms and is of the same order in terms of memory as the other two.
Abstract: Introduces a graph contraction based distributed-memory connected component algorithm.Four alternative distributed-memory connected component algorithms are presented.Theoretical and experimental analysis is presented for the five algorithms.Classes of problems under which the algorithms are most applicable are identified.Novel algorithm shows better scalability across the range of scientific computing graphs used herein. Connected component labeling is a key step in a wide-range of applications, such as community detection in social networks and coherent structure identification in massively-parallel scientific simulations. There have been several distributed-memory connected component algorithms described in literature; however, little has been done regarding their scalability analysis. Theoretical and experimental results are presented for five algorithms: three that are direct implementations of previous approaches, one that is an implementation of a previous approach that is optimized to reduce communication, and one that is a novel approach based on graph contraction. Under weak scaling and for certain classes of graphs, the graph contraction algorithm scales consistently better than the four other algorithms. Furthermore, it uses significantly less memory than two of the alternative methods and is of the same order in terms of memory as the other two.
TL;DR: The proposed code generation framework addresses a major memory bottleneck of prior techniques that limits the scalability of the generated code and is demonstrated on computations that are mixed regular/irregular, completely regular, and completely irregular.
Abstract: Many applications feature a mix of irregular and regular computational structures. For example, codes using adaptive mesh refinement (AMR) typically use a collection of regular blocks, where the number of blocks and the relationship between blocks is irregular. The computational structure in such applications generally involves regular (affine) loop computations within some number of innermost loops, while outer loops exhibit irregularity due to data-dependent control flow and indirect array access patterns. Prior approaches to distributed memory parallelization do not handle such computations effectively. They either target loop nests that are completely affine using polyhedral frameworks, or treat all loops as irregular. Consequently, the generated distributed memory code contains artifacts that disrupt the regular nature of previously affine innermost loops of the computation. This hampers subsequent optimizations to improve on-node performance. We propose a code generation framework that can effectively transform such applications for execution on distributed memory systems. Our approach generates distributed memory code which preserves program properties that enable subsequent polyhederal optimizations. Simultaneously, it addresses a major memory bottleneck of prior techniques that limits the scalability of the generated code. The effectiveness of the proposed framework is demonstrated on computations that are mixed regular/irregular, completely regular, and completely irregular.
TL;DR: The Chebyshev and Krylov-subspace methods for calculating the quantum walk propagation are provided, as well as methods for visualization and data analysis, and the performance behavior of pyCTQW on a distributed memory cluster is demonstrated.
TL;DR: A hardware-assisted distributed memory system may include software configurable shared memory regions in the local memory of each of multiple processor cores as mentioned in this paper, which are accessed through a network of on-chip atomic transaction engine (ATE) instances, one per core, over a private interconnect matrix that connects them together.
Abstract: A hardware-assisted Distributed Memory System may include software configurable shared memory regions in the local memory of each of multiple processor cores. Accesses to these shared memory regions may be made through a network of on-chip atomic transaction engine (ATE) instances, one per core, over a private interconnect matrix that connects them together. For example, each ATE instance may issue Remote Procedure Calls (RPCs), with or without responses, to an ATE instance associated with a remote processor core in order to perform operations that target memory locations controlled by the remote processor core. Each ATE instance may process RPCs (atomically) that are received from other ATE instances or that are generated locally. For some operation types, an ATE instance may execute the operations identified in the RPCs itself using dedicated hardware. For other operation types, the ATE instance may interrupt its local processor core to perform the operations.
TL;DR: The extensive evaluation demonstrates that SparkRDF can efficiently implement non-selective joins faster than both current state-of-the-art distributed and centralized stores, while being able to process other queries in real time, scaling linearly to the amount of data.
Abstract: With the explosive growth of semantic data on the Web over the past years, many large-scale RDF knowledge bases with billions of facts are generating. This poses significant challenges for the storage and query of big RDF graphs. Current systems still have many limitations in processing big RDF graphs including scalability and real-time. In this paper, we introduce the SparkRDF, an elastic discreted RDF graph processing engine with distributed memory. To reduce the high I/O and communication cost in distributed processing platforms, SparkRDF implements SPARQL query based on Spark, a novel in-memory distributed computing framework for big data processing. All the intermediate results are modeled as Resilient Discreted SubGraph, which are cached in the distributed memory to support fast iterative join operations. To cut down the search space and avoid the overhead of memory, we split the RDF graph into the small Multi-layer Elastic SubGraph based on the relations and classes. For SPARQL query optimization, SparkRDF deploys a serials of optimization strategies, leading to effective reduction on the size of intermediate results, the number of joins and the cost of communication. Our extensive evaluation demonstrates that SparkRDF can efficiently implement non-selective joins faster than both current state-of-the-art distributed and centralized stores, while being able to process other queries in real time, scaling linearly to the amount of data.
TL;DR: A dataflow-based design approach to implement video codecs on embedded multi-core platforms based on distributed memory and Transport-Triggered Architecture processor cores is presented and software synthesis techniques to improve dynamic dataflow implementations are described.
Abstract: Multimedia applications and embedded platforms are both becoming very complex in order to improve user experience. Thus, multimedia developers need high-level methods to automate time-consuming and error-prone tasks. Dynamic dataflow modeling is attractive to describe complex applications, such as video codecs, at a high level of abstraction. This paper presents a dataflow-based design approach to implement video codecs on embedded multi-core platforms. First, we introduce a custom architecture model to design low-power multi-core chips based on distributed memory and Transport-Triggered Architecture processor cores. Then, we describe software synthesis techniques to improve dynamic dataflow implementations. This methodology has been implemented into open-source tools and demonstrated on video decoders based on the MPEG-4 Visual standard and the new High Efficiency Video Coding standard. The simulations achieve real-time decoding (40FPS) of high definition (720P) MPEG-4 Visual video sequences on a custom multi-core platform clocked at 1Ghz, which is an improvement of more than 100 % over previously proposed implementations.
TL;DR: New algorithms for massively parallel granular dynamics simulations on distributed memory architectures using a domain partitioning approach that can reach unprecedented resolution of up to ten billion non-spherical particles and contacts are presented.
Abstract: This article presents new algorithms for massively parallel granular dynamics simulations on distributed memory architectures using a domain partitioning approach. Collisions are modelled with hard contacts in order to hide their micro-dynamics and thus to extend the time and length scales that can be simulated. The multi-contact problem is solved using a non-linear block Gauss-Seidel method that is conforming to the subdomain structure. The parallel algorithms employ a sophisticated protocol between processors that delegate algorithmic tasks such as contact treatment and position integration uniquely and robustly to the processors. Communication overhead is minimized through aggressive message aggregation, leading to excellent strong and weak scaling. The robustness and scalability is assessed on three clusters including two peta-scale supercomputers with up to 458752 processor cores. The simulations can reach unprecedented resolution of up to ten billion non-spherical particles and contacts.