Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 2015
  1. Home
  2. Topics
  3. Distributed memory
  4. 2015
Showing papers on "Distributed memory published in 2015"
Book•
The ParaView Guide: A Parallel Visualization Application

[...]

Utkarsh Ayachit
22 Jan 2015
TL;DR: ParaView was developed to analyze extremely large datasets using distributed memory computing resources and has become an integral tool in many national laboratories, universities and industry, and has won several awards related to high performance computation.
Abstract: ParaView is an open-source, multi-platform data analysis and visualization application. ParaView users can quickly build visualizations to analyze their data using qualitative and quantitative techniques. The data exploration can be done interactively in 3D or programmatically using ParaViews batch processing capabilities. ParaView was developed to analyze extremely large datasets using distributed memory computing resources. It can be run on supercomputers to analyze datasets of petascale size as well as on laptops for smaller data, has become an integral tool in many national laboratories, universities and industry, and has won several awards related to high performance computation. ParaView is open-source (BSD licensed, commercial software friendly). As with any of successful open-source project, ParaView is supported by an active user and developer community.

1,526 citations

Proceedings Article•10.1109/HPCC-CSS-ICESS.2015.246•
Performance Prediction for Apache Spark Platform

[...]

Kewen Wang1, Mohammad Maifi Hasan Khan1•
University of Connecticut1
24 Aug 2015
TL;DR: This paper presents a simulation driven prediction model that can predict job performance with high accuracy for Apache Spark platform and evaluated the prediction framework using four real-life applications to show that the model can achieve high prediction accuracy.
Abstract: Apache Spark is an open source distributed data processing platform that uses distributed memory abstraction to process large volume of data efficiently. However, performance of a particular job on Apache Spark platform can vary significantly depending on the input data type and size, design and implementation of the algorithm, and computing capability, making it extremely difficult to predict the performance metric of a job such as execution time, memory footprint, and I/O cost. To address this challenge, in this paper, we present a simulation driven prediction model that can predict job performance with high accuracy for Apache Spark platform. Specifically, as Apache spark jobs are often consist of multiple sequential stages, the presented prediction model simulates the execution of the actual job by using only a fraction of the input data, and collect execution traces (e.g., I/O overhead, memory consumption, execution time) to predict job performance for each execution stage individually. We evaluated our prediction framework using four real-life applications on a 13 node cluster, and experimental results show that the model can achieve high prediction accuracy.

172 citations

Journal Article•10.1109/MCSE.2015.4•
Opportunities for Nonvolatile Memory Systems in Extreme-Scale High-Performance Computing

[...]

Jeffrey S. Vetter1, Sparsh Mittal1•
Oak Ridge National Laboratory1
25 Feb 2015-Computing in Science and Engineering
TL;DR: Current results show several emerging memory technologies related to nonvolatile memory (NVM) devices are being investigated as an alternative for DRAM and could offer high-bandwidth I/O, larger main memory capacities, persistent data structures, and new approaches for application resilience and output postprocessing.
Abstract: For extreme-scale high-performance computing systems, system-wide power consumption has been identified as one of the key constraints moving forward, where DRAM main memory systems account for about 30 to 50 percent of a node's overall power consumption. As the benefits of device scaling for DRAM memory slow, it will become increasingly difficult to keep memory capacities balanced with increasing computational rates offered by next-generation processors. However, several emerging memory technologies related to nonvolatile memory (NVM) devices are being investigated as an alternative for DRAM. Moving forward, NVM devices could offer solutions for HPC architectures. Researchers are investigating how to integrate these emerging technologies into future extreme-scale HPC systems and how to expose these capabilities in the software stack and applications. Current results show several of these strategies could offer high-bandwidth I/O, larger main memory capacities, persistent data structures, and new approaches for application resilience and output postprocessing, such as transaction-based incremental checkpointing and in situ visualization, respectively.

113 citations

Journal Article•10.4208/CICP.020215.150515SW•
PVFMM: A Parallel Kernel Independent FMM for Particle and Volume Potentials

[...]

Dhairya Malhotra1, George Biros•
University of Texas at Austin1
14 Sep 2015-Communications in Computational Physics
TL;DR: This paper describes the implementation of a parallel fast multipole method for evaluating potentials for discrete and continuous source distributions and discusses several algorithmic improvements and performance optimizations including cache locality, vectorization, shared memory parallelism and use of coprocessors.
Abstract: We describe our implementation of a parallel fast multipole method for evaluating potentials for discrete and continuous source distributions. The first requires summation over the source points and the second requiring integration over a continuous source density. Both problems require (N2) complexity when computed directly; however, can be accelerated to (N) time using FMM. In our PVFMM software library, we use kernel independent FMM and this allows us to compute potentials for a wide range of elliptic kernels. Our method is high order, adaptive and scalable. In this paper, we discuss several algorithmic improvements and performance optimizations including cache locality, vectorization, shared memory parallelism and use of coprocessors. Our distributed memory implementation uses space-filling curve for partitioning data and a hypercube communication scheme. We present convergence results for Laplace, Stokes and Helmholtz (low wavenumber) kernels for both particle and volume FMM. We measure efficiency of our method in terms of CPU cycles per unknown for different accuracies and different kernels. We also demonstrate scalability of our implementation up to several thousand processor cores on the Stampede platform at the Texas Advanced Computing Center.

107 citations

Proceedings Article•10.1145/2814576.2814806•
Quartz: A Lightweight Performance Emulator for Persistent Memory Software

[...]

Haris Volos1, Guilherme Magalhaes1, Ludmila Cherkasova1, Jun Li1•
Hewlett-Packard1
24 Nov 2015
TL;DR: Quark as discussed by the authors is a performance emulator for persistent memory, which emulates a wide range of NVM latencies and bandwidth characteristics for performance evaluation of emerging byte-addressable NVMs and their impact on applications performance.
Abstract: Next-generation non-volatile memory (NVM) technologies, such as phase-change memory and memristors, can enable computer systems infrastructure to continue keeping up with the voracious appetite of data-centric applications for large, cheap, and fast storage. Persistent memory has emerged as a promising approach to accessing emerging byte-addressable non-volatile memory through processor load/store instructions. Due to lack of commercially available NVM, system software researchers have mainly relied on emulation to model persistent memory performance. However, existing emulation approaches are either too simplistic, or too slow to emulate large-scale workloads, or require special hardware. To fill this gap and encourage wider adoption of persistent memory, we developed a performance emulator for persistent memory, called Quartz. Quartz enables an efficient emulation of a wide range of NVM latencies and bandwidth characteristics for performance evaluation of emerging byte-addressable NVMs and their impact on applications performance (without modifying or instrumenting their source code) by leveraging features available in commodity hardware. Our emulator is implemented on three latest Intel Xeon-based processor architectures: Sandy Bridge, Ivy Bridge, and Haswell. To assist researchers and engineers in evaluating design decisions with emerging NVMs, we extend Quartz for emulating the application execution on future systems with two types of memory: fast, regular volatile DRAM and slower persistent memory. We evaluate the effectiveness of our approach by using a set of specially designed memory-intensive benchmarks and real applications. The accuracy of the proposed approach is validated by running these programs both on our emulation platform and a multisocket (NUMA) machine that can support a range of memory latencies. We show that Quartz can emulate a range of performance characteristics with low overhead and good accuracy (with emulation errors 0.2% - 9%).

99 citations

Proceedings Article•10.1109/HPEC.2015.7322467•
Accelerating K-Means clustering with parallel implementations and GPU computing

[...]

Janki Bhimani1, Miriam Leeser1, Ningfang Mi1•
Northeastern University1
12 Nov 2015
TL;DR: This paper explores accelerating the performance of K-means clustering using three approaches: 1) shared memory using OpenMP, 2) distributed memory with message passing (MPI), and 3) heterogeneous computing with NVIDIA Graphics Processing Units (GPUs) programmed with CUDA-C.
Abstract: K-Means clustering is a popular unsupervised machine learning method which has been used in diverse applications including image processing, information retrieval, social sciences and weather forecasting. However, clustering is computationally expensive especially when applied to large datasets. In this paper, we explore accelerating the performance of K-means clustering using three approaches: 1) shared memory using OpenMP, 2) distributed memory with message passing (MPI), and 3) heterogeneous computing with NVIDIA Graphics Processing Units (GPUs) programmed with CUDA-C. While others have looked at accelerating K-means clustering, this is the first study that compares these different approaches. In addition, K-means performance is very sensitive to the initial means chosen. We evaluate different initializations in parallel and choose the best one to use for the entire algorithm. We evaluate results on a range of images from small (300×300 pixels) to large (1164×1200 pixel). Our results show that all three parallel programming approaches give speed-up, with the best results obtained by OpenMP for smaller images and CUDA-C for larger ones. Each of these approaches gives approximately thirty times overall speed-up compared to a sequential implementation of K-means. In addition, our parallel initialization gives an additional 1.5 to 2.5 times speed-up over the accelerated parallel versions.

74 citations

Posted Content•
A distributed-memory package for dense Hierarchically Semi-Separable matrix computations using randomization

[...]

François-Henry Rouet1, Xiaoye S. Li1, Pieter Ghysels1, Artem Napov2•
Lawrence Berkeley National Laboratory1, Université libre de Bruxelles2
18 Mar 2015-arXiv: Mathematical Software
TL;DR: The STRUMPACK (STRUctured Matrices PACKAGE) as mentioned in this paper is a distributed memory library for computations with dense structured matrices, where a matrix is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with low numerical rank.
Abstract: We present a distributed-memory library for computations with dense structured matrices. A matrix is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with low numerical rank. Here, we use Hierarchically Semi-Separable representations (HSS). Such matrices appear in many applications, e.g., finite element methods, boundary element methods, etc. Exploiting this structure allows for fast solution of linear systems and/or fast computation of matrix-vector products, which are the two main building blocks of matrix computations. The compression algorithm that we use, that computes the HSS form of an input dense matrix, relies on randomized sampling with a novel adaptive sampling mechanism. We discuss the parallelization of this algorithm and also present the parallelization of structured matrix-vector product, structured factorization and solution routines. The efficiency of the approach is demonstrated on large problems from different academic and industrial applications, on up to 8,000 cores. This work is part of a more global effort, the STRUMPACK (STRUctured Matrices PACKage) software package for computations with sparse and dense structured matrices. Hence, although useful on their own right, the routines also represent a step in the direction of a distributed-memory sparse solver.

73 citations

Journal Article•10.1147/JRD.2014.2380199•
Transactional memory support in the IBM POWER8 processor

[...]

Hung Q. Le, Guy Lynn Guthrie, Derek Edward Williams, Maged M. Michael, Brad Frey, William J. Starke, Cathy May, Rei Odaira, Takuya Nakaike 
02 Feb 2015-Ibm Journal of Research and Development
TL;DR: The Power ISA transactional memory architecture, the POWER8 implementation of this architecture, and two practical uses of this Architecture—Transactional Lock Elision (TLE) and Thread-Level Speculation (TLS)—and provide performance results for these uses are detailed.
Abstract: With multi-core processors, parallel programming has taken on greater importance. Traditional parallel programming techniques based on critical sections controlled by locking have several well-known drawbacks. To allow for more efficient parallel programming with higher performance, the IBM POWER8™ processor implements a hardware transactional memory facility. Transactional memory allows groups of load and store operations to execute and commit as a single atomic unit without the use of traditional locks, thereby improving performance and simplifying the parallel programming model. The POWER8 transactional memory facility provides a robust capability to execute transactions that can survive interrupts. It also allows non-speculative accesses within transactions, which facilitates debugging and thread-level speculation. Unique challenges caused by implementing transactional memory on top of the Power ISA (Instruction Set Architecture) weakly consistent memory model are addressed. We detail the Power ISA transactional memory architecture, the POWER8 implementation of this architecture, and two practical uses of this architecture—Transactional Lock Elision (TLE) and Thread-Level Speculation (TLS)—and provide performance results for these uses.

72 citations

Posted Content•
An efficient multi-core implementation of a novel HSS-structured multifrontal solver using randomized sampling

[...]

Pieter Ghysels, Xiaoye S. Li, François-Henry Rouet, Samuel Williams, Artem Napov 
25 Feb 2015-arXiv: Mathematical Software
TL;DR: In this article, a sparse linear system solver based on a multifrontal variant of Gaussian elimination is presented, which exploits low-rank approximation of the resulting dense frontal matrices.
Abstract: We present a sparse linear system solver that is based on a multifrontal variant of Gaussian elimination, and exploits low-rank approximation of the resulting dense frontal matrices. We use hierarchically semiseparable (HSS) matrices, which have low-rank off-diagonal blocks, to approximate the frontal matrices. For HSS matrix construction, a randomized sampling algorithm is used together with interpolative decompositions. The combination of the randomized compression with a fast ULV HSS factorization leads to a solver with lower computational complexity than the standard multifrontal method for many applications, resulting in speedups up to 7 fold for problems in our test suite. The implementation targets many-core systems by using task parallelism with dynamic runtime scheduling. Numerical experiments show performance improvements over state-of-the-art sparse direct solvers. The implementation achieves high performance and good scalability on a range of modern shared memory parallel systems, including the Intel Xeon Phi (MIC). The code is part of a software package called STRUMPACK -- STRUctured Matrices PACKage, which also has a distributed memory component for dense rank-structured matrices.

72 citations

Bioinspired Programming of Memory Devices for Implementing an Inference Engine Emerging memory structures have several characteristics that are suitable for neuromorphic implementations. This paper connects the behavior of these ''new'' memory devices to achieving inference engines and provides a connection between the behavior of different devices and the learning algorithms.

[...]

Damien Querlioz, Olivier Bichler, Adrien F. Vincent, Christian Gamrat
1 Jan 2015
TL;DR: In this article, the authors use a neuroscience-inspired model of learning, spike-timing-dependent plasticity, to develop a bio-inspired approach for programming memory devices, which naturally gives rise to an inference engine.
Abstract: Cognitive tasks are essential for the modern applications of electronics, and rely on the capability to perform inference. The Von Neumann bottleneck is an impor- tant issue for such tasks, and emerging memory devices offer an opportunity to overcome this issue by fusing computing and memory, in nonvolatile instant ON/OFF systems. A vision for accomplishing this is to use brain-inspired architectures, which excel at inference and do not differentiate between computing and memory. In this work, we use a neuroscience-inspired model of learning, spike-timing-dependent plasticity, to devel- op a bioinspired approach for programming memory devices, which naturally gives rise to an inference engine. The method is then adapted to different memory devices, including multi- valued memories (cumulative memristive device, phase- change memory) and stochastic binary memories (conductive bridge memory, spin transfer torque magnetic tunnel junction). By means of system-level simulations, we investigate several applications, including image recognition and pattern detec- tion within video and auditory data. We compare the results of the different devices. Stochastic binary devices require the use of redundancy, the extent of which depends tremendously on the considered task. A theoretical analysis allows us to understand how the various devices differ, and ties the inference engine to the machine learning algorithm of expec- tation-maximization. Monte Carlo simulations demonstrate an exceptional robustness of the inference engine with respect to device variations and other issues. A theoretical analysis explains the roots of this robustness. These results highlight a possible new bioinspired paradigm for programming emerging memory devices, allowing the natural learning of a complex inference engine. The physics of the memory devices plays an active role. The results open the way for a reinvention of the role of memory, when solving inference tasks.

66 citations

Journal Article•10.1016/J.JCP.2015.08.016•
A scalable parallel black oil simulator on distributed memory parallel computers

[...]

Kun Wang, Hui Liu, Zhangxin Chen
15 Nov 2015-Journal of Computational Physics
TL;DR: The parallel simulator is designed to overcome the performance issues of common simulators that are implemented for personal computers and workstations and is capable of simulating extremely large-scale black oil problems with tens of millions of grid blocks using thousands of MPI processes on parallel computers.
Posted Content•
AccFFT: A library for distributed-memory FFT on CPU and GPU architectures

[...]

Amir Gholami, Judith Hill, Dhairya Malhotra, George Biros
26 Jun 2015-arXiv: Distributed, Parallel, and Cluster Computing
TL;DR: AccFFT extends existing FFT libraries for CUDA-enabled Graphics Processing Units (GPUs) to distributed memory clusters and uses overlapping communication method to reduce the overhead of PCIe transfers from/to GPU.
Abstract: We present a new library for parallel distributed Fast Fourier Transforms (FFT) The importance of FFT in science and engineering and the advances in high performance computing necessitate further improvements AccFFT extends existing FFT libraries for CUDA-enabled Graphics Processing Units (GPUs) to distributed memory clusters We use overlapping communication method to reduce the overhead of PCIe transfers from/to GPU We present numerical results on the Maverick platform at the Texas Advanced Computing Center (TACC) and on the Titan system at the Oak Ridge National Laboratory (ORNL) We present the scaling of the library up to 4,096 K20 GPUs of Titan
Journal Article•10.1145/2675359•
Improving Multibank Memory Access Parallelism with Lattice-Based Partitioning

[...]

Alessandro Cilardo1, Luca Gallo1•
University of Naples Federico II1
09 Jan 2015-ACM Transactions on Architecture and Code Optimization
TL;DR: This work addresses the problem of automated memory partitioning for emerging architectures, taking into account potentially parallel data accesses to physically independent banks, using the Z-polyhedral model for program analysis and adopting a partitioning scheme based on integer lattices.
Abstract: Emerging architectures, such as reconfigurable hardware platforms, provide the unprecedented opportunity of customizing the memory infrastructure based on application access patterns. This work addresses the problem of automated memory partitioning for such architectures, taking into account potentially parallel data accesses to physically independent banks. Targeted at affine static control parts (SCoPs), the technique relies on the Z-polyhedral model for program analysis and adopts a partitioning scheme based on integer lattices. The approach enables the definition of a solution space including previous works as particular cases. The problem of minimizing the total amount of memory required across the partitioned banks, referred to as storage minimization throughout the article, is tackled by an optimal approach yielding asymptotically zero memory waste or, as an alternative, an efficient approach ensuring arbitrarily small waste. The article also presents a prototype toolchain and a detailed step-by-step case study demonstrating the impact of the proposed technique along with extensive comparisons with alternative approaches in the literature.
Journal Article•10.1109/TPDS.2014.2320915•
Hotplug or Ballooning: A Comparative Study on Dynamic Memory Management Techniques for Virtual Machines

[...]

Haikun Liu1, Hai Jin1, Xiaofei Liao1, Wei Deng1, Bingsheng He2, Cheng-Zhong Xu3 •
Huazhong University of Science and Technology1, Nanyang Technological University2, Wayne State University3
01 May 2015-IEEE Transactions on Parallel and Distributed Systems
TL;DR: This paper develops a runtime system called U-tube, which provides a framework to adopt memory hotplug or ballooning for dynamic memory allocation and implements fine-grained memory hot Plug in Xen, and makes a quantitative comparison between memory hot plug and ballooning.
Abstract: In virtualization environments, static memory allocation for virtual machines (VMs) can lead to severe service level agreement (SLA) violations or inefficient use of memory. Dynamic memory allocation mechanisms such as ballooning and memory hotplug were proposed to handle the dynamics of memory demands. However, these mechanisms so far have not been quantitatively or comparatively studied. In this paper, we first develop a runtime system called U-tube, which provides a framework to adopt memory hotplug or ballooning for dynamic memory allocation. We then implement fine-grained memory hotplug in Xen. We demonstrate the effectiveness of U-tube for dynamic memory management through two case studies: dynamic memory balancing and memory overcommitment. With these two case studies, we make a quantitative comparison between memory hotplug and ballooning. The experiments show that there is no absolute winner for different scenarios. Our findings can be very useful for practitioners to choose the suitable dynamic memory management techniques in different scenarios.
Journal Article•10.1007/S40571-015-0047-6•
Ultrascale simulations of non-smooth granular dynamics

[...]

Tobias Preclik1, Ulrich Rüde1•
University of Erlangen-Nuremberg1
30 May 2015-Computational particle mechanics
TL;DR: In this paper, the authors present new algorithms for massively parallel granular dynamics simulations on distributed memory architectures using a domain partitioning approach, where collisions are modelled with hard contacts to hide their micro-dynamics and thus to extend the time and length scales.
Abstract: This article presents new algorithms for massively parallel granular dynamics simulations on distributed memory architectures using a domain partitioning approach. Collisions are modelled with hard contacts in order to hide their micro-dynamics and thus to extend the time and length scales that can be simulated. The global multi-contact problem is solved using a non-linear block Gauss-Seidel method that is conforming to the subdomain structure. The parallel algorithms employ a sophisticated protocol between processors that delegate algorithmic tasks such as contact treatment and position integration uniquely and robustly to the processors. Communication overhead is minimized through aggressive message aggregation, leading to excellent strong and weak scaling. The robustness and scalability is assessed on three clusters including two peta-scale supercomputers with up to 458,752 processor cores. The simulations can reach unprecedented resolution of up to ten billion ( $$10^{10}$$ ) non-spherical particles and contacts.
Journal Article•10.1109/TPDS.2014.2345068•
Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications

[...]

Xuanhua Shi1, Chen Ming1, Ligang He2, Xu Xie1, Lu Lu1, Hai Jin1, Yong Chen3, Song Wu1 •
Huazhong University of Science and Technology1, University of Warwick2, Texas Tech University3
01 Aug 2015-IEEE Transactions on Parallel and Distributed Systems
TL;DR: The results show that the Mammoth system can reduce the job execution time by more than 40 percent in typical cases, without requiring any modifications of the Hadoop programs, and can obtain similar performance as Spark when the memory is sufficient.
Abstract: The MapReduce platform has been widely used for large-scale data processing and analysis recently. It works well if the hardware of a cluster is well configured. However, our survey has indicated that common hardware configurations in small- and medium-size enterprises may not be suitable for such tasks. This situation is more challenging for memory-constrained systems, in which the memory is a bottleneck resource compared with the CPU power and thus does not meet the needs of large-scale data processing. The traditional high performance computing (HPC) system is an example of the memory-constrained system according to our survey. In this paper, we have developed Mammoth, a new MapReduce system, which aims to improve MapReduce performance using global memory management. In Mammoth, we design a novel rule-based heuristic to prioritize memory allocation and revocation among execution units (mapper, shuffler, reducer, etc.), to maximize the holistic benefits of the Map/Reduce job when scheduling each memory unit. We have also developed a multi-threaded execution engine, which is based on Hadoop but runs in a single JVM on a node. In the execution engine, we have implemented the algorithm of memory scheduling to realize global memory management, based on which we further developed the techniques such as sequential disk accessing, multi-cache and shuffling from memory, and solved the problem of full garbage collection in the JVM. We have conducted extensive experiments to compare Mammoth against the native Hadoop platform. The results show that the Mammoth system can reduce the job execution time by more than 40 percent in typical cases, without requiring any modifications of the Hadoop programs. When a system is short of memory, Mammoth can improve the performance by up to 5.19 times, as observed for I/O intensive applications, such as PageRank. We also compared Mammoth with Spark. Although Spark can achieve better performance than Mammoth for interactive and iterative applications when the memory is sufficient, our experimental results show that for batch processing applications, Mammoth can adapt better to various memory environments and outperform Spark when the memory is insufficient, and can obtain similar performance as Spark when the memory is sufficient. Given the growing importance of supporting large-scale data processing and analysis and the proven success of the MapReduce platform, the Mammoth system can have a promising potential and impact.
Proceedings Article•10.1145/2830772.2830795•
Avoiding information leakage in the memory controller with fixed service policies

[...]

Ali Shafiee1, Akhila Gundu1, Manjunath Shevgoor1, Rajeev Balasubramonian1, Mohit Tiwari2 •
University of Utah1, University of Texas at Austin2
5 Dec 2015
TL;DR: This work develops a comprehensive approach to eliminate timing channels in the memory controller that has two key elements: shape the memory access behavior of each thread so that it has an unchanging memory access pattern, and mathematically show that the proposed system yields zero information leakage.
Abstract: Trusted applications frequently execute in tandem with untrusted applications on personal devices and in cloud environments. Since these co-scheduled applications share hardware resources, the latencies encountered by the untrusted application betray information about whether the trusted applications are accessing shared resources or not. Prior studies have shown that such information leaks can be used by the untrusted application to decipher keys or launch covert-channel attacks. Prior work has also proposed techniques to eliminate information leakage in various shared resources. The best known solution to eliminate information leakage in the memory system incurs high performance penalties. This work develops a comprehensive approach to eliminate timing channels in the memory controller that has two key elements: (i) We shape the memory access behavior of each thread so that it has an unchanging memory access pattern. (ii) We show how efficient memory access pipelines can be constructed to process the resulting memory accesses without introducing any resource conflicts. We mathematically show that the proposed system yields zero information leakage. We then show that various page mapping policies can impact the throughput of our secure memory system. We also introduce techniques to re-order requests from different threads to boost performance without leaking information. Our best solution offers throughput that is 27% lower than that of an optimized non-secure baseline, and that is 69% higher than the best known competing scheme.
Journal Article•10.18637/JSS.V063.I10•
Parallelizing Gaussian Process Calculations in R

[...]

Christopher J. Paciorek, Benjamin Lipshitz, Wei Zhuo, Prabhat, Cari G. Kaufman, Rollin C. Thomas 
10 Feb 2015-Journal of Statistical Software
TL;DR: In this paper, a hybrid parallelization approach that uses both threading (shared memory) and message-passing (distributed memory) is proposed to overcome computational and memory constraints on the size of datasets that can be analyzed.
Abstract: We consider parallel computation for Gaussian process calculations to overcome computational and memory constraints on the size of datasets that can be analyzed. Using a hybrid parallelization approach that uses both threading (shared memory) and message-passing (distributed memory), we implement the core linear algebra operations used in spatial statistics and Gaussian process regression in an R package called bigGP that relies on C and MPI. The approach divides the covariance matrix into blocks such that the computational load is balanced across processes while communication between processes is limited. The package provides an API enabling R programmers to implement Gaussian process-based methods by using the distributed linear algebra operations without any C or MPI coding. We illustrate the approach and software by analyzing an astrophysics dataset with n = 67, 275 observations.
Proceedings Article•10.1145/2807591.2807604•
Scaling iterative graph computations with GraphMap

[...]

Kisung Lee1, Ling Liu2, Karsten Schwan2, Calton Pu2, Qi Zhang2, Yang Zhou2, Emre Yigitoglu2, Pingpeng Yuan3 •
Louisiana State University1, Georgia Institute of Technology2, Huazhong University of Science and Technology3
15 Nov 2015
TL;DR: Extensive experiments on several real-world graphs show that GraphMap outperforms existing distributed memory-based systems for various iterative graph algorithms.
Abstract: In recent years, systems researchers have devoted considerable effort to the study of large-scale graph processing. Existing distributed graph processing systems such as Pregel, based solely on distributed memory for their computations, fail to provide seamless scalability when the graph data and their intermediate computational results no longer fit into the memory; and most distributed approaches for iterative graph computations do not consider utilizing secondary storage a viable solution. This paper presents GraphMap, a distributed iterative graph computation framework that maximizes access locality and speeds up distributed iterative graph computations by effectively utilizing secondary storage. GraphMap has three salient features: (1) It distinguishes data states that are mutable during iterative computations from those that are read-only in all iterations to maximize sequential access and minimize random access. (2) It entails a two-level graph partitioning algorithm that enables balanced workloads and locality-optimized data placement. (3) It contains a proposed suite of locality-based optimizations that improve computational efficiency. Extensive experiments on several real-world graphs show that GraphMap outperforms existing distributed memory-based systems for various iterative graph algorithms.
Journal Article•10.1016/J.ANUCENE.2014.11.005•
Optimization and parallelization of the thermal–hydraulic subchannel code CTF for high-fidelity multi-physics applications

[...]

Robert Salko1, Rodney C. Schmidt2, Maria Avramova3•
Oak Ridge National Laboratory1, Sandia National Laboratories2, Pennsylvania State University3
01 Oct 2015-Annals of Nuclear Energy
TL;DR: Major improvements to the computational infrastructure of the CTF subchannel code are described so that full-core, pincell-resolved simulations can now be performed in much shorter run-times, either in stand-alone mode or as part of coupled-code multi-physics calculations.
Journal Article•10.1145/2779052•
Parallel Scheduling of Task Trees with Limited Memory

[...]

Lionel Eyraud-Dubois1, Loris Marchal2, Oliver Sinnen3, Frédéric Vivien2•
University of Bordeaux1, University of Lyon2, University of Auckland3
29 Jun 2015
TL;DR: In this article, the authors consider the problem of tree-shaped task graphs with multiple processors and design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan.
Abstract: This article investigates the execution of tree-shaped task graphs using multiple processors. Each edge of such a tree represents some large data. A task can only be executed if all input and output data fit into memory, and a data can only be removed from memory after the completion of the task that uses it as an input data. Such trees arise in the multifrontal method of sparse matrix factorization. The peak memory needed for the processing of the entire tree depends on the execution order of the tasks. With one processor, the objective of the tree traversal is to minimize the required memory. This problem was well studied, and optimal polynomial algorithms were proposed.Here, we extend the problem by considering multiple processors, which is of obvious interest in the application area of matrix factorization. With multiple processors comes the additional objective to minimize the time needed to traverse the tree—that is, to minimize the makespan. Not surprisingly, this problem proves to be much harder than the sequential one. We study the computational complexity of this problem and provide inapproximability results even for unit weight trees. We design a series of practical heuristics achieving different trade-offs between the minimization of peak memory usage and makespan. Some of these heuristics are able to process a tree while keeping the memory usage under a given memory limit. The different heuristics are evaluated in an extensive experimental evaluation using realistic trees.
Proceedings Article•10.1109/PESGM.2015.7285591•
Parallel massive-thread electromagnetic transient simulation on GPU

[...]

Zhiyin Zhou1, Venkata Dinavahi1•
University of Alberta1
26 Jul 2015
TL;DR: A massive-thread EMT program (MT-EMTP) is proposed and an efficient node-mapping structure is proposed to transform the original power system admittance matrix into a block-node diagonal sparse format to exploit themassive-thread parallel GPU architecture.
Abstract: The electromagnetic transient (EMT) simulation of a large-scale power system consumes so much computational power that parallel programming techniques are urgently needed in this area For example, realistic-sized power systems include thousands of buses, generators, and transmission lines Massive-thread computing is one of the key developments that can increase the EMT computational capabilities substantially when the processing unit has enough hardware cores Compared to the traditional CPU, the graphic-processing unit (GPU) has many more cores with distributed memory which can offer higher data throughput This paper proposes a massive-thread EMT program (MT-EMTP) and develops massive-thread parallel modules for linear passive elements, the universal line model, and the universal machine model for offline EMT simulation An efficient node-mapping structure is proposed to transform the original power system admittance matrix into a block-node diagonal sparse format to exploit the massive- thread parallel GPU architecture The developed MT-EMTP program has been tested on large-scale power systems of up to 2458 three-phase buses with detailed component modeling The simulation results and execution times are compared with mainstream commercial software, EMTP-RV, to show the improvement in performance with equivalent accuracy
Journal Article•10.1016/J.PARCO.2015.02.005•
Evaluation of connected-component labeling algorithms for distributed-memory systems

[...]

Jeremy Iverson1, Chandrika Kamath1, George Karypis2•
Lawrence Livermore National Laboratory1, University of Minnesota2
1 May 2015
TL;DR: Under weak scaling and for certain classes of graphs, the graph contraction algorithm scales consistently better than the four other algorithms and is of the same order in terms of memory as the other two.
Abstract: Introduces a graph contraction based distributed-memory connected component algorithm.Four alternative distributed-memory connected component algorithms are presented.Theoretical and experimental analysis is presented for the five algorithms.Classes of problems under which the algorithms are most applicable are identified.Novel algorithm shows better scalability across the range of scientific computing graphs used herein. Connected component labeling is a key step in a wide-range of applications, such as community detection in social networks and coherent structure identification in massively-parallel scientific simulations. There have been several distributed-memory connected component algorithms described in literature; however, little has been done regarding their scalability analysis. Theoretical and experimental results are presented for five algorithms: three that are direct implementations of previous approaches, one that is an implementation of a previous approach that is optimized to reduce communication, and one that is a novel approach based on graph contraction. Under weak scaling and for certain classes of graphs, the graph contraction algorithm scales consistently better than the four other algorithms. Furthermore, it uses significantly less memory than two of the alternative methods and is of the same order in terms of memory as the other two.
Proceedings Article•10.1145/2688500.2688515•
Distributed memory code generation for mixed Irregular/Regular computations

[...]

Mahesh Ravishankar1, Roshan Dathathri1, Venmugil Elango1, Louis-Noël Pouchet1, J. Ramanujam2, Atanas Rountev1, P. Sadayappan1 •
Ohio State University1, Louisiana State University2
24 Jan 2015
TL;DR: The proposed code generation framework addresses a major memory bottleneck of prior techniques that limits the scalability of the generated code and is demonstrated on computations that are mixed regular/irregular, completely regular, and completely irregular.
Abstract: Many applications feature a mix of irregular and regular computational structures. For example, codes using adaptive mesh refinement (AMR) typically use a collection of regular blocks, where the number of blocks and the relationship between blocks is irregular. The computational structure in such applications generally involves regular (affine) loop computations within some number of innermost loops, while outer loops exhibit irregularity due to data-dependent control flow and indirect array access patterns. Prior approaches to distributed memory parallelization do not handle such computations effectively. They either target loop nests that are completely affine using polyhedral frameworks, or treat all loops as irregular. Consequently, the generated distributed memory code contains artifacts that disrupt the regular nature of previously affine innermost loops of the computation. This hampers subsequent optimizations to improve on-node performance. We propose a code generation framework that can effectively transform such applications for execution on distributed memory systems. Our approach generates distributed memory code which preserves program properties that enable subsequent polyhederal optimizations. Simultaneously, it addresses a major memory bottleneck of prior techniques that limits the scalability of the generated code. The effectiveness of the proposed framework is demonstrated on computations that are mixed regular/irregular, completely regular, and completely irregular.
Journal Article•10.1016/J.CPC.2014.09.011•
pyCTQW: A continuous-time quantum walk simulator on distributed memory computers

[...]

Josh Izaac1, Jingbo Wang1•
University of Western Australia1
01 Jan 2015-Computer Physics Communications
TL;DR: The Chebyshev and Krylov-subspace methods for calculating the quantum walk propagation are provided, as well as methods for visualization and data analysis, and the performance behavior of pyCTQW on a distributed memory cluster is demonstrated.
Patent•
Distributed shared memory using interconnected atomic transaction engines at respective memory interfaces

[...]

Rishabh Jain1, Erik Schlanger1•
Business International Corporation1
23 Sep 2015
TL;DR: A hardware-assisted distributed memory system may include software configurable shared memory regions in the local memory of each of multiple processor cores as mentioned in this paper, which are accessed through a network of on-chip atomic transaction engine (ATE) instances, one per core, over a private interconnect matrix that connects them together.
Abstract: A hardware-assisted Distributed Memory System may include software configurable shared memory regions in the local memory of each of multiple processor cores. Accesses to these shared memory regions may be made through a network of on-chip atomic transaction engine (ATE) instances, one per core, over a private interconnect matrix that connects them together. For example, each ATE instance may issue Remote Procedure Calls (RPCs), with or without responses, to an ATE instance associated with a remote processor core in order to perform operations that target memory locations controlled by the remote processor core. Each ATE instance may process RPCs (atomically) that are received from other ATE instances or that are generated locally. For some operation types, an ATE instance may execute the operations identified in the RPCs itself using dedicated hardware. For other operation types, the ATE instance may interrupt its local processor core to perform the operations.
Proceedings Article•10.2118/176045-MS•
A Parallel Framework for Reservoir Simulators on Distributed-memory Supercomputers

[...]

Hui Liu, Kun Wang, Zhangxin Chen, Kirk E. Jordan, Jia Luo, Hui Deng 
20 Oct 2015
Proceedings Article•10.1109/WI-IAT.2015.186•
SparkRDF: Elastic Discreted RDF Graph Processing Engine With Distributed Memory

[...]

Xi Chen1, Huajun Chen1, Ningyu Zhang1, Songyang Zhang1•
Zhejiang University1
6 Dec 2015
TL;DR: The extensive evaluation demonstrates that SparkRDF can efficiently implement non-selective joins faster than both current state-of-the-art distributed and centralized stores, while being able to process other queries in real time, scaling linearly to the amount of data.
Abstract: With the explosive growth of semantic data on the Web over the past years, many large-scale RDF knowledge bases with billions of facts are generating. This poses significant challenges for the storage and query of big RDF graphs. Current systems still have many limitations in processing big RDF graphs including scalability and real-time. In this paper, we introduce the SparkRDF, an elastic discreted RDF graph processing engine with distributed memory. To reduce the high I/O and communication cost in distributed processing platforms, SparkRDF implements SPARQL query based on Spark, a novel in-memory distributed computing framework for big data processing. All the intermediate results are modeled as Resilient Discreted SubGraph, which are cached in the distributed memory to support fast iterative join operations. To cut down the search space and avoid the overhead of memory, we split the RDF graph into the small Multi-layer Elastic SubGraph based on the relations and classes. For SPARQL query optimization, SparkRDF deploys a serials of optimization strategies, leading to effective reduction on the size of intermediate results, the number of joins and the cost of communication. Our extensive evaluation demonstrates that SparkRDF can efficiently implement non-selective joins faster than both current state-of-the-art distributed and centralized stores, while being able to process other queries in real time, scaling linearly to the amount of data.
Journal Article•10.1007/S11265-014-0953-5•
Embedded Multi-Core Systems Dedicated to Dynamic Dataflow Programs

[...]

Herve Yviquel1, Alexandre Sanchez1, Pekka Jääskeläinen2, Jarmo Takala2, Mickaël Raulet1, Emmanuel Casseau3 •
Intelligence and National Security Alliance1, Tampere University of Technology2, University of Rennes3
1 Jul 2015
TL;DR: A dataflow-based design approach to implement video codecs on embedded multi-core platforms based on distributed memory and Transport-Triggered Architecture processor cores is presented and software synthesis techniques to improve dynamic dataflow implementations are described.
Abstract: Multimedia applications and embedded platforms are both becoming very complex in order to improve user experience. Thus, multimedia developers need high-level methods to automate time-consuming and error-prone tasks. Dynamic dataflow modeling is attractive to describe complex applications, such as video codecs, at a high level of abstraction. This paper presents a dataflow-based design approach to implement video codecs on embedded multi-core platforms. First, we introduce a custom architecture model to design low-power multi-core chips based on distributed memory and Transport-Triggered Architecture processor cores. Then, we describe software synthesis techniques to improve dynamic dataflow implementations. This methodology has been implemented into open-source tools and demonstrated on video decoders based on the MPEG-4 Visual standard and the new High Efficiency Video Coding standard. The simulations achieve real-time decoding (40FPS) of high definition (720P) MPEG-4 Visual video sequences on a custom multi-core platform clocked at 1Ghz, which is an improvement of more than 100 % over previously proposed implementations.
Posted Content•
Ultrascale Simulations of Non-smooth Granular Dynamics

[...]

Tobias Preclik1, Ulrich Rüde1•
University of Erlangen-Nuremberg1
23 Jan 2015-arXiv: Computational Engineering, Finance, and Science
TL;DR: New algorithms for massively parallel granular dynamics simulations on distributed memory architectures using a domain partitioning approach that can reach unprecedented resolution of up to ten billion non-spherical particles and contacts are presented.
Abstract: This article presents new algorithms for massively parallel granular dynamics simulations on distributed memory architectures using a domain partitioning approach. Collisions are modelled with hard contacts in order to hide their micro-dynamics and thus to extend the time and length scales that can be simulated. The multi-contact problem is solved using a non-linear block Gauss-Seidel method that is conforming to the subdomain structure. The parallel algorithms employ a sophisticated protocol between processors that delegate algorithmic tasks such as contact treatment and position integration uniquely and robustly to the processors. Communication overhead is minimized through aggressive message aggregation, leading to excellent strong and weak scaling. The robustness and scalability is assessed on three clusters including two peta-scale supercomputers with up to 458752 processor cores. The simulations can reach unprecedented resolution of up to ten billion non-spherical particles and contacts.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve