Top 134 papers published in the topic of Distributed memory in 2020

Showing papers on "Distributed memory published in 2020"

Journal Article•10.1016/J.SIMPA.2020.100037•

AMGCL – A C++ library for efficient solution of large sparse linear systems

[...]

Denis Demidov¹•Institutions (1)

1 Nov 2020

TL;DR: AMGCL provides an efficient, flexible, and extensible implementation of several iterative solvers and preconditioners on top of different backends allowing the acceleration of the solution with the help of OpenMP, OpenCL, or CUDA technologies.

...read moreread less

Abstract: AMGCL is a header-only C++ library for the solution of large sparse linear systems with algebraic multigrid. The method may be used as a black-box solver for computational problems in various fields, since it does not require any information about the underlying geometry. AMGCL provides an efficient, flexible, and extensible implementation of several iterative solvers and preconditioners on top of different backends allowing the acceleration of the solution with the help of OpenMP, OpenCL, or CUDA technologies. Most algorithms have both shared memory and distributed memory implementations. The library is published under a permissive MIT license.

...read moreread less

41 citations

Journal Article•10.1063/1.5141358•

GronOR: Massively parallel and GPU-accelerated non-orthogonal configuration interaction for large molecular systems

[...]

T. P. Straatsma¹, Ria Broer, Shirin Faraji, Remco W. A. Havenith, L. E. Aguilar Suarez, R. K. Kathir, M. Wibowo, C. de Graaf - Show less +4 more•Institutions (1)

National Center for Computational Sciences¹

14 Feb 2020-Journal of Chemical Physics

TL;DR: Algorithm and implementation details, parallel and accelerated performance benchmarks, and an analysis of the sensitivity of the accuracy of results and computational performance to thresholds used in the calculations are presented are presented.

...read moreread less

Abstract: GronOR is a program package for non-orthogonal configuration interaction calculations for an electronic wave function built in terms of anti-symmetrized products of multi-configuration molecular fragment wave functions. The two-electron integrals that have to be processed may be expressed in terms of atomic orbitals or in terms of an orbital basis determined from the molecular orbitals of the fragments. The code has been specifically designed for execution on distributed memory massively parallel and Graphics Processing Unit (GPU)-accelerated computer architectures, using an MPI+OpenACC/OpenMP programming approach. The task-based execution model used in the implementation allows for linear scaling with the number of nodes on the largest pre-exascale architectures available, provides hardware fault resiliency, and enables effective execution on systems with distinct central processing unit-only and GPU-accelerated partitions. The code interfaces with existing multi-configuration electronic structure codes that provide optimized molecular fragment orbitals, configuration interaction coefficients, and the required integrals. Algorithm and implementation details, parallel and accelerated performance benchmarks, and an analysis of the sensitivity of the accuracy of results and computational performance to thresholds used in the calculations are presented.

...read moreread less

23 citations

Journal Article•10.1016/J.ADVWATRES.2020.103734•

Efficient extraction of pore networks from massive tomograms via geometric domain decomposition

[...]

Zohaib Atiq Khan¹, Zohaib Atiq Khan², Ali Elkamel³, Ali Elkamel², Jeff T. Gostick² - Show less +1 more•Institutions (3)

University of Engineering and Technology, Lahore¹, University of Waterloo², Khalifa University³

01 Nov 2020-Advances in Water Resources

TL;DR: This study presents an efficient workflow to extract pore networks form large size porous domains using a watershed segmentation with geometrical domain decomposition using highly optimized and efficient modules such as Dask and Numba to obtain the maximum performance.

...read moreread less

23 citations

Journal Article•10.3389/FDATA.2020.00030•

LocationSpark: In-memory Distributed Spatial Query Processing and Optimization.

[...]

Mingjie Tang¹, Yongyang Yu², Ahmed R. Mahmood³, Qutaibah M. Malluhi⁴, Mourad Ouzzani⁵, Walid G. Aref³ - Show less +2 more•Institutions (5)

Chinese Academy of Sciences¹, Facebook², Purdue University³, Qatar University⁴, Qatar Computing Research Institute⁵

16 Oct 2020

TL;DR: LocationSpark as mentioned in this paper proposes a distributed query scheduler that uses a new cost model to minimize the cost of spatial query processing and optimizes and selects its best local query execution plan based on the indexes and the nature of the spatial queries in that node.

...read moreread less

Abstract: Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for handling query skew that commonly happens in practice, and minimizes communication costs accordingly. We propose a distributed query scheduler that uses a new cost model to minimize the cost of spatial query processing. The scheduler generates query execution plans that minimize the effect of query skew. The query scheduler utilizes new spatial indexing techniques based on bitmap filters to forward queries to the appropriate local nodes. Each local computation node is responsible for optimizing and selecting its best local query execution plan based on the indexes and the nature of the spatial queries in that node. All the proposed spatial query processing and optimization techniques are prototyped inside Spark, a distributed memory-based computation system. Our prototype system is termed LocationSpark. The experimental study is based on real datasets and demonstrates that LocationSpark can enhance distributed spatial query processing by up to an order of magnitude over existing in-memory and distributed spatial systems.

...read moreread less

21 citations

Journal Article•10.1109/TCI.2020.3008782•

Distributed Iterative CT Reconstruction Using Multi-Agent Consensus Equilibrium

[...]

Venkatesh Sridhar¹, Xiao Wang², Gregery T. Buzzard¹, Charles A. Bouman¹•Institutions (2)

Purdue University¹, Boston Children's Hospital²

17 Jul 2020-IEEE Transactions on Computational Imaging

TL;DR: In this article, a multi-agent consensus equilibrium (MACE) algorithm is proposed for distributed MBIR reconstruction across a large number of parallel nodes, where each node stores only a sparse subset of views and a small portion of the system matrix, and each parallel node performs a local sparse-view reconstruction.

...read moreread less

Abstract: Model-Based Image Reconstruction (MBIR) methods significantly enhance the quality of computed tomographic (CT) reconstructions relative to analytical techniques, but are limited by high computational cost. In this article, we propose a multi-agent consensus equilibrium (MACE) algorithm for distributing both the computation and memory of MBIR reconstruction across a large number of parallel nodes. In MACE, each node stores only a sparse subset of views and a small portion of the system matrix, and each parallel node performs a local sparse-view reconstruction, which based on repeated feedback from other nodes, converges to the global optimum. Our distributed approach can also incorporate advanced denoisers as priors to enhance reconstruction quality. In this case, we obtain a parallel solution to the serial framework of Plug-n-play (PnP) priors, which we call MACE-PnP. In order to make MACE practical, we introduce a partial update method that eliminates nested iterations and prove that it converges to the same global solution. Finally, we validate our approach on a distributed memory system with real CT data. We also demonstrate an implementation of our approach on a massive supercomputer that can perform large-scale reconstruction in real-time.

...read moreread less

19 citations

Proceedings Article•10.1109/BIGDATA50022.2020.9378050•

HeAT – a Distributed and GPU-accelerated Tensor Framework for Data Analytics

[...]

Markus Götz¹, Charlotte Debus², Daniel Coquelin³, Kai Krajsek³, Claudia Comito³, Philipp Knechtges², Björn Hagemeier³, Michael Tarnawa³, Simon Hanselmann¹, Martin Siggel², Achim Basermann², Achim Streit¹ - Show less +8 more•Institutions (3)

Karlsruhe Institute of Technology¹, German Aerospace Center², Forschungszentrum Jülich³

10 Dec 2020

TL;DR: HeAT, an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API, is introduced, which achieves speedups of up to two orders of magnitude.

...read moreread less

Abstract: To cope with the rapid growth in available data, the efficiency of data analysis and machine learning libraries has recently received increased attention. Although great advancements have been made in traditional array-based computations, most are limited by the resources available on a single computation node. Consequently, novel approaches must be made to exploit distributed resources, e.g. distributed memory architectures. To this end, we introduce HeAT, an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API. HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload on arbitrarily large high-performance computing systems via MPI. It provides both low-level array computations, as well as assorted higher-level algorithms. With HeAT, it is possible for a NumPy user to take full advantage of their available resources, significantly lowering the barrier to distributed data analysis. When compared to similar frameworks, HeAT achieves speedups of up to two orders of magnitude.

...read moreread less

18 citations

Book Chapter•10.1007/978-3-030-68035-0_12•

Parallel/Distributed Generative Adversarial Neural Networks for Data Augmentation of COVID-19 Training Images

[...]

Jamal Toutouh¹, Mathias Esteban², Sergio Nesmachnow²•Institutions (2)

Massachusetts Institute of Technology¹, University of the Republic²

2 Sep 2020

TL;DR: An approach using parallel/distributed generative adversarial networks for image data augmentation, applied to generate COVID-19 training samples for computational intelligence methods, indicates that the proposed model is able to generate accurate images and the $3\times 3$ version of the distributed GAN has better robustness properties of its training process, allowing to generate better and more diverse images.

...read moreread less

Abstract: This article presents an approach using parallel/distributed generative adversarial networks for image data augmentation, applied to generate COVID-19 training samples for computational intelligence methods. This is a relevant problem nowadays, considering the recent COVID-19 pandemic. Computational intelligence and learning methods are useful tools to assist physicians in the process of diagnosing diseases and acquire valuable medical knowledge. A specific generative adversarial network approach trained using a co-evolutionary algorithm is implemented, including a three-level parallel approach combining distributed memory and fine-grained parallelization using CPU and GPU. The experimental evaluation of the proposed method was performed on the high performance computing infrastructure provided by National Supercomputing Center, Uruguay. The main experimental results indicate that the proposed model is able to generate accurate images and the $3\times 3$ version of the distributed GAN has better robustness properties of its training process, allowing to generate better and more diverse images.

...read moreread less

17 citations

Posted Content•10.5555/3433701.3433732•

Distributed-Memory DMRG via Sparse and Dense Parallel Tensor Contractions

[...]

Ryan Levy, Edgar Solomonik, Bryan K. Clark¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

10 Jul 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: It is demonstrated that despite having limited concurrency, DMRG is weakly scalable with the use of efficient parallel tensor contraction mechanisms, and enables higher accuracy calculations via larger tensors for quantum state approximation.

...read moreread less

Abstract: The Density Matrix Renormalization Group (DMRG) algorithm is a powerful tool for solving eigenvalue problems to model quantum systems. DMRG relies on tensor contractions and dense linear algebra to compute properties of condensed matter physics systems. However, its efficient parallel implementation is challenging due to limited concurrency, large memory footprint, and tensor sparsity. We mitigate these problems by implementing two new parallel approaches that handle block sparsity arising in DMRG, via Cyclops, a distributed memory tensor contraction library. We benchmark their performance on two physical systems using the Blue Waters and Stampede2 supercomputers. Our DMRG performance is improved by up to 5.9X in runtime and 99X in processing rate over ITensor, at roughly comparable computational resource use. This enables higher accuracy calculations via larger tensors for quantum state approximation. We demonstrate that despite having limited concurrency, DMRG is weakly scalable with the use of efficient parallel tensor contraction mechanisms.

...read moreread less

15 citations

Journal Article•10.1016/J.IJMULTIPHASEFLOW.2020.103293•

Development of a CPU/GPU portable software library for Lagrangian–Eulerian simulations of liquid sprays

[...]

Wenjun Ge¹, Ramanan Sankaran¹, Jacqueline H. Chen²•Institutions (2)

Oak Ridge National Laboratory¹, Sandia National Laboratories²

01 Jul 2020-International Journal of Multiphase Flow

TL;DR: A performance-portable library, Grit, to track the Lagrangian particles in parallel on central processing unit (CPU) and graphics processing units (GPU) accelerated high performance computing (HPC) architectures and a conservative formulation has been developed and implemented in Grit for phase coupling with thermodynamic consistency.

...read moreread less

14 citations

Proceedings Article•10.1109/SC41405.2020.00089•

Pencil: A Pipelined Algorithm for Distributed Stencils

[...]

Hengjie Wang¹, Aparna Chandramowlishwaran¹•Institutions (1)

University of California, Irvine¹

1 Nov 2020

TL;DR: In this article, a pipelined stencil algorithm called Pencil is proposed for distributed memory machines that applies to practical CFD problems that span multiple iteration spaces on shared-memory machines.

...read moreread less

Abstract: Stencil computations are at the core of various Computational Fluid Dynamics (CFD) applications and have been well-studied for several decades. Typically they’re highly memory-bound and as a result, numerous tiling algorithms have been proposed to improve its performance. Although efficient, most of these algorithms are designed for single iteration spaces on shared-memory machines. However, in CFD, we are confronted with multi-block structured girds composed of multiple connected iteration spaces distributed across many nodes.In this paper, we propose a pipelined stencil algorithm called Pencil for distributed memory machines that applies to practical CFD problems that span multiple iteration spaces. Based on an in-depth analysis of cache tiling on a single node, we first identify both the optimal combination of MPI and OpenMP for temporal tiling and the best tiling approach, which outperforms the state-of-the-art automatic parallelization tool Pluto by up to $1.92 \times$. Then, we adopt DeepHalo to decouple the multiple connected iteration spaces so that temporal tiling can be applied to each space. Finally, we achieve overlap by pipelining the computation and communication without sacrificing the advantage from temporal cache tiling. Pencil is evaluated using 4 stencils across 6 numerical schemes on two distributed memory machines with Omni-Path and InfiniBand networks. On the Omni-Path system, Pencil exhibits outstanding weak and strong scalability for up to 128 nodes and outperforms MPI+OpenMP Funneled with space tiling by $1.33- 3.41 \times$ on a multi-block grid with 32 nodes.

...read moreread less

14 citations

Proceedings Article•10.1109/PYHPC51966.2020.00007•

Data Engineering for HPC with Python

[...]

Vibhatha Abeykoon, Niranda Perera, Chathura Widanage, Supun Kamburugamuve, Thejaka Amila Kanewala¹, Hasara Maithree², Pulasthi Wickramasinghe, Ahmet Uyar, Geoffrey C. Fox - Show less +5 more•Institutions (2)

Indiana University¹, University of Moratuwa²

1 Nov 2020

TL;DR: In this article, the authors present a distributed Python API based on table abstraction for representing and processing data, which adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings.

...read moreread less

Abstract: Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.

...read moreread less

Proceedings Article•10.1109/QCE49297.2020.00035•

Cache Blocking Technique to Large Scale Quantum Computing Simulation on Supercomputers

[...]

Jun Doi¹, Hiroshi Horii¹•Institutions (1)

IBM¹

1 Oct 2020

TL;DR: A cache blocking technique is applied by inserting swap gates in quantum circuits to decrease data movements in order to solve the scalability issue in parallel quantum computing simulations.

...read moreread less

Abstract: Classical computers require large memory resources and computational power to simulate quantum circuits with a large number of qubits. Even supercomputers that can store huge amounts of data face a scalability issue in regard to parallel quantum computing simulations because of the latency of data movements between distributed memory spaces. Here, we apply a cache blocking technique by inserting swap gates in quantum circuits to decrease data movements. We implemented this technique in the open source simulation framework Qiskit Aer. We evaluated our simulator on GPU clusters and observed good scalability.

...read moreread less

Journal Article•10.1016/J.MICPRO.2019.102950•

General memory efficient packet matching FPGA architecture for future high-speed networks

[...]

Michal Kekely¹, Lukas Kekely², Jan Kořenek¹•Institutions (2)

Brno University of Technology¹, CESNET²

01 Mar 2020-Microprocessors and Microsystems

TL;DR: A unique parallel hardware architecture for hash-based exact match classification of multiple packets in each clock cycle that offers a reduction of memory replication requirements and is able to maintain a rather high throughput of matching multiple packets per clock cycle even without fully replicated memory resources in matching tables.

...read moreread less

Journal Article•10.1371/JOURNAL.PONE.0239741•

Big Data in metagenomics: Apache Spark vs MPI

[...]

José Manuel Abuín¹, José Manuel Abuín², Nuno Lopes¹, Luís Carlos de Souza Ferreira¹, Tomás F. Pena², Bertil Schmidt³ - Show less +2 more•Institutions (3)

Ipca Laboratories Ltd.¹, University of Santiago de Compostela², University of Mainz³

06 Oct 2020-PLOS ONE

TL;DR: The new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.

...read moreread less

Abstract: The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.

...read moreread less

Journal Article•10.1007/S11227-019-03097-W•

GraphMap: scalable iterative graph processing using NoSQL

[...]

Sayan Goswami¹, Ayam Pokhrel², Kisung Lee², Ling Liu³, Qi Zhang⁴, Yang Zhou⁵ - Show less +2 more•Institutions (5)

Louisiana State University in Shreveport¹, Louisiana State University², Georgia Institute of Technology³, IBM⁴, Auburn University⁵

01 Sep 2020-The Journal of Supercomputing

TL;DR: This paper presents a new distributed iterative graph computation framework, called GraphMap, that utilizes a disk-based NoSQL database system for scalable graph processing while ensuring competitive performance.

...read moreread less

Abstract: Despite having several distributed graph processing frameworks, scalable iterative processing of large graphs is a challenging problem since the graph and intermediate data need a global view of the graph topology in distributed memory. Although some systems support out-of-core iterative computations, they use a single machine and often require fast storage. In this paper, we present a new distributed iterative graph computation framework, called GraphMap, that utilizes a disk-based NoSQL database system for scalable graph processing while ensuring competitive performance. Extensive experiments on several real-world graphs show that GraphMap is more scalable and often faster than existing distributed memory-based systems for various graph processing workloads.

...read moreread less

Proceedings Article•10.1145/3332466.3374537•

YewPar: skeletons for exact combinatorial search

[...]

Blair Archibald¹, Patrick Maier², Robert Stewart³, Phil Trinder¹•Institutions (3)

University of Glasgow¹, University of Stirling², Heriot-Watt University³

19 Feb 2020

TL;DR: This work aims to improve the reuse of intricate parallel search implementations by providing the first general purpose scalable parallel framework for exact combinatorial search, YewPar, and introduces Lazy Node Generators as a uniform API for search tree generation.

...read moreread less

Abstract: Combinatorial search is central to many applications, yet the huge irregular search trees and the need to respect search heuristics make it hard to parallelise. We aim to improve the reuse of intricate parallel search implementations by providing the first general purpose scalable parallel framework for exact combinatorial search, YewPar. We make the following contributions. (1) We present a novel formal model of parallel backtracking search, covering enumeration, decision, and optimisation search. (2) We introduce Lazy Node Generators as a uniform API for search tree generation. (3) We present the design and implementation of 12 widely applicable algorithmic skeletons for tree search on shared and distributed memory architectures. (4) Uniquely in the field we demonstrate how a wide range of parallel search applications can easily be constructed by composing Lazy Node Generators and the search skeletons. (5) We report a systematic performance analysis of all 12 YewPar skeletons on standard instances of 7 search applications, investigating skeleton overheads and scalability up to 255 workers on 17 distributed locations.

...read moreread less

Journal Article•10.1109/TCBB.2018.2858797•

Fast de Bruijn Graph Compaction in Distributed Memory Environments

[...]

Tony Pan¹, Rahul Nihalani¹, Srinivas Aluru¹•Institutions (1)

Georgia Institute of Technology¹

01 Jan 2020-IEEE/ACM Transactions on Computational Biology and Bioinformatics

TL;DR: The key advantages of the algorithm include bounding the chain compaction run-time to logarithmic number of iterations in the length of the longest chain, and ability to differentiate cycles from chains within logaratic number of iteration in the longest cycle.

...read moreread less

Abstract: De Bruijn graph based genome assembly has gained popularity as short read sequencers become ubiquitous. A core assembly operation is the generation of unitigs, which are sequences corresponding to chains in the graph. Unitigs are used as building blocks for generating longer sequences in many assemblers, and can facilitate graph compression. Chain compaction, by which unitigs are generated, remains a critical computational task. In this paper, we present a distributed memory parallel algorithm for simultaneous compaction of all chains in bi-directed de Bruijn graphs. The key advantages of our algorithm include bounding the chain compaction run-time to logarithmic number of iterations in the length of the longest chain, and ability to differentiate cycles from chains within logarithmic number of iterations in the length of the longest cycle. Our algorithm scales to thousands of computational cores, and can compact a whole genome de Bruijn graph from a human sequence read set in 7.3 seconds using 7680 distributed memory cores, and in 12.9 minutes using 64 shared memory cores. It is $3.7\times$ 3 . 7 × and $2.0\times$ 2 . 0 × faster than equivalent steps in the state-of-the-art tools for distributed and shared memory environments, respectively. An implementation of the algorithm is available at https://github.com/ParBLiSS/bruno .

...read moreread less

Proceedings Article•10.1109/EICONRUS49466.2020.9039470•

Comparative Performance Study of Shared and Distributed Memory Dynamic Programming Algorithms

[...]

Mikhail Posypkin¹, Si Thu Thant Sin¹•Institutions (1)

National Research University of Electronic Technology¹

1 Jan 2020

TL;DR: This paper considered the standard table-based dynamic programming algorithm of the Knapsack Problem for serial, shared memory parallel and distributed memory multiprocessors as well as OpenMP and MPI parallel programming tools for experimental evaluation.

...read moreread less

Abstract: This paper is devoted to the experimental and theoretical study of parallel versions of the dynamic programming algorithm for the Knapsack Problem. We considered the standard table-based dynamic programming algorithm of the Knapsack Problem for serial, shared memory parallel and distributed memory multiprocessors. Experimental comparison and the performance study of shared and distributed memory dynamic programming algorithms is carried out. The OpenMP and MPI parallel programming tools are used for experimental evaluation.

...read moreread less

Proceedings Article•10.1109/IPDPSW50202.2020.00092•

Parallel/distributed implementation of cellular training for generative adversarial neural networks.

[...]

Emiliano Perez, Sergio Nesmachnow¹, Jamal Toutouh², Erik Hemberg², Una-May O'Reily² - Show less +1 more•Institutions (2)

University of the Republic¹, Massachusetts Institute of Technology²

1 Apr 2020

TL;DR: In this paper, a parallel/distributed implementation of a cellular competitive coevolutionary method to train two populations of GANs is proposed for high performance/supercomputing centers.

...read moreread less

Abstract: Generative adversarial networks (GANs) are widely used to learn generative models. GANs consist of two networks, a generator and a discriminator, that apply adversarial learning to optimize their parameters. This article presents a parallel/distributed implementation of a cellular competitive coevolutionary method to train two populations of GANs. A distributed memory parallel implementation is proposed for execution in high performance/supercomputing centers. Efficient results are reported on addressing the generation of handwritten digits (MNIST dataset samples). Moreover, the proposed implementation is able to reduce the training times and scale properly when considering different grid sizes for training.

...read moreread less

Proceedings Article•10.1109/SC41405.2020.00028•

Distributed-Memory DMRG via Sparse and Dense Parallel Tensor Contractions

[...]

Ryan Levy, Edgar Solomonik, Bryan K. Clark

1 Nov 2020

TL;DR: Cyclops as mentioned in this paper is a distributed memory tensor contraction library that implements two new parallel approaches that handle block sparsity arising in density matrix renormalization group (DMRG) algorithms.

...read moreread less

Abstract: The density matrix renormalization group (DMRG) algorithm is a powerful tool for solving eigenvalue problems to model quantum systems. DMRG relies on tensor contractions and dense linear algebra to compute properties of condensed matter physics systems. However, its efficient parallel implementation is challenging due to limited concurrency, large memory footprint, and tensor sparsity. We mitigate these problems by implementing two new parallel approaches that handle block sparsity arising in DMRG, via Cyclops, a distributed memory tensor contraction library. We benchmark their performance on two physical systems using the Blue Waters and Stampede2 supercomputers. Our DMRG performance is improved by up to 5.9X in runtime and 99X in processing rate over ITensor, at roughly comparable computational resource use. This enables higher accuracy calculations via larger tensors for quantum state approximation. We demonstrate that despite having limited concurrency, DMRG is weakly scalable with the use of efficient parallel tensor contraction mechanisms.

...read moreread less

Journal Article•10.1007/S00500-019-04122-Z•

A novel parallel local search algorithm for the maximum vertex weight clique problem in large graphs

[...]

Ender Sevinc, Tansel Dokeroglu

1 Mar 2020

TL;DR: The Par-LS algorithm is reported as one of the best performing algorithms in the literature for the solution of the MVWCP in large graphs, developed on a distributed memory environment by using message passing interface libraries and employs a different exploration strategy at each processor.

...read moreread less

Abstract: This study proposes a new parallel local search algorithm (Par-LS) for solving the maximum vertex weight clique problem (MVWCP) in large graphs. Solving the MVWCP in a large graph with millions of edges and vertices is an intractable problem. Parallel local search methods are powerful tools to deal with such problems with their high-performance computation capability. The Par-LS algorithm is developed on a distributed memory environment by using message passing interface libraries and employs a different exploration strategy at each processor. The Par-LS introduces new operators parallel($$\omega $$,1)-swap and parallel(1,2)-swap, for searching the neighboring solutions while improving the current solution through iterations. During our experiments, 172 of 173 benchmark problem instances from the DIMACS, BHOSLIB and Network Data Repository graph libraries are solved optimally with respect to the best/optimal reported results. A new best solution for the largest problem instance of the BHOSLIB benchmark (frb100-40) is discovered. The Par-LS algorithm is reported as one of the best performing algorithms in the literature for the solution of the MVWCP in large graphs.

...read moreread less

Journal Article•10.3390/APP10072539•

Locality-Sensitive Hashing for Information Retrieval System on Multiple GPGPU Devices

[...]

Toan Nguyen Mau, Yasushi Inoguchi

07 Apr 2020-Applied Sciences

TL;DR: This paper proposes an extension of DLSH for big data sets using multiple GPGPUs, in order to increase the capacity and performance of the information retrieval system and shows that DLSh can be applied to real-life dynamic database systems.

...read moreread less

Abstract: It is challenging to build a real-time information retrieval system, especially for systems with high-dimensional big data. To structure big data, many hashing algorithms that map similar data items to the same bucket to advance the search have been proposed. Locality-Sensitive Hashing (LSH) is a common approach for reducing the number of dimensions of a data set, by using a family of hash functions and a hash table. The LSH hash table is an additional component that supports the indexing of hash values (keys) for the corresponding data/items. We previously proposed the Dynamic Locality-Sensitive Hashing (DLSH) algorithm with a dynamically structured hash table, optimized for storage in the main memory and General-Purpose computation on Graphics Processing Units (GPGPU) memory. This supports the handling of constantly updated data sets, such as songs, images, or text databases. The DLSH algorithm works effectively with data sets that are updated with high frequency and is compatible with parallel processing. However, the use of a single GPGPU device for processing big data is inadequate, due to the small memory capacity of GPGPU devices. When using multiple GPGPU devices for searching, we need an effective search algorithm to balance the jobs. In this paper, we propose an extension of DLSH for big data sets using multiple GPGPUs, in order to increase the capacity and performance of the information retrieval system. Different search strategies on multiple DLSH clusters are also proposed to adapt our parallelized system. With significant results in terms of performance and accuracy, we show that DLSH can be applied to real-life dynamic database systems.

...read moreread less

Proceedings Article•10.1145/3392717.3392745•

Parallelizing pruned landmark labeling: dealing with dependencies in graph algorithms

[...]

Ruoming Jin¹, Zhen Peng², Wendell Wu¹, Feodor F. Dragan¹, Gagan Agrawal³, Bin Ren² - Show less +2 more•Institutions (3)

Kent State University¹, College of William & Mary², Georgia Regents University³

29 Jun 2020

TL;DR: This paper demonstrates the first scalable parallel implementation of the PPL algorithm that produces the same results as the sequential algorithm, resulting in the Vertex-Centrix PLL (VC-PLL) algorithm, which can efficiently execute on graphs with more than a billion edges.

...read moreread less

Abstract: To help compute shortest path distances over large graphs efficiently, 2-hop labeling has emerged as a major tool, with Pruned Landmark Labeling (PPL) as a popular algorithm. This paper demonstrates the first scalable parallel implementation of the PPL algorithm that produces the same results as the sequential algorithm. Based on theoretical analysis, we show how computations on each vertex can be performed in parallel while maintaining correctness, resulting in the Vertex-Centrix PLL (VC-PLL) algorithm. We also show a formulation of this algorithm based on linear algebra and argue why the use of a library based on linear algebra operations will not produce an efficient implementation. Next, we introduce a batched VC-PLL (BVC-PLL) algorithm to reduce the computational inefficiency in VC-PLL. We have carried out a parallel implementation of this method for modern clusters, combining shared memory and distributed memory parallelism, that can efficiently execute on graphs with more than a billion edges. We also demonstrate how BVC-PLL algorithm can be extended to handle directed graphs and weighted graphs and how the version for weighted graphs can benefit from SIMD parallelization.

...read moreread less

Journal Article•10.1063/1.5129452•

Techniques for high-performance construction of Fock matrices.

[...]

Hua Huang¹, C. David Sherrill¹, Edmond Chow¹•Institutions (1)

Georgia Institute of Technology¹

13 Jan 2020-Journal of Chemical Physics

TL;DR: Techniques for Fock matrix construction that are designed for high performance on shared and distributed memory parallel computers when using Gaussian basis sets are presented and a globally accessible matrix class for accessing distributed Fock and density matrices is presented.

...read moreread less

Abstract: This paper presents techniques for Fock matrix construction that are designed for high performance on shared and distributed memory parallel computers when using Gaussian basis sets. Four main techniques are considered. (1) To calculate electron repulsion integrals, we demonstrate batching together the calculation of multiple shell quartets of the same angular momentum class so that the calculation of large sets of primitive integrals can be efficiently vectorized. (2) For multithreaded summation of entries into the Fock matrix, we investigate using a combination of atomic operations and thread-local copies of the Fock matrix. (3) For distributed memory parallel computers, we present a globally accessible matrix class for accessing distributed Fock and density matrices. The new matrix class introduces a batched mode for remote memory access that can reduce the synchronization cost. (4) For density fitting, we exploit both symmetry (of the Coulomb and exchange matrices) and sparsity (of 3-index tensors) and give a performance comparison of density fitting and the conventional direct calculation approach. The techniques are implemented in an open-source software library called GTFock.

...read moreread less

Proceedings Article•

Reactive Task Migration for Hybrid MPI+OpenMP Applications

[...]

Jannis Klinkenberg¹, Philipp Samfass², Michael Bader², Christian Terboven¹, Matthias S. Müller¹ - Show less +1 more•Institutions (2)

RWTH Aachen University¹, Technische Universität München²

1 Jan 2020

TL;DR: This work presents a novel library for fine-granular task-based reactive load balancing in distributed memory based on MPI and OpenMP, and demonstrates its robustness for work-induced imbalances for a realistic application.

...read moreread less

Abstract: Many applications in high performance computing are designed based on underlying performance and execution models. While these models could successfully be employed in the past for balancing load within and between compute nodes, modern software and hardware increasingly make performance predictability difficult if not impossible. Consequently, balancing computational load becomes much more difficult. Aiming to tackle these challenges in search for a general solution, we present a novel library for fine-granular task-based reactive load balancing in distributed memory based on MPI and OpenMP. With our approach, individual migratable tasks can be executed on any MPI rank. The actual executing rank is determined at run time based on online performance data. We evaluate our approach under an enforced power cap and under enforced clock frequency changes for a synthetic benchmark and show its robustness for work-induced imbalances for a realistic application. Our experiments demonstrate speedups of up to $1.31\text {X}$.

...read moreread less

Journal Article•10.1186/S12859-019-3338-8•

pSpatiocyte: a high-performance simulator for intracellular reaction-diffusion systems

[...]

Satya N. V. Arjunan, Atsushi Miyauchi, Kazunari Iwamoto, Koichi Takahashi

29 Jan 2020-BMC Bioinformatics

TL;DR: A parallelized Spatiocyte method that demonstrates good accuracies, fast runtimes and a significant performance advantage over well-known microscopic particle methods in large-scale simulations of intracellular reaction-diffusion systems.

...read moreread less

Abstract: Studies using quantitative experimental methods have shown that intracellular spatial distribution of molecules plays a central role in many cellular systems. Spatially resolved computer simulations can integrate quantitative data from these experiments to construct physically accurate models of the systems. Although computationally expensive, microscopic resolution reaction-diffusion simulators, such as Spatiocyte can directly capture intracellular effects comprising diffusion-limited reactions and volume exclusion from crowded molecules by explicitly representing individual diffusing molecules in space. To alleviate the steep computational cost typically associated with the simulation of large or crowded intracellular compartments, we present a parallelized Spatiocyte method called pSpatiocyte. The new high-performance method employs unique parallelization schemes on hexagonal close-packed (HCP) lattice to efficiently exploit the resources of common workstations and large distributed memory parallel computers. We introduce a coordinate system for fast accesses to HCP lattice voxels, a parallelized event scheduler, a parallelized Gillespie’s direct-method for unimolecular reactions, and a parallelized event for diffusion and bimolecular reaction processes. We verified the correctness of pSpatiocyte reaction and diffusion processes by comparison to theory. To evaluate the performance of pSpatiocyte, we performed a series of parallelized diffusion runs on the RIKEN K computer. In the case of fine lattice discretization with low voxel occupancy, pSpatiocyte exhibited 74% parallel efficiency and achieved a speedup of 7686 times with 663552 cores compared to the runtime with 64 cores. In the weak scaling performance, pSpatiocyte obtained efficiencies of at least 60% with up to 663552 cores. When executing the Michaelis-Menten benchmark model on an eight-core workstation, pSpatiocyte required 45- and 55-fold shorter runtimes than Smoldyn and the parallel version of ReaDDy, respectively. As a high-performance application example, we study the dual phosphorylation-dephosphorylation cycle of the MAPK system, a typical reaction network motif in cell signaling pathways. pSpatiocyte demonstrates good accuracies, fast runtimes and a significant performance advantage over well-known microscopic particle methods in large-scale simulations of intracellular reaction-diffusion systems. The source code of pSpatiocyte is available at https://spatiocyte.org.

...read moreread less

Proceedings Article•10.1109/ICDCS47774.2020.00021•

DeX: Scaling Applications Beyond Machine Boundaries

[...]

Sang-Hoon Kim¹, Ho-Ren Chuang², Robert Lyerly², Pierre Olivier³, Changwoo Min², Binoy Ravindran² - Show less +2 more•Institutions (3)

Ajou University¹, Virginia Tech², University of Manchester³

13 Mar 2020

TL;DR: DeX as discussed by the authors is an operating system-level approach to extend the execution boundary of existing applications over multiple machines by allowing threads in a process to be relocated and distributed dynamically through a simple function call.

...read moreread less

Abstract: Increasing the computing performance within a single-machine form factor is becoming increasingly difficult due to the complexities in scaling processor interconnects and coherence protocols. On the other hand, converting existing applications to run on multiple nodes requires a significant effort to rewrite application logic in distributed programming models and adapt the code to the underlying network characteristics.This paper presents DeX, an operating system-level approach to extend the execution boundary of existing applications over multiple machines. DeX allows the threads in a process to be relocated and distributed dynamically through a simple function call. DeX makes it trivial for developers to convert any application to be distributed over multiple nodes and for applications to transparently utilize disaggregated resources in a rack-scale system with minimal effort. Evaluation results using a running prototype and eight real applications showed promising results – six out of the eight scaled beyond the single-machine performance on DeX.

...read moreread less

Proceedings Article•10.1109/RTAS48715.2020.00-16•

Addressing Resource Contention and Timing Predictability for Multi-Core Architectures with Shared Memory Interconnects

[...]

Haitong Wang¹, Neil Audsley¹, Wanli Chang¹•Institutions (1)

University of York¹

21 Apr 2020

TL;DR: Evaluations on simulators and FPGA implementations with synthetic memory workload show that the latency variation is significantly reduced, contributing towards timing predictability of multi-core systems.

...read moreread less

Abstract: Multi-core architectures are increasingly being used in real-time embedded systems. In general, such systems have more processors than the shared memory modules, potentially causing severe interference over memory accesses. This resource contention could lead to substantial variation on memory access latencies, and thus wide fluctuation in the overall system performance, which is highly undesirable especially for the time-critical applications. In this paper, we address resource contention and timing predictability for multi-core architectures with distributed memory interconnects. We focus on the locally arbitrated interconnect constructed by pipelined multiplexing stages with local arbitration, while the globally arbitrated interconnect employing global scheduling to the same architecture potentially suffers synchronisation issue and requires strict coordination. Our contributions are mainly threefold: (i) We analyse the resource contention across the memory access data path, and report the accurate calculational method to bound the worst-case behaviour. (ii) We compare the average-case behaviour of the locally arbitrated and the globally arbitrated architectures with experiments, demonstrating varying memory latencies caused by the resource sharing issue. (iii) We propose an architectural modification to smooth resource sharing. Evaluations on simulators and FPGA implementations with synthetic memory workload show that the latency variation is significantly reduced, contributing towards timing predictability of multi-core systems.

...read moreread less

Proceedings Article•10.1109/IPDPSW50202.2020.00048•

Considerations for a Distributed GraphBLAS API

[...]

Benjamin Brock¹, Aydin Buluc², Timothy G. Mattson³, Scott McMillan⁴, José E. Moreira⁵, Roger Pearce⁶, Oguz Selvitopi², Trevor Steil⁷ - Show less +4 more•Institutions (7)

University of California, Berkeley¹, Lawrence Berkeley National Laboratory², Intel³, Software Engineering Institute⁴, IBM⁵, Lawrence Livermore National Laboratory⁶, University of Minnesota⁷

18 May 2020

TL;DR: This paper reviews various approaches for a GraphBLAS API for distributed computing and highlights the pros and cons of different approaches rather than to advocate for one particular choice.

...read moreread less

Abstract: The GraphBLAS emerged from an international effort to standardize linear-algebraic building blocks for computing on graphs and graph-structured data. The GraphBLAS is expressed as a C API and has paved the way for multiple implementations. The GraphBLAS C API, however, does not define how distributed-memory parallelism should be handled. This paper reviews various approaches for a GraphBLAS API for distributed computing. This work is guided by our experience with existing distributed memory libraries. Our goal for this paper is to highlight the pros and cons of different approaches rather than to advocate for one particular choice.

...read moreread less

Posted Content•

Data Engineering for HPC with Python.

[...]

Indiana University¹, University of Moratuwa²

13 Oct 2020-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: This paper presents a distributed Python API based on table abstraction for representing and processing data, which adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings.

...read moreread less

...

Expand