Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 2020
  1. Home
  2. Topics
  3. Distributed memory
  4. 2020
Showing papers on "Distributed memory published in 2020"
Journal Article•10.1016/J.SIMPA.2020.100037•
AMGCL – A C++ library for efficient solution of large sparse linear systems

[...]

Denis Demidov1•
Russian Academy of Sciences1
1 Nov 2020
TL;DR: AMGCL provides an efficient, flexible, and extensible implementation of several iterative solvers and preconditioners on top of different backends allowing the acceleration of the solution with the help of OpenMP, OpenCL, or CUDA technologies.
Abstract: AMGCL is a header-only C++ library for the solution of large sparse linear systems with algebraic multigrid. The method may be used as a black-box solver for computational problems in various fields, since it does not require any information about the underlying geometry. AMGCL provides an efficient, flexible, and extensible implementation of several iterative solvers and preconditioners on top of different backends allowing the acceleration of the solution with the help of OpenMP, OpenCL, or CUDA technologies. Most algorithms have both shared memory and distributed memory implementations. The library is published under a permissive MIT license.

41 citations

Journal Article•10.1063/1.5141358•
GronOR: Massively parallel and GPU-accelerated non-orthogonal configuration interaction for large molecular systems

[...]

T. P. Straatsma1, Ria Broer, Shirin Faraji, Remco W. A. Havenith, L. E. Aguilar Suarez, R. K. Kathir, M. Wibowo, C. de Graaf •
National Center for Computational Sciences1
14 Feb 2020-Journal of Chemical Physics
TL;DR: Algorithm and implementation details, parallel and accelerated performance benchmarks, and an analysis of the sensitivity of the accuracy of results and computational performance to thresholds used in the calculations are presented are presented.
Abstract: GronOR is a program package for non-orthogonal configuration interaction calculations for an electronic wave function built in terms of anti-symmetrized products of multi-configuration molecular fragment wave functions. The two-electron integrals that have to be processed may be expressed in terms of atomic orbitals or in terms of an orbital basis determined from the molecular orbitals of the fragments. The code has been specifically designed for execution on distributed memory massively parallel and Graphics Processing Unit (GPU)-accelerated computer architectures, using an MPI+OpenACC/OpenMP programming approach. The task-based execution model used in the implementation allows for linear scaling with the number of nodes on the largest pre-exascale architectures available, provides hardware fault resiliency, and enables effective execution on systems with distinct central processing unit-only and GPU-accelerated partitions. The code interfaces with existing multi-configuration electronic structure codes that provide optimized molecular fragment orbitals, configuration interaction coefficients, and the required integrals. Algorithm and implementation details, parallel and accelerated performance benchmarks, and an analysis of the sensitivity of the accuracy of results and computational performance to thresholds used in the calculations are presented.

23 citations

Journal Article•10.1016/J.ADVWATRES.2020.103734•
Efficient extraction of pore networks from massive tomograms via geometric domain decomposition

[...]

Zohaib Atiq Khan1, Zohaib Atiq Khan2, Ali Elkamel3, Ali Elkamel2, Jeff T. Gostick2 •
University of Engineering and Technology, Lahore1, University of Waterloo2, Khalifa University3
01 Nov 2020-Advances in Water Resources
TL;DR: This study presents an efficient workflow to extract pore networks form large size porous domains using a watershed segmentation with geometrical domain decomposition using highly optimized and efficient modules such as Dask and Numba to obtain the maximum performance.

23 citations

Journal Article•10.3389/FDATA.2020.00030•
LocationSpark: In-memory Distributed Spatial Query Processing and Optimization.

[...]

Mingjie Tang1, Yongyang Yu2, Ahmed R. Mahmood3, Qutaibah M. Malluhi4, Mourad Ouzzani5, Walid G. Aref3 •
Chinese Academy of Sciences1, Facebook2, Purdue University3, Qatar University4, Qatar Computing Research Institute5
16 Oct 2020
TL;DR: LocationSpark as mentioned in this paper proposes a distributed query scheduler that uses a new cost model to minimize the cost of spatial query processing and optimizes and selects its best local query execution plan based on the indexes and the nature of the spatial queries in that node.
Abstract: Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for handling query skew that commonly happens in practice, and minimizes communication costs accordingly. We propose a distributed query scheduler that uses a new cost model to minimize the cost of spatial query processing. The scheduler generates query execution plans that minimize the effect of query skew. The query scheduler utilizes new spatial indexing techniques based on bitmap filters to forward queries to the appropriate local nodes. Each local computation node is responsible for optimizing and selecting its best local query execution plan based on the indexes and the nature of the spatial queries in that node. All the proposed spatial query processing and optimization techniques are prototyped inside Spark, a distributed memory-based computation system. Our prototype system is termed LocationSpark. The experimental study is based on real datasets and demonstrates that LocationSpark can enhance distributed spatial query processing by up to an order of magnitude over existing in-memory and distributed spatial systems.

21 citations

Journal Article•10.1109/TCI.2020.3008782•
Distributed Iterative CT Reconstruction Using Multi-Agent Consensus Equilibrium

[...]

Venkatesh Sridhar1, Xiao Wang2, Gregery T. Buzzard1, Charles A. Bouman1•
Purdue University1, Boston Children's Hospital2
17 Jul 2020-IEEE Transactions on Computational Imaging
TL;DR: In this article, a multi-agent consensus equilibrium (MACE) algorithm is proposed for distributed MBIR reconstruction across a large number of parallel nodes, where each node stores only a sparse subset of views and a small portion of the system matrix, and each parallel node performs a local sparse-view reconstruction.
Abstract: Model-Based Image Reconstruction (MBIR) methods significantly enhance the quality of computed tomographic (CT) reconstructions relative to analytical techniques, but are limited by high computational cost. In this article, we propose a multi-agent consensus equilibrium (MACE) algorithm for distributing both the computation and memory of MBIR reconstruction across a large number of parallel nodes. In MACE, each node stores only a sparse subset of views and a small portion of the system matrix, and each parallel node performs a local sparse-view reconstruction, which based on repeated feedback from other nodes, converges to the global optimum. Our distributed approach can also incorporate advanced denoisers as priors to enhance reconstruction quality. In this case, we obtain a parallel solution to the serial framework of Plug-n-play (PnP) priors, which we call MACE-PnP. In order to make MACE practical, we introduce a partial update method that eliminates nested iterations and prove that it converges to the same global solution. Finally, we validate our approach on a distributed memory system with real CT data. We also demonstrate an implementation of our approach on a massive supercomputer that can perform large-scale reconstruction in real-time.

19 citations

Proceedings Article•10.1109/BIGDATA50022.2020.9378050•
HeAT – a Distributed and GPU-accelerated Tensor Framework for Data Analytics

[...]

Markus Götz1, Charlotte Debus2, Daniel Coquelin3, Kai Krajsek3, Claudia Comito3, Philipp Knechtges2, Björn Hagemeier3, Michael Tarnawa3, Simon Hanselmann1, Martin Siggel2, Achim Basermann2, Achim Streit1 •
Karlsruhe Institute of Technology1, German Aerospace Center2, Forschungszentrum Jülich3
10 Dec 2020
TL;DR: HeAT, an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API, is introduced, which achieves speedups of up to two orders of magnitude.
Abstract: To cope with the rapid growth in available data, the efficiency of data analysis and machine learning libraries has recently received increased attention. Although great advancements have been made in traditional array-based computations, most are limited by the resources available on a single computation node. Consequently, novel approaches must be made to exploit distributed resources, e.g. distributed memory architectures. To this end, we introduce HeAT, an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API. HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload on arbitrarily large high-performance computing systems via MPI. It provides both low-level array computations, as well as assorted higher-level algorithms. With HeAT, it is possible for a NumPy user to take full advantage of their available resources, significantly lowering the barrier to distributed data analysis. When compared to similar frameworks, HeAT achieves speedups of up to two orders of magnitude.

18 citations

Book Chapter•10.1007/978-3-030-68035-0_12•
Parallel/Distributed Generative Adversarial Neural Networks for Data Augmentation of COVID-19 Training Images

[...]

Jamal Toutouh1, Mathias Esteban2, Sergio Nesmachnow2•
Massachusetts Institute of Technology1, University of the Republic2
2 Sep 2020
TL;DR: An approach using parallel/distributed generative adversarial networks for image data augmentation, applied to generate COVID-19 training samples for computational intelligence methods, indicates that the proposed model is able to generate accurate images and the \(3\times 3\) version of the distributed GAN has better robustness properties of its training process, allowing to generate better and more diverse images.
Abstract: This article presents an approach using parallel/distributed generative adversarial networks for image data augmentation, applied to generate COVID-19 training samples for computational intelligence methods. This is a relevant problem nowadays, considering the recent COVID-19 pandemic. Computational intelligence and learning methods are useful tools to assist physicians in the process of diagnosing diseases and acquire valuable medical knowledge. A specific generative adversarial network approach trained using a co-evolutionary algorithm is implemented, including a three-level parallel approach combining distributed memory and fine-grained parallelization using CPU and GPU. The experimental evaluation of the proposed method was performed on the high performance computing infrastructure provided by National Supercomputing Center, Uruguay. The main experimental results indicate that the proposed model is able to generate accurate images and the \(3\times 3\) version of the distributed GAN has better robustness properties of its training process, allowing to generate better and more diverse images.

17 citations

Posted Content•10.5555/3433701.3433732•
Distributed-Memory DMRG via Sparse and Dense Parallel Tensor Contractions

[...]

Ryan Levy, Edgar Solomonik, Bryan K. Clark1•
University of Illinois at Urbana–Champaign1
10 Jul 2020-arXiv: Distributed, Parallel, and Cluster Computing
TL;DR: It is demonstrated that despite having limited concurrency, DMRG is weakly scalable with the use of efficient parallel tensor contraction mechanisms, and enables higher accuracy calculations via larger tensors for quantum state approximation.
Abstract: The Density Matrix Renormalization Group (DMRG) algorithm is a powerful tool for solving eigenvalue problems to model quantum systems. DMRG relies on tensor contractions and dense linear algebra to compute properties of condensed matter physics systems. However, its efficient parallel implementation is challenging due to limited concurrency, large memory footprint, and tensor sparsity. We mitigate these problems by implementing two new parallel approaches that handle block sparsity arising in DMRG, via Cyclops, a distributed memory tensor contraction library. We benchmark their performance on two physical systems using the Blue Waters and Stampede2 supercomputers. Our DMRG performance is improved by up to 5.9X in runtime and 99X in processing rate over ITensor, at roughly comparable computational resource use. This enables higher accuracy calculations via larger tensors for quantum state approximation. We demonstrate that despite having limited concurrency, DMRG is weakly scalable with the use of efficient parallel tensor contraction mechanisms.

15 citations

Journal Article•10.1016/J.IJMULTIPHASEFLOW.2020.103293•
Development of a CPU/GPU portable software library for Lagrangian–Eulerian simulations of liquid sprays

[...]

Wenjun Ge1, Ramanan Sankaran1, Jacqueline H. Chen2•
Oak Ridge National Laboratory1, Sandia National Laboratories2
01 Jul 2020-International Journal of Multiphase Flow
TL;DR: A performance-portable library, Grit, to track the Lagrangian particles in parallel on central processing unit (CPU) and graphics processing units (GPU) accelerated high performance computing (HPC) architectures and a conservative formulation has been developed and implemented in Grit for phase coupling with thermodynamic consistency.

14 citations

Proceedings Article•10.1109/SC41405.2020.00089•
Pencil: A Pipelined Algorithm for Distributed Stencils

[...]

Hengjie Wang1, Aparna Chandramowlishwaran1•
University of California, Irvine1
1 Nov 2020
TL;DR: In this article, a pipelined stencil algorithm called Pencil is proposed for distributed memory machines that applies to practical CFD problems that span multiple iteration spaces on shared-memory machines.
Abstract: Stencil computations are at the core of various Computational Fluid Dynamics (CFD) applications and have been well-studied for several decades. Typically they’re highly memory-bound and as a result, numerous tiling algorithms have been proposed to improve its performance. Although efficient, most of these algorithms are designed for single iteration spaces on shared-memory machines. However, in CFD, we are confronted with multi-block structured girds composed of multiple connected iteration spaces distributed across many nodes.In this paper, we propose a pipelined stencil algorithm called Pencil for distributed memory machines that applies to practical CFD problems that span multiple iteration spaces. Based on an in-depth analysis of cache tiling on a single node, we first identify both the optimal combination of MPI and OpenMP for temporal tiling and the best tiling approach, which outperforms the state-of-the-art automatic parallelization tool Pluto by up to $1.92 \times$. Then, we adopt DeepHalo to decouple the multiple connected iteration spaces so that temporal tiling can be applied to each space. Finally, we achieve overlap by pipelining the computation and communication without sacrificing the advantage from temporal cache tiling. Pencil is evaluated using 4 stencils across 6 numerical schemes on two distributed memory machines with Omni-Path and InfiniBand networks. On the Omni-Path system, Pencil exhibits outstanding weak and strong scalability for up to 128 nodes and outperforms MPI+OpenMP Funneled with space tiling by $1.33- 3.41 \times$ on a multi-block grid with 32 nodes.

14 citations

Proceedings Article•10.1109/PYHPC51966.2020.00007•
Data Engineering for HPC with Python

[...]

Vibhatha Abeykoon, Niranda Perera, Chathura Widanage, Supun Kamburugamuve, Thejaka Amila Kanewala1, Hasara Maithree2, Pulasthi Wickramasinghe, Ahmet Uyar, Geoffrey C. Fox •
Indiana University1, University of Moratuwa2
1 Nov 2020
TL;DR: In this article, the authors present a distributed Python API based on table abstraction for representing and processing data, which adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings.
Abstract: Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.
Proceedings Article•10.1109/QCE49297.2020.00035•
Cache Blocking Technique to Large Scale Quantum Computing Simulation on Supercomputers

[...]

Jun Doi1, Hiroshi Horii1•
IBM1
1 Oct 2020
TL;DR: A cache blocking technique is applied by inserting swap gates in quantum circuits to decrease data movements in order to solve the scalability issue in parallel quantum computing simulations.
Abstract: Classical computers require large memory resources and computational power to simulate quantum circuits with a large number of qubits. Even supercomputers that can store huge amounts of data face a scalability issue in regard to parallel quantum computing simulations because of the latency of data movements between distributed memory spaces. Here, we apply a cache blocking technique by inserting swap gates in quantum circuits to decrease data movements. We implemented this technique in the open source simulation framework Qiskit Aer. We evaluated our simulator on GPU clusters and observed good scalability.
Journal Article•10.1016/J.MICPRO.2019.102950•
General memory efficient packet matching FPGA architecture for future high-speed networks

[...]

Michal Kekely1, Lukas Kekely2, Jan Kořenek1•
Brno University of Technology1, CESNET2
01 Mar 2020-Microprocessors and Microsystems
TL;DR: A unique parallel hardware architecture for hash-based exact match classification of multiple packets in each clock cycle that offers a reduction of memory replication requirements and is able to maintain a rather high throughput of matching multiple packets per clock cycle even without fully replicated memory resources in matching tables.
Journal Article•10.1371/JOURNAL.PONE.0239741•
Big Data in metagenomics: Apache Spark vs MPI

[...]

José Manuel Abuín1, José Manuel Abuín2, Nuno Lopes1, Luís Carlos de Souza Ferreira1, Tomás F. Pena2, Bertil Schmidt3 •
Ipca Laboratories Ltd.1, University of Santiago de Compostela2, University of Mainz3
06 Oct 2020-PLOS ONE
TL;DR: The new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.
Abstract: The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.
Journal Article•10.1007/S11227-019-03097-W•
GraphMap: scalable iterative graph processing using NoSQL

[...]

Sayan Goswami1, Ayam Pokhrel2, Kisung Lee2, Ling Liu3, Qi Zhang4, Yang Zhou5 •
Louisiana State University in Shreveport1, Louisiana State University2, Georgia Institute of Technology3, IBM4, Auburn University5
01 Sep 2020-The Journal of Supercomputing
TL;DR: This paper presents a new distributed iterative graph computation framework, called GraphMap, that utilizes a disk-based NoSQL database system for scalable graph processing while ensuring competitive performance.
Abstract: Despite having several distributed graph processing frameworks, scalable iterative processing of large graphs is a challenging problem since the graph and intermediate data need a global view of the graph topology in distributed memory. Although some systems support out-of-core iterative computations, they use a single machine and often require fast storage. In this paper, we present a new distributed iterative graph computation framework, called GraphMap, that utilizes a disk-based NoSQL database system for scalable graph processing while ensuring competitive performance. Extensive experiments on several real-world graphs show that GraphMap is more scalable and often faster than existing distributed memory-based systems for various graph processing workloads.
Proceedings Article•10.1145/3332466.3374537•
YewPar: skeletons for exact combinatorial search

[...]

Blair Archibald1, Patrick Maier2, Robert Stewart3, Phil Trinder1•
University of Glasgow1, University of Stirling2, Heriot-Watt University3
19 Feb 2020
TL;DR: This work aims to improve the reuse of intricate parallel search implementations by providing the first general purpose scalable parallel framework for exact combinatorial search, YewPar, and introduces Lazy Node Generators as a uniform API for search tree generation.
Abstract: Combinatorial search is central to many applications, yet the huge irregular search trees and the need to respect search heuristics make it hard to parallelise. We aim to improve the reuse of intricate parallel search implementations by providing the first general purpose scalable parallel framework for exact combinatorial search, YewPar. We make the following contributions. (1) We present a novel formal model of parallel backtracking search, covering enumeration, decision, and optimisation search. (2) We introduce Lazy Node Generators as a uniform API for search tree generation. (3) We present the design and implementation of 12 widely applicable algorithmic skeletons for tree search on shared and distributed memory architectures. (4) Uniquely in the field we demonstrate how a wide range of parallel search applications can easily be constructed by composing Lazy Node Generators and the search skeletons. (5) We report a systematic performance analysis of all 12 YewPar skeletons on standard instances of 7 search applications, investigating skeleton overheads and scalability up to 255 workers on 17 distributed locations.
Journal Article•10.1109/TCBB.2018.2858797•
Fast de Bruijn Graph Compaction in Distributed Memory Environments

[...]

Tony Pan1, Rahul Nihalani1, Srinivas Aluru1•
Georgia Institute of Technology1
01 Jan 2020-IEEE/ACM Transactions on Computational Biology and Bioinformatics
TL;DR: The key advantages of the algorithm include bounding the chain compaction run-time to logarithmic number of iterations in the length of the longest chain, and ability to differentiate cycles from chains within logaratic number of iteration in the longest cycle.
Abstract: De Bruijn graph based genome assembly has gained popularity as short read sequencers become ubiquitous. A core assembly operation is the generation of unitigs, which are sequences corresponding to chains in the graph. Unitigs are used as building blocks for generating longer sequences in many assemblers, and can facilitate graph compression. Chain compaction, by which unitigs are generated, remains a critical computational task. In this paper, we present a distributed memory parallel algorithm for simultaneous compaction of all chains in bi-directed de Bruijn graphs. The key advantages of our algorithm include bounding the chain compaction run-time to logarithmic number of iterations in the length of the longest chain, and ability to differentiate cycles from chains within logarithmic number of iterations in the length of the longest cycle. Our algorithm scales to thousands of computational cores, and can compact a whole genome de Bruijn graph from a human sequence read set in 7.3 seconds using 7680 distributed memory cores, and in 12.9 minutes using 64 shared memory cores. It is $3.7\times$ 3 . 7 × and $2.0\times$ 2 . 0 × faster than equivalent steps in the state-of-the-art tools for distributed and shared memory environments, respectively. An implementation of the algorithm is available at https://github.com/ParBLiSS/bruno .
Proceedings Article•10.1109/EICONRUS49466.2020.9039470•
Comparative Performance Study of Shared and Distributed Memory Dynamic Programming Algorithms

[...]

Mikhail Posypkin1, Si Thu Thant Sin1•
National Research University of Electronic Technology1
1 Jan 2020
TL;DR: This paper considered the standard table-based dynamic programming algorithm of the Knapsack Problem for serial, shared memory parallel and distributed memory multiprocessors as well as OpenMP and MPI parallel programming tools for experimental evaluation.
Abstract: This paper is devoted to the experimental and theoretical study of parallel versions of the dynamic programming algorithm for the Knapsack Problem. We considered the standard table-based dynamic programming algorithm of the Knapsack Problem for serial, shared memory parallel and distributed memory multiprocessors. Experimental comparison and the performance study of shared and distributed memory dynamic programming algorithms is carried out. The OpenMP and MPI parallel programming tools are used for experimental evaluation.
Proceedings Article•10.1109/IPDPSW50202.2020.00092•
Parallel/distributed implementation of cellular training for generative adversarial neural networks.

[...]

Emiliano Perez, Sergio Nesmachnow1, Jamal Toutouh2, Erik Hemberg2, Una-May O'Reily2 •
University of the Republic1, Massachusetts Institute of Technology2
1 Apr 2020
TL;DR: In this paper, a parallel/distributed implementation of a cellular competitive coevolutionary method to train two populations of GANs is proposed for high performance/supercomputing centers.
Abstract: Generative adversarial networks (GANs) are widely used to learn generative models. GANs consist of two networks, a generator and a discriminator, that apply adversarial learning to optimize their parameters. This article presents a parallel/distributed implementation of a cellular competitive coevolutionary method to train two populations of GANs. A distributed memory parallel implementation is proposed for execution in high performance/supercomputing centers. Efficient results are reported on addressing the generation of handwritten digits (MNIST dataset samples). Moreover, the proposed implementation is able to reduce the training times and scale properly when considering different grid sizes for training.
Proceedings Article•10.1109/SC41405.2020.00028•
Distributed-Memory DMRG via Sparse and Dense Parallel Tensor Contractions

[...]

Ryan Levy, Edgar Solomonik, Bryan K. Clark
1 Nov 2020
TL;DR: Cyclops as mentioned in this paper is a distributed memory tensor contraction library that implements two new parallel approaches that handle block sparsity arising in density matrix renormalization group (DMRG) algorithms.
Abstract: The density matrix renormalization group (DMRG) algorithm is a powerful tool for solving eigenvalue problems to model quantum systems. DMRG relies on tensor contractions and dense linear algebra to compute properties of condensed matter physics systems. However, its efficient parallel implementation is challenging due to limited concurrency, large memory footprint, and tensor sparsity. We mitigate these problems by implementing two new parallel approaches that handle block sparsity arising in DMRG, via Cyclops, a distributed memory tensor contraction library. We benchmark their performance on two physical systems using the Blue Waters and Stampede2 supercomputers. Our DMRG performance is improved by up to 5.9X in runtime and 99X in processing rate over ITensor, at roughly comparable computational resource use. This enables higher accuracy calculations via larger tensors for quantum state approximation. We demonstrate that despite having limited concurrency, DMRG is weakly scalable with the use of efficient parallel tensor contraction mechanisms.
Journal Article•10.1007/S00500-019-04122-Z•
A novel parallel local search algorithm for the maximum vertex weight clique problem in large graphs

[...]

Ender Sevinc, Tansel Dokeroglu
1 Mar 2020
TL;DR: The Par-LS algorithm is reported as one of the best performing algorithms in the literature for the solution of the MVWCP in large graphs, developed on a distributed memory environment by using message passing interface libraries and employs a different exploration strategy at each processor.
Abstract: This study proposes a new parallel local search algorithm (Par-LS) for solving the maximum vertex weight clique problem (MVWCP) in large graphs. Solving the MVWCP in a large graph with millions of edges and vertices is an intractable problem. Parallel local search methods are powerful tools to deal with such problems with their high-performance computation capability. The Par-LS algorithm is developed on a distributed memory environment by using message passing interface libraries and employs a different exploration strategy at each processor. The Par-LS introduces new operators parallel($$\omega $$,1)-swap and parallel(1,2)-swap, for searching the neighboring solutions while improving the current solution through iterations. During our experiments, 172 of 173 benchmark problem instances from the DIMACS, BHOSLIB and Network Data Repository graph libraries are solved optimally with respect to the best/optimal reported results. A new best solution for the largest problem instance of the BHOSLIB benchmark (frb100-40) is discovered. The Par-LS algorithm is reported as one of the best performing algorithms in the literature for the solution of the MVWCP in large graphs.
Journal Article•10.3390/APP10072539•
Locality-Sensitive Hashing for Information Retrieval System on Multiple GPGPU Devices

[...]

Toan Nguyen Mau, Yasushi Inoguchi
07 Apr 2020-Applied Sciences
TL;DR: This paper proposes an extension of DLSH for big data sets using multiple GPGPUs, in order to increase the capacity and performance of the information retrieval system and shows that DLSh can be applied to real-life dynamic database systems.
Abstract: It is challenging to build a real-time information retrieval system, especially for systems with high-dimensional big data. To structure big data, many hashing algorithms that map similar data items to the same bucket to advance the search have been proposed. Locality-Sensitive Hashing (LSH) is a common approach for reducing the number of dimensions of a data set, by using a family of hash functions and a hash table. The LSH hash table is an additional component that supports the indexing of hash values (keys) for the corresponding data/items. We previously proposed the Dynamic Locality-Sensitive Hashing (DLSH) algorithm with a dynamically structured hash table, optimized for storage in the main memory and General-Purpose computation on Graphics Processing Units (GPGPU) memory. This supports the handling of constantly updated data sets, such as songs, images, or text databases. The DLSH algorithm works effectively with data sets that are updated with high frequency and is compatible with parallel processing. However, the use of a single GPGPU device for processing big data is inadequate, due to the small memory capacity of GPGPU devices. When using multiple GPGPU devices for searching, we need an effective search algorithm to balance the jobs. In this paper, we propose an extension of DLSH for big data sets using multiple GPGPUs, in order to increase the capacity and performance of the information retrieval system. Different search strategies on multiple DLSH clusters are also proposed to adapt our parallelized system. With significant results in terms of performance and accuracy, we show that DLSH can be applied to real-life dynamic database systems.
Proceedings Article•10.1145/3392717.3392745•
Parallelizing pruned landmark labeling: dealing with dependencies in graph algorithms

[...]

Ruoming Jin1, Zhen Peng2, Wendell Wu1, Feodor F. Dragan1, Gagan Agrawal3, Bin Ren2 •
Kent State University1, College of William & Mary2, Georgia Regents University3
29 Jun 2020
TL;DR: This paper demonstrates the first scalable parallel implementation of the PPL algorithm that produces the same results as the sequential algorithm, resulting in the Vertex-Centrix PLL (VC-PLL) algorithm, which can efficiently execute on graphs with more than a billion edges.
Abstract: To help compute shortest path distances over large graphs efficiently, 2-hop labeling has emerged as a major tool, with Pruned Landmark Labeling (PPL) as a popular algorithm. This paper demonstrates the first scalable parallel implementation of the PPL algorithm that produces the same results as the sequential algorithm. Based on theoretical analysis, we show how computations on each vertex can be performed in parallel while maintaining correctness, resulting in the Vertex-Centrix PLL (VC-PLL) algorithm. We also show a formulation of this algorithm based on linear algebra and argue why the use of a library based on linear algebra operations will not produce an efficient implementation. Next, we introduce a batched VC-PLL (BVC-PLL) algorithm to reduce the computational inefficiency in VC-PLL. We have carried out a parallel implementation of this method for modern clusters, combining shared memory and distributed memory parallelism, that can efficiently execute on graphs with more than a billion edges. We also demonstrate how BVC-PLL algorithm can be extended to handle directed graphs and weighted graphs and how the version for weighted graphs can benefit from SIMD parallelization.
Journal Article•10.1063/1.5129452•
Techniques for high-performance construction of Fock matrices.

[...]

Hua Huang1, C. David Sherrill1, Edmond Chow1•
Georgia Institute of Technology1
13 Jan 2020-Journal of Chemical Physics
TL;DR: Techniques for Fock matrix construction that are designed for high performance on shared and distributed memory parallel computers when using Gaussian basis sets are presented and a globally accessible matrix class for accessing distributed Fock and density matrices is presented.
Abstract: This paper presents techniques for Fock matrix construction that are designed for high performance on shared and distributed memory parallel computers when using Gaussian basis sets. Four main techniques are considered. (1) To calculate electron repulsion integrals, we demonstrate batching together the calculation of multiple shell quartets of the same angular momentum class so that the calculation of large sets of primitive integrals can be efficiently vectorized. (2) For multithreaded summation of entries into the Fock matrix, we investigate using a combination of atomic operations and thread-local copies of the Fock matrix. (3) For distributed memory parallel computers, we present a globally accessible matrix class for accessing distributed Fock and density matrices. The new matrix class introduces a batched mode for remote memory access that can reduce the synchronization cost. (4) For density fitting, we exploit both symmetry (of the Coulomb and exchange matrices) and sparsity (of 3-index tensors) and give a performance comparison of density fitting and the conventional direct calculation approach. The techniques are implemented in an open-source software library called GTFock.
Proceedings Article•
Reactive Task Migration for Hybrid MPI+OpenMP Applications

[...]

Jannis Klinkenberg1, Philipp Samfass2, Michael Bader2, Christian Terboven1, Matthias S. Müller1 •
RWTH Aachen University1, Technische Universität München2
1 Jan 2020
TL;DR: This work presents a novel library for fine-granular task-based reactive load balancing in distributed memory based on MPI and OpenMP, and demonstrates its robustness for work-induced imbalances for a realistic application.
Abstract: Many applications in high performance computing are designed based on underlying performance and execution models. While these models could successfully be employed in the past for balancing load within and between compute nodes, modern software and hardware increasingly make performance predictability difficult if not impossible. Consequently, balancing computational load becomes much more difficult. Aiming to tackle these challenges in search for a general solution, we present a novel library for fine-granular task-based reactive load balancing in distributed memory based on MPI and OpenMP. With our approach, individual migratable tasks can be executed on any MPI rank. The actual executing rank is determined at run time based on online performance data. We evaluate our approach under an enforced power cap and under enforced clock frequency changes for a synthetic benchmark and show its robustness for work-induced imbalances for a realistic application. Our experiments demonstrate speedups of up to \(1.31\text {X}\).
Journal Article•10.1186/S12859-019-3338-8•
pSpatiocyte: a high-performance simulator for intracellular reaction-diffusion systems

[...]

Satya N. V. Arjunan, Atsushi Miyauchi, Kazunari Iwamoto, Koichi Takahashi
29 Jan 2020-BMC Bioinformatics
TL;DR: A parallelized Spatiocyte method that demonstrates good accuracies, fast runtimes and a significant performance advantage over well-known microscopic particle methods in large-scale simulations of intracellular reaction-diffusion systems.
Abstract: Studies using quantitative experimental methods have shown that intracellular spatial distribution of molecules plays a central role in many cellular systems. Spatially resolved computer simulations can integrate quantitative data from these experiments to construct physically accurate models of the systems. Although computationally expensive, microscopic resolution reaction-diffusion simulators, such as Spatiocyte can directly capture intracellular effects comprising diffusion-limited reactions and volume exclusion from crowded molecules by explicitly representing individual diffusing molecules in space. To alleviate the steep computational cost typically associated with the simulation of large or crowded intracellular compartments, we present a parallelized Spatiocyte method called pSpatiocyte. The new high-performance method employs unique parallelization schemes on hexagonal close-packed (HCP) lattice to efficiently exploit the resources of common workstations and large distributed memory parallel computers. We introduce a coordinate system for fast accesses to HCP lattice voxels, a parallelized event scheduler, a parallelized Gillespie’s direct-method for unimolecular reactions, and a parallelized event for diffusion and bimolecular reaction processes. We verified the correctness of pSpatiocyte reaction and diffusion processes by comparison to theory. To evaluate the performance of pSpatiocyte, we performed a series of parallelized diffusion runs on the RIKEN K computer. In the case of fine lattice discretization with low voxel occupancy, pSpatiocyte exhibited 74% parallel efficiency and achieved a speedup of 7686 times with 663552 cores compared to the runtime with 64 cores. In the weak scaling performance, pSpatiocyte obtained efficiencies of at least 60% with up to 663552 cores. When executing the Michaelis-Menten benchmark model on an eight-core workstation, pSpatiocyte required 45- and 55-fold shorter runtimes than Smoldyn and the parallel version of ReaDDy, respectively. As a high-performance application example, we study the dual phosphorylation-dephosphorylation cycle of the MAPK system, a typical reaction network motif in cell signaling pathways. pSpatiocyte demonstrates good accuracies, fast runtimes and a significant performance advantage over well-known microscopic particle methods in large-scale simulations of intracellular reaction-diffusion systems. The source code of pSpatiocyte is available at https://spatiocyte.org.
Proceedings Article•10.1109/ICDCS47774.2020.00021•
DeX: Scaling Applications Beyond Machine Boundaries

[...]

Sang-Hoon Kim1, Ho-Ren Chuang2, Robert Lyerly2, Pierre Olivier3, Changwoo Min2, Binoy Ravindran2 •
Ajou University1, Virginia Tech2, University of Manchester3
13 Mar 2020
TL;DR: DeX as discussed by the authors is an operating system-level approach to extend the execution boundary of existing applications over multiple machines by allowing threads in a process to be relocated and distributed dynamically through a simple function call.
Abstract: Increasing the computing performance within a single-machine form factor is becoming increasingly difficult due to the complexities in scaling processor interconnects and coherence protocols. On the other hand, converting existing applications to run on multiple nodes requires a significant effort to rewrite application logic in distributed programming models and adapt the code to the underlying network characteristics.This paper presents DeX, an operating system-level approach to extend the execution boundary of existing applications over multiple machines. DeX allows the threads in a process to be relocated and distributed dynamically through a simple function call. DeX makes it trivial for developers to convert any application to be distributed over multiple nodes and for applications to transparently utilize disaggregated resources in a rack-scale system with minimal effort. Evaluation results using a running prototype and eight real applications showed promising results – six out of the eight scaled beyond the single-machine performance on DeX.
Proceedings Article•10.1109/RTAS48715.2020.00-16•
Addressing Resource Contention and Timing Predictability for Multi-Core Architectures with Shared Memory Interconnects

[...]

Haitong Wang1, Neil Audsley1, Wanli Chang1•
University of York1
21 Apr 2020
TL;DR: Evaluations on simulators and FPGA implementations with synthetic memory workload show that the latency variation is significantly reduced, contributing towards timing predictability of multi-core systems.
Abstract: Multi-core architectures are increasingly being used in real-time embedded systems. In general, such systems have more processors than the shared memory modules, potentially causing severe interference over memory accesses. This resource contention could lead to substantial variation on memory access latencies, and thus wide fluctuation in the overall system performance, which is highly undesirable especially for the time-critical applications. In this paper, we address resource contention and timing predictability for multi-core architectures with distributed memory interconnects. We focus on the locally arbitrated interconnect constructed by pipelined multiplexing stages with local arbitration, while the globally arbitrated interconnect employing global scheduling to the same architecture potentially suffers synchronisation issue and requires strict coordination. Our contributions are mainly threefold: (i) We analyse the resource contention across the memory access data path, and report the accurate calculational method to bound the worst-case behaviour. (ii) We compare the average-case behaviour of the locally arbitrated and the globally arbitrated architectures with experiments, demonstrating varying memory latencies caused by the resource sharing issue. (iii) We propose an architectural modification to smooth resource sharing. Evaluations on simulators and FPGA implementations with synthetic memory workload show that the latency variation is significantly reduced, contributing towards timing predictability of multi-core systems.
Proceedings Article•10.1109/IPDPSW50202.2020.00048•
Considerations for a Distributed GraphBLAS API

[...]

Benjamin Brock1, Aydin Buluc2, Timothy G. Mattson3, Scott McMillan4, José E. Moreira5, Roger Pearce6, Oguz Selvitopi2, Trevor Steil7 •
University of California, Berkeley1, Lawrence Berkeley National Laboratory2, Intel3, Software Engineering Institute4, IBM5, Lawrence Livermore National Laboratory6, University of Minnesota7
18 May 2020
TL;DR: This paper reviews various approaches for a GraphBLAS API for distributed computing and highlights the pros and cons of different approaches rather than to advocate for one particular choice.
Abstract: The GraphBLAS emerged from an international effort to standardize linear-algebraic building blocks for computing on graphs and graph-structured data. The GraphBLAS is expressed as a C API and has paved the way for multiple implementations. The GraphBLAS C API, however, does not define how distributed-memory parallelism should be handled. This paper reviews various approaches for a GraphBLAS API for distributed computing. This work is guided by our experience with existing distributed memory libraries. Our goal for this paper is to highlight the pros and cons of different approaches rather than to advocate for one particular choice.
Posted Content•
Data Engineering for HPC with Python.

[...]

Vibhatha Abeykoon, Niranda Perera, Chathura Widanage, Supun Kamburugamuve, Thejaka Amila Kanewala1, Hasara Maithree2, Pulasthi Wickramasinghe, Ahmet Uyar, Geoffrey C. Fox •
Indiana University1, University of Moratuwa2
13 Oct 2020-arXiv: Distributed, Parallel, and Cluster Computing
TL;DR: This paper presents a distributed Python API based on table abstraction for representing and processing data, which adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings.
Abstract: Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve