Top 151 papers published in the topic of Distributed memory in 2019

Showing papers on "Distributed memory published in 2019"

EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences.

[...]

Pierre Barbera¹, Alexey M. Kozlov¹, Lucas Czech¹, Benoit Morel¹, Diego Darriba¹, Diego Darriba², Tomas Flouri¹, Tomas Flouri³, Alexandros Stamatakis¹, Alexandros Stamatakis⁴ - Show less +6 more•Institutions (4)

Heidelberg Institute for Theoretical Studies¹, University of A Coruña², University College London³, Karlsruhe Institute of Technology⁴

01 Mar 2019-Systematic Biology

TL;DR: The performance assessment shows that EPA‐NG outperforms RAxML‐EPA and PPLACER by up to a factor of 30 in sequential execution mode, while attaining comparable parallel efficiency on shared memory systems.

...read moreread less

Abstract: Next generation sequencing (NGS) technologies have led to a ubiquity of molecular sequence data. This data avalanche is particularly challenging in metagenetics, which focuses on taxonomic identification of sequences obtained from diverse microbial environments. Phylogenetic placement methods determine how these sequences fit into an evolutionary context. Previous implementations of phylogenetic placement algorithms, such as the evolutionary placement algorithm (EPA) included in RAxML, or PPLACER, are being increasingly used for this purpose. However, due to the steady progress in NGS technologies, the current implementations face substantial scalability limitations. Herein, we present EPA-NG, a complete reimplementation of the EPA that is substantially faster, offers a distributed memory parallelization, and integrates concepts from both, RAxML-EPA and PPLACER. EPA-NG can be executed on standard shared memory, as well as on distributed memory systems (e.g., computing clusters). To demonstrate the scalability of EPA-NG, we placed $1$ billion metagenetic reads from the Tara Oceans Project onto a reference tree with 3748 taxa in just under $7$ h, using 2048 cores. Our performance assessment shows that EPA-NG outperforms RAxML-EPA and PPLACER by up to a factor of $30$ in sequential execution mode, while attaining comparable parallel efficiency on shared memory systems. We further show that the distributed memory parallelization of EPA-NG scales well up to 2048 cores. EPA-NG is available under the AGPLv3 license: https://github.com/Pbdas/epa-ng.

...read moreread less

561 citations

Journal Article•10.1134/S1995080219050056•

AMGCL: An Efficient, Flexible, and Extensible Algebraic Multigrid Implementation

[...]

Denis Demidov¹•Institutions (1)

Russian Academy of Sciences¹

24 Jun 2019-Lobachevskii Journal of Mathematics

TL;DR: The paper presents AMGCL -- an opensource C++ library implementing the algebraic multigrid method (AMG) for solution of large sparse linear systems of equations, usually arising from discretization of partial differential equations on an unstructured grid.

...read moreread less

Abstract: The paper presents AMGCL—an opensource C++ library implementing the algebraic multigrid method (AMG) for solution of large sparse linear systems of equations, usually arising from discretization of partial differential equations on an unstructured grid. The library supports both shared and distributed memory computation, allows to utilize modern massively parallel processors via OpenMP, OpenCL, or CUDA technologies, has minimal dependencies, and is easily extensible. The design principles behind AMGCL are discussed and it is shown that the code performance is on par with alternative implementations.

...read moreread less

98 citations

Journal Article•10.1109/TPDS.2019.2904058•

Parallelizing Word2Vec in Shared and Distributed Memory

[...]

Shihao Ji¹, Nadathur Satish², Sheng Li², Pradeep Dubey²•Institutions (2)

Georgia State University¹, Intel²

01 Sep 2019-IEEE Transactions on Parallel and Distributed Systems

TL;DR: HogBatch as mentioned in this paper improves reuse of various data structures in the algorithm through the use of minibatching and negative sample sharing, hence allowing us to express the problem using matrix multiply operations.

...read moreread less

Abstract: Word2vec is a widely used algorithm for extracting low-dimensional vector representations of words. State-of-the-art algorithms including those by Mikolov et al. [1] , [2] have been parallelized for multi-core CPU architectures, but are based on vector-vector operations with “Hogwild” updates that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we propose “HogBatch” by improving reuse of various data structures in the algorithm through the use of minibatching and negative sample sharing, hence allowing us to express the problem using matrix multiply operations. We also explore different techniques to distribute word2vec computation across nodes in a computer cluster, and demonstrate good strong scalability up to 32 nodes. The new algorithm is particularly suitable for modern multi-core/many-core architectures, especially Intel's latest Knights Landing processors, and allows us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge. We released the source code for reproducible research and general usage.

...read moreread less

58 citations

Journal Article•10.1145/3365676•

Fast Parallel Algorithms for Counting and Listing Triangles in Big Graphs

[...]

Shaikh Arifuzzaman¹, Maleq Khan², Madhav V. Marathe³•Institutions (3)

University of New Orleans¹, Texas A&M University², University of Virginia³

13 Dec 2019-ACM Transactions on Knowledge Discovery From Data

TL;DR: This article presents two efficient MPI-based distributed memory parallel algorithms for counting triangles in big graphs and presents a novel approach that reduces communication cost drastically leading the algorithm to both a space- and runtime-efficient algorithm.

...read moreread less

Abstract: Big graphs (networks) arising in numerous application areas pose significant challengesfor graph analysts as these graphs grow to billions of nodes and edges and are prohibitively large to fit in the main memory. Finding the number of triangles in a graph is an important problem in the mining and analysis of graphs. In this article, we present two efficient MPI-based distributed memory parallel algorithms for counting triangles in big graphs. The first algorithm employs overlapping partitioning and efficient load balancing schemes to provide a very fast parallel algorithm. The algorithm scales well to networks with billions of nodes and can compute the exact number of triangles in a network with 10 billion edges in 16 minutes. The second algorithm divides the network into non-overlapping partitions leading to a space-efficient algorithm. Our results on both artificial and real-world networks demonstrate a significant space saving with this algorithm. We also present a novel approach that reduces communication cost drastically leading the algorithm to both a space- and runtime-efficient algorithm. Further, we demonstrate how our algorithms can be used to list all triangles in a graph and compute clustering coefficients of nodes. Our algorithm can also be adapted to a parallel approximation algorithm using an edge sparsification method.

...read moreread less

36 citations

Journal Article•10.1016/J.CMA.2018.08.028•

Large-scale stochastic topology optimization using adaptive mesh refinement and coarsening through a two-level parallelization scheme

[...]

Joan Baiges¹, Jesús Martínez-Frutos², D. Herrero-Pérez², F. Otero³, A. Ferrer¹ - Show less +1 more•Institutions (3)

Polytechnic University of Catalonia¹, University of Cartagena², National Institute of Statistics and Geography³

01 Jan 2019-Computer Methods in Applied Mechanics and Engineering

TL;DR: The topology optimization problem is addressed using the topological derivative concept in combination with a level-set method, and the performance and scalability are evaluated using several numerical benchmarks and real-world applications, showing good performance and Scalability up to thousands of processors.

...read moreread less

35 citations

Journal Article•10.1109/TCBB.2017.2760829•

Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems

[...]

Tony Pan¹, Patrick Flick¹, Chirag Jain¹, Yongchao Liu¹, Srinivas Aluru¹ - Show less +1 more•Institutions (1)

Georgia Institute of Technology¹

01 Jul 2019-IEEE/ACM Transactions on Computational Biology and Bioinformatics

TL;DR: The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible, and is the first $k$k-mer indexing library for distributed memory environments, and the first extensible library for general indexing and counting.

...read moreread less

Abstract: Counting and indexing fixed length substrings, or $k$k-mers, in biological sequences is a key step in many bioinformatics tasks including genome alignment and mapping, genome assembly, and error correction. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, few bioinformatics tools can efficiently process the datasets at the current generation rate of 1.8 terabases per 3-day experiment from a single sequencer. We present Kmerind, a high performance parallel $k$k-mer indexing library for distributed memory environments. The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible. Kmerind's $k$k-mer counter performs similarly or better than the best existing $k$k-mer counting tools even on shared memory systems. In a distributed memory environment, Kmerind counts $k$k-mers in a 120 GB sequence read dataset in less than 13 seconds on 1024 Xeon CPU cores, and fully indexes their positions in approximately 17 seconds. Querying for 1 percent of the $k$k-mers in these indices can be completed in 0.23 seconds and 28 seconds, respectively. Kmerind is the first $k$k-mer indexing library for distributed memory environments, and the first extensible library for general $k$k-mer indexing and counting. Kmerind is available at https://github.com/ParBLiSS/kmerind.

...read moreread less

33 citations

Journal Article•10.1002/NME.5970•

Parallel block‐preconditioned monolithic solvers for fluid‐structure interaction problems

[...]

D. Jodlbauer¹, Ulrich Langer², Thomas Wick³•Institutions (3)

Johannes Kepler University of Linz¹, Austrian Academy of Sciences², Leibniz University of Hanover³

10 Feb 2019-International Journal for Numerical Methods in Engineering

TL;DR: In this article, the solution of fluid-structure interaction problems using a monolithic approach for the coupling between fluid and solid subproblems is considered by means of the arbitrary Lagrangian-Eulerian framework and a nonlinear harmonic mesh motion model.

...read moreread less

Abstract: In this work, we consider the solution of fluid-structure interaction problems using a monolithic approach for the coupling between fluid and solid subproblems. The coupling of both equations is realized by means of the arbitrary Lagrangian-Eulerian framework and a nonlinear harmonic mesh motion model. Monolithic approaches require the solution of large, ill-conditioned linear systems of algebraic equations at every Newton step. Direct solvers tend to use too much memory even for a relatively small number of degrees of freedom, and, in addition, exhibit superlinear grow in arithmetic complexity. Thus, iterative solvers are the only viable option. To ensure convergence of iterative methods within a reasonable amount of iterations, good and, at the same time, cheap preconditioners have to be developed. We study physics-based block preconditioners, which are derived from the block $LDU$-factorization of the FSI Jacobian, and their performance on distributed memory parallel computers in terms of two- and three-dimensional test cases permitting large deformations.

...read moreread less

29 citations

Journal Article•10.1016/J.SWEVO.2019.06.003•

Performance analysis of synchronous and asynchronous distributed genetic algorithms on multiprocessors

[...]

Amr Abdelhafez¹, Enrique Alba, Gabriel Luque•Institutions (1)

Assiut University¹

01 Sep 2019-Swarm and evolutionary computation

TL;DR: A performance study of three different PGAs, considers the master-slave model along with synchronous and asynchronous distributed GAs (dGAs), presenting their different designs and expected similarities when running in a number of cores ranging from one to 32 cores, and describes how the speed-up and parallel performance of the dGAs is changing as the number of core enlarges.

...read moreread less

Abstract: Because of their effectiveness and flexibility in finding useful solutions, Genetic Algorithms (GAs) are very popular search techniques for solving complex optimization problems in scientific and industrial fields. Parallel GAs (PGAs), and especially distributed ones have been usually presented as the way to overcome the time-consuming shortcoming of sequential GAs. In the case of applying PGAs, we can expect better performance, the reason being the exchange of knowledge during the parallel search process. The resulting distributed search is different compared to what sequential panmictic GAs do, then deserving additional studies. This article presents a performance study of three different PGAs. Moreover, we investigate the effect of synchronizing communications over modern shared-memory multiprocessors. We consider the master-slave model along with synchronous and asynchronous distributed GAs (dGAs), presenting their different designs and expected similarities when running in a number of cores ranging from one to 32 cores. The master-slave model showed a competitive numerical effort versus the other dGAs and demonstrated to be able to scale-up well over multiprocessors. We describe how the speed-up and parallel performance of the dGAs is changing as the number of cores enlarges. Results of the island model show that synchronous and asynchronous dGAs have different numerical performances on a multiprocessor, the asynchronous algorithm having a faster execution, thus more attractive for time demanding applications. Our results and statistical analyses help in developing a novel body of knowledge on PGAs running in shared memory multiprocessors (versus overwhelming literature oriented to distributed memory clusters), something useful for researchers, beginners, and final users of these techniques.

...read moreread less

28 citations

Posted Content•

FastSV: A Distributed-Memory Connected Component Algorithm with Fast Convergence

[...]

Yongzhe Zhang, Ariful Azad, Zhenjiang Hu

14 Oct 2019-arXiv: Data Structures and Algorithms

TL;DR: The algorithm simplifies the classic Shiloach-Vishkin algorithm and employs several novel and efficient hooking strategies for faster convergence and map different steps of FastSV to linear algebraic operations and implement them with the help of scalable graph libraries.

...read moreread less

Abstract: This paper presents a new distributed-memory algorithm called FastSV for finding connected components in an undirected graph. Our algorithm simplifies the classic Shiloach-Vishkin algorithm and employs several novel and efficient hooking strategies for faster convergence. We map different steps of FastSV to linear algebraic operations and implement them with the help of scalable graph libraries. FastSV uses sparse operations to avoid redundant work and optimized MPI communication to avoid bottlenecks. The resultant algorithm shows high-performance and scalability as it can find the connected components of a hyperlink graph with over 134B edges in 30 seconds using 262K cores on a Cray XC40 supercomputer. FastSV outperforms the state-of-the-art algorithm by an average speedup of 2.21x (max 4.27x) on a variety of real-world graphs.

...read moreread less

27 citations

Proceedings Article•10.1145/3295500.3356201•

Streaming message interface: high-performance distributed memory programming on reconfigurable hardware

[...]

Tiziano De Matteis¹, Johannes de Fine Licht¹, Jakub Beránek², Torsten Hoefler¹•Institutions (2)

ETH Zurich¹, Technical University of Ostrava²

17 Nov 2019

TL;DR: Streaming Message Interface (SMI) as mentioned in this paper is a communication model and API that unifies explicit message passing with a hardware-oriented programming model, facilitating minimal-overhead, flexible, and productive inter-FPGA communication.

...read moreread less

Abstract: Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is typically handled either by going through the host machine, sacrificing performance, or by streaming across fixed device-to-device connections, sacrificing flexibility. We present Streaming Message Interface (SMI), a communication model and API that unifies explicit message passing with a hardware-oriented programming model, facilitating minimal-overhead, flexible, and productive inter-FPGA communication. Instead of bulk transmission, messages are streamed across the network during computation, allowing communication to be seamlessly integrated into pipelined designs. We present a high-level synthesis implementation of SMI targeting a dedicated FPGA interconnect, exposing runtime-configurable routing with support for arbitrary network topologies, and implement a set of distributed memory benchmarks. Using SMI, programmers can implement distributed, scalable HPC programs on reconfigurable hardware, without deviating from best practices for hardware design.

...read moreread less

27 citations

Proceedings Article•10.1109/SCALA49573.2019.00010•

Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC

[...]

Thomas Herault¹, Yves Robert¹, George Bosilca¹, Jack Dongarra¹•Institutions (1)

University of Tennessee¹

1 Nov 2019

TL;DR: This paper introduces a generic and flexible matrix- matrix multiplication algorithm C = A × B for state-of-the-art computing platforms, which relies on the classical tile-based outer-product algorithm, but enhances it with several control dependencies to increase data re-use and to optimize communication flow from/to the accelerators within each node.

...read moreread less

Abstract: This paper introduces a generic and flexible matrix- matrix multiplication algorithm C = A × B for state-of-the-art computing platforms. Typically, these platforms are distributed- memory machines whose nodes are equipped with several ac- celerators. To the best of our knowledge, SLATE [9] is the only library that provides a publicly available implementation on such platforms, and it is currently limited to problem instances where the C matrix can entirely fit in the memory of the GPU accelera- tors. Our algorithm relies on the classical tile-based outer-product algorithm, but enhances it with several control dependencies to increase data re-use and to optimize communication flow from/to the accelerators within each node. The algorithm is written with the PARSEC runtime system, which allows for a fast and generic implementation, while achieving close-to-peak performance.

...read moreread less

Posted Content•

LocationSpark: In-memory Distributed Spatial Query Processing and Optimization

[...]

Mingjie Tang, Yongyang Yu¹, Walid G. Aref, Ahmed R. Mahmood, Qutaibah M. Malluhi, Mourad Ouzzani - Show less +2 more•Institutions (1)

Purdue University¹

08 Jul 2019-arXiv: Databases

TL;DR: This paper introduces new techniques for handling query skew that commonly happens in practice, and minimizes communication costs accordingly, and proposes a distributed query scheduler that uses a new cost model to minimize the cost of spatial query processing.

...read moreread less

Abstract: Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for handling query skew, which is common in practice, and optimize communication costs accordingly. We propose a distributed query scheduler that use a new cost model to optimize the cost of spatial query processing. The scheduler generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bitmap filters to forward queries to the appropriate local nodes. Each local computation node is responsible for optimizing and selecting its best local query execution plan based on the indexes and the nature of the spatial queries in that node. All the proposed spatial query processing and optimization techniques are prototyped inside Spark, a distributed memory-based computation system. The experimental study is based on real datasets and demonstrates that distributed spatial query processing can be enhanced by up to an order of magnitude over existing in-memory and distributed spatial systems.

...read moreread less

Journal Article•10.1016/J.PARCO.2019.102582•

DuctTeip: An efficient programming model for distributed task-based parallel computing

[...]

Afshin Zafari¹, Elisabeth Larsson¹, Martin Tillenius¹•Institutions (1)

Uppsala University¹

1 Dec 2019

TL;DR: In this article, a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware is proposed for distributed task-based parallel programming, which has been successful both in simplifying the programming and in exploiting the available hardware parallelism for shared memory systems.

...read moreread less

Abstract: Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored approaches. Task-based parallel programming has been successful both in simplifying the programming and in exploiting the available hardware parallelism for shared memory systems. In this paper we focus on how to extend task-parallel programming to distributed memory systems. We use a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware. We test the proposed programming model on two different applications, a Cholesky factorization, and a solver for the Shallow Water Equations. We also compare the performance of our implementation with that of other frameworks for distributed task-parallel programming, and show that it is competitive.

...read moreread less

Journal Article•10.1007/S11036-019-01318-3•

ZAKI: A Smart Method and Tool for Automatic Performance Optimization of Parallel SpMV Computations on Distributed Memory Machines

[...]

Sardar Usman¹, Rashid Mehmood¹, Iyad Katib¹, Aiiad Albeshri¹, Saleh M. Altowaijri² - Show less +1 more•Institutions (2)

King Abdulaziz University¹, Northern Borders University²

31 Jul 2019-Mobile Networks and Applications

TL;DR: This is the first work where the sparsity structure of matrices have been exploited to predict the optimal number of processes for a given matrix in distributed memory environments by using different base and ensemble machine learning methods.

...read moreread less

Abstract: SpMV is a vital computing operation of many scientific, engineering, economic and social applications, increasingly being used to develop timely intelligence for the design and management of smart societies. Several factors affect the performance of SpMV computations, such as matrix characteristics, storage formats, software and hardware platforms. The complexity of the computer systems is on the rise with the increasing number of cores per processor, different levels of caches, processors per node and high speed interconnect. There is an ever-growing need for new optimization techniques and efficient ways of exploiting parallelism. In this paper, we propose ZAKI, a data-driven, machine-learning approach and tool, to predict the optimal number of processes for SpMV computations of an arbitrary sparse matrix on a distributed memory machine. The aim herein is to allow application scientists to automatically obtain the best configuration, and hence the best performance, for the execution of SpMV computations. We train and test the tool using nearly 2000 real world matrices obtained from 45 application domains including computational fluid dynamics (CFD), computer vision, and robotics. The tool uses three machine learning methods, decision trees, random forest, gradient boosting, and is evaluated in depth. A discussion on the applicability of our proposed tool to energy efficiency optimization of SpMV computations is given. This is the first work where the sparsity structure of matrices have been exploited to predict the optimal number of processes for a given matrix in distributed memory environments by using different base and ensemble machine learning methods.

...read moreread less

Proceedings Article•10.1109/HPEC.2019.8916243•

One Quadrillion Triangles Queried on One Million Processors

[...]

Roger Pearce¹, Trevor Steil¹, Benjamin W. Priest¹, Geoffrey Sanders¹•Institutions (1)

Lawrence Livermore National Laboratory¹

1 Sep 2019

TL;DR: To the knowledge, these results are the largest triangle count experiments on synthetic scale-free graphs to date.

...read moreread less

Abstract: We update our prior 2017 Graph Challenge submission [7] on large scale triangle counting in distributed memory by demonstrating scaling and validation on trillion-edge scale-free graphs. We incorporate recent distributed communication optimizations developed for irregular communication workloads [1], and demonstrate scaling up to 1.5 million cores of IBM BG/Q Sequoia at LLNL. We validate our implementation using nonstochastic Kronecker graph generation where ground-truth local and global triangle counts are known, and model our Kronecker graph inputs after the Graph500 [5] R-MAT inputs. To our knowledge, our results are the largest triangle count experiments on synthetic scale-free graphs to date.

...read moreread less

Proceedings Article•10.1109/IPDPS.2019.00012•

LACC: A Linear-Algebraic Algorithm for Finding Connected Components in Distributed Memory

[...]

Ariful Azad¹, Aydin Buluc²•Institutions (2)

Indiana University¹, Lawrence Berkeley National Laboratory²

20 May 2019

TL;DR: This paper presents a parallel connected-components algorithm that can run on distributed-memory computers and uses linear algebraic primitives and is based on a PRAM algorithm by Awerbuch and Shiloach, which outperforms previous algorithms by a significant margin.

...read moreread less

Abstract: Finding connected components is one of the most widely used operations on a graph. Optimal serial algorithms for the problem have been known for half a century, and many competing parallel algorithms have been proposed over the last several decades under various different models of parallel computation. This paper presents a parallel connected-components algorithm that can run on distributed-memory computers. Our algorithm uses linear algebraic primitives and is based on a PRAM algorithm by Awerbuch and Shiloach. We show that the resulting algorithm, named LACC for Linear Algebraic Connected Components, outperforms competitors by a factor of up to 12x for small to medium scale graphs. For large graphs with more than 50B edges, LACC scales to 4K nodes (262K cores) of a Cray XC40 supercomputer and outperforms previous algorithms by a significant margin. This remarkable performance is accomplished by (1) exploiting sparsity that was not present in the original PRAM algorithm formulation, (2) using high-performance primitives of Combinatorial BLAS, and (3) identifying hot spots and optimizing them away by exploiting algorithmic insights.

...read moreread less

Proceedings Article•10.1145/3337821.3337919•

diBELLA: Distributed Long Read to Long Read Alignment

[...]

Marquita Ellis¹, Giulia Guidi¹, Aydin Buluc¹, Leonid Oliker², Katherine Yelick¹ - Show less +1 more•Institutions (2)

University of California, Berkeley¹, Lawrence Berkeley National Laboratory²

5 Aug 2019

TL;DR: DiBELLA as discussed by the authors is a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from "third generation" long read sequencers.

...read moreread less

Abstract: We present a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from "third generation" long read sequencers [29]. While long sequences of DNA offer enormous advantages for biological analysis and insight, current long read sequencing instruments have high error rates and therefore require different approaches to analysis than their short read counterparts. Our work focuses on an efficient distributed-memory parallelization of an accurate single-node algorithm for overlapping and aligning long reads. We achieve scalability of this irregular algorithm by addressing the competing issues of increasing parallelism, minimizing communication, constraining the memory footprint, and ensuring good load balance. The resulting application, diBELLA, is the first distributed memory overlapper and aligner specifically designed for long reads and parallel scalability. We describe and present analyses for high level design trade-offs and conduct an extensive empirical analysis that compares performance characteristics across state-of-the-art HPC systems as well as a commercial cloud architectures, highlighting the advantages of state-of-the-art network technologies.

...read moreread less

Journal Article•10.1109/ACCESS.2019.2923565•

ZAKI+: A Machine Learning Based Process Mapping Tool for SpMV Computations on Distributed Memory Architectures

[...]

Sardar Usman¹, Rashid Mehmood¹, Iyad Katib¹, Aiiad Albeshri¹•Institutions (1)

King Abdulaziz University¹

17 Jun 2019-IEEE Access

TL;DR: ZAKI+ provides optimal process mapping and outperforms the MPI default mapping policy by a factor of 4.24, the first work where the sparsity structure of matrices has been exploited to predict the optimal mapping of processes and data in distributed-memory environments.

...read moreread less

Abstract: Smart cities and other cyber-physical systems (CPSs) rely on various scientific, engineering, business, and social applications that provide timely intelligence for their design, operations, and management. Many of these scientific and analytics applications require the solution of sparse linear equation systems, where sparse matrix-vector (SpMV) product is a key computing operation. Several factors determine the performance of parallel SpMV computations, including matrix characteristics, storage formats, and the rising complexity and heterogeneity of computer systems. There is a pressing need for new ways of exploiting parallelism, and mapping data and applications to the computing resources. We propose here ZAKI+, a data-driven machine-learning approach, allowing users to automatically, effortlessly, and speedily obtain the best configuration (the data distribution, the optimal number of processes, and mapping strategy) and performance for the execution of the parallel SpMV computations on distributed memory machines. We train and test the tool using three machine learning methods-decision trees, random forest, and Xtreme boosting-and nearly 2000 real-world matrices obtained from 45 application domains, including computer vision and robotics. ZAKI+ provides optimal process mapping and outperforms the MPI default mapping policy by a factor of 4.24. This is the first work where the sparsity structure of matrices has been exploited to predict the optimal mapping of processes and data in distributed-memory environments by using different base and ensemble machine learning methods. Various CPSs comprise compute-intensive machine learning applications, such as the SpMV, and hence, the process and data mapping contributions of this paper would be of paramount impact for the CPSs.

...read moreread less

Journal Article•10.1016/J.JPDC.2019.02.002•

Modeling the asynchronous Jacobi method without communication delays

[...]

Jordi Wolfson-Pou¹, Edmond Chow¹•Institutions (1)

Georgia Institute of Technology¹

01 Jun 2019-Journal of Parallel and Distributed Computing

TL;DR: It is shown that simplified asynchronous Jacobi can continue to reduce the residual when some processes are slower than other processes, and can converge when synchronous Jacobi does not.

...read moreread less

Journal Article•10.1186/S13104-019-4256-6•

An OpenMP-based tool for finding longest common subsequence in bioinformatics

[...]

Rayhan Shikder¹, Parimala Thulasiraman¹, Pourang Irani¹, Pingzhao Hu¹•Institutions (1)

University of Manitoba¹

11 Apr 2019-BMC Research Notes

TL;DR: Three novel parallel versions of the LCS algorithm are developed that can provide significant advantages in terms of execution times, monetary cost, and pervasiveness in finding LCS of DNA sequences in an environment where Graphics Processing Units are not available.

...read moreread less

Abstract: Finding the longest common subsequence (LCS) among sequences is NP-hard. This is an important problem in bioinformatics for DNA sequence alignment and pattern discovery. In this research, we propose new CPU-based parallel implementations that can provide significant advantages in terms of execution times, monetary cost, and pervasiveness in finding LCS of DNA sequences in an environment where Graphics Processing Units are not available. For general purpose use, we also make the OpenMP-based tool publicly available to end users. In this study, we develop three novel parallel versions of the LCS algorithm on: (i) distributed memory machine using message passing interface (MPI); (ii) shared memory machine using OpenMP, and (iii) hybrid platform that utilizes both distributed and shared memory using MPI-OpenMP. The experimental results with both simulated and real DNA sequence data show that the shared memory OpenMP implementation provides at least two-times absolute speedup than the best sequential version of the algorithm and a relative speedup of almost 7. We provide a detailed comparison of the execution times among the implementations on different platforms with different versions of the algorithm. We also show that removing branch conditions negatively affects the performance of the CPU-based parallel algorithm on OpenMP platform.

...read moreread less

Proceedings Article•10.1109/HPEC.2019.8916299•

Scaling and Quality of Modularity Optimization Methods for Graph Clustering

[...]

Sayan Ghosh¹, Mahantesh Halappanavar¹, Antonino Tumeo¹, Ananth Kalyanarainan²•Institutions (2)

Pacific Northwest National Laboratory¹, Washington State University²

1 Sep 2019

TL;DR: An edge-balanced graph distribution is introduced for the distributed-memory community detection implementation of the popular Louvain method, that significantly reduces communication and offers up to 80% improvement in the overall execution time.

...read moreread less

Abstract: Real-world graphs exhibit structures known as “communities” or “clusters” consisting of a group of vertices with relatively high connectivity between them, as compared to the rest of the vertices in the network. Graph clustering or community detection is a fundamental graph operation used to analyze real-world graphs occurring in the areas of computational biology, cybersecurity, electrical grids, etc. Similar to other graph algorithms, owing to irregular memory accesses and inherently sequential nature, current algorithms for community detection are challenging to parallelize. However, in order to analyze large networks, it is important to develop scalable parallel implementations of graph clustering that are capable of exploiting the architectural features of modern supercomputers.In response to the 2019 Streaming Graph Challenge, we present quality and performance analysis of our distributed-memory community detection using Vite, which is our distributed memory implementation of the popular Louvain method, on the ALCF Theta supercomputer.Clustering methods such as Louvain that rely on modularity maximization are known to suffer from the resolution limit problem, preventing identification of clusters of certain sizes. Hence, we also include quality analysis of our shared-memory implementation of the Fast-tracking Resistance method, in comparison with Louvain on the challenge datasets.Furthermore, we introduce an edge-balanced graph distribution for our distributed memory implementation, that significantly reduces communication, offering up to 80% improvement in the overall execution time. In addition to performance/quality analysis, we also include details on the power/energy consumption, and memory traffic of the distributed-memory clustering implementation using real-world graphs with over a billion edges.

...read moreread less

Proceedings Article•10.1109/EMPDP.2019.8671639•

Strategies for Parallel Execution of Cellular Automata in Distributed Memory Architectures

[...]

Andrea Giordano¹, Alessio De Rango², Donato D'Ambrosio², Rocco Rongo², W. Spataro¹ - Show less +1 more•Institutions (2)

Indian Council of Agricultural Research¹, University of Calabria²

1 Feb 2019

TL;DR: A preliminary study on different parallelizzation techniques for structured grid models such as cellular automata on distributed memory architectures and the performance achieved when a real-life application, namely the SciddicaT cellular Automata model for debris-flows simulation, is adopted.

...read moreread less

Abstract: The Cellular Automata paradigm is an efficient tool to model and study complex systems such as traffic simulation, lava flows and swarm based behaviour. In addition, cellular automata can be profitably used in many mathematical problems such as differential equations and chaos theory. Due to their inherent parallel nature, cellular automata can be efficiently parallelized among a set of computing nodes in order to scale and speed up their execution. This paper presents a preliminary study on different parallelizzation techniques for structured grid models such as cellular automata on distributed memory architectures. In particular, three strategies are presented and compared in order to evaluate their efficiency in terms of speedup. An experimental section shows the performance achieved by the three strategies when a real-life application, namely the SciddicaT cellular automata model for debris-flows simulation, is adopted.

...read moreread less

Journal Article•10.1016/J.CMA.2019.04.032•

Efficient parallelization for volume-coupled multiphysics simulations on hierarchical Cartesian grids

[...]

Michael Schlottke-Lakemper¹, Ansgar Niemöller¹, Matthias Meinke¹, Wolfgang Schröder¹•Institutions (1)

RWTH Aachen University¹

01 Aug 2019-Computer Methods in Applied Mechanics and Engineering

TL;DR: The design of the coupling algorithm is shown to be more efficient than a standard hybrid flow-aeroacoustics scheme coupled via disk I/O and to predict the acoustic field of a turbulent jet in a direct-hybrid simulation.

...read moreread less

Posted Content•

Distributed Iterative CT Reconstruction using Multi-Agent Consensus Equilibrium

[...]

Venkatesh Sridhar¹, Xiao Wang¹, Gregery T. Buzzard¹, Charles A. Bouman²•Institutions (2)

Purdue University¹, Harvard University²

21 Nov 2019-arXiv: Image and Video Processing

TL;DR: This article proposes a multi-agent consensus equilibrium (MACE) algorithm for distributing both the computation and memory of MBIR reconstruction across a large number of parallel nodes, and obtains a parallel solution to the serial framework of Plug-n-play (PnP) priors, which it is called MACE-pnP.

...read moreread less

Abstract: Model-Based Image Reconstruction (MBIR) methods significantly enhance the quality of computed tomographic (CT) reconstructions relative to analytical techniques, but are limited by high computational cost. In this paper, we propose a multi-agent consensus equilibrium (MACE) algorithm for distributing both the computation and memory of MBIR reconstruction across a large number of parallel nodes. In MACE, each node stores only a sparse subset of views and a small portion of the system matrix, and each parallel node performs a local sparse-view reconstruction, which based on repeated feedback from other nodes, converges to the global optimum. Our distributed approach can also incorporate advanced denoisers as priors to enhance reconstruction quality. In this case, we obtain a parallel solution to the serial framework of Plug-n-play (PnP) priors, which we call MACE-PnP. In order to make MACE practical, we introduce a partial update method that eliminates nested iterations and prove that it converges to the same global solution. Finally, we validate our approach on a distributed memory system with real CT data. We also demonstrate an implementation of our approach on a massive supercomputer that can perform large-scale reconstruction in real-time.

...read moreread less

Journal Article•10.1016/J.INS.2019.01.050•

Parallel feature selection for distributed-memory clusters

[...]

Jorge González-Domínguez, Verónica Bolón-Canedo, Borja Freire, Juan Touriño

01 Sep 2019-Information Sciences

TL;DR: This work presents fast-mRMR-MPI, a novel hybrid parallel implementation that uses MPI and OpenMP to accelerate feature selection on distributed-memory clusters to improve both scalability and efficiency.

...read moreread less

Journal Article•10.1016/J.JPDC.2018.11.009•

Large-scale parallel similarity search with Product Quantization for online multimedia services

[...]

Guilherme Andrade¹, André Fernandes², Jeremias M. Gomes², Renato Ferreira¹, George Teodoro² - Show less +1 more•Institutions (2)

Universidade Federal de Minas Gerais¹, University of Brasília²

01 Mar 2019-Journal of Parallel and Distributed Computing

TL;DR: This paper addresses the challenges of similarity search in high-dimensional spaces with an efficient parallelization of the Product Quantization Approximate Nearest Neighbor Search (PQANNS) indexing and proposes mechanisms to minimize the query response times in online scenarios in which the query rates vary at run-time.

...read moreread less

Proceedings Article•10.1109/ISPA-BDCLOUD-SUSTAINCOM-SOCIALCOM48970.2019.00074•

A Variable Batch Size Strategy for Large Scale Distributed DNN Training

[...]

Hu Zhongzhe¹, Junmin Xiao¹, Tian Zhongbo, Xiaoyang Zhang¹, Zhu Hongrui¹, Yao Chengji, Ninghui Sun¹, Guangming Tan¹ - Show less +4 more•Institutions (1)

Chinese Academy of Sciences¹

1 Dec 2019

TL;DR: An in-depth analysis of the different stages in the recent multi-step schedule finds that the training process at the first stage is sensitive to the batchsize, while different batchsize do not significantly impact the latter stages, and a novel variable batchsize strategy is proposed.

...read moreread less

Abstract: Large batch distributed synchronous stochastic gradient descent (SGD) has been widely used to train deep neural networks on a distributed memory system with multi-nodes, which can leverage parallel resources to reduce the number of iterative steps and speed up the convergence of training process. However, the large-batch SGD leads to a poor test accuracy, which would counteract the benefits of large scale parallel SGD. Existing solutions for large-batch training either significantly degrade accuracy or require massive additional hyper-parameter tuning. To overcome the difficulty above, we propose a novel variable batchsize strategy. With an in-depth analysis of the different stages in the recent multi-step schedule, we find that the training process at the first stage is sensitive to the batchsize, while different batchsize do not significantly impact the latter stages. Based on this discovery, we first claim that different stages of training should use different batchsize. Hence, the variable batchsize strategy is proposed for the large scale distributed training. Furthermore, in order to turn existing hyper-parameters automatically, an auto-tuning engine is designed for the variable batchsize strategy to achieve higher testing accuracy in the extremely large batchsize cases. By using our strategy, we successfully scale the batchsize to 120K in latter stages on ImageNet-1K with ResNet50 without accuracy loss and 128K with slight accuracy loss. The experimental evaluation on 2048 GPUs shows that the variable batchsize strategy with our auto-tuning engine could complete the training of ResNet-50 in 25 minutes. Furthermore, the new strategy successfully decreases the number of parameter updates by about 1.7 times compared with Facebook's multi-step schedule.

...read moreread less

Journal Article•10.1145/3357596•

GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms

[...]

Nina Engelhardt¹, Hayden K.-H. So¹•Institutions (1)

University of Hong Kong¹

14 Oct 2019-arXiv: Distributed, Parallel, and Cluster Computing

TL;DR: GraVF-M is presented, a framework designed to ease the implementation of FPGA-based graph processing accelerators for multi-FPGA platforms with distributed memory that compares favorably to state-of-the-art frameworks from the literature and reaches 94% of the projected performance limit of the system.

...read moreread less

Abstract: Due to the irregular nature of connections in most graph datasets, partitioning graph analysis algorithms across multiple computational nodes that do not share a common memory inevitably leads to large amounts of interconnect traffic. Previous research has shown that FPGAs can outcompete software-based graph processing in shared memory contexts, but it remains an open question if this advantage can be maintained in distributed systems. In this work, we present GraVF-M, a framework designed to ease the implementation of FPGA-based graph processing accelerators for multi-FPGA platforms with distributed memory. Based on a lightweight description of the algorithm kernel, the framework automatically generates optimized RTL code for the whole multi-FPGA design. We exploit an aspect of the programming model to present a familiar message-passing paradigm to the user, while under the hood implementing a more efficient architecture that can reduce the necessary inter-FPGA network traffic by a factor equal to the average degree of the input graph. A performance model based on a theoretical analysis of the factors influencing performance serves to evaluate the efficiency of our implementation. With a throughput of up to 5.8 GTEPS (billions of traversed edges per second) on a 4-FPGA system, the designs generated by GraVF-M compare favorably to state-of-the-art frameworks from the literature and reach 94% of the projected performance limit of the system.

...read moreread less

Posted Content•

In-memory Distributed Spatial Query Processing and Optimization

[...]

Mingjie Tang, Yongyang Yu, Walid G. Aref, Ahmed R. Mahmood, Qutaibah M. Malluhi, Mourad Ouzzani - Show less +2 more

8 Jul 2019

TL;DR: In this article, the authors propose a distributed spatial query scheduler that uses a new cost model to optimize the cost of spatial query processing in an in-memory and distributed setup to address scalability.

...read moreread less

Patent•

Executing distributed memory operations using processing elements connected by distributed channels

[...]

Bushra Ahsan¹, Michael Adler¹, Neal Crago¹, Joel Emer¹, Aamer Jaleel¹, Angshuman Parashar¹, Michael Pellauer¹ - Show less +3 more•Institutions (1)

Intel¹

17 Jun 2019

TL;DR: In this paper, the authors present a method for implementing a distributed memory operation using the disclosure of distributed channel information for an algorithm to be executed by a plurality of spatially distributed processing elements.

...read moreread less

Abstract: A technology for implementing a method for distributed memory operations. A method of the disclosure includes obtaining distributed channel information for an algorithm to be executed by a plurality of spatially distributed processing elements. For each distributed channel in the distributed channel information, the method further associates one or more of the plurality of spatially distributed processing elements with the distributed channel based on the algorithm.

...read moreread less

...

Expand