Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Distributed memory
  4. 2019
  1. Home
  2. Topics
  3. Distributed memory
  4. 2019
Showing papers on "Distributed memory published in 2019"
Journal Article•10.1093/SYSBIO/SYY054•
EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences.

[...]

Pierre Barbera1, Alexey M. Kozlov1, Lucas Czech1, Benoit Morel1, Diego Darriba1, Diego Darriba2, Tomas Flouri1, Tomas Flouri3, Alexandros Stamatakis1, Alexandros Stamatakis4 •
Heidelberg Institute for Theoretical Studies1, University of A Coruña2, University College London3, Karlsruhe Institute of Technology4
01 Mar 2019-Systematic Biology
TL;DR: The performance assessment shows that EPA‐NG outperforms RAxML‐EPA and PPLACER by up to a factor of 30 in sequential execution mode, while attaining comparable parallel efficiency on shared memory systems.
Abstract: Next generation sequencing (NGS) technologies have led to a ubiquity of molecular sequence data. This data avalanche is particularly challenging in metagenetics, which focuses on taxonomic identification of sequences obtained from diverse microbial environments. Phylogenetic placement methods determine how these sequences fit into an evolutionary context. Previous implementations of phylogenetic placement algorithms, such as the evolutionary placement algorithm (EPA) included in RAxML, or PPLACER, are being increasingly used for this purpose. However, due to the steady progress in NGS technologies, the current implementations face substantial scalability limitations. Herein, we present EPA-NG, a complete reimplementation of the EPA that is substantially faster, offers a distributed memory parallelization, and integrates concepts from both, RAxML-EPA and PPLACER. EPA-NG can be executed on standard shared memory, as well as on distributed memory systems (e.g., computing clusters). To demonstrate the scalability of EPA-NG, we placed $1$ billion metagenetic reads from the Tara Oceans Project onto a reference tree with 3748 taxa in just under $7$ h, using 2048 cores. Our performance assessment shows that EPA-NG outperforms RAxML-EPA and PPLACER by up to a factor of $30$ in sequential execution mode, while attaining comparable parallel efficiency on shared memory systems. We further show that the distributed memory parallelization of EPA-NG scales well up to 2048 cores. EPA-NG is available under the AGPLv3 license: https://github.com/Pbdas/epa-ng.

561 citations

Journal Article•10.1134/S1995080219050056•
AMGCL: An Efficient, Flexible, and Extensible Algebraic Multigrid Implementation

[...]

Denis Demidov1•
Russian Academy of Sciences1
24 Jun 2019-Lobachevskii Journal of Mathematics
TL;DR: The paper presents AMGCL -- an opensource C++ library implementing the algebraic multigrid method (AMG) for solution of large sparse linear systems of equations, usually arising from discretization of partial differential equations on an unstructured grid.
Abstract: The paper presents AMGCL—an opensource C++ library implementing the algebraic multigrid method (AMG) for solution of large sparse linear systems of equations, usually arising from discretization of partial differential equations on an unstructured grid. The library supports both shared and distributed memory computation, allows to utilize modern massively parallel processors via OpenMP, OpenCL, or CUDA technologies, has minimal dependencies, and is easily extensible. The design principles behind AMGCL are discussed and it is shown that the code performance is on par with alternative implementations.

98 citations

Journal Article•10.1109/TPDS.2019.2904058•
Parallelizing Word2Vec in Shared and Distributed Memory

[...]

Shihao Ji1, Nadathur Satish2, Sheng Li2, Pradeep Dubey2•
Georgia State University1, Intel2
01 Sep 2019-IEEE Transactions on Parallel and Distributed Systems
TL;DR: HogBatch as mentioned in this paper improves reuse of various data structures in the algorithm through the use of minibatching and negative sample sharing, hence allowing us to express the problem using matrix multiply operations.
Abstract: Word2vec is a widely used algorithm for extracting low-dimensional vector representations of words. State-of-the-art algorithms including those by Mikolov et al. [1] , [2] have been parallelized for multi-core CPU architectures, but are based on vector-vector operations with “Hogwild” updates that are memory-bandwidth intensive and do not efficiently use computational resources. In this paper, we propose “HogBatch” by improving reuse of various data structures in the algorithm through the use of minibatching and negative sample sharing, hence allowing us to express the problem using matrix multiply operations. We also explore different techniques to distribute word2vec computation across nodes in a computer cluster, and demonstrate good strong scalability up to 32 nodes. The new algorithm is particularly suitable for modern multi-core/many-core architectures, especially Intel's latest Knights Landing processors, and allows us to scale up the computation near linearly across cores and nodes, and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge. We released the source code for reproducible research and general usage.

58 citations

Journal Article•10.1145/3365676•
Fast Parallel Algorithms for Counting and Listing Triangles in Big Graphs

[...]

Shaikh Arifuzzaman1, Maleq Khan2, Madhav V. Marathe3•
University of New Orleans1, Texas A&M University2, University of Virginia3
13 Dec 2019-ACM Transactions on Knowledge Discovery From Data
TL;DR: This article presents two efficient MPI-based distributed memory parallel algorithms for counting triangles in big graphs and presents a novel approach that reduces communication cost drastically leading the algorithm to both a space- and runtime-efficient algorithm.
Abstract: Big graphs (networks) arising in numerous application areas pose significant challengesfor graph analysts as these graphs grow to billions of nodes and edges and are prohibitively large to fit in the main memory. Finding the number of triangles in a graph is an important problem in the mining and analysis of graphs. In this article, we present two efficient MPI-based distributed memory parallel algorithms for counting triangles in big graphs. The first algorithm employs overlapping partitioning and efficient load balancing schemes to provide a very fast parallel algorithm. The algorithm scales well to networks with billions of nodes and can compute the exact number of triangles in a network with 10 billion edges in 16 minutes. The second algorithm divides the network into non-overlapping partitions leading to a space-efficient algorithm. Our results on both artificial and real-world networks demonstrate a significant space saving with this algorithm. We also present a novel approach that reduces communication cost drastically leading the algorithm to both a space- and runtime-efficient algorithm. Further, we demonstrate how our algorithms can be used to list all triangles in a graph and compute clustering coefficients of nodes. Our algorithm can also be adapted to a parallel approximation algorithm using an edge sparsification method.

36 citations

Journal Article•10.1016/J.CMA.2018.08.028•
Large-scale stochastic topology optimization using adaptive mesh refinement and coarsening through a two-level parallelization scheme

[...]

Joan Baiges1, Jesús Martínez-Frutos2, D. Herrero-Pérez2, F. Otero3, A. Ferrer1 •
Polytechnic University of Catalonia1, University of Cartagena2, National Institute of Statistics and Geography3
01 Jan 2019-Computer Methods in Applied Mechanics and Engineering
TL;DR: The topology optimization problem is addressed using the topological derivative concept in combination with a level-set method, and the performance and scalability are evaluated using several numerical benchmarks and real-world applications, showing good performance and Scalability up to thousands of processors.

35 citations

Journal Article•10.1109/TCBB.2017.2760829•
Kmerind: A Flexible Parallel Library for K-mer Indexing of Biological Sequences on Distributed Memory Systems

[...]

Tony Pan1, Patrick Flick1, Chirag Jain1, Yongchao Liu1, Srinivas Aluru1 •
Georgia Institute of Technology1
01 Jul 2019-IEEE/ACM Transactions on Computational Biology and Bioinformatics
TL;DR: The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible, and is the first $k$k-mer indexing library for distributed memory environments, and the first extensible library for general indexing and counting.
Abstract: Counting and indexing fixed length substrings, or $k$k-mers, in biological sequences is a key step in many bioinformatics tasks including genome alignment and mapping, genome assembly, and error correction. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, few bioinformatics tools can efficiently process the datasets at the current generation rate of 1.8 terabases per 3-day experiment from a single sequencer. We present Kmerind, a high performance parallel $k$k-mer indexing library for distributed memory environments. The Kmerind library provides a set of simple and consistent APIs with sequential semantics and parallel implementations that are designed to be flexible and extensible. Kmerind's $k$k-mer counter performs similarly or better than the best existing $k$k-mer counting tools even on shared memory systems. In a distributed memory environment, Kmerind counts $k$k-mers in a 120 GB sequence read dataset in less than 13 seconds on 1024 Xeon CPU cores, and fully indexes their positions in approximately 17 seconds. Querying for 1 percent of the $k$k-mers in these indices can be completed in 0.23 seconds and 28 seconds, respectively. Kmerind is the first $k$k-mer indexing library for distributed memory environments, and the first extensible library for general $k$k-mer indexing and counting. Kmerind is available at https://github.com/ParBLiSS/kmerind.

33 citations

Journal Article•10.1002/NME.5970•
Parallel block‐preconditioned monolithic solvers for fluid‐structure interaction problems

[...]

D. Jodlbauer1, Ulrich Langer2, Thomas Wick3•
Johannes Kepler University of Linz1, Austrian Academy of Sciences2, Leibniz University of Hanover3
10 Feb 2019-International Journal for Numerical Methods in Engineering
TL;DR: In this article, the solution of fluid-structure interaction problems using a monolithic approach for the coupling between fluid and solid subproblems is considered by means of the arbitrary Lagrangian-Eulerian framework and a nonlinear harmonic mesh motion model.
Abstract: In this work, we consider the solution of fluid-structure interaction problems using a monolithic approach for the coupling between fluid and solid subproblems. The coupling of both equations is realized by means of the arbitrary Lagrangian-Eulerian framework and a nonlinear harmonic mesh motion model. Monolithic approaches require the solution of large, ill-conditioned linear systems of algebraic equations at every Newton step. Direct solvers tend to use too much memory even for a relatively small number of degrees of freedom, and, in addition, exhibit superlinear grow in arithmetic complexity. Thus, iterative solvers are the only viable option. To ensure convergence of iterative methods within a reasonable amount of iterations, good and, at the same time, cheap preconditioners have to be developed. We study physics-based block preconditioners, which are derived from the block $LDU$-factorization of the FSI Jacobian, and their performance on distributed memory parallel computers in terms of two- and three-dimensional test cases permitting large deformations.

29 citations

Journal Article•10.1016/J.SWEVO.2019.06.003•
Performance analysis of synchronous and asynchronous distributed genetic algorithms on multiprocessors

[...]

Amr Abdelhafez1, Enrique Alba, Gabriel Luque•
Assiut University1
01 Sep 2019-Swarm and evolutionary computation
TL;DR: A performance study of three different PGAs, considers the master-slave model along with synchronous and asynchronous distributed GAs (dGAs), presenting their different designs and expected similarities when running in a number of cores ranging from one to 32 cores, and describes how the speed-up and parallel performance of the dGAs is changing as the number of core enlarges.
Abstract: Because of their effectiveness and flexibility in finding useful solutions, Genetic Algorithms (GAs) are very popular search techniques for solving complex optimization problems in scientific and industrial fields. Parallel GAs (PGAs), and especially distributed ones have been usually presented as the way to overcome the time-consuming shortcoming of sequential GAs. In the case of applying PGAs, we can expect better performance, the reason being the exchange of knowledge during the parallel search process. The resulting distributed search is different compared to what sequential panmictic GAs do, then deserving additional studies. This article presents a performance study of three different PGAs. Moreover, we investigate the effect of synchronizing communications over modern shared-memory multiprocessors. We consider the master-slave model along with synchronous and asynchronous distributed GAs (dGAs), presenting their different designs and expected similarities when running in a number of cores ranging from one to 32 cores. The master-slave model showed a competitive numerical effort versus the other dGAs and demonstrated to be able to scale-up well over multiprocessors. We describe how the speed-up and parallel performance of the dGAs is changing as the number of cores enlarges. Results of the island model show that synchronous and asynchronous dGAs have different numerical performances on a multiprocessor, the asynchronous algorithm having a faster execution, thus more attractive for time demanding applications. Our results and statistical analyses help in developing a novel body of knowledge on PGAs running in shared memory multiprocessors (versus overwhelming literature oriented to distributed memory clusters), something useful for researchers, beginners, and final users of these techniques.

28 citations

Posted Content•
FastSV: A Distributed-Memory Connected Component Algorithm with Fast Convergence

[...]

Yongzhe Zhang, Ariful Azad, Zhenjiang Hu
14 Oct 2019-arXiv: Data Structures and Algorithms
TL;DR: The algorithm simplifies the classic Shiloach-Vishkin algorithm and employs several novel and efficient hooking strategies for faster convergence and map different steps of FastSV to linear algebraic operations and implement them with the help of scalable graph libraries.
Abstract: This paper presents a new distributed-memory algorithm called FastSV for finding connected components in an undirected graph. Our algorithm simplifies the classic Shiloach-Vishkin algorithm and employs several novel and efficient hooking strategies for faster convergence. We map different steps of FastSV to linear algebraic operations and implement them with the help of scalable graph libraries. FastSV uses sparse operations to avoid redundant work and optimized MPI communication to avoid bottlenecks. The resultant algorithm shows high-performance and scalability as it can find the connected components of a hyperlink graph with over 134B edges in 30 seconds using 262K cores on a Cray XC40 supercomputer. FastSV outperforms the state-of-the-art algorithm by an average speedup of 2.21x (max 4.27x) on a variety of real-world graphs.

27 citations

Proceedings Article•10.1145/3295500.3356201•
Streaming message interface: high-performance distributed memory programming on reconfigurable hardware

[...]

Tiziano De Matteis1, Johannes de Fine Licht1, Jakub Beránek2, Torsten Hoefler1•
ETH Zurich1, Technical University of Ostrava2
17 Nov 2019
TL;DR: Streaming Message Interface (SMI) as mentioned in this paper is a communication model and API that unifies explicit message passing with a hardware-oriented programming model, facilitating minimal-overhead, flexible, and productive inter-FPGA communication.
Abstract: Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is typically handled either by going through the host machine, sacrificing performance, or by streaming across fixed device-to-device connections, sacrificing flexibility. We present Streaming Message Interface (SMI), a communication model and API that unifies explicit message passing with a hardware-oriented programming model, facilitating minimal-overhead, flexible, and productive inter-FPGA communication. Instead of bulk transmission, messages are streamed across the network during computation, allowing communication to be seamlessly integrated into pipelined designs. We present a high-level synthesis implementation of SMI targeting a dedicated FPGA interconnect, exposing runtime-configurable routing with support for arbitrary network topologies, and implement a set of distributed memory benchmarks. Using SMI, programmers can implement distributed, scalable HPC programs on reconfigurable hardware, without deviating from best practices for hardware design.

27 citations

Proceedings Article•10.1109/SCALA49573.2019.00010•
Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC

[...]

Thomas Herault1, Yves Robert1, George Bosilca1, Jack Dongarra1•
University of Tennessee1
1 Nov 2019
TL;DR: This paper introduces a generic and flexible matrix- matrix multiplication algorithm C = A × B for state-of-the-art computing platforms, which relies on the classical tile-based outer-product algorithm, but enhances it with several control dependencies to increase data re-use and to optimize communication flow from/to the accelerators within each node.
Abstract: This paper introduces a generic and flexible matrix- matrix multiplication algorithm C = A × B for state-of-the-art computing platforms. Typically, these platforms are distributed- memory machines whose nodes are equipped with several ac- celerators. To the best of our knowledge, SLATE [9] is the only library that provides a publicly available implementation on such platforms, and it is currently limited to problem instances where the C matrix can entirely fit in the memory of the GPU accelera- tors. Our algorithm relies on the classical tile-based outer-product algorithm, but enhances it with several control dependencies to increase data re-use and to optimize communication flow from/to the accelerators within each node. The algorithm is written with the PARSEC runtime system, which allows for a fast and generic implementation, while achieving close-to-peak performance.
Posted Content•
LocationSpark: In-memory Distributed Spatial Query Processing and Optimization

[...]

Mingjie Tang, Yongyang Yu1, Walid G. Aref, Ahmed R. Mahmood, Qutaibah M. Malluhi, Mourad Ouzzani •
Purdue University1
08 Jul 2019-arXiv: Databases
TL;DR: This paper introduces new techniques for handling query skew that commonly happens in practice, and minimizes communication costs accordingly, and proposes a distributed query scheduler that uses a new cost model to minimize the cost of spatial query processing.
Abstract: Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for handling query skew, which is common in practice, and optimize communication costs accordingly. We propose a distributed query scheduler that use a new cost model to optimize the cost of spatial query processing. The scheduler generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bitmap filters to forward queries to the appropriate local nodes. Each local computation node is responsible for optimizing and selecting its best local query execution plan based on the indexes and the nature of the spatial queries in that node. All the proposed spatial query processing and optimization techniques are prototyped inside Spark, a distributed memory-based computation system. The experimental study is based on real datasets and demonstrates that distributed spatial query processing can be enhanced by up to an order of magnitude over existing in-memory and distributed spatial systems.
Journal Article•10.1016/J.PARCO.2019.102582•
DuctTeip: An efficient programming model for distributed task-based parallel computing

[...]

Afshin Zafari1, Elisabeth Larsson1, Martin Tillenius1•
Uppsala University1
1 Dec 2019
TL;DR: In this article, a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware is proposed for distributed task-based parallel programming, which has been successful both in simplifying the programming and in exploiting the available hardware parallelism for shared memory systems.
Abstract: Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored approaches. Task-based parallel programming has been successful both in simplifying the programming and in exploiting the available hardware parallelism for shared memory systems. In this paper we focus on how to extend task-parallel programming to distributed memory systems. We use a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware. We test the proposed programming model on two different applications, a Cholesky factorization, and a solver for the Shallow Water Equations. We also compare the performance of our implementation with that of other frameworks for distributed task-parallel programming, and show that it is competitive.
Journal Article•10.1007/S11036-019-01318-3•
ZAKI: A Smart Method and Tool for Automatic Performance Optimization of Parallel SpMV Computations on Distributed Memory Machines

[...]

Sardar Usman1, Rashid Mehmood1, Iyad Katib1, Aiiad Albeshri1, Saleh M. Altowaijri2 •
King Abdulaziz University1, Northern Borders University2
31 Jul 2019-Mobile Networks and Applications
TL;DR: This is the first work where the sparsity structure of matrices have been exploited to predict the optimal number of processes for a given matrix in distributed memory environments by using different base and ensemble machine learning methods.
Abstract: SpMV is a vital computing operation of many scientific, engineering, economic and social applications, increasingly being used to develop timely intelligence for the design and management of smart societies. Several factors affect the performance of SpMV computations, such as matrix characteristics, storage formats, software and hardware platforms. The complexity of the computer systems is on the rise with the increasing number of cores per processor, different levels of caches, processors per node and high speed interconnect. There is an ever-growing need for new optimization techniques and efficient ways of exploiting parallelism. In this paper, we propose ZAKI, a data-driven, machine-learning approach and tool, to predict the optimal number of processes for SpMV computations of an arbitrary sparse matrix on a distributed memory machine. The aim herein is to allow application scientists to automatically obtain the best configuration, and hence the best performance, for the execution of SpMV computations. We train and test the tool using nearly 2000 real world matrices obtained from 45 application domains including computational fluid dynamics (CFD), computer vision, and robotics. The tool uses three machine learning methods, decision trees, random forest, gradient boosting, and is evaluated in depth. A discussion on the applicability of our proposed tool to energy efficiency optimization of SpMV computations is given. This is the first work where the sparsity structure of matrices have been exploited to predict the optimal number of processes for a given matrix in distributed memory environments by using different base and ensemble machine learning methods.
Proceedings Article•10.1109/HPEC.2019.8916243•
One Quadrillion Triangles Queried on One Million Processors

[...]

Roger Pearce1, Trevor Steil1, Benjamin W. Priest1, Geoffrey Sanders1•
Lawrence Livermore National Laboratory1
1 Sep 2019
TL;DR: To the knowledge, these results are the largest triangle count experiments on synthetic scale-free graphs to date.
Abstract: We update our prior 2017 Graph Challenge submission [7] on large scale triangle counting in distributed memory by demonstrating scaling and validation on trillion-edge scale-free graphs. We incorporate recent distributed communication optimizations developed for irregular communication workloads [1], and demonstrate scaling up to 1.5 million cores of IBM BG/Q Sequoia at LLNL. We validate our implementation using nonstochastic Kronecker graph generation where ground-truth local and global triangle counts are known, and model our Kronecker graph inputs after the Graph500 [5] R-MAT inputs. To our knowledge, our results are the largest triangle count experiments on synthetic scale-free graphs to date.
Proceedings Article•10.1109/IPDPS.2019.00012•
LACC: A Linear-Algebraic Algorithm for Finding Connected Components in Distributed Memory

[...]

Ariful Azad1, Aydin Buluc2•
Indiana University1, Lawrence Berkeley National Laboratory2
20 May 2019
TL;DR: This paper presents a parallel connected-components algorithm that can run on distributed-memory computers and uses linear algebraic primitives and is based on a PRAM algorithm by Awerbuch and Shiloach, which outperforms previous algorithms by a significant margin.
Abstract: Finding connected components is one of the most widely used operations on a graph. Optimal serial algorithms for the problem have been known for half a century, and many competing parallel algorithms have been proposed over the last several decades under various different models of parallel computation. This paper presents a parallel connected-components algorithm that can run on distributed-memory computers. Our algorithm uses linear algebraic primitives and is based on a PRAM algorithm by Awerbuch and Shiloach. We show that the resulting algorithm, named LACC for Linear Algebraic Connected Components, outperforms competitors by a factor of up to 12x for small to medium scale graphs. For large graphs with more than 50B edges, LACC scales to 4K nodes (262K cores) of a Cray XC40 supercomputer and outperforms previous algorithms by a significant margin. This remarkable performance is accomplished by (1) exploiting sparsity that was not present in the original PRAM algorithm formulation, (2) using high-performance primitives of Combinatorial BLAS, and (3) identifying hot spots and optimizing them away by exploiting algorithmic insights.
Proceedings Article•10.1145/3337821.3337919•
diBELLA: Distributed Long Read to Long Read Alignment

[...]

Marquita Ellis1, Giulia Guidi1, Aydin Buluc1, Leonid Oliker2, Katherine Yelick1 •
University of California, Berkeley1, Lawrence Berkeley National Laboratory2
5 Aug 2019
TL;DR: DiBELLA as discussed by the authors is a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from "third generation" long read sequencers.
Abstract: We present a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from "third generation" long read sequencers [29]. While long sequences of DNA offer enormous advantages for biological analysis and insight, current long read sequencing instruments have high error rates and therefore require different approaches to analysis than their short read counterparts. Our work focuses on an efficient distributed-memory parallelization of an accurate single-node algorithm for overlapping and aligning long reads. We achieve scalability of this irregular algorithm by addressing the competing issues of increasing parallelism, minimizing communication, constraining the memory footprint, and ensuring good load balance. The resulting application, diBELLA, is the first distributed memory overlapper and aligner specifically designed for long reads and parallel scalability. We describe and present analyses for high level design trade-offs and conduct an extensive empirical analysis that compares performance characteristics across state-of-the-art HPC systems as well as a commercial cloud architectures, highlighting the advantages of state-of-the-art network technologies.
Journal Article•10.1109/ACCESS.2019.2923565•
ZAKI+: A Machine Learning Based Process Mapping Tool for SpMV Computations on Distributed Memory Architectures

[...]

Sardar Usman1, Rashid Mehmood1, Iyad Katib1, Aiiad Albeshri1•
King Abdulaziz University1
17 Jun 2019-IEEE Access
TL;DR: ZAKI+ provides optimal process mapping and outperforms the MPI default mapping policy by a factor of 4.24, the first work where the sparsity structure of matrices has been exploited to predict the optimal mapping of processes and data in distributed-memory environments.
Abstract: Smart cities and other cyber-physical systems (CPSs) rely on various scientific, engineering, business, and social applications that provide timely intelligence for their design, operations, and management. Many of these scientific and analytics applications require the solution of sparse linear equation systems, where sparse matrix-vector (SpMV) product is a key computing operation. Several factors determine the performance of parallel SpMV computations, including matrix characteristics, storage formats, and the rising complexity and heterogeneity of computer systems. There is a pressing need for new ways of exploiting parallelism, and mapping data and applications to the computing resources. We propose here ZAKI+, a data-driven machine-learning approach, allowing users to automatically, effortlessly, and speedily obtain the best configuration (the data distribution, the optimal number of processes, and mapping strategy) and performance for the execution of the parallel SpMV computations on distributed memory machines. We train and test the tool using three machine learning methods-decision trees, random forest, and Xtreme boosting-and nearly 2000 real-world matrices obtained from 45 application domains, including computer vision and robotics. ZAKI+ provides optimal process mapping and outperforms the MPI default mapping policy by a factor of 4.24. This is the first work where the sparsity structure of matrices has been exploited to predict the optimal mapping of processes and data in distributed-memory environments by using different base and ensemble machine learning methods. Various CPSs comprise compute-intensive machine learning applications, such as the SpMV, and hence, the process and data mapping contributions of this paper would be of paramount impact for the CPSs.
Journal Article•10.1016/J.JPDC.2019.02.002•
Modeling the asynchronous Jacobi method without communication delays

[...]

Jordi Wolfson-Pou1, Edmond Chow1•
Georgia Institute of Technology1
01 Jun 2019-Journal of Parallel and Distributed Computing
TL;DR: It is shown that simplified asynchronous Jacobi can continue to reduce the residual when some processes are slower than other processes, and can converge when synchronous Jacobi does not.
Journal Article•10.1186/S13104-019-4256-6•
An OpenMP-based tool for finding longest common subsequence in bioinformatics

[...]

Rayhan Shikder1, Parimala Thulasiraman1, Pourang Irani1, Pingzhao Hu1•
University of Manitoba1
11 Apr 2019-BMC Research Notes
TL;DR: Three novel parallel versions of the LCS algorithm are developed that can provide significant advantages in terms of execution times, monetary cost, and pervasiveness in finding LCS of DNA sequences in an environment where Graphics Processing Units are not available.
Abstract: Finding the longest common subsequence (LCS) among sequences is NP-hard. This is an important problem in bioinformatics for DNA sequence alignment and pattern discovery. In this research, we propose new CPU-based parallel implementations that can provide significant advantages in terms of execution times, monetary cost, and pervasiveness in finding LCS of DNA sequences in an environment where Graphics Processing Units are not available. For general purpose use, we also make the OpenMP-based tool publicly available to end users. In this study, we develop three novel parallel versions of the LCS algorithm on: (i) distributed memory machine using message passing interface (MPI); (ii) shared memory machine using OpenMP, and (iii) hybrid platform that utilizes both distributed and shared memory using MPI-OpenMP. The experimental results with both simulated and real DNA sequence data show that the shared memory OpenMP implementation provides at least two-times absolute speedup than the best sequential version of the algorithm and a relative speedup of almost 7. We provide a detailed comparison of the execution times among the implementations on different platforms with different versions of the algorithm. We also show that removing branch conditions negatively affects the performance of the CPU-based parallel algorithm on OpenMP platform.
Proceedings Article•10.1109/HPEC.2019.8916299•
Scaling and Quality of Modularity Optimization Methods for Graph Clustering

[...]

Sayan Ghosh1, Mahantesh Halappanavar1, Antonino Tumeo1, Ananth Kalyanarainan2•
Pacific Northwest National Laboratory1, Washington State University2
1 Sep 2019
TL;DR: An edge-balanced graph distribution is introduced for the distributed-memory community detection implementation of the popular Louvain method, that significantly reduces communication and offers up to 80% improvement in the overall execution time.
Abstract: Real-world graphs exhibit structures known as “communities” or “clusters” consisting of a group of vertices with relatively high connectivity between them, as compared to the rest of the vertices in the network. Graph clustering or community detection is a fundamental graph operation used to analyze real-world graphs occurring in the areas of computational biology, cybersecurity, electrical grids, etc. Similar to other graph algorithms, owing to irregular memory accesses and inherently sequential nature, current algorithms for community detection are challenging to parallelize. However, in order to analyze large networks, it is important to develop scalable parallel implementations of graph clustering that are capable of exploiting the architectural features of modern supercomputers.In response to the 2019 Streaming Graph Challenge, we present quality and performance analysis of our distributed-memory community detection using Vite, which is our distributed memory implementation of the popular Louvain method, on the ALCF Theta supercomputer.Clustering methods such as Louvain that rely on modularity maximization are known to suffer from the resolution limit problem, preventing identification of clusters of certain sizes. Hence, we also include quality analysis of our shared-memory implementation of the Fast-tracking Resistance method, in comparison with Louvain on the challenge datasets.Furthermore, we introduce an edge-balanced graph distribution for our distributed memory implementation, that significantly reduces communication, offering up to 80% improvement in the overall execution time. In addition to performance/quality analysis, we also include details on the power/energy consumption, and memory traffic of the distributed-memory clustering implementation using real-world graphs with over a billion edges.
Proceedings Article•10.1109/EMPDP.2019.8671639•
Strategies for Parallel Execution of Cellular Automata in Distributed Memory Architectures

[...]

Andrea Giordano1, Alessio De Rango2, Donato D'Ambrosio2, Rocco Rongo2, W. Spataro1 •
Indian Council of Agricultural Research1, University of Calabria2
1 Feb 2019
TL;DR: A preliminary study on different parallelizzation techniques for structured grid models such as cellular automata on distributed memory architectures and the performance achieved when a real-life application, namely the SciddicaT cellular Automata model for debris-flows simulation, is adopted.
Abstract: The Cellular Automata paradigm is an efficient tool to model and study complex systems such as traffic simulation, lava flows and swarm based behaviour. In addition, cellular automata can be profitably used in many mathematical problems such as differential equations and chaos theory. Due to their inherent parallel nature, cellular automata can be efficiently parallelized among a set of computing nodes in order to scale and speed up their execution. This paper presents a preliminary study on different parallelizzation techniques for structured grid models such as cellular automata on distributed memory architectures. In particular, three strategies are presented and compared in order to evaluate their efficiency in terms of speedup. An experimental section shows the performance achieved by the three strategies when a real-life application, namely the SciddicaT cellular automata model for debris-flows simulation, is adopted.
Journal Article•10.1016/J.CMA.2019.04.032•
Efficient parallelization for volume-coupled multiphysics simulations on hierarchical Cartesian grids

[...]

Michael Schlottke-Lakemper1, Ansgar Niemöller1, Matthias Meinke1, Wolfgang Schröder1•
RWTH Aachen University1
01 Aug 2019-Computer Methods in Applied Mechanics and Engineering
TL;DR: The design of the coupling algorithm is shown to be more efficient than a standard hybrid flow-aeroacoustics scheme coupled via disk I/O and to predict the acoustic field of a turbulent jet in a direct-hybrid simulation.
Posted Content•
Distributed Iterative CT Reconstruction using Multi-Agent Consensus Equilibrium

[...]

Venkatesh Sridhar1, Xiao Wang1, Gregery T. Buzzard1, Charles A. Bouman2•
Purdue University1, Harvard University2
21 Nov 2019-arXiv: Image and Video Processing
TL;DR: This article proposes a multi-agent consensus equilibrium (MACE) algorithm for distributing both the computation and memory of MBIR reconstruction across a large number of parallel nodes, and obtains a parallel solution to the serial framework of Plug-n-play (PnP) priors, which it is called MACE-pnP.
Abstract: Model-Based Image Reconstruction (MBIR) methods significantly enhance the quality of computed tomographic (CT) reconstructions relative to analytical techniques, but are limited by high computational cost. In this paper, we propose a multi-agent consensus equilibrium (MACE) algorithm for distributing both the computation and memory of MBIR reconstruction across a large number of parallel nodes. In MACE, each node stores only a sparse subset of views and a small portion of the system matrix, and each parallel node performs a local sparse-view reconstruction, which based on repeated feedback from other nodes, converges to the global optimum. Our distributed approach can also incorporate advanced denoisers as priors to enhance reconstruction quality. In this case, we obtain a parallel solution to the serial framework of Plug-n-play (PnP) priors, which we call MACE-PnP. In order to make MACE practical, we introduce a partial update method that eliminates nested iterations and prove that it converges to the same global solution. Finally, we validate our approach on a distributed memory system with real CT data. We also demonstrate an implementation of our approach on a massive supercomputer that can perform large-scale reconstruction in real-time.
Journal Article•10.1016/J.INS.2019.01.050•
Parallel feature selection for distributed-memory clusters

[...]

Jorge González-Domínguez, Verónica Bolón-Canedo, Borja Freire, Juan Touriño
01 Sep 2019-Information Sciences
TL;DR: This work presents fast-mRMR-MPI, a novel hybrid parallel implementation that uses MPI and OpenMP to accelerate feature selection on distributed-memory clusters to improve both scalability and efficiency.
Journal Article•10.1016/J.JPDC.2018.11.009•
Large-scale parallel similarity search with Product Quantization for online multimedia services

[...]

Guilherme Andrade1, André Fernandes2, Jeremias M. Gomes2, Renato Ferreira1, George Teodoro2 •
Universidade Federal de Minas Gerais1, University of Brasília2
01 Mar 2019-Journal of Parallel and Distributed Computing
TL;DR: This paper addresses the challenges of similarity search in high-dimensional spaces with an efficient parallelization of the Product Quantization Approximate Nearest Neighbor Search (PQANNS) indexing and proposes mechanisms to minimize the query response times in online scenarios in which the query rates vary at run-time.
Proceedings Article•10.1109/ISPA-BDCLOUD-SUSTAINCOM-SOCIALCOM48970.2019.00074•
A Variable Batch Size Strategy for Large Scale Distributed DNN Training

[...]

Hu Zhongzhe1, Junmin Xiao1, Tian Zhongbo, Xiaoyang Zhang1, Zhu Hongrui1, Yao Chengji, Ninghui Sun1, Guangming Tan1 •
Chinese Academy of Sciences1
1 Dec 2019
TL;DR: An in-depth analysis of the different stages in the recent multi-step schedule finds that the training process at the first stage is sensitive to the batchsize, while different batchsize do not significantly impact the latter stages, and a novel variable batchsize strategy is proposed.
Abstract: Large batch distributed synchronous stochastic gradient descent (SGD) has been widely used to train deep neural networks on a distributed memory system with multi-nodes, which can leverage parallel resources to reduce the number of iterative steps and speed up the convergence of training process. However, the large-batch SGD leads to a poor test accuracy, which would counteract the benefits of large scale parallel SGD. Existing solutions for large-batch training either significantly degrade accuracy or require massive additional hyper-parameter tuning. To overcome the difficulty above, we propose a novel variable batchsize strategy. With an in-depth analysis of the different stages in the recent multi-step schedule, we find that the training process at the first stage is sensitive to the batchsize, while different batchsize do not significantly impact the latter stages. Based on this discovery, we first claim that different stages of training should use different batchsize. Hence, the variable batchsize strategy is proposed for the large scale distributed training. Furthermore, in order to turn existing hyper-parameters automatically, an auto-tuning engine is designed for the variable batchsize strategy to achieve higher testing accuracy in the extremely large batchsize cases. By using our strategy, we successfully scale the batchsize to 120K in latter stages on ImageNet-1K with ResNet50 without accuracy loss and 128K with slight accuracy loss. The experimental evaluation on 2048 GPUs shows that the variable batchsize strategy with our auto-tuning engine could complete the training of ResNet-50 in 25 minutes. Furthermore, the new strategy successfully decreases the number of parameter updates by about 1.7 times compared with Facebook's multi-step schedule.
Journal Article•10.1145/3357596•
GraVF-M: Graph Processing System Generation for Multi-FPGA Platforms

[...]

Nina Engelhardt1, Hayden K.-H. So1•
University of Hong Kong1
14 Oct 2019-arXiv: Distributed, Parallel, and Cluster Computing
TL;DR: GraVF-M is presented, a framework designed to ease the implementation of FPGA-based graph processing accelerators for multi-FPGA platforms with distributed memory that compares favorably to state-of-the-art frameworks from the literature and reaches 94% of the projected performance limit of the system.
Abstract: Due to the irregular nature of connections in most graph datasets, partitioning graph analysis algorithms across multiple computational nodes that do not share a common memory inevitably leads to large amounts of interconnect traffic. Previous research has shown that FPGAs can outcompete software-based graph processing in shared memory contexts, but it remains an open question if this advantage can be maintained in distributed systems. In this work, we present GraVF-M, a framework designed to ease the implementation of FPGA-based graph processing accelerators for multi-FPGA platforms with distributed memory. Based on a lightweight description of the algorithm kernel, the framework automatically generates optimized RTL code for the whole multi-FPGA design. We exploit an aspect of the programming model to present a familiar message-passing paradigm to the user, while under the hood implementing a more efficient architecture that can reduce the necessary inter-FPGA network traffic by a factor equal to the average degree of the input graph. A performance model based on a theoretical analysis of the factors influencing performance serves to evaluate the efficiency of our implementation. With a throughput of up to 5.8 GTEPS (billions of traversed edges per second) on a 4-FPGA system, the designs generated by GraVF-M compare favorably to state-of-the-art frameworks from the literature and reach 94% of the projected performance limit of the system.
Posted Content•
In-memory Distributed Spatial Query Processing and Optimization

[...]

Mingjie Tang, Yongyang Yu, Walid G. Aref, Ahmed R. Mahmood, Qutaibah M. Malluhi, Mourad Ouzzani 
8 Jul 2019
TL;DR: In this article, the authors propose a distributed spatial query scheduler that uses a new cost model to optimize the cost of spatial query processing in an in-memory and distributed setup to address scalability.
Abstract: Due to the ubiquity of spatial data applications and the large amounts of spatial data that these applications generate and process, there is a pressing need for scalable spatial query processing. In this paper, we present new techniques for spatial query processing and optimization in an in-memory and distributed setup to address scalability. More specifically, we introduce new techniques for handling query skew, which is common in practice, and optimize communication costs accordingly. We propose a distributed query scheduler that use a new cost model to optimize the cost of spatial query processing. The scheduler generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bitmap filters to forward queries to the appropriate local nodes. Each local computation node is responsible for optimizing and selecting its best local query execution plan based on the indexes and the nature of the spatial queries in that node. All the proposed spatial query processing and optimization techniques are prototyped inside Spark, a distributed memory-based computation system. The experimental study is based on real datasets and demonstrates that distributed spatial query processing can be enhanced by up to an order of magnitude over existing in-memory and distributed spatial systems.
Patent•
Executing distributed memory operations using processing elements connected by distributed channels

[...]

Bushra Ahsan1, Michael Adler1, Neal Crago1, Joel Emer1, Aamer Jaleel1, Angshuman Parashar1, Michael Pellauer1 •
Intel1
17 Jun 2019
TL;DR: In this paper, the authors present a method for implementing a distributed memory operation using the disclosure of distributed channel information for an algorithm to be executed by a plurality of spatially distributed processing elements.
Abstract: A technology for implementing a method for distributed memory operations. A method of the disclosure includes obtaining distributed channel information for an algorithm to be executed by a plurality of spatially distributed processing elements. For each distributed channel in the distributed channel information, the method further associates one or more of the plurality of spatially distributed processing elements with the distributed channel based on the algorithm.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve