Scispace (Formerly Typeset)
  1. Home
  2. Conferences
  3. Parallel Computing
  4. 2021
  1. Home
  2. Conferences
  3. Parallel Computing
  4. 2021
Showing papers presented at "Parallel Computing in 2021"
Journal Article•10.1016/J.PARCO.2021.102751•
Parallel and scalable Dunn Index for the validation of big data clusters

[...]

Chiheb-Eddine Ben N'cir1, Abdallah Hamza1, Waad Bouaguel1•
Tunis University1
1 May 2021
TL;DR: A parallel and scalable model, referred to as S-DI (Scalable Dunn Index), to compute the Dunn Index measure for an internal validation of clustering results and a good scalability and a reliable validation compared to other existing measures when handling large scale data are proposed.
Abstract: Parallelizing data clustering algorithms has attracted the interest of many researchers over the past few years. Many efficient parallel algorithms were proposed to build partitioning over a huge volume of data. The effectiveness of these algorithms is attributed to the distribution of data among a cluster of nodes and to the parallel computation models. Although the effectiveness of parallel models to deal with increasing volume of data little work is done on the validation of big clusters. To deal with this issue, we propose a parallel and scalable model, referred to as S-DI (Scalable Dunn Index), to compute the Dunn Index measure for an internal validation of clustering results. Rather than computing the Dunn Index on a single machine in the clustering validation process, the new proposed measure is computed by distributing the partitioning among a cluster of nodes using a customized parallel model under Apache Spark framework. The proposed S-DI is also enhanced by a Sketch and Validate sampling technique which aims to approximate the Dunn Index value by using a small representative data-sample. Different experiments on simulated and real datasets showed a good scalability of our proposed measure and a reliable validation compared to other existing measures when handling large scale data.

38 citations

Journal Article•10.1016/J.PARCO.2021.102833•
Porting WarpX to GPU-accelerated platforms

[...]

Andrew Myers1, Ann S. Almgren1, Ligia Diana Amorim1, John B. Bell1, Luca Fedeli2, Lixin Ge1, Lixin Ge3, Kevin Gott1, D.P. Grote4, Mark Hogan3, Axel Huebl1, Revathi Jambunathan1, Remi Lehe1, C. Ng3, Michael Rowan1, Olga V. Shapoval1, Maxence Thévenet, Jean-Luc Vay1, Henri Vincenti2, E. Yang1, N. Zaïm2, Weiqun Zhang1, Yinjian Zhao1, Edoardo Zoni1 •
Lawrence Berkeley National Laboratory1, Université Paris-Saclay2, SLAC National Accelerator Laboratory3, Lawrence Livermore National Laboratory4
1 Dec 2021
TL;DR: WarpX as mentioned in this paper is a general purpose electromagnetic particle-in-cell code that was originally designed to run on many-core CPU architectures, and it uses the AMReX library to allow WarpX to use the GPU-accelerated nodes on the Summit supercomputer.
Abstract: Author(s): Myers, A; Almgren, A; Amorim, LD; Bell, J; Fedeli, L; Ge, L; Gott, K; Grote, DP; Hogan, M; Huebl, A; Jambunathan, R; Lehe, R; Ng, C; Rowan, M; Shapoval, O; Thevenet, M; Vay, JL; Vincenti, H; Yang, E; Zaim, N; Zhang, W; Zhao, Y; Zoni, E | Abstract: WarpX is a general purpose electromagnetic particle-in-cell code that was originally designed to run on many-core CPU architectures. We describe the strategy, based on the AMReX library, followed to allow WarpX to use the GPU-accelerated nodes on OLCF's Summit supercomputer, a strategy we believe will extend to the upcoming machines Frontier and Aurora. We summarize the challenges encountered, lessons learned, and give current performance results on a series of relevant benchmark problems.

36 citations

Journal Article•10.1016/J.PARCO.2021.102828•
A novel hybrid heuristic-based list scheduling algorithm in heterogeneous cloud computing environment for makespan optimization

[...]

Mirsaeid Hosseini Shirvani1, Reza Noorian Talouki1•
Islamic Azad University1
1 Dec 2021
TL;DR: A novel hybrid heuristic-based list scheduling (HH-LiSch) algorithm is presented for solving the dependent task scheduling in HCC systems in a bounded number of the fully connected virtual machines (VMs).
Abstract: An efficient workflow scheduling can potentially exploit heterogeneity of resources in heterogeneous cloud computing (HCC) platform commensurate with variable requirement of dependent tasks in a given workflow. Minimizing the total scheduling length, m a k e s p a n , is essential for application performance in heterogeneous computing systems especially in cloud computing environment. The problem of scheduling a set of different dependent tasks onto a set of heterogeneous computational resources is a well-known NP-Hard problem. Therefore, no polynomial scheduling algorithm for computing the optimal solution exists. For approximating a solution to this problem many algorithms have been proposed, but majority of them have low efficiency. In this paper, a novel hybrid heuristic-based list scheduling (HH-LiSch) algorithm is presented for solving the dependent task scheduling in HCC systems in a bounded number of the fully connected virtual machines (VMs). The novelty of the current paper is to present the new task priority strategy, find appropriate VM's slot time, and utilize task duplication technique. Two novel task priority strategies are applied to prioritize tasks in an efficient ordered list. Then, during the scheduling process an insertion-based procedure is called to find an appropriate potential slot time for performing task duplication technique. If it works, the task duplication is added to rudimentary scheduling scheme. In this way, the final scheduling is gradually generated. To validate the work, the experiments are based on six real-world scientific workflows and a random task graph (RTG); then, the performance is evaluated in terms of makespan, Schedule Length Ratio (SLR), speedup and efficiency. The simulation results prove a significant improvement against other counterparts in literature.

36 citations

Journal Article•10.1016/J.PARCO.2021.102831•
Toward performance-portable PETSc for GPU-based exascale systems

[...]

Richard T. Mills1, Mark F. Adams2, Satish Balay1, Jed Brown3, Alp Dener1, Matthew G. Knepley4, Scott Kruger, Hannah Morgan5, Todd Munson1, Karl Rupp1, Karl Rupp6, Barry Smith, Stefano Zampini7, Hong Zhang1, Junchao Zhang1 •
Argonne National Laboratory1, Lawrence Berkeley National Laboratory2, University of Colorado Boulder3, University at Buffalo4, University of Chicago5, Vienna University of Technology6, King Abdullah University of Science and Technology7
1 Dec 2021
TL;DR: The Portable Extensible Toolkit for Scientific Computation (PETSc) as mentioned in this paper provides scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization.
Abstract: The Portable Extensible Toolkit for Scientific computation (PETSc) library delivers scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization. The PETSc design for performance portability addresses fundamental GPU accelerator challenges and stresses flexibility and extensibility by separating the programming model used by the application from that used by the library, and it enables application developers to use their preferred programming model, such as Kokkos, RAJA, SYCL, HIP, CUDA, or OpenCL, on upcoming exascale systems. A blueprint for using GPUs from PETSc-based codes is provided, and case studies emphasize the flexibility and high performance achieved on current GPU-based systems.

30 citations

Journal Article•10.1016/J.PARCO.2021.102827•
Implementation and evaluation of MPI 4.0 partitioned communication libraries

[...]

Matthew G. F. Dosanjh1, Andrew Worley2, Derek Schafer3, Prema Prema Soundararajan4, Sheikh K. Ghafoor2, Anthony Skjellum3, Purushotham V. Bangalore5, Ryan E. Grant6, Ryan E. Grant1 •
Sandia National Laboratories1, Tennessee Technological University2, University of Tennessee at Chattanooga3, University of Alabama at Birmingham4, University of Alabama5, Queen's University6
1 Dec 2021
TL;DR: This library provides an opportunity to explore potential optimizations and identify further enhancements to the APIs, and is compared to initial prototype libraries that have recently become available that have been updated to the standard-compliant interface.
Abstract: Partitioned point-to-point communication primitives provide a performance-oriented mechanism to support a hybrid parallel programming model and have been included in the upcoming MPI-4.0 standard. These primitives enable an MPI library to transfer parts of the data buffer while the application provides partial contributions using multiple threads or tasks or simply pipelines the buffers sequentially. The focus of this paper is the design and implementation of a layered library that provides the functionality of these newer APIs and supports application development using these newer APIs. This library provides an opportunity to explore potential optimizations and identify further enhancements to the APIs. Initial experience in designing this library along with preliminary performance results are presented. In addition, the library is compared to initial prototype libraries that have recently become available that have been updated to the standard-compliant interface. These prototype libraries were built on remote-memory-access (RMA) primitives, offering insight into different implementation strategies. In general, we observe an interesting trade-off space, with the RMA-based implementation proving more performant for send-side partitioning, with increases in perceived bandwidth 8.9x on average over a single send, compared to the persistent-based implementation, which shows improvements 4.0x on average. In comparing the two implementations, we find that the persistent-based implementation enables more overlap for receive-side partitioning up to 5.37X the RMA library’s overlap, while the RMA-based implementation provides better send-side performance of up to 70%.

22 citations

Journal Article•10.1016/J.PARCO.2021.102841•
GPU algorithms for Efficient Exascale Discretizations

[...]

Ahmad Abdelfattah1, Valeria Barra2, Natalie Beams1, Ryan Bleile3, Jed Brown2, Jean-Sylvain Camier3, Robert Carson3, Noel Chalmers4, Veselin Dobrev3, Yohann Dudouit3, Paul Fischer5, Paul Fischer6, Ali Karakus7, Stefan Kerkemeier5, Tzanio V. Kolev3, Yu-Hsiang Lan5, Elia Merzari5, Elia Merzari8, Misun Min5, Malachi Phillips6, Thilina Rathnayake6, R. Rieben3, Thomas Stitt3, Ananias G. Tomboulides9, Ananias G. Tomboulides5, Stanimire Tomov1, Vladimir Tomov3, Arturo Vargas3, Tim Warburton10, Kenneth Weiss3 •
University of Tennessee1, University of Colorado Boulder2, Lawrence Livermore National Laboratory3, Advanced Micro Devices4, Argonne National Laboratory5, University of Illinois at Urbana–Champaign6, Middle East Technical University7, Pennsylvania State University8, Aristotle University of Thessaloniki9, Virginia Tech10
1 Dec 2021
TL;DR: In this article, the authors describe the research and development activities in the Center for Efficient Exascale Discretization (CEED) within the US exascale computing project, targeting state-of-the-art finite-element algorithms for high-order applications on GPU-accelerated platforms.
Abstract: In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek projects. We report performance and capability improvements in several CEED-enabled applications on both NVIDIA and AMD GPU systems.

22 citations

Journal Article•10.1016/J.PARCO.2021.102840•
Porting hypre to heterogeneous computer architectures: Strategies and experiences

[...]

Robert D. Falgout1, Ruipeng Li1, Björn Sjögreen1, Lu Wang, Ulrike Meier Yang1 •
Lawrence Livermore National Laboratory1
1 Dec 2021
TL;DR: This work discusses the experiences and strategies to port hypre to heterogeneous computers with accelerators, including the design of a new memory model, the use of abstractions, the BoxLoop macros in the structured and semi-structured interfaces, and the restructuring of algebraic multigrid (AMG) into modular components.
Abstract: Linear systems are occurring in many applications, and solving them can take a large amount of the total simulation time. The high performance library hypre provides a variety of interfaces and linear solvers, including various multigrid methods, that have achieved good scalability on a variety of homogeneous parallel computer architectures. Heterogeneous architectures with nodes that have both CPUs and accelerators provide new challenges, since they require more fine-grained parallelism and reduced data movement between different memories on a single node as well as across nodes. We will discuss our experiences and strategies to port hypre to heterogeneous computers with accelerators, including the design of a new memory model, the use of abstractions, the BoxLoop macros in the structured and semi-structured interfaces, and the restructuring of algebraic multigrid (AMG) into modular components. We present numerical experiments comparing CPU and GPU performance for several test problems.

21 citations

Journal Article•10.1016/J.PARCO.2020.102698•
Multiscale modeling and cinematic visualization of photosynthetic energy conversion processes from electronic to cell scales.

[...]

Melih Sener1, Stuart Levy1, John E. Stone1, AJ Christensen1, Barry Isralewitz1, Robert Patterson1, Kalina Borkiewicz1, Jeff Carpenter1, C. Neil Hunter2, Zaida Luthey-Schulten1, Donna Cox1 •
University of Illinois at Urbana–Champaign1, University of Sheffield2
1 May 2021
TL;DR: This accessible visual narrative shows a lay audience how the energy of sunlight is captured, converted, and stored through a chain of proteins to power living cells, in a modern retelling of one of humanity's earliest stories-the interplay between light and life.
Abstract: Conversion of sunlight into chemical energy, namely photosynthesis, is the primary energy source of life on Earth. A visualization depicting this process, based on multiscale computational models from electronic to cell scales, is presented in the form of an excerpt from the fulldome show Birth of Planet Earth. This accessible visual narrative shows a lay audience, including children, how the energy of sunlight is captured, converted, and stored through a chain of proteins to power living cells. The visualization is the result of a multi-year collaboration among biophysicists, visualization scientists, and artists, which, in turn, is based on a decade-long experimental-computational collaboration on structural and functional modeling that produced an atomic detail description of a bacterial bioenergetic organelle, the chromatophore. Software advancements necessitated by this project have led to significant performance and feature advances, including hardware-accelerated cinematic ray tracing and instanced visualizations for efficient cell-scale modeling. The energy conversion steps depicted feature an integration of function from electronic to cell levels, spanning nearly 12 orders of magnitude in time scales. This atomic detail description uniquely enables a modern retelling of one of humanity’s earliest stories—the interplay between light and life.

20 citations

Journal Article•10.1016/J.PARCO.2021.102836•
Enabling GPU accelerated computing in the SUNDIALS time integration library

[...]

Cody J. Balos1, David J. Gardner1, Carol S. Woodward1, Daniel R. Reynolds2•
Lawrence Livermore National Laboratory1, Southern Methodist University2
1 Dec 2021
TL;DR: In this paper, the authors discuss their considerations, both internal and external, when designing these new features and present the features themselves, and also present performance results for several of the features on the Summit supercomputer and early access hardware for the Frontier supercomputer.
Abstract: As part of the Exascale Computing Project (ECP), a recent focus of development efforts for the SUite of Nonlinear and DIfferential/ALgebraic equation Solvers (SUNDIALS) has been to enable GPU-accelerated time integration in scientific applications at extreme scales. This effort has resulted in several new GPU-enabled implementations of core SUNDIALS data structures, support for programming paradigms which are aware of the heterogeneous architectures, and the introduction of utilities to provide new points of flexibility. In this paper, we discuss our considerations, both internal and external, when designing these new features and present the features themselves. We also present performance results for several of the features on the Summit supercomputer and early access hardware for the Frontier supercomputer, which demonstrate negligible performance overhead resulting from the additional infrastructure and significant speedups when using both NVIDIA and AMD GPUs.

19 citations

Journal Article•10.1016/J.PARCO.2021.102858•
On revisiting energy and performance in microservices applications: A cloud elasticity-driven approach

[...]

Igor Fontana De Nardin1, Rodrigo da Rosa Righi1, Thiago Roberto Lima Lopes1, Cristiano André da Costa1, Heon Y. Yeom2, Harald Köstler3 •
Universidade do Vale do Rio dos Sinos1, Seoul National University2, University of Erlangen-Nuremberg3
1 Dec 2021

18 citations

Journal Article•10.1016/J.PARCO.2021.102793•
Callback-based completion notification using MPI Continuations

[...]

Joseph Schuchart1, Philipp Samfass2, Christoph Niethammer, José Gracia, George Bosilca1 •
University of Tennessee1, Technische Universität München2
1 Sep 2021
TL;DR: An extension to the MPI standard providing operation completion notifications using callbacks, so-called MPI Continuations, is presented and it is shown that the interface enables low-latency, high-throughput completion notifications that outperform solutions implemented in the application space.
Abstract: Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and non-blocking operations, it falls short of adequately supporting APMs as correctly and efficiently handling MPI communication in different models is still a challenge. We have previously proposed an extension to the MPI standard providing operation completion notifications using callbacks, so-called MPI Continuations. This interface is flexible enough to accommodate a wide range of different APMs. In this paper, we present an extension to the previously described interface that allows for finer control of the behavior of the MPI Continuations interface. We then present some of our first experiences in using the interface in the context of different applications, including the NAS parallel benchmarks, the PaRSEC task-based runtime system, and a load-balancing scheme within an adaptive mesh refinement solver called ExaHyPE. We show that the interface, implemented inside Open MPI, enables low-latency, high-throughput completion notifications that outperform solutions implemented in the application space.
Journal Article•10.1145/3460872•
HashGraph—Scalable Hash Tables Using a Sparse Graph Data Structure

[...]

Oded Green1•
Georgia Institute of Technology1
15 Jul 2021
TL;DR: HashGraph as mentioned in this paper is a scalable approach for building hash tables that uses concepts taken from sparse graph representations, and it can deal with a large number of hash values per entry without loss of performance.
Abstract: In this article, we introduce HashGraph, a new scalable approach for building hash tables that uses concepts taken from sparse graph representations—hence, the name HashGraph. HashGraph introduces a new way to deal with hash-collisions that does not use “open-addressing” or “separate-chaining,” yet it has the benefits of both these approaches. HashGraph currently works for static inputs. Recent progress with dynamic graph data structures suggests that HashGraph might be extendable to dynamic inputs as well. We show that HashGraph can deal with a large number of hash values per entry without loss of performance. Last, we show a new querying algorithm for value lookups. We experimentally compare HashGraph to several state-of-the-art implementations and find that it outperforms them on average 2× when the inputs are unique and by as much as 40× when the input contains duplicates. The implementation of HashGraph in this article is for NVIDIA GPUs. HashGraph can build a hash table at a rate of 2.5 billion keys per second on a NVIDIA GV100 GPU and can query at nearly the same rate.
Journal Article•10.1016/J.PARCO.2020.102736•
A new scalable distributed k-means algorithm based on Cloud micro-services for High-performance computing

[...]

Fatéma Zahra Benchara1, Mohamed Youssfi1•
University of Hassan II Casablanca1
1 Apr 2021
TL;DR: A new distributed k-means method which integrates virtual parallel distributed computing model with a low communication cost mechanism is presented which is expected to provide scalable HPC applications for big data clustering.
Abstract: The paper aims to propose a distributed clustering method for High performance computing (HPC) models and, its application for medical image processing. The communication cost is one of the great challenges, which minimizes the scalability of parallel and distributed computing models. Indeed, it reduces significantly the performance of HPC systems where these models are assigned to be implemented. In this paper, we present a new distributed k-means method which integrates virtual parallel distributed computing model with a low communication cost mechanism. The k-means method is performed as a distributed service within a cooperative micro-services team which uses asynchronous communication mechanism based on AMQP protocol. We design and implement a parallel and distributed HPC application for MRI image segmentation assigned to be deployed on cloud. Experimental results show that the proposed method (DSCM) and its assigned model reach high degree of scalability. We expect this clustering approach to provide scalable HPC applications for big data clustering.
Journal Article•10.1016/J.PARCO.2021.102769•
Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems

[...]

Seher Acer1, Erik G. Boman1, Christian A. Glusa1, Sivasankaran Rajamanickam1•
Sandia National Laboratories1
1 Sep 2021
TL;DR: Sphynx provides a good and robust partitioning method across a wide range of graphs for applications looking for a GPU-based partitioner, and is faster and obtains better balance and better quality partitions.
Abstract: Graph partitioning has been an important tool to partition the work among several processors to minimize the communication cost and balance the workload. While accelerator-based supercomputers are emerging to be the standard, the use of graph partitioning becomes even more important as applications are rapidly moving to these architectures. However, there is no distributed-memory-parallel, multi-GPU graph partitioner available for applications. We developed a spectral graph partitioner, Sphynx, using the portable, accelerator-friendly stack of the Trilinos framework. In Sphynx, we allow using different preconditioners and exploit their unique advantages. We use Sphynx to systematically evaluate the various algorithmic choices in spectral partitioning with a focus on the GPU performance. We perform those evaluations on two distinct classes of graphs: regular (such as meshes, matrices from finite element methods) and irregular (such as social networks and web graphs), and show that different settings and preconditioners are needed for these graph classes. The experimental results on the Summit supercomputer show that Sphynx is the fastest alternative on irregular graphs in an application-friendly setting and obtains a partitioning quality close to ParMETIS on regular graphs. When compared to nvGRAPH on a single GPU, Sphynx is faster and obtains better balance and better quality partitions. Sphynx provides a good and robust partitioning method across a wide range of graphs for applications looking for a GPU-based partitioner.
Journal Article•10.1016/J.PARCO.2021.102837•
Measurement and analysis of GPU-accelerated applications with HPCToolkit

[...]

Keren Zhou1, Laksono Adhianto1, Jonathon M. Anderson1, Aaron Cherian1, Dejan Grubisic1, Mark W. Krentel1, Yumeng Liu1, Xiaozhu Meng1, John Mellor-Crummey1 •
Rice University1
1 Dec 2021
TL;DR: HPCToolkit as discussed by the authors uses PC sampling and instrumentation to measure and attribute GPU performance metrics to source lines, loops, and inlined code and provides a view of how an execution evolves over time.
Abstract: To address the challenge of performance analysis on the US DOE’s forthcoming exascale supercomputers, Rice University has been extending its HPCToolkit performance tools to support measurement and analysis of GPU-accelerated applications. To help developers understand the performance of accelerated applications as a whole, HPCToolkit’s measurement and analysis tools attribute metrics to calling contexts that span both CPUs and GPUs. To measure GPU-accelerated applications efficiently, HPCToolkit employs a novel wait-free data structure to coordinate monitoring and attribution of GPU performance. To help developers understand the performance of complex GPU code generated from high-level programming models, HPCToolkit constructs sophisticated approximations of call path profiles for GPU computations. To support fine-grained analysis and tuning, HPCToolkit uses PC sampling and instrumentation to measure and attribute GPU performance metrics to source lines, loops, and inlined code. To supplement fine-grained measurements, HPCToolkit can measure GPU kernel executions using hardware performance counters. To provide a view of how an execution evolves over time, HPCToolkit can collect, analyze, and visualize call path traces within and across nodes. Finally, on NVIDIA GPUs, HPCToolkit can derive and attribute a collection of useful performance metrics based on measurements using GPU PC samples. We illustrate HPCToolkit’s new capabilities for analyzing GPU-accelerated applications with several codes developed as part of the Exascale Computing Project.
Journal Article•10.1016/J.PARCO.2021.102830•
Towards performance portability in the Spark astrophysical magnetohydrodynamics solver in the Flash-X simulation framework

[...]

Sean M. Couch1, Jared W. Carlson1, Michael A. Pajkos1, Brian W. O'Shea1, Anshu Dubey2, Tom Klosterman2 •
Michigan State University1, Argonne National Laboratory2
1 Dec 2021
TL;DR: Spark as discussed by the authors is a state-of-the-art magnetohydrodynamics solver for core-collapse supernova simulations, which uses high-order spatial reconstruction, Runge-Kutta time integration and an efficient cell-centered approach to satisfying the divergence-free condition for the magnetic fields.
Abstract: Simulations of core-collapse supernovae, and other astrophysical phenomena, are quintessential extreme-scale computing challenges. For core-collapse supernova simulations to be carried out by the ExaStar project under the Exascale Computing Project umbrella, a robust, efficient, and state-of-the-art magnetohydrodynamics solver is a critical requirement. In Flash-X , the primary software instrument for ExaStar, a new magnetohydrodynamics solver has been designed and implemented from the ground up to achieve accuracy and efficiency for simulations of complex astrophysical flows. This new solver, dubbed Spark , uses high-order spatial reconstruction, Runge–Kutta time integration, and an efficient cell-centered approach to satisfying the divergence-free condition for the magnetic fields. Spark was written to be optimized for data locality in cache hierarchy of CPUs. Since data locality optimizations for cache hierarchy are not directly compatible with those of accelerators, we have taken the approach of using program synthesis to avoid massive amounts of code replication that would be necessary if we were to maintain two different versions of the solver. Our program synthesis relies on a simple key-dictionary approach, implemented in python, that enables us to assemble the version of the solver suitable for the target hardware from code fragments identified by specific keys. In this paper, we describe the data locality optimizations of the solver for CPUs and accelerators and the program synthesis tools that enable this portability. We also detail the parallel performance of Spark for both CPUs and accelerators.
Journal Article•10.1016/J.PARCO.2021.102853•
An international survey on MPI users

[...]

Atsushi Hori1, Emmanuel Jeannot2, George Bosilca3, Takahiro Ogura4, Balazs Gerofi, Jie Yin1, Yutaka Ishikawa1 •
National Institute of Informatics1, L'Abri2, University of Tennessee3, Fujitsu4
1 Dec 2021
TL;DR: In this article, a survey of Message Passing Interface (MPI) uses was conducted using two online questionnaire frameworks and has gathered more than 850 answers from 42 countries since February 2019.
Abstract: The Message Passing Interface (MPI) plays a crucial part in the parallel computing ecosystem, a driving force behind many of the high-performance computing (HPC) successes. To maintain its relevance to the user community—and in particular to the growing HPC community at large—the MPI standard needs to identify and understand the MPI users’ concerns and expectations, and adapt accordingly to continue to efficiently bridge the gap between users and hardware. This questionnaire survey was conducted using two online questionnaire frameworks and has gathered more than 850 answers from 42 countries since February 2019. Some of preceding surveys of MPI uses are questionnaire surveys like ours, while others are conducted either by analyzing MPI programs to reveal static behavior or by using profiling tools to analyze the dynamic runtime behavior of MPI jobs. Our survey is different from other questionnaire surveys in terms of its larger number of participants and wide geographic spread. As a result, it is possible to illustrate the current status of MPI users more accurately and with a wider geographical distribution. In this report, we will show some interesting findings, compare the results with preceding studies when possible, and provide some recommendations for MPI Forum based on the findings.
Journal Article•10.1016/J.PARCO.2020.102734•
Parallelization of network motif discovery using star contraction

[...]

Esra Ruzgar Ateskan1, Kayhan Erciyes2, Mehmet Emin Dalkilic1•
Ege University1, Üsküdar University2
1 Apr 2021
TL;DR: This study uses star contraction algorithm to partition complex networks efficiently for parallel discovery of network motifs, and proposes two new heuristics to make star contraction more suitable for partitioning of complex networks.
Abstract: Network motifs are widely used to uncover structural design principles of complex networks. Current sequential network motif discovery algorithms become inefficient as motif size grows, thus parallelization methods have been proposed in the literature. In this study, we use star contraction algorithm to partition complex networks efficiently for parallel discovery of network motifs. We propose two new heuristics to make star contraction more suitable for partitioning of complex networks. The effectiveness of our partitioning strategies is verified using the ESU algorithm for subgraph counting. We also propose a ghost vertices detection algorithm to ensure that all the motifs located in multiple parts are exactly found. We implement our method using MPI libraries and tested on real-life complex networks of different domains. We compared speedups of star contraction algorithm with speedups of other graph partitioning algorithms. Our algorithm obtained better speedups than those of other partitioning algorithms for most cases. Our algorithm provides significant speedups when compared to sequential ESU algorithm allowing discovery of larger network motifs.
Journal Article•10.1016/J.PARCO.2021.102792•
A computational-graph partitioning method for training memory-constrained DNNs

[...]

Fareed Qararyah1, Mohamed Wahib2, Doğa Dikbayır3, Mehmet E. Belviranli4, Didem Unat1 •
Koç University1, National Institute of Advanced Industrial Science and Technology2, Michigan State University3, Colorado School of Mines4
1 Jul 2021
TL;DR: ParDNN as discussed by the authors is an automatic, generic, and non-intrusive partitioning strategy for DNNs that are represented as computational graphs, which decides a placement of DNN's underlying computational graph operations across multiple devices so that the devices' memory constraints are met and the training time is minimized.
Abstract: Many state-of-the-art Deep Neural Networks (DNNs) have substantial memory requirements. Limited device memory becomes a bottleneck when training those models. We propose ParDNN , an automatic, generic, and non-intrusive partitioning strategy for DNNs that are represented as computational graphs. ParDNN decides a placement of DNN’s underlying computational graph operations across multiple devices so that the devices’ memory constraints are met and the training time is minimized. ParDNN is completely independent of the deep learning aspects of a DNN. It requires no modification neither at the model nor at the systems level implementation of its operation kernels. ParDNN partitions DNNs having billions of parameters and hundreds of thousands of operations in seconds to few minutes. Our experiments with TensorFlow on 16 GPUs demonstrate efficient training of 5 very large models while achieving superlinear scaling for both the batch size and training throughput. ParDNN either outperforms or qualitatively improves upon the related work.
Journal Article•10.1016/J.PARCO.2021.102786•
Improving the I/O of large geophysical models using PnetCDF and BeeGFS

[...]

Jared Brzenski1, Christopher Paolini1, Jose E. Castillo1•
San Diego State University1
1 Jul 2021
TL;DR: This paper significantly decreased the amount of time spent saving data to disk, and analysis of the features used in relation to PnetCDF with BeeGFS I/O optimization is given.
Abstract: Large scale geophysical modeling uses high performance computing systems to expedite the solutions of very large, complex systems. High disk latencies, low IOPS, and low read/write data transfer rates are relegating many numerical simulations to I/O bound jobs, where the run time is bound not by CPU rate, but by I/O rate. In this paper we seek to improve the I/O of two geophysical modeling applications and take full advantage of the parallel nature of the programs, as well as the file management system for the large output files. Parallelizing output for these programs is achieved using PnetCDF, a parallel implementation of the netCDF format, and BeeGFS, an open-source parallel file system. Using these solutions, we have significantly decreased the amount of time spent saving data to disk, and give analysis of the features used in relation to PnetCDF with BeeGFS I/O optimization.
Journal Article•10.1016/J.PARCO.2020.102722•
Parallel branch and bound algorithm for solving integer linear programming models derived from behavioral synthesis

[...]

Mohammad K. Fallah1, Mahmood Fazlali1•
Shahid Beheshti University1
1 Apr 2021
TL;DR: Two exact parallel branch and bound algorithms capable of solving large-scale ILP models derived from behavioral synthesis are developed which can successfully accelerate behavioral synthesis on multi-core platforms and outperforms IBM ILOG CPLEX (v12.60) MIP solver in solving largeILP models.
Abstract: Integer Linear Programming (ILP) formulation of behavioral synthesis allows hardware designers to implement efficient circuits considering resource and timing constraint. However, finding the optimal answer of ILP models is an NP-Hard problem and remains a computational challenge. In this paper, we address this challenge by developing two exact parallel branch and bound algorithms which are capable of solving large-scale ILP models derived from behavioral synthesis. The first algorithm enables sub-node parallelism as well as adaptive branching and memory efficient techniques to accelerate solving ILP models on shared memory multi-core systems. The second algorithm is developed based on node parallelism strategy. We evaluated the proposed algorithms using large ILP models derived from Media Bench Data Flow Graphs. The experimental results indicate both the proposed methods can successfully accelerate behavioral synthesis on multi-core platforms and outperforms IBM ILOG CPLEX (v12.60) MIP solver in solving large ILP models.
Journal Article•10.1016/J.PARCO.2020.102721•
Asynchronous parallel stochastic Quasi-Newton methods

[...]

Qianqian Tong1, Guannan Liang1, Xingyu Cai2, Chun Jiang Zhu1, Jinbo Bi1 •
University of Connecticut1, Baidu2
1 Apr 2021
TL;DR: In this article, the authors proposed an asynchronous parallel algorithm for stochastic quasi-Newton (AsySQN) method, which is the first one that truly parallelizes L-BFGS with a convergence guarantee.
Abstract: Although first-order stochastic algorithms, such as stochastic gradient descent, have been the main force to scale up machine learning models, such as deep neural nets, the second-order quasi-Newton methods start to draw attention due to their effectiveness in dealing with ill-conditioned optimization problems. The L-BFGS method is one of the most widely used quasi-Newton methods. We propose an asynchronous parallel algorithm for stochastic quasi-Newton (AsySQN) method. Unlike prior attempts, which parallelize only the calculation for gradient or the two-loop recursion of L-BFGS, our algorithm is the first one that truly parallelizes L-BFGS with a convergence guarantee. Adopting the variance reduction technique, a prior stochastic L-BFGS, which has not been designed for parallel computing, reaches a linear convergence rate. We prove that our asynchronous parallel scheme maintains the same linear convergence rate but achieves significant speedup. Empirical evaluations in both simulations and benchmark datasets demonstrate the speedup in comparison with the non-parallel stochastic L-BFGS, as well as the better performance than first-order methods in solving ill-conditioned problems.
Journal Article•10.1016/J.PARCO.2021.102815•
Optimal task scheduling for partially heterogeneous systems

[...]

Michael Orr1, Oliver Sinnen1•
University of Auckland1
1 Oct 2021
TL;DR: In this paper, an extension to the Allocation-Ordering (AO) state-space model for task scheduling is presented, which allows a system with related heterogeneous processors to be modeled, and optimal schedules on such a system to be found.
Abstract: Task scheduling with communication delays is a strongly NP-hard problem. Previous attempts at finding optimal solutions to this problem have used branch-and-bound state–space search, with promising results. However, the scheduling model used assumes a target system with fully homogeneous processors, which is unrealistic for many real world systems for which task scheduling might be performed. This paper presents an extension to the Allocation-Ordering (AO) state–space model for task scheduling which allows a system with related heterogeneous processors to be modeled, and optimal schedules on such a system to be found. Of particular note, the distinct allocation phase allows this model to efficiently adapt to partially heterogeneous systems, in which subsets of the processors are identical to each other, which significantly helps to reduce the search space. An extensive experimental evaluation shows that the introduction of heterogeneity certainly increases the difficulty of the problem. However, many problem instances solvable using homogeneous processors remain solvable with a heterogeneous target system, made possible by the significant benefit of this model in considering partial heterogeneity.
Journal Article•10.1145/3470642•
A High-throughput Parallel Viterbi Algorithm via Bitslicing

[...]

MonfaredSaleh Khalaj, HajihassaniOmid1, MohsseniVahid, RahmatiDara2, GorginSaeid3 •
University of Alberta1, Shahid Beheshti University2, Iranian Research Organization for Science and Technology3
15 Oct 2021
TL;DR: A novel bitsliced high-performance Viterbi algorithm suitable for high-throughput and data-intensive communication and a new column-major data representation scheme coupled with a novel bit-major representation scheme are presented.
Abstract: In this work, we present a novel bitsliced high-performance Viterbi algorithm suitable for high-throughput and data-intensive communication. A new column-major data representation scheme coupled wi...
Journal Article•10.1016/J.PARCO.2021.102824•
GPU accelerated parallel reliability-guided digital volume correlation with automatic seed selection based on 3D SIFT

[...]

Linchao Cai1, Junrong Yang1, Shoubin Dong1, Zhenyu Jiang1•
South China University of Technology1
1 Dec 2021
TL;DR: A GPU accelerated parallel reliability-guided DVC algorithm (CuSIFT-RGDVC) on CUDA is proposed, which leverages 3D scale-invariant feature transform (3D SIFT) to assist seed selection to realize fully automation and improves performance utilizing GPU computing.
Abstract: Digital volume correlation (DVC) is a powerful and widely used technique for measuring the internal 3D deformation field of a wide range of materials. One of the most popular DVC algorithms is the reliability-guided DVC (RG-DVC) which is good at dealing with large continuous deformation. However, RG-DVC requires a manually specified seed from which computation starts, and suffers from the efficiency due to a huge amount of computation and data dependency. This paper proposes a GPU accelerated parallel reliability-guided DVC algorithm (CuSIFT-RGDVC) on CUDA, which leverages 3D scale-invariant feature transform (3D SIFT) to assist seed selection to realize fully automation and improves performance utilizing GPU computing. In CuSIFT-RGDVC, reliability-guided displacement tracking (RGDT) is rewritten using sorted array-based batch processing mechanism which is a globally sequential locally parallel model, and multi-granularity parallelism is adopted to maximize GPU utilization. The empirical result shows that the proposed CuSIFT-RGDVC provides up to 29.1x speedup compared with our multi-threaded implementation and achieves the same level of computation speed as the state-of-the-art path-independent DVC without sacrificing accuracy.
Book Chapter•10.3233/APC210004•
Performance analysis of k-nearest neighbor classification algorithms for bank loan sectors

[...]

K. Hemachandran1, P. M. George, Raul Villamarin Rodriguez1, R. M. Kulkarni, S. Roy •
Woxsen School of Business1
1 Jan 2021
TL;DR: An attempt has been made to develop an algorithm for banks to check the credibility of borrowers to avoid nonperformance assets and the effectiveness of K nearest neighbor algorithm were analysed.
Abstract: An attempt has been made to develop an algorithm for banks to check the credibility of borrowers to avoid nonperformance assets. People move towards different banks for loan purpose to fulfil their financial needs. Approaching bank for loan is increasing day by day mainly for child marriage, education, agriculture, business, home loan etc. Some people take the loan and they won't pay back in time or some will move out of the country without any intimation, so that bank will go in loss. Even now in covid-19 pandemic many industries were closed but they need to give salary to the employees, need to pay rent and electricity bills too for that they will approach bank for loan. For all these cases bank first need to analyse their Credit Information Bureau India Limited score and check whether they had done loan repayments in appropriate time or not. In the present work the effectiveness of K nearest neighbor algorithm were analysed. This research were carried out using python. The accuracy of this classifier is analysed using following metrics such as Jaccard index, F1-score and LogLoss. This helps to find the potential of the customer which is much higher than the data mining classification algorithm and thus it helps in sanctioning loans. © 2021 The authors and IOS Press.
Journal Article•10.1016/J.PARCO.2021.102813•
Performance portability through machine learning guided kernel selection in SYCL libraries

[...]

John Lawson, Mehdi Goli
1 Oct 2021
TL;DR: In this paper, unsupervised clustering methods are used to select a subset of possible kernels that should be deployed and simple classification methods can be trained to select from these kernels at runtime to give good performance.
Abstract: Automatically tuning parallel compute kernels allows libraries and frameworks to achieve performance on a wide range of hardware, however these techniques are typically focused on finding optimal kernel parameters for particular input sizes and parameters. General purpose compute libraries must be able to cater to all inputs and parameters provided by a user, and so these techniques are of limited use. Additionally parallel programming frameworks such as SYCL require that the kernels be deployed in a binary format embedded within the library. As such it is impractical to deploy a large number of possible kernel configurations without inflating the library size. Machine learning methods can be used to mitigate against both of these problems and provide performance for general purpose routines with a limited number of kernel configurations. We show that unsupervised clustering methods can be used to select a subset of the possible kernels that should be deployed and that simple classification methods can be trained to select from these kernels at runtime to give good performance. As these techniques are fully automated, relying only on benchmark data, the tuning process for new hardware or problems does not require any developer effort or expertise. We demonstrate that this technique gives competitive performance to vendor specific libraries when used in inference of a large neural network.
Journal Article•10.1016/J.PARCO.2020.102708•
Improved probabilistic I/O scheduling for limited-size Burst-Buffers deployed HPC

[...]

Benbo Zha1, Hong Shen1•
Sun Yat-sen University1
1 Apr 2021
TL;DR: The modularization technique is proposed, as the first improvement, to reduce the repeated computation by isolating the heuristic application selection module from the original method and reusing the application ranking result to adjust the I/O scheduling.
Abstract: I/O bottleneck is a critical problem in current High Performance Computing (HPC) systems which hinges the performance scalability of a system. Some techniques, such as I/O scheduling and Burst-Buffering, had been proposed to accelerate data exchange between the compute and storage components on HPC platforms. Probabilistic I/O scheduling, a Markov-chain-based hybrid method combined the above-mentioned two techniques, controls the data transmission considering the whole load states of the Burst-Buffers system to mitigate the I/O congestion caused by unpredictable concurrent I/O bursts. However, this method requires a large amount of computation to make online scheduling, resulting in significant wastage of computing resources and decreased efficiency in scheduling. In this paper, we first introduce the architecture of Burst-Buffers deployed HPC platform, the probabilistic execution model of applications, and the basic probabilistic I/O scheduling method with a proof of its efficiency based on the Markov-chain framework. Then, we propose the modularization technique, as the first improvement, to reduce the repeated computation by isolating the heuristic application selection module from the original method and reusing the application ranking result to adjust the I/O scheduling. Next, we propose the thresholding technique, as the second improvement, to reduce the number of data transferring on burst-buffers by considering the write amplification characteristic of the underlying storage devices. Finally, we conduct extensive simulation experiments to show that our proposed I/O scheduling methods outperform the existing I/O scheduling methods without introducing burst-buffers states and without considering the characteristics of storage devices.
Journal Article•10.1016/J.PARCO.2021.102855•
AIOC2: A deep Q-learning approach to autonomic I/O congestion control in Lustre

[...]

Cheng Wen1, Shijun Deng1, Lingfang Zeng, Yang Wang2, André Brinkmann3 •
Huazhong University of Science and Technology1, Chinese Academy of Sciences2, University of Mainz3
1 Dec 2021
TL;DR: Li et al. as discussed by the authors proposed an adaptive I/O congestion control framework, named AIOC 2, which can not only adaptively tune the congestion control parameters, but also exploit the deep Q-learning method to start the training parameters and optimize the tuning for different types of workloads from the server and the client at the same time.
Abstract: In high performance computing systems, I/O congestion is a common problem in large-scale distributed file systems. However, the current implementation mainly requires administrator to manually design low-level implementation and optimization, we proposes an adaptive I/O congestion control framework, named AIOC 2 , which can not only adaptively tune the I/O congestion control parameters, but also exploit the deep Q-learning method to start the training parameters and optimize the tuning for different types of workloads from the server and the client at the same time. AIOC 2 combines the feedback-based dynamic I/O congestion control and deep Q-learning parameter tuning technology to achieve autonomic I/O congestion control, improve system I/O throughput, and thus reduce I/O latency without human interference. Experimental results show that AIOC 2 can greatly reduce the impact of I/O congestion on I/O throughput and I/O latency performance in Lustre clusters. Compared to existing Lustre cluster systems, AIOC 2 can increase write I/O throughput by 34.82% and decrease I/O latency by 26.17% on average.
Journal Article•10.1016/J.PARCO.2021.102758•
Visualizing the world’s largest turbulence simulation

[...]

Salvatore Cielo1, Luigi Iapichino1, Johannes Günther2, Christoph Federrath3, Elisabeth Mayer1, Markus Wiedemann4 •
Bavarian Academy of Sciences and Humanities1, Intel2, Australian National University3, Ludwig Maximilian University of Munich4
1 May 2021
TL;DR: In this paper, the authors describe a scalable approach for scientific visualization in HPC environments, based on the ray tracing engine Intel® OSPRay associated with VisIt, which has been applied to the visualization of the largest simulations of interstellar turbulence ever performed.
Abstract: We describe a novel, scalable approach for scientific visualization in HPC environments, based on the ray tracing engine Intel® OSPRay associated with VisIt. Part of the software stack of the Leibniz Supercomputing Centre, this method has been applied to the visualization of the largest simulations of interstellar turbulence ever performed, produced on SuperMUC-NG. The hybrid (MPI + Threading Building Blocks) parallelization of OSPRay and VisIt allows efficient scaling up to about 150 thousand cores, making it possible to visualize the data at the full, unprecedented resolution of 1004 8 3 grid elements (about 23 TB per snapshot). Besides presenting the method, its HPC context and future developments, we describe the implications of our visualization in the considered science case: our work brilliantly showcases the stretching-and-folding mechanisms through which astrophysical processes drive turbulence and amplify the magnetic field in the interstellar gas, and how the first structures, the seeds of newborn stars are shaped by this process. We finally observe the similarities between ray tracing and other HPC numerical techniques used in astrophysics, anticipating increasing convergences in the near future.

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve