Top 68 papers presented at Parallel Computing in 2021

Showing papers presented at "Parallel Computing in 2021"

Journal Article•10.1016/J.PARCO.2021.102751•

Parallel and scalable Dunn Index for the validation of big data clusters

[...]

Chiheb-Eddine Ben N'cir¹, Abdallah Hamza¹, Waad Bouaguel¹•Institutions (1)

1 May 2021

TL;DR: A parallel and scalable model, referred to as S-DI (Scalable Dunn Index), to compute the Dunn Index measure for an internal validation of clustering results and a good scalability and a reliable validation compared to other existing measures when handling large scale data are proposed.

...read moreread less

Abstract: Parallelizing data clustering algorithms has attracted the interest of many researchers over the past few years. Many efficient parallel algorithms were proposed to build partitioning over a huge volume of data. The effectiveness of these algorithms is attributed to the distribution of data among a cluster of nodes and to the parallel computation models. Although the effectiveness of parallel models to deal with increasing volume of data little work is done on the validation of big clusters. To deal with this issue, we propose a parallel and scalable model, referred to as S-DI (Scalable Dunn Index), to compute the Dunn Index measure for an internal validation of clustering results. Rather than computing the Dunn Index on a single machine in the clustering validation process, the new proposed measure is computed by distributing the partitioning among a cluster of nodes using a customized parallel model under Apache Spark framework. The proposed S-DI is also enhanced by a Sketch and Validate sampling technique which aims to approximate the Dunn Index value by using a small representative data-sample. Different experiments on simulated and real datasets showed a good scalability of our proposed measure and a reliable validation compared to other existing measures when handling large scale data.

...read moreread less

38 citations

Journal Article•10.1016/J.PARCO.2021.102833•

Porting WarpX to GPU-accelerated platforms

[...]

Andrew Myers¹, Ann S. Almgren¹, Ligia Diana Amorim¹, John B. Bell¹, Luca Fedeli², Lixin Ge¹, Lixin Ge³, Kevin Gott¹, D.P. Grote⁴, Mark Hogan³, Axel Huebl¹, Revathi Jambunathan¹, Remi Lehe¹, C. Ng³, Michael Rowan¹, Olga V. Shapoval¹, Maxence Thévenet, Jean-Luc Vay¹, Henri Vincenti², E. Yang¹, N. Zaïm², Weiqun Zhang¹, Yinjian Zhao¹, Edoardo Zoni¹ - Show less +20 more•Institutions (4)

Lawrence Berkeley National Laboratory¹, Université Paris-Saclay², SLAC National Accelerator Laboratory³, Lawrence Livermore National Laboratory⁴

1 Dec 2021

TL;DR: WarpX as mentioned in this paper is a general purpose electromagnetic particle-in-cell code that was originally designed to run on many-core CPU architectures, and it uses the AMReX library to allow WarpX to use the GPU-accelerated nodes on the Summit supercomputer.

...read moreread less

Abstract: Author(s): Myers, A; Almgren, A; Amorim, LD; Bell, J; Fedeli, L; Ge, L; Gott, K; Grote, DP; Hogan, M; Huebl, A; Jambunathan, R; Lehe, R; Ng, C; Rowan, M; Shapoval, O; Thevenet, M; Vay, JL; Vincenti, H; Yang, E; Zaim, N; Zhang, W; Zhao, Y; Zoni, E | Abstract: WarpX is a general purpose electromagnetic particle-in-cell code that was originally designed to run on many-core CPU architectures. We describe the strategy, based on the AMReX library, followed to allow WarpX to use the GPU-accelerated nodes on OLCF's Summit supercomputer, a strategy we believe will extend to the upcoming machines Frontier and Aurora. We summarize the challenges encountered, lessons learned, and give current performance results on a series of relevant benchmark problems.

...read moreread less

36 citations

Journal Article•10.1016/J.PARCO.2021.102828•

A novel hybrid heuristic-based list scheduling algorithm in heterogeneous cloud computing environment for makespan optimization

[...]

Mirsaeid Hosseini Shirvani¹, Reza Noorian Talouki¹•Institutions (1)

Islamic Azad University¹

1 Dec 2021

TL;DR: A novel hybrid heuristic-based list scheduling (HH-LiSch) algorithm is presented for solving the dependent task scheduling in HCC systems in a bounded number of the fully connected virtual machines (VMs).

...read moreread less

Abstract: An efficient workflow scheduling can potentially exploit heterogeneity of resources in heterogeneous cloud computing (HCC) platform commensurate with variable requirement of dependent tasks in a given workflow. Minimizing the total scheduling length, m a k e s p a n , is essential for application performance in heterogeneous computing systems especially in cloud computing environment. The problem of scheduling a set of different dependent tasks onto a set of heterogeneous computational resources is a well-known NP-Hard problem. Therefore, no polynomial scheduling algorithm for computing the optimal solution exists. For approximating a solution to this problem many algorithms have been proposed, but majority of them have low efficiency. In this paper, a novel hybrid heuristic-based list scheduling (HH-LiSch) algorithm is presented for solving the dependent task scheduling in HCC systems in a bounded number of the fully connected virtual machines (VMs). The novelty of the current paper is to present the new task priority strategy, find appropriate VM's slot time, and utilize task duplication technique. Two novel task priority strategies are applied to prioritize tasks in an efficient ordered list. Then, during the scheduling process an insertion-based procedure is called to find an appropriate potential slot time for performing task duplication technique. If it works, the task duplication is added to rudimentary scheduling scheme. In this way, the final scheduling is gradually generated. To validate the work, the experiments are based on six real-world scientific workflows and a random task graph (RTG); then, the performance is evaluated in terms of makespan, Schedule Length Ratio (SLR), speedup and efficiency. The simulation results prove a significant improvement against other counterparts in literature.

...read moreread less

36 citations

Journal Article•10.1016/J.PARCO.2021.102831•

Toward performance-portable PETSc for GPU-based exascale systems

[...]

Richard T. Mills¹, Mark F. Adams², Satish Balay¹, Jed Brown³, Alp Dener¹, Matthew G. Knepley⁴, Scott Kruger, Hannah Morgan⁵, Todd Munson¹, Karl Rupp¹, Karl Rupp⁶, Barry Smith, Stefano Zampini⁷, Hong Zhang¹, Junchao Zhang¹ - Show less +11 more•Institutions (7)

Argonne National Laboratory¹, Lawrence Berkeley National Laboratory², University of Colorado Boulder³, University at Buffalo⁴, University of Chicago⁵, Vienna University of Technology⁶, King Abdullah University of Science and Technology⁷

1 Dec 2021

TL;DR: The Portable Extensible Toolkit for Scientific Computation (PETSc) as mentioned in this paper provides scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization.

...read moreread less

Abstract: The Portable Extensible Toolkit for Scientific computation (PETSc) library delivers scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization. The PETSc design for performance portability addresses fundamental GPU accelerator challenges and stresses flexibility and extensibility by separating the programming model used by the application from that used by the library, and it enables application developers to use their preferred programming model, such as Kokkos, RAJA, SYCL, HIP, CUDA, or OpenCL, on upcoming exascale systems. A blueprint for using GPUs from PETSc-based codes is provided, and case studies emphasize the flexibility and high performance achieved on current GPU-based systems.

...read moreread less

30 citations

Journal Article•10.1016/J.PARCO.2021.102827•

Implementation and evaluation of MPI 4.0 partitioned communication libraries

[...]

Matthew G. F. Dosanjh¹, Andrew Worley², Derek Schafer³, Prema Prema Soundararajan⁴, Sheikh K. Ghafoor², Anthony Skjellum³, Purushotham V. Bangalore⁵, Ryan E. Grant⁶, Ryan E. Grant¹ - Show less +5 more•Institutions (6)

Sandia National Laboratories¹, Tennessee Technological University², University of Tennessee at Chattanooga³, University of Alabama at Birmingham⁴, University of Alabama⁵, Queen's University⁶

1 Dec 2021

TL;DR: This library provides an opportunity to explore potential optimizations and identify further enhancements to the APIs, and is compared to initial prototype libraries that have recently become available that have been updated to the standard-compliant interface.

...read moreread less

Abstract: Partitioned point-to-point communication primitives provide a performance-oriented mechanism to support a hybrid parallel programming model and have been included in the upcoming MPI-4.0 standard. These primitives enable an MPI library to transfer parts of the data buffer while the application provides partial contributions using multiple threads or tasks or simply pipelines the buffers sequentially. The focus of this paper is the design and implementation of a layered library that provides the functionality of these newer APIs and supports application development using these newer APIs. This library provides an opportunity to explore potential optimizations and identify further enhancements to the APIs. Initial experience in designing this library along with preliminary performance results are presented. In addition, the library is compared to initial prototype libraries that have recently become available that have been updated to the standard-compliant interface. These prototype libraries were built on remote-memory-access (RMA) primitives, offering insight into different implementation strategies. In general, we observe an interesting trade-off space, with the RMA-based implementation proving more performant for send-side partitioning, with increases in perceived bandwidth 8.9x on average over a single send, compared to the persistent-based implementation, which shows improvements 4.0x on average. In comparing the two implementations, we find that the persistent-based implementation enables more overlap for receive-side partitioning up to 5.37X the RMA library’s overlap, while the RMA-based implementation provides better send-side performance of up to 70%.

...read moreread less

22 citations

Journal Article•10.1016/J.PARCO.2021.102841•

GPU algorithms for Efficient Exascale Discretizations

[...]

Ahmad Abdelfattah¹, Valeria Barra², Natalie Beams¹, Ryan Bleile³, Jed Brown², Jean-Sylvain Camier³, Robert Carson³, Noel Chalmers⁴, Veselin Dobrev³, Yohann Dudouit³, Paul Fischer⁵, Paul Fischer⁶, Ali Karakus⁷, Stefan Kerkemeier⁵, Tzanio V. Kolev³, Yu-Hsiang Lan⁵, Elia Merzari⁵, Elia Merzari⁸, Misun Min⁵, Malachi Phillips⁶, Thilina Rathnayake⁶, R. Rieben³, Thomas Stitt³, Ananias G. Tomboulides⁹, Ananias G. Tomboulides⁵, Stanimire Tomov¹, Vladimir Tomov³, Arturo Vargas³, Tim Warburton¹⁰, Kenneth Weiss³ - Show less +26 more•Institutions (10)

University of Tennessee¹, University of Colorado Boulder², Lawrence Livermore National Laboratory³, Advanced Micro Devices⁴, Argonne National Laboratory⁵, University of Illinois at Urbana–Champaign⁶, Middle East Technical University⁷, Pennsylvania State University⁸, Aristotle University of Thessaloniki⁹, Virginia Tech¹⁰

1 Dec 2021

TL;DR: In this article, the authors describe the research and development activities in the Center for Efficient Exascale Discretization (CEED) within the US exascale computing project, targeting state-of-the-art finite-element algorithms for high-order applications on GPU-accelerated platforms.

...read moreread less

Abstract: In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek projects. We report performance and capability improvements in several CEED-enabled applications on both NVIDIA and AMD GPU systems.

...read moreread less

22 citations

Journal Article•10.1016/J.PARCO.2021.102840•

Porting hypre to heterogeneous computer architectures: Strategies and experiences

[...]

Robert D. Falgout¹, Ruipeng Li¹, Björn Sjögreen¹, Lu Wang, Ulrike Meier Yang¹ - Show less +1 more•Institutions (1)

Lawrence Livermore National Laboratory¹

1 Dec 2021

TL;DR: This work discusses the experiences and strategies to port hypre to heterogeneous computers with accelerators, including the design of a new memory model, the use of abstractions, the BoxLoop macros in the structured and semi-structured interfaces, and the restructuring of algebraic multigrid (AMG) into modular components.

...read moreread less

Abstract: Linear systems are occurring in many applications, and solving them can take a large amount of the total simulation time. The high performance library hypre provides a variety of interfaces and linear solvers, including various multigrid methods, that have achieved good scalability on a variety of homogeneous parallel computer architectures. Heterogeneous architectures with nodes that have both CPUs and accelerators provide new challenges, since they require more fine-grained parallelism and reduced data movement between different memories on a single node as well as across nodes. We will discuss our experiences and strategies to port hypre to heterogeneous computers with accelerators, including the design of a new memory model, the use of abstractions, the BoxLoop macros in the structured and semi-structured interfaces, and the restructuring of algebraic multigrid (AMG) into modular components. We present numerical experiments comparing CPU and GPU performance for several test problems.

...read moreread less

21 citations

Journal Article•10.1016/J.PARCO.2020.102698•

Multiscale modeling and cinematic visualization of photosynthetic energy conversion processes from electronic to cell scales.

[...]

Melih Sener¹, Stuart Levy¹, John E. Stone¹, AJ Christensen¹, Barry Isralewitz¹, Robert Patterson¹, Kalina Borkiewicz¹, Jeff Carpenter¹, C. Neil Hunter², Zaida Luthey-Schulten¹, Donna Cox¹ - Show less +7 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of Sheffield²

1 May 2021

TL;DR: This accessible visual narrative shows a lay audience how the energy of sunlight is captured, converted, and stored through a chain of proteins to power living cells, in a modern retelling of one of humanity's earliest stories-the interplay between light and life.

...read moreread less

Abstract: Conversion of sunlight into chemical energy, namely photosynthesis, is the primary energy source of life on Earth. A visualization depicting this process, based on multiscale computational models from electronic to cell scales, is presented in the form of an excerpt from the fulldome show Birth of Planet Earth. This accessible visual narrative shows a lay audience, including children, how the energy of sunlight is captured, converted, and stored through a chain of proteins to power living cells. The visualization is the result of a multi-year collaboration among biophysicists, visualization scientists, and artists, which, in turn, is based on a decade-long experimental-computational collaboration on structural and functional modeling that produced an atomic detail description of a bacterial bioenergetic organelle, the chromatophore. Software advancements necessitated by this project have led to significant performance and feature advances, including hardware-accelerated cinematic ray tracing and instanced visualizations for efficient cell-scale modeling. The energy conversion steps depicted feature an integration of function from electronic to cell levels, spanning nearly 12 orders of magnitude in time scales. This atomic detail description uniquely enables a modern retelling of one of humanity’s earliest stories—the interplay between light and life.

...read moreread less

20 citations

Journal Article•10.1016/J.PARCO.2021.102836•

Enabling GPU accelerated computing in the SUNDIALS time integration library

[...]

Cody J. Balos¹, David J. Gardner¹, Carol S. Woodward¹, Daniel R. Reynolds²•Institutions (2)

Lawrence Livermore National Laboratory¹, Southern Methodist University²

1 Dec 2021

TL;DR: In this paper, the authors discuss their considerations, both internal and external, when designing these new features and present the features themselves, and also present performance results for several of the features on the Summit supercomputer and early access hardware for the Frontier supercomputer.

...read moreread less

Abstract: As part of the Exascale Computing Project (ECP), a recent focus of development efforts for the SUite of Nonlinear and DIfferential/ALgebraic equation Solvers (SUNDIALS) has been to enable GPU-accelerated time integration in scientific applications at extreme scales. This effort has resulted in several new GPU-enabled implementations of core SUNDIALS data structures, support for programming paradigms which are aware of the heterogeneous architectures, and the introduction of utilities to provide new points of flexibility. In this paper, we discuss our considerations, both internal and external, when designing these new features and present the features themselves. We also present performance results for several of the features on the Summit supercomputer and early access hardware for the Frontier supercomputer, which demonstrate negligible performance overhead resulting from the additional infrastructure and significant speedups when using both NVIDIA and AMD GPUs.

...read moreread less

19 citations

Journal Article•10.1016/J.PARCO.2021.102858•

On revisiting energy and performance in microservices applications: A cloud elasticity-driven approach

[...]

Igor Fontana De Nardin¹, Rodrigo da Rosa Righi¹, Thiago Roberto Lima Lopes¹, Cristiano André da Costa¹, Heon Y. Yeom², Harald Köstler³ - Show less +2 more•Institutions (3)

Universidade do Vale do Rio dos Sinos¹, Seoul National University², University of Erlangen-Nuremberg³

1 Dec 2021

18 citations

Journal Article•10.1016/J.PARCO.2021.102793•

Callback-based completion notification using MPI Continuations

[...]

Joseph Schuchart¹, Philipp Samfass², Christoph Niethammer, José Gracia, George Bosilca¹ - Show less +1 more•Institutions (2)

University of Tennessee¹, Technische Universität München²

1 Sep 2021

TL;DR: An extension to the MPI standard providing operation completion notifications using callbacks, so-called MPI Continuations, is presented and it is shown that the interface enables low-latency, high-throughput completion notifications that outperform solutions implemented in the application space.

...read moreread less

Abstract: Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and non-blocking operations, it falls short of adequately supporting APMs as correctly and efficiently handling MPI communication in different models is still a challenge. We have previously proposed an extension to the MPI standard providing operation completion notifications using callbacks, so-called MPI Continuations. This interface is flexible enough to accommodate a wide range of different APMs. In this paper, we present an extension to the previously described interface that allows for finer control of the behavior of the MPI Continuations interface. We then present some of our first experiences in using the interface in the context of different applications, including the NAS parallel benchmarks, the PaRSEC task-based runtime system, and a load-balancing scheme within an adaptive mesh refinement solver called ExaHyPE. We show that the interface, implemented inside Open MPI, enables low-latency, high-throughput completion notifications that outperform solutions implemented in the application space.

...read moreread less

Journal Article•10.1145/3460872•

HashGraph—Scalable Hash Tables Using a Sparse Graph Data Structure

[...]

Oded Green¹•Institutions (1)

Georgia Institute of Technology¹

15 Jul 2021

TL;DR: HashGraph as mentioned in this paper is a scalable approach for building hash tables that uses concepts taken from sparse graph representations, and it can deal with a large number of hash values per entry without loss of performance.

...read moreread less

Abstract: In this article, we introduce HashGraph, a new scalable approach for building hash tables that uses concepts taken from sparse graph representations—hence, the name HashGraph. HashGraph introduces a new way to deal with hash-collisions that does not use “open-addressing” or “separate-chaining,” yet it has the benefits of both these approaches. HashGraph currently works for static inputs. Recent progress with dynamic graph data structures suggests that HashGraph might be extendable to dynamic inputs as well. We show that HashGraph can deal with a large number of hash values per entry without loss of performance. Last, we show a new querying algorithm for value lookups. We experimentally compare HashGraph to several state-of-the-art implementations and find that it outperforms them on average 2× when the inputs are unique and by as much as 40× when the input contains duplicates. The implementation of HashGraph in this article is for NVIDIA GPUs. HashGraph can build a hash table at a rate of 2.5 billion keys per second on a NVIDIA GV100 GPU and can query at nearly the same rate.

...read moreread less

Journal Article•10.1016/J.PARCO.2020.102736•

A new scalable distributed k-means algorithm based on Cloud micro-services for High-performance computing

[...]

Fatéma Zahra Benchara¹, Mohamed Youssfi¹•Institutions (1)

University of Hassan II Casablanca¹

1 Apr 2021

TL;DR: A new distributed k-means method which integrates virtual parallel distributed computing model with a low communication cost mechanism is presented which is expected to provide scalable HPC applications for big data clustering.

...read moreread less

Abstract: The paper aims to propose a distributed clustering method for High performance computing (HPC) models and, its application for medical image processing. The communication cost is one of the great challenges, which minimizes the scalability of parallel and distributed computing models. Indeed, it reduces significantly the performance of HPC systems where these models are assigned to be implemented. In this paper, we present a new distributed k-means method which integrates virtual parallel distributed computing model with a low communication cost mechanism. The k-means method is performed as a distributed service within a cooperative micro-services team which uses asynchronous communication mechanism based on AMQP protocol. We design and implement a parallel and distributed HPC application for MRI image segmentation assigned to be deployed on cloud. Experimental results show that the proposed method (DSCM) and its assigned model reach high degree of scalability. We expect this clustering approach to provide scalable HPC applications for big data clustering.

...read moreread less

Journal Article•10.1016/J.PARCO.2021.102769•

Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems

[...]

Seher Acer¹, Erik G. Boman¹, Christian A. Glusa¹, Sivasankaran Rajamanickam¹•Institutions (1)

Sandia National Laboratories¹

1 Sep 2021

TL;DR: Sphynx provides a good and robust partitioning method across a wide range of graphs for applications looking for a GPU-based partitioner, and is faster and obtains better balance and better quality partitions.

...read moreread less

Abstract: Graph partitioning has been an important tool to partition the work among several processors to minimize the communication cost and balance the workload. While accelerator-based supercomputers are emerging to be the standard, the use of graph partitioning becomes even more important as applications are rapidly moving to these architectures. However, there is no distributed-memory-parallel, multi-GPU graph partitioner available for applications. We developed a spectral graph partitioner, Sphynx, using the portable, accelerator-friendly stack of the Trilinos framework. In Sphynx, we allow using different preconditioners and exploit their unique advantages. We use Sphynx to systematically evaluate the various algorithmic choices in spectral partitioning with a focus on the GPU performance. We perform those evaluations on two distinct classes of graphs: regular (such as meshes, matrices from finite element methods) and irregular (such as social networks and web graphs), and show that different settings and preconditioners are needed for these graph classes. The experimental results on the Summit supercomputer show that Sphynx is the fastest alternative on irregular graphs in an application-friendly setting and obtains a partitioning quality close to ParMETIS on regular graphs. When compared to nvGRAPH on a single GPU, Sphynx is faster and obtains better balance and better quality partitions. Sphynx provides a good and robust partitioning method across a wide range of graphs for applications looking for a GPU-based partitioner.

...read moreread less

Journal Article•10.1016/J.PARCO.2021.102837•

Measurement and analysis of GPU-accelerated applications with HPCToolkit

[...]

Keren Zhou¹, Laksono Adhianto¹, Jonathon M. Anderson¹, Aaron Cherian¹, Dejan Grubisic¹, Mark W. Krentel¹, Yumeng Liu¹, Xiaozhu Meng¹, John Mellor-Crummey¹ - Show less +5 more•Institutions (1)

Rice University¹

1 Dec 2021

TL;DR: HPCToolkit as discussed by the authors uses PC sampling and instrumentation to measure and attribute GPU performance metrics to source lines, loops, and inlined code and provides a view of how an execution evolves over time.

...read moreread less

Abstract: To address the challenge of performance analysis on the US DOE’s forthcoming exascale supercomputers, Rice University has been extending its HPCToolkit performance tools to support measurement and analysis of GPU-accelerated applications. To help developers understand the performance of accelerated applications as a whole, HPCToolkit’s measurement and analysis tools attribute metrics to calling contexts that span both CPUs and GPUs. To measure GPU-accelerated applications efficiently, HPCToolkit employs a novel wait-free data structure to coordinate monitoring and attribution of GPU performance. To help developers understand the performance of complex GPU code generated from high-level programming models, HPCToolkit constructs sophisticated approximations of call path profiles for GPU computations. To support fine-grained analysis and tuning, HPCToolkit uses PC sampling and instrumentation to measure and attribute GPU performance metrics to source lines, loops, and inlined code. To supplement fine-grained measurements, HPCToolkit can measure GPU kernel executions using hardware performance counters. To provide a view of how an execution evolves over time, HPCToolkit can collect, analyze, and visualize call path traces within and across nodes. Finally, on NVIDIA GPUs, HPCToolkit can derive and attribute a collection of useful performance metrics based on measurements using GPU PC samples. We illustrate HPCToolkit’s new capabilities for analyzing GPU-accelerated applications with several codes developed as part of the Exascale Computing Project.

...read moreread less

Journal Article•10.1016/J.PARCO.2021.102830•

Towards performance portability in the Spark astrophysical magnetohydrodynamics solver in the Flash-X simulation framework

[...]

Sean M. Couch¹, Jared W. Carlson¹, Michael A. Pajkos¹, Brian W. O'Shea¹, Anshu Dubey², Tom Klosterman² - Show less +2 more•Institutions (2)

Michigan State University¹, Argonne National Laboratory²

1 Dec 2021

TL;DR: Spark as discussed by the authors is a state-of-the-art magnetohydrodynamics solver for core-collapse supernova simulations, which uses high-order spatial reconstruction, Runge-Kutta time integration and an efficient cell-centered approach to satisfying the divergence-free condition for the magnetic fields.

...read moreread less

Abstract: Simulations of core-collapse supernovae, and other astrophysical phenomena, are quintessential extreme-scale computing challenges. For core-collapse supernova simulations to be carried out by the ExaStar project under the Exascale Computing Project umbrella, a robust, efficient, and state-of-the-art magnetohydrodynamics solver is a critical requirement. In Flash-X , the primary software instrument for ExaStar, a new magnetohydrodynamics solver has been designed and implemented from the ground up to achieve accuracy and efficiency for simulations of complex astrophysical flows. This new solver, dubbed Spark , uses high-order spatial reconstruction, Runge–Kutta time integration, and an efficient cell-centered approach to satisfying the divergence-free condition for the magnetic fields. Spark was written to be optimized for data locality in cache hierarchy of CPUs. Since data locality optimizations for cache hierarchy are not directly compatible with those of accelerators, we have taken the approach of using program synthesis to avoid massive amounts of code replication that would be necessary if we were to maintain two different versions of the solver. Our program synthesis relies on a simple key-dictionary approach, implemented in python, that enables us to assemble the version of the solver suitable for the target hardware from code fragments identified by specific keys. In this paper, we describe the data locality optimizations of the solver for CPUs and accelerators and the program synthesis tools that enable this portability. We also detail the parallel performance of Spark for both CPUs and accelerators.

...read moreread less

Journal Article•10.1016/J.PARCO.2021.102853•

An international survey on MPI users

[...]

Atsushi Hori¹, Emmanuel Jeannot², George Bosilca³, Takahiro Ogura⁴, Balazs Gerofi, Jie Yin¹, Yutaka Ishikawa¹ - Show less +3 more•Institutions (4)

National Institute of Informatics¹, L'Abri², University of Tennessee³, Fujitsu⁴

1 Dec 2021

TL;DR: In this article, a survey of Message Passing Interface (MPI) uses was conducted using two online questionnaire frameworks and has gathered more than 850 answers from 42 countries since February 2019.

...read moreread less

Abstract: The Message Passing Interface (MPI) plays a crucial part in the parallel computing ecosystem, a driving force behind many of the high-performance computing (HPC) successes. To maintain its relevance to the user community—and in particular to the growing HPC community at large—the MPI standard needs to identify and understand the MPI users’ concerns and expectations, and adapt accordingly to continue to efficiently bridge the gap between users and hardware. This questionnaire survey was conducted using two online questionnaire frameworks and has gathered more than 850 answers from 42 countries since February 2019. Some of preceding surveys of MPI uses are questionnaire surveys like ours, while others are conducted either by analyzing MPI programs to reveal static behavior or by using profiling tools to analyze the dynamic runtime behavior of MPI jobs. Our survey is different from other questionnaire surveys in terms of its larger number of participants and wide geographic spread. As a result, it is possible to illustrate the current status of MPI users more accurately and with a wider geographical distribution. In this report, we will show some interesting findings, compare the results with preceding studies when possible, and provide some recommendations for MPI Forum based on the findings.

...read moreread less

Journal Article•10.1016/J.PARCO.2020.102734•

Parallelization of network motif discovery using star contraction

[...]

Esra Ruzgar Ateskan¹, Kayhan Erciyes², Mehmet Emin Dalkilic¹•Institutions (2)

Ege University¹, Üsküdar University²

1 Apr 2021

TL;DR: This study uses star contraction algorithm to partition complex networks efficiently for parallel discovery of network motifs, and proposes two new heuristics to make star contraction more suitable for partitioning of complex networks.

...read moreread less

Abstract: Network motifs are widely used to uncover structural design principles of complex networks. Current sequential network motif discovery algorithms become inefficient as motif size grows, thus parallelization methods have been proposed in the literature. In this study, we use star contraction algorithm to partition complex networks efficiently for parallel discovery of network motifs. We propose two new heuristics to make star contraction more suitable for partitioning of complex networks. The effectiveness of our partitioning strategies is verified using the ESU algorithm for subgraph counting. We also propose a ghost vertices detection algorithm to ensure that all the motifs located in multiple parts are exactly found. We implement our method using MPI libraries and tested on real-life complex networks of different domains. We compared speedups of star contraction algorithm with speedups of other graph partitioning algorithms. Our algorithm obtained better speedups than those of other partitioning algorithms for most cases. Our algorithm provides significant speedups when compared to sequential ESU algorithm allowing discovery of larger network motifs.

...read moreread less

Journal Article•10.1016/J.PARCO.2021.102792•

A computational-graph partitioning method for training memory-constrained DNNs

[...]

Fareed Qararyah¹, Mohamed Wahib², Doğa Dikbayır³, Mehmet E. Belviranli⁴, Didem Unat¹ - Show less +1 more•Institutions (4)

Koç University¹, National Institute of Advanced Industrial Science and Technology², Michigan State University³, Colorado School of Mines⁴

1 Jul 2021

TL;DR: ParDNN as discussed by the authors is an automatic, generic, and non-intrusive partitioning strategy for DNNs that are represented as computational graphs, which decides a placement of DNN's underlying computational graph operations across multiple devices so that the devices' memory constraints are met and the training time is minimized.

...read moreread less

Abstract: Many state-of-the-art Deep Neural Networks (DNNs) have substantial memory requirements. Limited device memory becomes a bottleneck when training those models. We propose ParDNN , an automatic, generic, and non-intrusive partitioning strategy for DNNs that are represented as computational graphs. ParDNN decides a placement of DNN’s underlying computational graph operations across multiple devices so that the devices’ memory constraints are met and the training time is minimized. ParDNN is completely independent of the deep learning aspects of a DNN. It requires no modification neither at the model nor at the systems level implementation of its operation kernels. ParDNN partitions DNNs having billions of parameters and hundreds of thousands of operations in seconds to few minutes. Our experiments with TensorFlow on 16 GPUs demonstrate efficient training of 5 very large models while achieving superlinear scaling for both the batch size and training throughput. ParDNN either outperforms or qualitatively improves upon the related work.

...read moreread less

Journal Article•10.1016/J.PARCO.2021.102786•

Improving the I/O of large geophysical models using PnetCDF and BeeGFS

[...]

Jared Brzenski¹, Christopher Paolini¹, Jose E. Castillo¹•Institutions (1)

San Diego State University¹

1 Jul 2021

TL;DR: This paper significantly decreased the amount of time spent saving data to disk, and analysis of the features used in relation to PnetCDF with BeeGFS I/O optimization is given.

...read moreread less

Abstract: Large scale geophysical modeling uses high performance computing systems to expedite the solutions of very large, complex systems. High disk latencies, low IOPS, and low read/write data transfer rates are relegating many numerical simulations to I/O bound jobs, where the run time is bound not by CPU rate, but by I/O rate. In this paper we seek to improve the I/O of two geophysical modeling applications and take full advantage of the parallel nature of the programs, as well as the file management system for the large output files. Parallelizing output for these programs is achieved using PnetCDF, a parallel implementation of the netCDF format, and BeeGFS, an open-source parallel file system. Using these solutions, we have significantly decreased the amount of time spent saving data to disk, and give analysis of the features used in relation to PnetCDF with BeeGFS I/O optimization.

...read moreread less

Journal Article•10.1016/J.PARCO.2020.102722•

Parallel branch and bound algorithm for solving integer linear programming models derived from behavioral synthesis

[...]

Mohammad K. Fallah¹, Mahmood Fazlali¹•Institutions (1)

Shahid Beheshti University¹

1 Apr 2021

TL;DR: Two exact parallel branch and bound algorithms capable of solving large-scale ILP models derived from behavioral synthesis are developed which can successfully accelerate behavioral synthesis on multi-core platforms and outperforms IBM ILOG CPLEX (v12.60) MIP solver in solving largeILP models.

...read moreread less

Abstract: Integer Linear Programming (ILP) formulation of behavioral synthesis allows hardware designers to implement efficient circuits considering resource and timing constraint. However, finding the optimal answer of ILP models is an NP-Hard problem and remains a computational challenge. In this paper, we address this challenge by developing two exact parallel branch and bound algorithms which are capable of solving large-scale ILP models derived from behavioral synthesis. The first algorithm enables sub-node parallelism as well as adaptive branching and memory efficient techniques to accelerate solving ILP models on shared memory multi-core systems. The second algorithm is developed based on node parallelism strategy. We evaluated the proposed algorithms using large ILP models derived from Media Bench Data Flow Graphs. The experimental results indicate both the proposed methods can successfully accelerate behavioral synthesis on multi-core platforms and outperforms IBM ILOG CPLEX (v12.60) MIP solver in solving large ILP models.

...read moreread less

Journal Article•10.1016/J.PARCO.2020.102721•

Asynchronous parallel stochastic Quasi-Newton methods

[...]

Qianqian Tong¹, Guannan Liang¹, Xingyu Cai², Chun Jiang Zhu¹, Jinbo Bi¹ - Show less +1 more•Institutions (2)

University of Connecticut¹, Baidu²

1 Apr 2021

TL;DR: In this article, the authors proposed an asynchronous parallel algorithm for stochastic quasi-Newton (AsySQN) method, which is the first one that truly parallelizes L-BFGS with a convergence guarantee.

...read moreread less

Abstract: Although first-order stochastic algorithms, such as stochastic gradient descent, have been the main force to scale up machine learning models, such as deep neural nets, the second-order quasi-Newton methods start to draw attention due to their effectiveness in dealing with ill-conditioned optimization problems. The L-BFGS method is one of the most widely used quasi-Newton methods. We propose an asynchronous parallel algorithm for stochastic quasi-Newton (AsySQN) method. Unlike prior attempts, which parallelize only the calculation for gradient or the two-loop recursion of L-BFGS, our algorithm is the first one that truly parallelizes L-BFGS with a convergence guarantee. Adopting the variance reduction technique, a prior stochastic L-BFGS, which has not been designed for parallel computing, reaches a linear convergence rate. We prove that our asynchronous parallel scheme maintains the same linear convergence rate but achieves significant speedup. Empirical evaluations in both simulations and benchmark datasets demonstrate the speedup in comparison with the non-parallel stochastic L-BFGS, as well as the better performance than first-order methods in solving ill-conditioned problems.

...read moreread less

Journal Article•10.1016/J.PARCO.2021.102815•

Optimal task scheduling for partially heterogeneous systems

[...]

Michael Orr¹, Oliver Sinnen¹•Institutions (1)

University of Auckland¹

1 Oct 2021

TL;DR: In this paper, an extension to the Allocation-Ordering (AO) state-space model for task scheduling is presented, which allows a system with related heterogeneous processors to be modeled, and optimal schedules on such a system to be found.

...read moreread less

Abstract: Task scheduling with communication delays is a strongly NP-hard problem. Previous attempts at finding optimal solutions to this problem have used branch-and-bound state–space search, with promising results. However, the scheduling model used assumes a target system with fully homogeneous processors, which is unrealistic for many real world systems for which task scheduling might be performed. This paper presents an extension to the Allocation-Ordering (AO) state–space model for task scheduling which allows a system with related heterogeneous processors to be modeled, and optimal schedules on such a system to be found. Of particular note, the distinct allocation phase allows this model to efficiently adapt to partially heterogeneous systems, in which subsets of the processors are identical to each other, which significantly helps to reduce the search space. An extensive experimental evaluation shows that the introduction of heterogeneity certainly increases the difficulty of the problem. However, many problem instances solvable using homogeneous processors remain solvable with a heterogeneous target system, made possible by the significant benefit of this model in considering partial heterogeneity.

...read moreread less

Journal Article•10.1145/3470642•

A High-throughput Parallel Viterbi Algorithm via Bitslicing

[...]

MonfaredSaleh Khalaj, HajihassaniOmid¹, MohsseniVahid, RahmatiDara², GorginSaeid³ - Show less +1 more•Institutions (3)

University of Alberta¹, Shahid Beheshti University², Iranian Research Organization for Science and Technology³

15 Oct 2021

TL;DR: A novel bitsliced high-performance Viterbi algorithm suitable for high-throughput and data-intensive communication and a new column-major data representation scheme coupled with a novel bit-major representation scheme are presented.

...read moreread less

Abstract: In this work, we present a novel bitsliced high-performance Viterbi algorithm suitable for high-throughput and data-intensive communication. A new column-major data representation scheme coupled wi...

...read moreread less

Journal Article•10.1016/J.PARCO.2021.102824•

GPU accelerated parallel reliability-guided digital volume correlation with automatic seed selection based on 3D SIFT

[...]

Linchao Cai¹, Junrong Yang¹, Shoubin Dong¹, Zhenyu Jiang¹•Institutions (1)

South China University of Technology¹

1 Dec 2021

TL;DR: A GPU accelerated parallel reliability-guided DVC algorithm (CuSIFT-RGDVC) on CUDA is proposed, which leverages 3D scale-invariant feature transform (3D SIFT) to assist seed selection to realize fully automation and improves performance utilizing GPU computing.

...read moreread less

Abstract: Digital volume correlation (DVC) is a powerful and widely used technique for measuring the internal 3D deformation field of a wide range of materials. One of the most popular DVC algorithms is the reliability-guided DVC (RG-DVC) which is good at dealing with large continuous deformation. However, RG-DVC requires a manually specified seed from which computation starts, and suffers from the efficiency due to a huge amount of computation and data dependency. This paper proposes a GPU accelerated parallel reliability-guided DVC algorithm (CuSIFT-RGDVC) on CUDA, which leverages 3D scale-invariant feature transform (3D SIFT) to assist seed selection to realize fully automation and improves performance utilizing GPU computing. In CuSIFT-RGDVC, reliability-guided displacement tracking (RGDT) is rewritten using sorted array-based batch processing mechanism which is a globally sequential locally parallel model, and multi-granularity parallelism is adopted to maximize GPU utilization. The empirical result shows that the proposed CuSIFT-RGDVC provides up to 29.1x speedup compared with our multi-threaded implementation and achieves the same level of computation speed as the state-of-the-art path-independent DVC without sacrificing accuracy.

...read moreread less

Book Chapter•10.3233/APC210004•

Performance analysis of k-nearest neighbor classification algorithms for bank loan sectors

[...]

K. Hemachandran¹, P. M. George, Raul Villamarin Rodriguez¹, R. M. Kulkarni, S. Roy - Show less +1 more•Institutions (1)

Woxsen School of Business¹

1 Jan 2021

TL;DR: An attempt has been made to develop an algorithm for banks to check the credibility of borrowers to avoid nonperformance assets and the effectiveness of K nearest neighbor algorithm were analysed.

...read moreread less

Abstract: An attempt has been made to develop an algorithm for banks to check the credibility of borrowers to avoid nonperformance assets. People move towards different banks for loan purpose to fulfil their financial needs. Approaching bank for loan is increasing day by day mainly for child marriage, education, agriculture, business, home loan etc. Some people take the loan and they won't pay back in time or some will move out of the country without any intimation, so that bank will go in loss. Even now in covid-19 pandemic many industries were closed but they need to give salary to the employees, need to pay rent and electricity bills too for that they will approach bank for loan. For all these cases bank first need to analyse their Credit Information Bureau India Limited score and check whether they had done loan repayments in appropriate time or not. In the present work the effectiveness of K nearest neighbor algorithm were analysed. This research were carried out using python. The accuracy of this classifier is analysed using following metrics such as Jaccard index, F1-score and LogLoss. This helps to find the potential of the customer which is much higher than the data mining classification algorithm and thus it helps in sanctioning loans. © 2021 The authors and IOS Press.

...read moreread less

Journal Article•10.1016/J.PARCO.2021.102813•

Performance portability through machine learning guided kernel selection in SYCL libraries

[...]

John Lawson, Mehdi Goli

1 Oct 2021

TL;DR: In this paper, unsupervised clustering methods are used to select a subset of possible kernels that should be deployed and simple classification methods can be trained to select from these kernels at runtime to give good performance.

...read moreread less

Abstract: Automatically tuning parallel compute kernels allows libraries and frameworks to achieve performance on a wide range of hardware, however these techniques are typically focused on finding optimal kernel parameters for particular input sizes and parameters. General purpose compute libraries must be able to cater to all inputs and parameters provided by a user, and so these techniques are of limited use. Additionally parallel programming frameworks such as SYCL require that the kernels be deployed in a binary format embedded within the library. As such it is impractical to deploy a large number of possible kernel configurations without inflating the library size. Machine learning methods can be used to mitigate against both of these problems and provide performance for general purpose routines with a limited number of kernel configurations. We show that unsupervised clustering methods can be used to select a subset of the possible kernels that should be deployed and that simple classification methods can be trained to select from these kernels at runtime to give good performance. As these techniques are fully automated, relying only on benchmark data, the tuning process for new hardware or problems does not require any developer effort or expertise. We demonstrate that this technique gives competitive performance to vendor specific libraries when used in inference of a large neural network.

...read moreread less

Journal Article•10.1016/J.PARCO.2020.102708•

Improved probabilistic I/O scheduling for limited-size Burst-Buffers deployed HPC

[...]

Benbo Zha¹, Hong Shen¹•Institutions (1)

Sun Yat-sen University¹

1 Apr 2021

TL;DR: The modularization technique is proposed, as the first improvement, to reduce the repeated computation by isolating the heuristic application selection module from the original method and reusing the application ranking result to adjust the I/O scheduling.

...read moreread less

Abstract: I/O bottleneck is a critical problem in current High Performance Computing (HPC) systems which hinges the performance scalability of a system. Some techniques, such as I/O scheduling and Burst-Buffering, had been proposed to accelerate data exchange between the compute and storage components on HPC platforms. Probabilistic I/O scheduling, a Markov-chain-based hybrid method combined the above-mentioned two techniques, controls the data transmission considering the whole load states of the Burst-Buffers system to mitigate the I/O congestion caused by unpredictable concurrent I/O bursts. However, this method requires a large amount of computation to make online scheduling, resulting in significant wastage of computing resources and decreased efficiency in scheduling. In this paper, we first introduce the architecture of Burst-Buffers deployed HPC platform, the probabilistic execution model of applications, and the basic probabilistic I/O scheduling method with a proof of its efficiency based on the Markov-chain framework. Then, we propose the modularization technique, as the first improvement, to reduce the repeated computation by isolating the heuristic application selection module from the original method and reusing the application ranking result to adjust the I/O scheduling. Next, we propose the thresholding technique, as the second improvement, to reduce the number of data transferring on burst-buffers by considering the write amplification characteristic of the underlying storage devices. Finally, we conduct extensive simulation experiments to show that our proposed I/O scheduling methods outperform the existing I/O scheduling methods without introducing burst-buffers states and without considering the characteristics of storage devices.

...read moreread less

Journal Article•10.1016/J.PARCO.2021.102855•

AIOC2: A deep Q-learning approach to autonomic I/O congestion control in Lustre

[...]

Cheng Wen¹, Shijun Deng¹, Lingfang Zeng, Yang Wang², André Brinkmann³ - Show less +1 more•Institutions (3)

Huazhong University of Science and Technology¹, Chinese Academy of Sciences², University of Mainz³

1 Dec 2021

TL;DR: Li et al. as discussed by the authors proposed an adaptive I/O congestion control framework, named AIOC 2, which can not only adaptively tune the congestion control parameters, but also exploit the deep Q-learning method to start the training parameters and optimize the tuning for different types of workloads from the server and the client at the same time.

...read moreread less

Abstract: In high performance computing systems, I/O congestion is a common problem in large-scale distributed file systems. However, the current implementation mainly requires administrator to manually design low-level implementation and optimization, we proposes an adaptive I/O congestion control framework, named AIOC 2 , which can not only adaptively tune the I/O congestion control parameters, but also exploit the deep Q-learning method to start the training parameters and optimize the tuning for different types of workloads from the server and the client at the same time. AIOC 2 combines the feedback-based dynamic I/O congestion control and deep Q-learning parameter tuning technology to achieve autonomic I/O congestion control, improve system I/O throughput, and thus reduce I/O latency without human interference. Experimental results show that AIOC 2 can greatly reduce the impact of I/O congestion on I/O throughput and I/O latency performance in Lustre clusters. Compared to existing Lustre cluster systems, AIOC 2 can increase write I/O throughput by 34.82% and decrease I/O latency by 26.17% on average.

...read moreread less

Journal Article•10.1016/J.PARCO.2021.102758•

Visualizing the world’s largest turbulence simulation

[...]

Salvatore Cielo¹, Luigi Iapichino¹, Johannes Günther², Christoph Federrath³, Elisabeth Mayer¹, Markus Wiedemann⁴ - Show less +2 more•Institutions (4)

Bavarian Academy of Sciences and Humanities¹, Intel², Australian National University³, Ludwig Maximilian University of Munich⁴

1 May 2021

TL;DR: In this paper, the authors describe a scalable approach for scientific visualization in HPC environments, based on the ray tracing engine Intel® OSPRay associated with VisIt, which has been applied to the visualization of the largest simulations of interstellar turbulence ever performed.

...read moreread less

Abstract: We describe a novel, scalable approach for scientific visualization in HPC environments, based on the ray tracing engine Intel® OSPRay associated with VisIt. Part of the software stack of the Leibniz Supercomputing Centre, this method has been applied to the visualization of the largest simulations of interstellar turbulence ever performed, produced on SuperMUC-NG. The hybrid (MPI + Threading Building Blocks) parallelization of OSPRay and VisIt allows efficient scaling up to about 150 thousand cores, making it possible to visualize the data at the full, unprecedented resolution of 1004 8 3 grid elements (about 23 TB per snapshot). Besides presenting the method, its HPC context and future developments, we describe the implications of our visualization in the considered science case: our work brilliantly showcases the stretching-and-folding mechanisms through which astrophysical processes drive turbulence and amplify the magnetic field in the interstellar gas, and how the first structures, the seeds of newborn stars are shaped by this process. We finally observe the similarities between ray tracing and other HPC numerical techniques used in astrophysics, anticipating increasing convergences in the near future.

...read moreread less