TL;DR: PPB-MCTS is a novel distributed-memory parallel MCTS algorithm designed to significantly reduce communication overhead and maintain or improve performance in combinatorial optimization problems.
Abstract: Monte-Carlo Tree Search (MCTS) is an adaptive and heuristic tree-search algorithm designed to uncover sub-optimal actions at each decision-making point. This method progressively constructs a search tree by gathering samples throughout its execution. Predominantly applied within the realm of gaming, MCTS has exhibited exceptional achievements. Additionally, it has displayed promising outcomes when employed to solve NP-hard combinatorial optimization problems. MCTS has been adapted for distributed-memory parallel platforms. The primary challenges associated with distributed-memory parallel MCTS are the substantial communication overhead and the necessity to balance the computational load among various processes. In this work, we introduce a novel distributed-memory parallel MCTS algorithm with partial backpropagations, referred to as Parallel Partial-Backpropagation MCTS (PPB-MCTS). Our design approach aims to significantly reduce the communication overhead while maintaining, or even slightly improving, the performance in the context of combinatorial optimization problems. To address the communication overhead challenge, we propose a strategy involving transmitting an additional backpropagation message. This strategy avoids attaching an information table to the communication messages exchanged by the processes, thus reducing the communication overhead. Furthermore, this approach contributes to enhancing the decision-making accuracy during the selection phase. The load balancing issue is also effectively addressed by implementing a shared transposition table among the parallel processes. Furthermore, we introduce two primary methods for managing duplicate states within distributed-memory parallel MCTS, drawing upon techniques utilized in addressing duplicate states within sequential MCTS. Duplicate states can transform the conventional search tree into a Directed Acyclic Graph (DAG). To evaluate the performance of our proposed parallel algorithm, we conduct an extensive series of experiments on solving instances of the Job-Shop Scheduling Problem (JSSP) and the Weighted Set-Cover Problem (WSCP). These problems are recognized for their complexity and classified as NP-hard combinatorial optimization problems with considerable relevance within industrial applications. The experiments are performed on a cluster of computers with many cores. The empirical results highlight the enhanced scalability of our algorithm compared to that of the existing distributed-memory parallel MCTS algorithms. As the number of processes increases, our algorithm demonstrates increased rollout efficiency while maintaining an improved load balance across processes.
TL;DR: Implementing one-sided broadcast collectives in MPI RMA significantly improves performance compared to traditional methods.
Abstract: Currently, processing large volumes of expanding data efficiently and consistently is a significant challenge. Traditional distributed-memory high-performance computers (HPC) based on message-passing model struggle with inherent synchronization difficulties, limiting their ability to keep pace. Remote Memory Access (RMA, also known as one-sided MPI communications) allows a process to directly read from or write to the memory of another process, bypassing the need for message exchange. Unfortunately, there is no collective operation interface in the current MPI RMA standard. However, RMA has the potential to reduce synchronization costs by enabling concurrent access to shared data structures, distributed among MPI processes' memories. Existing onesided MPI standards offer a linear interface only that hampers parallelization and far from efficient. To bridge this gap, we propose an algorithm design for efficient collective (parallelizable) operations in the RMA paradigm. Our study primarily examines the benefits of collective operations using the broadcast algorithm as an example. Our implementations surpass traditional methods, demonstrating the promising potential of this technique, as more performance tests indicate.
Billy C. Brock, Robert Cohn, Suyash Bakshi, Tuomas Kärnä, Jeongnim Kim, Mateusz Nowak, Łukasz Ślusarczyk, Kacper Stefanski, Timothy G. Mattson
30 May 2024
TL;DR: Distributed ranges provide a model for distributed data structures, algorithms, and views, enabling high-level parallel programming with interoperability and performance.
Abstract: Data structures and algorithms are essential building blocks for programs, and distributed data structures, which automatically partition data across multiple memory locales, are essential to writing high-level parallel programs. While many projects have designed and implemented C++ distributed data structures and algorithms, there has not been widespread adoption of an interoperable model allowing algorithms and data structures from different libraries to work together. This paper introduces distributed ranges, which is a model for building generic data structures, views, and algorithms. A distributed range extends a C++ range, which is an iterable sequence of values, with a concept of segmentation, thus exposing how the distributed range is partitioned over multiple memory locales. Distributed data structures provide this distributed range interface, which allows them to be used with a collection of generic algorithms implemented using the distributed range interface. The modular nature of the model allows for the straightforward implementation of distributed views, which are lightweight objects that provide a lazily evaluated view of another range. Views can be composed together recursively and combined with algorithms to implement computational kernels using efficient, flexible, and high-level standard C++ primitives. We evaluate the distributed ranges model by implementing a set of standard concepts and views as well as two execution runtimes, a multi-node, MPI-based runtime and a single-process, multi-GPU runtime. We demonstrate that high-level algorithms implemented using generic, high-level distributed ranges can achieve performance competitive with highly-tuned, expert-written code.
Brock, Benjamin, Cohn Robert, Karna Tuomas, Kim, Jeongnim, Nowak, Mateusz, Mattson, Timothy G.
4 Jun 2024
Abstract: Data structures and algorithms are essential building blocks for programs, and \emph{distributed data structures}, which automatically partition data across multiple memory locales, are essential to writing high-level parallel programs. While many projects have designed and implemented C++ distributed data structures and algorithms, there has not been widespread adoption of an interoperable model allowing algorithms and data structures from different libraries to work together. This paper introduces distributed ranges, which is a model for building generic data structures, views, and algorithms. A distributed range extends a C++ range, which is an iterable sequence of values, with a concept of segmentation, thus exposing how the distributed range is partitioned over multiple memory locales. Distributed data structures provide this distributed range interface, which allows them to be used with a collection of generic algorithms implemented using the distributed range interface. The modular nature of the model allows for the straightforward implementation of \textit{distributed views}, which are lightweight objects that provide a lazily evaluated view of another range. Views can be composed together recursively and combined with algorithms to implement computational kernels using efficient, flexible, and high-level standard C++ primitives. We evaluate the distributed ranges model by implementing a set of standard concepts and views as well as two execution runtimes, a multi-node, MPI-based runtime and a single-process, multi-GPU runtime. We demonstrate that high-level algorithms implemented using generic, high-level distributed ranges can achieve performance competitive with highly-tuned, expert-written code.