TL;DR: The topology, routing and flow control are described to characterize the performance of the network that serves as the fabric for a large-scale parallel machine learning system with up to 10,440 TSPs and more than 2 TeraBytes of global memory accessible in less than 3 microseconds of end-to-end system latency.
Abstract: We describe our novel commercial software-defined approach for large-scale interconnection networks of tensor streaming processing (TSP) elements. The system architecture includes packaging, routing, and flow control of the interconnection network of TSPs. We describe the communication and synchronization primitives of a bandwidth-rich substrate for global communication. This scalable communication fabric provides the backbone for large-scale systems based on a software-defined Dragonfly topology, ultimately yielding a parallel machine learning system with elasticity to support a variety of workloads, both training and inference. We extend the TSP's producer-consumer stream programming model to include global memory which is implemented as logically shared, but physically distributed SRAM on-chip memory. Each TSP contributes 220 MiBytes to the global memory capacity, with the maximum capacity limited only by the network's scale --- the maximum number of endpoints in the system. The TSP acts as both a processing element (endpoint) and network switch for moving tensors across the communication links. We describe a novel software-controlled networking approach that avoids the latency variation introduced by dynamic contention for network links. We describe the topology, routing and flow control to characterize the performance of the network that serves as the fabric for a large-scale parallel machine learning system with up to 10,440 TSPs and more than 2 TeraBytes of global memory accessible in less than 3 microseconds of end-to-end system latency.
TL;DR: In this article , the authors make the case that distributed shared-memory databases (DSM-DB) can bring many advantages, e.g., higher memory utilization, better independent scaling (of compute and memory), and lower cost of ownership.
Abstract: Memory disaggregation (MD) allows for scalable and elastic data center design by separating compute (CPU) from memory. With MD, compute and memory are no longer coupled into the same server box. Instead, they are connected to each other via ultra-fast networking such as RDMA. MD can bring many advantages, e.g., higher memory utilization, better independent scaling (of compute and memory), and lower cost of ownership. This paper makes the case that MD can fuel the next wave of innovation on database systems. We observe that MD revives the great debate of "shared what" in the database community. We envision that distributed shared-memory databases (DSM-DB, for short) - that have not received much attention before - can be promising in the future with MD. We present a list of challenges and opportunities that can inspire next steps in system design making the case for DSM-DB.
TL;DR: A systematic literature review with state-of-the-art research on the application of parallel processing and shared/distributed techniques to determine communities for social network analysis is presented in this article.
Abstract: Community detection in social networks is the process of identifying the cohesive groups of similar nodes. Detection of these groups can be helpful in many applications, such as finding networks of protein interaction in biological networks, finding the users of similar mind for ads and suggestions, finding a shared research field in collaborative networks, analyzing public health, future link prediction in social networks, analyzing criminology, and many more. However, with the increase in the number of profiles and content shared on social media platforms, the analysis is often time-consuming and exhaustive. In order to speed up and optimize the community detection process, parallel processing and Shared/Distributed memory techniques are widely used. Despite community detection has widespread use in social networks, no attempt has ever been made to compile and systematically discuss research efforts on the emerging subject of identifying parallel and distributed methods for community detection in social networks. Most of the surveys described the serial algorithms used for community detection. Our survey work comes under the scope of new design techniques, exciting or novel applications, components or standards, and applications of an educational, transactional, and co-operational nature. This paper accommodates and presents a systematic literature review with state-of-the-art research on the application of parallel processing and Shared/Distributed techniques to determine communities for social network analysis. Advanced search strategy has been performed on several digital libraries for extracting several studies for the review. The systematic search landed in finding 3220 studies, among which 65 relevant studies are selected after conducting various screening phases for further review. The application of parallel computing, shared memory, and distributed memory on the existing community detection methodologies have been discussed thoroughly. More specifically, the central significance of this paper is that a systematic literature review is conducted to gather the relevant studies from different digital libraries and gray literature. Then, different parametric values of each selected study are appropriately compared. Moreover, the need for further research to speed up the process of community formation in parallel and shared approaches has been pinpointed more suitably. A pictorial glance of this paper is depicted as follows:
TL;DR: Ginkgo’s memory accessor is leveraged in order to integrate a communication-reduction strategy into the (Krylov) GMRES solver that decouples the storage format of the orthogonal basis from the arithmetic precision that is employed during the operations with that basis.
Abstract: Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of many large-scale sparse linear systems. To a large extent, the performance of practical realizations of these methods is constrained by the communication bandwidth in current computer architectures, motivating the investigation of sophisticated techniques to avoid, reduce, and/or hide the message-passing costs (in distributed platforms) and the memory accesses (in all architectures). This article leverages Ginkgo’s memory accessor in order to integrate a communication-reduction strategy into the (Krylov) GMRES solver that decouples the storage format (i.e., the data representation in memory) of the orthogonal basis from the arithmetic precision that is employed during the operations with that basis. Given that the execution time of the GMRES solver is largely determined by the memory accesses, the cost of the datatype transforms can be mostly hidden, resulting in the acceleration of the iterative step via a decrease in the volume of bits being retrieved from memory. Together with the special properties of the orthonormal basis (whose elements are all bounded by 1), this paves the road toward the aggressive customization of the storage format, which includes some floating-point as well as fixed-point formats with mild impact on the convergence of the iterative process. We develop a high-performance implementation of the “compressed basis GMRES” solver in the Ginkgo sparse linear algebra library using a large set of test problems from the SuiteSparse Matrix Collection. We demonstrate robustness and performance advantages on a modern NVIDIA V100 graphics processing unit (GPU) of up to 50% over the standard GMRES solver that stores all data in IEEE double-precision.
TL;DR: In this article , a distributed-memory implementation of the Louvain method for serial community detection on heterogeneous multi-GPU systems has been presented, which can be extended to many other iterative graph algorithms.
TL;DR: It is demonstrated how TTG can address issues without sacrificing scalability or programmability by providing higher-level abstractions than conventionally provided by task-centric programming systems, without impeding the ability of these runtimes to manage task creation and execution as well as data and resource management efficiently.
Abstract: We present and evaluate TTG, a novel programming model and its C++ implementation that by marrying the ideas of control and data flowgraph programming supports compact specification and efficient distributed execution of dynamic and irregular applications. Programming interfaces that support task-based execution often only support shared memory parallel environments; a few support distributed memory environments, either by discovering the entire DAG of tasks on all processes, or by introducing explicit communications. The first approach limits scalability, while the second increases the complexity of programming. We demonstrate how TTG can address these issues without sacrificing scalability or programmability by providing higher-level abstractions than conventionally provided by task-centric programming systems, without impeding the ability of these runtimes to manage task creation and execution as well as data and resource management efficiently. TTG supports distributed memory execution over 2 different task runtimes, PaRSEC and MADNESS. Performance of four paradigmatic applications (in graph analytics, dense and block-sparse linear algebra, and numerical integrodifferential calculus) with various degrees of irregularity implemented in TTG is illustrated on large distributed-memory platforms and compared to the state-of-the-art implementations.
TL;DR: In this paper , a parallel scheme for a multi-domain truly incompressible smoothed particle hydrodynamics (SPH) approach is presented, which is developed for distributed memory architectures through the Message Passing Interface (MPI) paradigm as communication between partitions.
TL;DR: PIPECG-OATI as discussed by the authors provides a large overlap of global communication with independent computations at higher number of cores by using iteration combination and by introducing new recurrence and non-recurrence computations.
TL;DR: TriC as discussed by the authors is a distributed-memory implementation of graph triangle counting using Mes-sage Passing Interface (MPI), which was featured in the 2020 Graph Challenge competition and adopted a user defined buffering strategy to overcome the startup problem for large graphs.
Abstract: Graphs are ubiquitous in modeling complex systems and representing interactions between entities to uncover structural information of the domain. Traditionally, graph analytics workloads are challenging to efficiently scale (both strong and weak cases) on distributed memory due to the irregular memory-access driven nature (with little or no computations) of the meth-ods. The structure of graphs and their relative distribution over the processing elements poses another level of complexity, making it difficult to attain sustainable scalability across platforms. In this paper, we discuss enhancements to TriC, a distributed-memory implementation of graph triangle counting using Mes-sage Passing Interface (MPI), which was featured in the 2020 Graph Challenge competition. We have made some incremental enhancements to TriC, primarily adopting a user-defined buffering strategy to overcome the startup problem for large graphs (by fixing the memory for intermediate data), and experimenting with probabilistic data structures such as bloom filter to improve the query response time for assessing edge existence, at the expense of increasing the overall false positive rate. These adjustments have led to a modest improvements in most cases, as compared to the previous version.
TL;DR: In this paper , a hybrid MPI/OpenMP implementation of an eigensolver written in Fortran 90 was used to accelerate the solution of large eigenvalue problems arising from manybody calculations in nuclear physics on distributed-memory parallel systems equipped with general-purpose Graphic Processing Units (GPUs).
TL;DR: In this article , the problem of finding the best combination of data and computation mapping that results in low communication overhead is formulated as a constrained optimization problem using Lagrange multipliers, and a heuristic is used to decide which constraints to leave unsatisfied (based on the penalty of increased communication incurred in doing so).
Abstract: Distributed memory parallel computers offer enormous computation power, scalability and flexibility. However, these machines are difficult to program and this limits their widespread use. An important characteristic of these machines is the difference in the access time for data in local versus non-local memory; non-local memory accesses are much slower than local memory accesses. This is also a characteristic of shared memory machines but to a less degree. Therefore it is essential that as far as possible, the data that needs to be accessed by a processor during the execution of the computation assigned to it reside in its local memory rather than in some other processor's memory. Several research projects have concluded that proper mapping of data is key to realizing the performance potential of distributed memory machines. Current language design efforts such as Fortran D and High Performance Fortran (HPF) are based on this. It is our thesis that for many practical codes, it is possible to derive good mappings through a combination of algorithms and systematic procedures. We view mapping as consisting of wo phases, alignment followed by distribution. For the alignment phase we present three constraint-based methods--one based on a linear programming formulation of the problem; the second formulates the alignment problem as a constrained optimization problem using Lagrange multipliers; the third method uses a heuristic to decide which constraints to leave unsatisfied (based on the penalty of increased communication incurred in doing so) in order to find a mapping. In addressing the distribution phase, we have developed two methods that integrate the placement of computation--loop nests in our case--with the mapping of data. For one distributed dimension, our approach finds the best combination of data and computation mapping that results in low communication overhead; this is done by choosing a loop order that allows message vectorization. In the second method, we introduce the distribution preference graph and the operations on this graph allow us to integrate loop restructuring transformations and data mapping. These techniques produce mappings that have been used in efficient hand-coded implementations of several benchmark codes.
TL;DR: A hybrid semantic storage model called CS-SDM is presented, using a new CS-based SDM design as the cleaning memory for a Binary Sparse Distributed Representation (BSDR) of the holistic data.
Abstract: Sparse Distributed Memory (SDM) and Binary Distributed Representations (BDR), as two phenomenological approaches to biological memory modelling, have a lot of common features. The idea of their integration in a hybrid semantic storage model with SDM used as the low (brain cell) level cleaning memory for BDR used as the high-level symbolic information coder seems natural. The hybrid semantic storage must be able to memorize holistic data (like the structures of interconnected and serialized key-value pairs) in a neural network. It has been proposed several times since 1990th. However, the earlier proposed models are not practical because of insufficient scalability and/or low storage density. The gap between SDM and BDR can be filled using the results of a 3rd theory dealing with sparse signals: Compressive Sensing or Sampling (CS). Such a hybrid semantic storage model is presented. We call it CS-SDM to reflect using a new CS-based SDM design as the cleaning memory for a Binary Sparse Distributed Representation (BSDR) of the holistic data. CS-SDM has been implemented on GPU and demonstrated much better capacity and denoising capabilities than classical SDM designs.
TL;DR: This work presents a communication-efficient distributed graph algorithm for finding connected components that scales to massively parallel machines and tackles the data irregularities introduced by high degree vertices by using an efficient procedure for distributing their incident edges.
Abstract: Finding the connected components of an undirected graph is one of the most fundamental graph problems. Connected components are used in a wide spectrum of applications including VLSI design, machine learning and image analysis. Sequentially, one can easily find all connected components in linear time using breadth-first traversal. However, in a massively distributed setting, finding connected components in a scalable way becomes much harder due to data irregularities and the overhead associated with the increased need for communication. In this work, we present a communication-efficient distributed graph algorithm for finding connected components that scales to massively parallel machines. Our algorithm is based on a recent linear-work shared-memory parallel algorithm by Blelloch et al. [1] and refines it for a distributed memory setting. This includes a communication-efficient graph contraction procedure, as well as a distributed variant of the low diameter decomposition by Miller et al. [2]. We tackle the data irregularities introduced by high degree vertices by using an efficient procedure for distributing their incident edges. Our experimental evaluation on up to 16384 cores indicates a good weak scaling behavior that outperforms current state-of-the-art algorithms.
TL;DR: Using software parameters, these researchers developed a new distributed shared memory design that outperforms conventional structural design in managing a distributed shared address space.
Abstract: In multiprocessor networks, shared distributed memory is a popular research subject. It has been studied by scientists and engineers from a variety of fields. To construct high-performance, large-scale multiprocessor systems, it is necessary to maximise distributed shared memory (SDM) performance. SDM algorithm, locking shared space, thrashing, concurrent access, page faults, extension, transparency, huge database support and cost have all been the subject of extensive investigation in the past. Memory structure has a substantial impact on the system's power consumption and performance; hence a low-power system is needed. There are a number of problems with the design of shared and private memory systems for multiple processors that this dissertation seeks to address. Using software parameters, these researchers developed a new distributed shared memory design that outperforms conventional structural design in managing a distributed shared address space. The primary goal of this study is to use a software method to create novel distributed shared memory architecture.
TL;DR: Two proofs-of-concept for distributed-memory parallel approaches based on the Futhark functional programming language are presented and most of the second-order array combinators of the language are implemented, and ways to go beyond this naive memory model are proposed.
Abstract: In this paper, we present two proofs-of-concept for distributed-memory parallel approaches based on the Futhark functional programming language. Futhark is an array-based language generating high-performance code for CPU and GPU back-ends, leveraging shared-memory parallelization techniques. While the code generated by Futhark is extremely efficient, it lacks the capability to be distributed among several computing nodes, which is necessary in many engineering applications (computational fluid mechanics, meteorology, etc.). To this aim, it is desirable to add an MPI back-end to the Futhark compiler. In order to test the feasibility of a new compiler back-end, we implemented a C library wrapping Futhark kernels and handling a multi-block decomposition and communications. This library showed very promising performance and speedup results in the case of stencil-based algorithms. It thus allowed the initiation of the second part of our project: the implementation of a complete compiler back-end for the Futhark language. In this first attempt, we are using a naive memory model that has the advantage of simplicity at the cost of low efficiency. We show that we implemented most of the second-order array combinators of the language, which are the abstractions responsible for the vast majority of its parallelization capabilities, and we propose ways to go beyond our naive memory model.
TL;DR: A distributed-memory parallel Gauss--Seidel (dmpGS) is proposed by implementing a parallel sparse triangular solver (stSpike) based on the Spike algorithm that significantly improves the scalability of dmpGS.
Abstract: Gauss--Seidel (GS) is a widely used iterative method for solving sparse linear systems of equations and also known to be effective as a smoother in algebraic multigrid methods. Parallelization of GS is a challenging task since solving the sparse lower triangular system in GS constitutes a sequential bottleneck at each iteration. We propose a distributed-memory parallel GS (dmpGS) by implementing a parallel sparse triangular solver (stSpike) based on the Spike algorithm. stSpike decouples the global triangular system into smaller systems that can be solved concurrently and requires the solution of a much smaller reduced sparse lower triangular system which constitutes a sequential bottleneck. In order to alleviate this bottleneck and to reduce the communication overhead of dmpGS, we propose a partitioning and reordering model consisting of two phases. The first phase is a novel hypergraph partitioning model whose partitioning objective simultaneously encodes minimizing the reduced system size and the communication volume. The second phase is an in-block row reordering method for decreasing the nonzero count of the reduced system. Extensive experiments on a dataset consisting of 359 sparse linear systems verify the effectiveness of the proposed partitioning and reordering model in terms of reducing the communication and the sequential computational overheads. Parallel experiments on 12 large systems using up to 320 cores demonstrate that the proposed model significantly improves the scalability of dmpGS.
TL;DR: In this article , the authors discuss general parallel design principles in both shared-memory style and message-passing style programming, as well as task-centric programs, which largely apply to both shared memory and message passing style programming.
Abstract: Parallel programming is challenging. There are many parts interacting in a complex manner: algorithm-imposed dependency, scheduling on multiple execution units, synchronization, data communication capacity, network topology, memory bandwidth limit, cache performance in the presence of multiple independent threads accessing memory, program scalability, heterogeneity of hardware, and so on. It is useful to understand each of these aspects separately. We discuss general parallel design principles in this chapter. These ideas largely apply to both shared-memory style and message-passing style programming, as well as task-centric programs.
TL;DR: In this article , the authors proposed two parallel implementation methods of partitioning operations on distributed memory systems (DMSs): distributed cluster tree construction (DCTC) and redundant CHT construction (RCTC).
Abstract: A hierarchical matrix (H-matrix) is an approximated form that represents N × N correlations of N objects. H-matrix construction is achieved by dividing a matrix into submatrices (partitioning), followed by calculating these submatrices' element values (filling). Matrix partitioning consists of two steps: cluster tree (CT) construction, where objects are divided into clusters hierarchically; and block cluster tree (BCT) construction, which involves observing all cluster pairs at the same CT level that satisfies the admissibility condition. This study proposes two parallel implementation methods of partitioning operations on distributed memory systems (DMSs): distributed cluster tree construction (DCTC) and redundant cluster tree construction (RCTC). In DCTC, both CT and BCT constructions are parallelized using workers in all computing nodes. In RCTC, CT is constructed in every computing node redundantly by employing only intra-node work stealing. The BCT is then constructed in parallel using workers in all computing nodes. RCTC cannot achieve speedup using multiple computing nodes, but can eliminate the data exchange cost incurred by DCTC. We used the task-parallel language Tascell, which employs both intra- and inter-node work stealing, to handle arbitrary unbalanced tree construction and traversal on DMSs. Our RCTC implementations achieved a 1.11-1.20-fold speedup using up to 8 nodes × 36 workers in numerical experiments with 3D electric field analyses and N ≃ 108.
TL;DR: In this article , the authors make the case that distributed shared-memory databases (DSM-DB) can bring many advantages, e.g., higher memory utilization, better independent scaling (of compute and memory), and lower cost of ownership.
Abstract: Memory disaggregation (MD) allows for scalable and elastic data center design by separating compute (CPU) from memory. With MD, compute and memory are no longer coupled into the same server box. Instead, they are connected to each other via ultra-fast networking such as RDMA. MD can bring many advantages, e.g., higher memory utilization, better independent scaling (of compute and memory), and lower cost of ownership. This paper makes the case that MD can fuel the next wave of innovation on database systems. We observe that MD revives the great debate of "shared what" in the database community. We envision that distributed shared-memory databases (DSM-DB, for short) - that have not received much attention before - can be promising in the future with MD. We present a list of challenges and opportunities that can inspire next steps in system design making the case for DSM-DB.
TL;DR: In this article , a scalable algorithm for solving compact banded linear systems on distributed memory architectures is presented, which is particularly useful for solving the linear systems arising from the application of compact finite difference operators to a wide range of partial differential equation problems, such as but not limited to the numerical simulations of compressible turbulent flows, aeroacoustics, elastic plastic wave propagation, and electromagnetics.
TL;DR: The randUTV as discussed by the authors algorithm is based on the idea of randomized singular value decomposition (RSVD) and is optimized for both shared-memory and distributed-memory computing environments.
Abstract: Randomized singular value decomposition (RSVD) is by now a well-established technique for efficiently computing an approximate singular value decomposition of a matrix. Building on the ideas that underpin RSVD, the recently proposed algorithm “randUTV” computes a full factorization of a given matrix that provides low-rank approximations with near-optimal error. Because the bulk of randUTV is cast in terms of communication-efficient operations such as matrix-matrix multiplication and unpivoted QR factorizations, it is faster than competing rank-revealing factorization methods such as column-pivoted QR in most high-performance computational settings. In this article, optimized randUTV implementations are presented for both shared-memory and distributed-memory computing environments. For shared memory, randUTV is redesigned in terms of an algorithm-by-blocks that, together with a runtime task scheduler, eliminates bottlenecks from data synchronization points to achieve acceleration over the standard blocked algorithm based on a purely fork-join approach. The distributed-memory implementation is based on the ScaLAPACK library. The performance of our new codes compares favorably with competing factorizations available on both shared-memory and distributed-memory architectures.
TL;DR: In this paper , a source-to-source automatic parallelizing compiler that expresses parallelism with the DVMH directive-based programming model is presented for almost affine accesses in loop nests for distributed memory parallel architectures.
Abstract: We present new techniques for compilation of sequential programs for almost affine accesses in loop nests for distributed-memory parallel architectures. Our approach is implemented as a source-to-source automatic parallelizing compiler that expresses parallelism with the DVMH directive-based programming model. Compared to all previous approaches ours addresses all three main sub-problems of the problem of distributed memory parallelization: data and computation distribution and communication optimization. Parallelization of sequential programs with structured grid computations is considered. In this paper, we use the NAS Parallel Benchmarks to evaluate the performance of generated programs and provide experimental results on up to 9 nodes of a computational cluster with two 8-core processors in a node.
TL;DR: H2Opus as discussed by the authors is a performance-oriented package that supports a broad variety of hierarchical matrix operations on CPUs and GPUs, including matrix-vector multiplication and matrix recompression.
Abstract: Hierarchical ${\mathscr{H}}^{2}$ -matrices are asymptotically optimal representations for the discretizations of non-local operators such as those arising in integral equations or from kernel functions. Their O(N) complexity in both memory and operator application makes them particularly suited for large-scale problems. As a result, there is a need for software that provides support for distributed operations on these matrices to allow large-scale problems to be represented. In this paper, we present high-performance, distributed-memory GPU-accelerated algorithms and implementations for matrix-vector multiplication and matrix recompression of hierarchical matrices in the ${\mathscr{H}}^{2}$ format. The algorithms are a new module of H2Opus, a performance-oriented package that supports a broad variety of ${\mathscr{H}}^{2}$ matrix operations on CPUs and GPUs. Performance in the distributed GPU setting is achieved by marshaling the tree data of the hierarchical matrix representation to allow batched kernels to be executed on the individual GPUs. MPI is used for inter-process communication. We optimize the communication data volume and hide much of the communication cost with local compute phases of the algorithms. Results show near-ideal scalability up to 1024 NVIDIA V100 GPUs on Summit, with performance exceeding 2.3 Tflop/s/GPU for the matrix-vector multiplication, and 670 Gflop/s/GPU for matrix compression, which involves batched QR and SVD operations. We illustrate the flexibility and efficiency of the library by solving a 2D variable diffusivity integral fractional diffusion problem with an algebraic multigrid-preconditioned Krylov solver and demonstrate scalability up to 16M degrees of freedom problems on 64 GPUs.
TL;DR: In this article , the authors propose the concept of distributed data persistency (DDP) model, which is the binding of the memory persistency model with the data consistency model in a distributed system.
Abstract: Distributed applications such as key-value stores and databases provide fault tolerance by replicating records in the memories of different nodes, and using data consistency protocols to ensure consistency across replicas. In this environment, nonvolatile memory (NVM) offers the ability to attain high-performance data durability. However, it is unclear how to tie NVM memory persistency models to the existing data consistency frameworks. In this article, we propose the concept of distributed data persistency (DDP) model, which is the binding of the memory persistency model with the data consistency model in a distributed system. We reason about the interaction between consistency and persistency by using the concepts of visibility point and durability point. We design low-latency distributed protocols for several DDP models, and investigate the tradeoffs between performance, durability, and intuition provided to the programmer.
TL;DR: Jiang et al. as mentioned in this paper proposed an automated parallel training system that combines the advantages from both data and model parallelism, making trade-offs between the memory consumption and the hardware utilization, thus automatically generates the distributed computation graph and maximizes the overall system throughput.
Abstract: Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the distributed training of ultra-large models. However, directly deploying these systems often leads to sub-optimal training efficiency due to the complex model architectures and the strict device memory constraints. In this paper, we propose Optimal Sharded Data Parallel (OSDP), an automated parallel training system that combines the advantages from both data and model parallelism. Given the model description and the device information, OSDP makes trade-offs between the memory consumption and the hardware utilization, thus automatically generates the distributed computation graph and maximizes the overall system throughput. In addition, OSDP introduces operator splitting to further alleviate peak memory footprints during training with negligible overheads, which enables the trainability of larger models as well as the higher throughput. Extensive experimental results of OSDP on multiple different kinds of large-scale models demonstrate that the proposed strategy outperforms the state-of-the-art in multiple regards. Our code is available at https://github.com/Youhe-Jiang/OptimalShardedDataParallel.
TL;DR: In this paper , the development of parallelizing compiler onto computer system with distributed memory is discussed, which is becoming topical for future computer systems with hundreds and more cores, as well as the transformation of sequential programs onto a distributed memory requires development of new functions.
Abstract: This paper is concerned with development of parallelizing compiler onto computer system with distributed memory. Industrial parallelizing compilers create programs for shared memory systems. Transformation of sequential programs onto systems with distributed memory requires development of new functions. This is becoming topical for future computer systems with hundreds and more cores.