Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at https://github.com/alpa-projects/alpa

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Sparse tensors are rapidly becoming critical components of modern deep learning workloads. However, developing high-performance sparse operators can be difficult and tedious, and existing vendor libraries cannot satisfy the escalating demands from new operators. Sparse tensor compilers simplify the development of operators, but efficient sparse compilation for deep learning remains challenging because a single sparse format cannot maximize hardware efficiency, and single-shot compilers cannot keep up with latest hardware and system advances. In this paper, we observe that the key to addressing both these challenges is to leverage composable formats and composable transformations. We propose SparseTIR, a sparse tensor compilation abstraction that offers composable formats and composable transformations for deep learning workloads. SparseTIR constructs a search space over these composable components for performance tuning. With these improvements, SparseTIR obtains consistent performance speedups vs vendor libraries on GPUs for single operators: 1.20-2.34x for GNN operators, 1.05-2.98x for sparse attention operators, and 0.56-7.45x for sparse convolution operators. SparseTIR also accelerates end-to-end GNNs by 1.08-1.52x for GraphSAGE training, and 4.20-40.18x for RGCN inference.

/pdf/sparsetir-composable-abstractions-for-sparse-compilation-in-303h2ogz.pdf

SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning

Graph neural networks (GNNs) have emerged as a series of competent graph learning methods for diverse real-world scenarios, ranging from daily applications like recommendation systems and question answering to cutting-edge technologies such as drug discovery in life sciences and n-body simulation in astrophysics. However, task performance is not the only requirement for GNNs. Performance-oriented GNNs have exhibited potential adverse effects like vulnerability to adversarial attacks, unexplainable discrimination against disadvantaged groups, or excessive resource consumption in edge computing environments. To avoid these unintentional harms, it is necessary to build competent GNNs characterised by trustworthiness. To this end, we propose a comprehensive roadmap to build trustworthy GNNs from the view of the various computing technologies involved. In this survey, we introduce basic concepts and comprehensively summarise existing efforts for trustworthy GNNs from six aspects, including robustness, explainability, privacy, fairness, accountability, and environmental well-being. Additionally, we highlight the intricate cross-aspect relations between the above six aspects of trustworthy GNNs. Finally, we present a thorough overview of trending directions for facilitating the research and industrialisation of trustworthy GNNs.

Trustworthy Graph Neural Networks: Aspects, Methods and Trends

When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must configure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. In current systems, adapting such parameters during training is ill-supported. Users must set system parameters at deployment time, and provide fixed adaptation schedules for hyper-parameters in the training program. We describe KungFu, a distributed ML library for TensorFlow that is designed to enable adaptive training. KungFu allows users to express high-level Adaptation Policies (APs) that describe how to change hyperand system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster rescaling or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators, which are embedded in the dataflow graph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency of monitoring and adaptation operations.

/pdf/kungfu-making-training-in-distributed-machine-learning-3homn7cmte.pdf

KungFu: Making Training in Distributed Machine Learning Adaptive

Graph neural networks (GNNs) have achieved breakthrough performance in graph analytics such as node classification, link prediction and graph clustering. Many GNN training frameworks have been developed, but they are usually designed as a set of manually written, GNN-specific operators plugged into existing deep learning systems, which incurs high memory consumption, poor data locality, and large semantic gap between algorithm design and implementation. This paper proposes the Seastar system, which presents a vertex-centric programming model for GNN training on GPU and provides idiomatic python constructs to enable easy development of novel homogeneous and heterogeneous GNN models. We also propose novel optimizations to produce highly efficient fused GPU kernels for forward and backward passes in GNN training. Compared with the state-of-the art GNN systems, DGL and PyG, Seastar achieves better usability, up to 2 and 8 times less memory consumption, and 14 and 3 times faster execution, respectively.

Seastar: vertex-centric programming for graph neural networks

A good parallelization strategy can significantly improve the efficiency or reduce the cost for the distributed training of deep neural networks (DNNs). Recently, several methods have been proposed to find efficient parallelization strategies but they all optimize a single objective (e.g., execution time, memory consumption) and produce only one strategy. We propose FT, an efficient algorithm that searches for an optimal set of parallelization strategies to allow the trade-off among different objectives. FT can adapt to different scenarios by minimizing the memory consumption when the number of devices is limited and fully utilize additional resources to reduce the execution time. For popular DNN models (e.g., vision, language), an in-depth analysis is conducted to understand the trade-offs among different objectives and their influence on the parallelization strategies. We also develop a user-friendly system, called TensorOpt, which allows users to run their distributed DNN training jobs without caring the details of parallelization strategies. Experimental results show that FT runs efficiently and provides accurate estimation of runtime costs, and TensorOpt is more flexible in adapting to resource availability compared with existing frameworks.

TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism.

Multi-tenant GPU clusters are common, where users purchase GPU quota to run their neural network training jobs. However, strict quota-based scheduling often leads to cluster under-utilization, while allowing quota groups to use excess GPUs improves utilization but results in fairness problems. We propose PPS, a probabilistic prediction based scheduler, which uses job history statistics to predict future cluster status for making good scheduling decisions. Different from existing schedulers that rely on deep learning frameworks to adjust bad scheduling decisions and/or require detailed job information, PPS treats jobs as black boxes in that PPS runs a job to completion without adjustment once scheduled and requires only aggregate job statistics. The black-box feature is favorable due to its good generality, compatibility and security, and made possible by the predictability of aggregate resource utilization statistics of large clusters. Extensive experiments show that PPS achieves high cluster utilization and good fairness simultaneously. 

PPS: Fair and efficient black-box scheduling for multi-tenant GPU clusters

Embedding-based deep recommendation models (EDRMs), which contain small dense models and large embedding tables, are widely used in industry. Embedding communication constitutes the main cost for the distributed training of EDRMs, and thus we propose two strategies to improve its efficiency, i.e.,embedding tiering andpre-fetching. In particular, embedding tiering uses AllReduce to communicate popular embeddings that are accessed frequently. This is counter-intuitive as embeddings belong to the sparse embedding tables, but reasonable because the access pattern of popular embeddings resembles dense models. Pre-fetching starts communication early for embeddings that receive no updates such that they are removed from the critical path of training. We implement embedding tiering and pre-fetching in a system called FEC and compare it with the state-of-the-art systems on real datasets. The results show that FEC consistently outperforms the existing methods on all datasets, and its speed can be up to 6.65x and 2.42x in terms of embedding communication time and training throughput compared with the best performing baseline.

FEC: Efficient Deep Recommendation Model Training with Flexible Embedding Communication

We study how to support elasticity, i.e., the ability to dynamically adjust the parallelism (number of GPUs), for deep neural network (DNN) training. Elasticity can benefit multi-tenant GPU cluster management in many ways, e.g., achieving various scheduling objectives (e.g., job throughput, job completion time, GPU efficiency) according to cluster load variations, maximizing the use of transient idle resources, performance profiling, job migration, and straggler mitigation. However, existing parallelism adjustment strategies incur high overheads, which hinder many applications from making effective use of elasticity. We propose EDL to enable low-overhead elastic deep learning with a simple API. We present techniques that are necessary to reduce the overhead of parallelism adjustments, such as stop-free scaling and dynamic data pipeline. We also demonstrate that EDL can indeed bring significant benefits to the above-listed applications in GPU cluster management.

Elastic deep learning in multi-tenant GPU cluster

Handle everyday research tasks with reliable, citation-backed results

Your personal Research Agent to handle research tasks with citation-backed results

Popular Tasks used by Researchers

How can I help with your research?

Meet SciSpace

Get more enhanced response by uploading the PDFs you want me to reference.

No relevant PDFs in your library

SciSpace is the AI research assistant for academics. Run systematic literature reviews on 280M+ papers, and write papers with cited sources. Trusted by 1M+ students, PhDs & researchers.

SciSpace | AI for Research

Analyze PDFs

Code & Manuscripts

Funding & Grants

Literature & Patents

Medical & Clinical Data

Systematic Review

Visualize & Present

Web & Data

Build a Google Scholar-like website for your research.

Build a website

Create charts and images for your research

Create a Chart

Write a paper for submission to a journal

Draft a manuscript

Patent Search

Design eye-catching scientific posters in minutes.

Scientific Poster Generation

Systematic Literature Review

One task is running at the moment. Your messages will be shown right after.

Drag and drop or click here to browse

Loved by <highlight>1 million+</highlight> researchers

Extract a list of specific topics and their sources from unstructured text

Topics

Compare and analyze relevant papers that matches with your search

Papers

Get insights from PDFs and bookmarked papers from your library

My library

Recent searches

Try searching for:

Catch AI-generated content in scholarly and non-scholarly content

{ai} Detector

Ai Writer

Get PDF Summaries, highlighted text explanations 

Chat with PDF

Effortlessly create in-text citations and bibliographies in APA and 2,500 other formats

Citation generator

Get explanations, summaries, and answers on academic papers

Ease up your research workflow with {scispace}'s cohort of exciting AI tools

Elevate your academic writing skills and convey your ideas the way you want

Paraphraser

Explore our range of reading and writing tools

Your file is being prepared and should be ready in a few minutes. If it's a large file, it might take a bit longer. You can close this window, and we'll email you the file when it's done.

You have reached a maximum limit of <strong>{limit}</strong> columns in the table. Remove at least <strong>1</strong> column to add or create another one.

Kaihao Ma

Author Tools

Chat about Author