Efficient Distributed Sequence Parallelism for Transformer-Based Image Segmentation

doi:10.2352/ei.2024.36.12.hpci-199

Journal Article10.2352/ei.2024.36.12.hpci-199

Efficient Distributed Sequence Parallelism for Transformer-Based Image Segmentation

Isaac Lyngaas, +7 more

- 21 Jan 2024

- IS&T International Symposium on Electron...

- Vol. 36, Iss: 12, pp 199-7

1

TL;DR: Efficient distributed sequence parallelism for transformer-based image segmentation models enables training of large models on distributed systems, improving scalability and reducing training time.

Abstract: We introduce an efficient distributed sequence parallel approach for training transformer-based deep learning image segmentation models.The neural network models are comprised of a combination of a Vision Transformer encoder with a convolutional decoder to provide image segmentation mappings.The utility of the distributed sequence parallel approach is especially useful in cases where the tokenized embedding representation of image data are too large to fit into standard computing hardware memory.To demonstrate the performance and characteristics of our models trained in sequence parallel fashion compared to standard models, we evaluate our approach using a 3D MRI brain tumor segmentation dataset.We show that training with a sequence parallel approach can match standard sequential model training in terms of convergence.Furthermore, we show that our sequence parallel approach has the capability to support training of models that would not be possible on standard computing resources.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Figure 8. Train loss, validation loss, and dice accuracy scores for training long-sequence ViT models.

Figure 7. Train loss, validation loss, and dice accuracy scores when using various splitting methods for the No-Gather approach for handling ViT encoder output.

Figure 1. A generic ViT using a sequence distributed across 2 GPUs with L attention layers where the Attention Layer is outlined by a gray box. Residual1,2 represent the input for residual connections, i.e. the input from before being transformed by the self-attention and feed forward layers respectively.

Figure 2. A ViT Encoder-Convolutional Decoder network for the case of SeqPar = 2 where an Image is tokenized into patches such that lx = W16 × D 16 × H 16 . Output of the ViT is projected back into image space and then fed into multiple 3D Convolution Layers ( 3×3×3 filters with Upsample Scaling of 2). The initial convolution layer maps from embedding length Em to an arbitrary feature space chosen to be 128 for this example. This output is fed into a linear projection which transforms back into the original space based on the number of segmentation classes.

Figure 4. The distribution of an image across different number of sequence parallel ranks using spatial splitting. After splitting, these sequence distributed images are each tokenized into patches similar to the process shown in Figure 3.

Figure 3. A (W,D,H) image being tokenized into 27 (P,P,P) patches where P = W3 = D 3 = H 3 . Tokenized patches are ordered such that it first indexes through Height, then Depth, and then Width dimensions.

Citations

Journal Article•10.48550/arxiv.2404.09707

Adaptive Patching for High-resolution Image Segmentation with Transformers

Enzhi Zhang, +7 more

- 15 Apr 2024

- arXiv.org

TL;DR: Adaptive patching for high-resolution image segmentation with transformers significantly reduces the number of patches, improving performance and reducing computational cost.

...read moreread less

1

References

•Book Chapter•10.1007/978-3-319-24574-4_28

U-Net: Convolutional Networks for Biomedical Image Segmentation

Olaf Ronneberger, +2 more

- 05 Oct 2015

TL;DR: Neber et al. as discussed by the authors proposed a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently, which can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.

...read moreread less

92K

•Posted Content

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, +11 more

- 22 Oct 2020

- arXiv: Computer Vision and Pattern Recog...

TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

36.9K

•Proceedings Article•10.1109/3DV.2016.79

V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation

Fausto Milletari, +2 more

- 15 Jun 2016

TL;DR: In this article, a volumetric, fully convolutional neural network (FCN) was proposed to predict segmentation for the whole volume at one time, which can deal with situations where there is a strong imbalance between the number of foreground and background voxels.

...read moreread less

7.7K

•Posted Content

V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation

Fausto Milletari, +2 more

- 15 Jun 2016

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This work proposes an approach to 3D image segmentation based on a volumetric, fully convolutional, neural network, trained end-to-end on MRI volumes depicting prostate, and learns to predict segmentation for the whole volume at once.

...read moreread less

5.8K

•Posted Content

Longformer: The Long-Document Transformer

Iz Beltagy, +2 more

- 10 Apr 2020

- arXiv: Computation and Language

TL;DR: Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks.

...read moreread less

3.9K