Journal Article10.2352/ei.2024.36.12.hpci-199
Efficient Distributed Sequence Parallelism for Transformer-Based Image Segmentation
Isaac Lyngaas,Murali Gopalakrishnan Meena,Evan Calabrese,Mohamed Wahib,Peng Chen,Jun Igarashi,Yuankai Huo,Xiao Wang +7 more
TL;DR: Efficient distributed sequence parallelism for transformer-based image segmentation models enables training of large models on distributed systems, improving scalability and reducing training time.
read more
Abstract: We introduce an efficient distributed sequence parallel approach for training transformer-based deep learning image segmentation models.The neural network models are comprised of a combination of a Vision Transformer encoder with a convolutional decoder to provide image segmentation mappings.The utility of the distributed sequence parallel approach is especially useful in cases where the tokenized embedding representation of image data are too large to fit into standard computing hardware memory.To demonstrate the performance and characteristics of our models trained in sequence parallel fashion compared to standard models, we evaluate our approach using a 3D MRI brain tumor segmentation dataset.We show that training with a sequence parallel approach can match standard sequential model training in terms of convergence.Furthermore, we show that our sequence parallel approach has the capability to support training of models that would not be possible on standard computing resources.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Figure 8. Train loss, validation loss, and dice accuracy scores for training long-sequence ViT models. 
Figure 7. Train loss, validation loss, and dice accuracy scores when using various splitting methods for the No-Gather approach for handling ViT encoder output. 
Figure 1. A generic ViT using a sequence distributed across 2 GPUs with L attention layers where the Attention Layer is outlined by a gray box. Residual1,2 represent the input for residual connections, i.e. the input from before being transformed by the self-attention and feed forward layers respectively. 
Figure 2. A ViT Encoder-Convolutional Decoder network for the case of SeqPar = 2 where an Image is tokenized into patches such that lx = W16 × D 16 × H 16 . Output of the ViT is projected back into image space and then fed into multiple 3D Convolution Layers ( 3×3×3 filters with Upsample Scaling of 2). The initial convolution layer maps from embedding length Em to an arbitrary feature space chosen to be 128 for this example. This output is fed into a linear projection which transforms back into the original space based on the number of segmentation classes. 
Figure 4. The distribution of an image across different number of sequence parallel ranks using spatial splitting. After splitting, these sequence distributed images are each tokenized into patches similar to the process shown in Figure 3. 
Figure 3. A (W,D,H) image being tokenized into 27 (P,P,P) patches where P = W3 = D 3 = H 3 . Tokenized patches are ordered such that it first indexes through Height, then Depth, and then Width dimensions.
Citations
Adaptive Patching for High-resolution Image Segmentation with Transformers
Enzhi Zhang,Isaac Lyngaas,Peng Chen,Xiao Wang,Jun Igarashi,Yuankai Huo,Mohamed Wahib,Masaharu Munetomo +7 more
TL;DR: Adaptive patching for high-resolution image segmentation with transformers significantly reduces the number of patches, improving performance and reducing computational cost.
1
References
U-Net: Convolutional Networks for Biomedical Image Segmentation
Olaf Ronneberger,Philipp Fischer,Thomas Brox +2 more
- 05 Oct 2015
TL;DR: Neber et al. as discussed by the authors proposed a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently, which can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks.
•Posted Content
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy,Lucas Beyer,Alexander Kolesnikov,Dirk Weissenborn,Xiaohua Zhai,Thomas Unterthiner,Mostafa Dehghani,Matthias Minderer,Georg Heigold,Sylvain Gelly,Jakob Uszkoreit,Neil Houlsby +11 more
TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation
Fausto Milletari,Nassir Navab,Seyed-Ahmad Ahmadi +2 more
- 15 Jun 2016
TL;DR: In this article, a volumetric, fully convolutional neural network (FCN) was proposed to predict segmentation for the whole volume at one time, which can deal with situations where there is a strong imbalance between the number of foreground and background voxels.
7.7K
•Posted Content
V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation
TL;DR: This work proposes an approach to 3D image segmentation based on a volumetric, fully convolutional, neural network, trained end-to-end on MRI volumes depicting prostate, and learns to predict segmentation for the whole volume at once.
5.8K
•Posted Content
Longformer: The Long-Document Transformer
TL;DR: Following prior work on long-sequence transformers, the Longformer is evaluated on character-level language modeling and achieves state-of-the-art results on text8 and enwik8 and pretrain Longformer and finetune it on a variety of downstream tasks.
3.9K