Alexander M. Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan‐Yen Lo, Piotr Dollár, Ross Girshick
1 Oct 2023
Abstract: We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive – often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at segment-anything.com to foster research into foundation models for computer vision. We recommend reading the full paper at: arxiv.org/abs/2304.02643.
TL;DR: Hu et al. as discussed by the authors proposed Swin-Unet, which is an Unet-like pure Transformer for medical image segmentation, which uses a hierarchical Swin Transformer with shifted windows as the encoder to extract context features.
Abstract: In the past few years, convolutional neural networks (CNNs) have achieved milestones in medical image analysis. In particular, deep neural networks based on U-shaped architecture and skip-connections have been widely applied in various medical image tasks. However, although CNN has achieved excellent performance, it cannot learn global semantic information interaction well due to the locality of convolution operation. In this paper, we propose Swin-Unet, which is an Unet-like pure Transformer for medical image segmentation. The tokenized image patches are fed into the Transformer-based U-shaped Encoder-Decoder architecture with skip-connections for local-global semantic feature learning. Specifically, we use a hierarchical Swin Transformer with shifted windows as the encoder to extract context features. And a symmetric Swin Transformer-based decoder with a patch expanding layer is designed to perform the up-sampling operation to restore the spatial resolution of the feature maps. Under the direct down-sampling and up-sampling of the inputs and outputs by $$4{\times }$$ , experiments on multi-organ and cardiac segmentation tasks demonstrate that the pure Transformer-based U-shaped Encoder-Decoder network outperforms those methods with full-convolution or the combination of transformer and convolution. The codes have been publicly available at the link ( https://github.com/HuCaoFighting/Swin-Unet ).
TL;DR: A deep learning segmentation model that can automatically and robustly segment all major anatomic structures on body CT images is presented and enables robust and accurate segmentation of 104 anatomic structure relevant for use cases such as organ volumetry, disease characterization, and surgical or radiation therapy planning.
Abstract: We present a deep learning segmentation model that can automatically and robustly segment all major anatomical structures in body CT images. In this retrospective study, 1204 CT examinations (from the years 2012, 2016, and 2020) were used to segment 104 anatomical structures (27 organs, 59 bones, 10 muscles, 8 vessels) relevant for use cases such as organ volumetry, disease characterization, and surgical or radiotherapy planning. The CT images were randomly sampled from routine clinical studies and thus represent a real-world dataset (different ages, pathologies, scanners, body parts, sequences, and sites). The authors trained an nnU-Net segmentation algorithm on this dataset and calculated Dice similarity coefficients (Dice) to evaluate the model's performance. The trained algorithm was applied to a second dataset of 4004 whole-body CT examinations to investigate age dependent volume and attenuation changes. The proposed model showed a high Dice score (0.943) on the test set, which included a wide range of clinical data with major pathologies. The model significantly outperformed another publicly available segmentation model on a separate dataset (Dice score, 0.932 versus 0.871, respectively). The aging study demonstrated significant correlations between age and volume and mean attenuation for a variety of organ groups (e.g., age and aortic volume; age and mean attenuation of the autochthonous dorsal musculature). The developed model enables robust and accurate segmentation of 104 anatomical structures. The annotated dataset (https://doi.org/10.5281/zenodo.6802613) and toolkit (https://www.github.com/wasserth/TotalSegmentator) are publicly available.
TL;DR: Self-attention mechanism is effective in image processing but faces challenges due to its 1D sequence nature, high complexity, and lack of channel adaptability. Large kernel attention (LKA) enables self-adaptive and long-range correlations while addressing these challenges. VAN, a neural network based on LKA, achieves comparable results with similar-sized CNNs and ViTs in various tasks.
Abstract: Abstract While originally designed for natural language processing tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision: (1) treating images as 1D sequences neglects their 2D structures; (2) the quadratic complexity is too expensive for high-resolution images; (3) it only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel linear attention named large kernel attention (LKA) to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings. Furthermore, we present a neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple, VAN achieves comparable results with similar size convolutional neural networks (CNNs) and vision transformers (ViTs) in various tasks, including image classification, object detection, semantic segmentation, panoptic segmentation, pose estimation, etc. For example, VAN-B6 achieves 87.8% accuracy on ImageNet benchmark, and sets new state-of-the-art performance (58.2 PQ) for panoptic segmentation. Besides, VAN-B2 surpasses Swin-T 4 mIoU (50.1 vs. 46.1) for semantic segmentation on ADE20K benchmark, 2.6 AP (48.8 vs. 46.2) for object detection on COCO dataset. It provides a novel method and a simple yet strong baseline for the community. The code is available at https://github.com/Visual-Attention-Network .
TL;DR: Interactive foreground extraction using iterative graph cuts is a powerful tool for image editing. It combines texture and edge information to produce high-quality results.
Abstract: The problem of efficient, interactive foreground/background segmentation in still images is of great practical importance in image editing. Classical image segmentation tools use either texture (colour) information, e.g. Magic Wand, or edge (contrast) information, e.g. Intelligent Scissors. Recently, an approach based on optimization by graph-cut has been developed which successfully combines both types of information. In this paper we extend the graph-cut approach in three respects. First, we have developed a more powerful, iterative version of the optimisation. Secondly, the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result. Thirdly, a robust algorithm for ''border matting'' has been developed to estimate simultaneously the alpha-matte around an object boundary and the colours of foreground pixels. We show that for moderately difficult examples the proposed method outperforms competitive tools.
TL;DR: In this paper , the authors proposed a multi-task multi-sensor fusion framework, which unifies multi-modal features in the shared bird's-eye view representation space, which nicely preserves both geometric and semantic information.
Abstract: Multi-sensor fusion is essential for an accurate and reliable autonomous driving system. Recent approaches are based on point-level fusion: augmenting the LiDAR point cloud with camera features. However, the camera-to-LiDAR projection throws away the semantic density of camera features, hindering the effectiveness of such methods, especially for semantic-oriented tasks (such as 3D scene segmentation). In this paper, we propose BEVFusion, an efficient and generic multi-task multi-sensor fusion framework. It unifies multi-modal features in the shared bird's-eye view (BEV) representation space, which nicely preserves both geometric and semantic information. To achieve this, we diagnose and lift the key efficiency bottlenecks in the view transformation with optimized BEV pooling, reducing latency by more than $\mathbf{40}\times$ . BEVFusion is fundamentally task-agnostic and seamlessly supports different 3D perception tasks with almost no architectural changes. It establishes the new state of the art on the nuScenes benchmark, achieving 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9× lower computation cost. Code to reproduce our results is available at https://github.com/mit-han-lab/bevfusion.
Zhu Li, Xinjiang Wang, Zhanghan Ke, Wayne Zhang, Rynson W. H. Lau
1 Jun 2023
TL;DR: BiFormer is a vision transformer that utilizes bi-level routing attention to alleviate the high computation burden and heavy memory footprint of traditional attention mechanisms.
Abstract: As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (i.e., routed regions). We provide a simple yet effective implementation of the proposed bilevel routing attention, which utilizes the sparsity to save both computation and memory while involving only GPU-friendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a query adaptive manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. Code is available at https://github.com/rayleizhu/BiFormer.
TL;DR: InternImage is a new large-scale CNN-based foundation model that achieves comparable performance to ViTs on challenging benchmarks. It utilizes deformable convolutions to learn stronger and more robust patterns from massive data.
Abstract: Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, andADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs.
TL;DR: The Segment Anything (SA) dataset as mentioned in this paper is the largest dataset for image segmentation, with over 1 billion masks on 11M licensed and privacy-preserving images and is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks.
Abstract: We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision.
TL;DR: In this article , a Differentiable Binarization (DB) module that integrates the binarization process, one of the most important steps in the post-processing procedure, into a segmentation network is proposed.
Abstract: Recently, segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field, because of their superiority in detecting the text instances of arbitrary shapes and extreme aspect ratios, profiting from the pixel-level descriptions. However, the vast majority of the existing segmentation-based approaches are limited to their complex post-processing algorithms and the scale robustness of their segmentation models, where the post-processing algorithms are not only isolated to the model optimization but also time-consuming and the scale robustness is usually strengthened by fusing multi-scale feature maps directly. In this paper, we propose a Differentiable Binarization (DB) module that integrates the binarization process, one of the most important steps in the post-processing procedure, into a segmentation network. Optimized along with the proposed DB module, the segmentation network can produce more accurate results, which enhances the accuracy of text detection with a simple pipeline. Furthermore, an efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively. By incorporating the proposed DB and ASF with the segmentation network, our proposed scene text detector consistently achieves state-of-the-art results, in terms of both detection accuracy and speed, on five standard benchmarks.
TL;DR: Wang et al. as discussed by the authors proposed a multi-stage architecture for action segmentation, where the first stage generates an initial prediction that is refined by the next ones. But their architecture still suffers from a small receptive field, and they propose a dual dilated layer that combines both large and small receptive fields.
Abstract: With the success of deep learning in classifying short trimmed videos, more attention has been focused on temporally segmenting and classifying activities in long untrimmed videos. State-of-the-art approaches for action segmentation utilize several layers of temporal convolution and temporal pooling. Despite the capabilities of these approaches in capturing temporal dependencies, their predictions suffer from over-segmentation errors. In this paper, we propose a multi-stage architecture for the temporal action segmentation task that overcomes the limitations of the previous approaches. The first stage generates an initial prediction that is refined by the next ones. In each stage we stack several layers of dilated temporal convolutions covering a large receptive field with few parameters. While this architecture already performs well, lower layers still suffer from a small receptive field. To address this limitation, we propose a dual dilated layer that combines both large and small receptive fields. We further decouple the design of the first stage from the refining stages to address the different requirements of these stages. Extensive evaluation shows the effectiveness of the proposed model in capturing long-range dependencies and recognizing action segments. Our models achieve state-of-the-art results on three datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.
TL;DR: SAM exhibits promising zero-shot segmentation performance for certain medical imaging datasets, but its performance varies significantly across datasets and tasks.
Abstract: Training segmentation models for medical images continues to be challenging due to the limited availability of data annotations. Segment Anything Model (SAM) is a foundation model trained on over 1 billion annotations, predominantly for natural images, that is intended to segment user-defined objects of interest in an interactive manner. While the model performance on natural images is impressive, medical image domains pose their own set of challenges. Here, we perform an extensive evaluation of SAM’s ability to segment medical images on a collection of 19 medical imaging datasets from various modalities and anatomies. In our experiments, we generated point and box prompts for SAM using a standard method that simulates interactive segmentation. We report the following findings: (1) SAM’s performance based on single prompts highly varies depending on the dataset and the task, from IoU=0.1135 for spine MRI to IoU=0.8650 for hip X-ray. (2) Segmentation performance appears to be better for well-circumscribed objects with prompts with less ambiguity such as the segmentation of organs in computed tomography and poorer in various other scenarios such as the segmentation of brain tumors. (3) SAM performs notably better with box prompts than with point prompts. (4) SAM outperforms similar methods RITM, SimpleClick, and FocalClick in almost all single-point prompt settings. (5) When multiple-point prompts are provided iteratively, SAM’s performance generally improves only slightly while other methods’ performance improves to the level that surpasses SAM’s point-based performance. We also provide several illustrations for SAM’s performance on all tested datasets, iterative segmentation, and SAM’s behavior given prompt ambiguity. We conclude that SAM shows impressive zero-shot segmentation performance for certain medical imaging datasets, but moderate to poor performance for others. SAM has the potential to make a significant impact in automated medical image segmentation in medical imaging, but appropriate care needs to be applied when using it. Code for evaluation SAM is made publicly available at https://github.com/mazurowski-lab/segment-anything-medical-evaluation.
TL;DR: PIDNet is a novel three-branch network architecture for real-time semantic segmentation that alleviates overshoot issues inherent to two-branch networks.
Abstract: Two-branch network architecture has shown its efficiency and effectiveness in real-time semantic segmentation tasks. However, direct fusion of high-resolution details and low-frequency context has the drawback of detailed features being easily overwhelmed by surrounding contextual information. This overshoot phenomenon limits the improvement of the segmentation accuracy of existing two-branch models. In this paper, we make a connection between Convolutional Neural Networks (CNN) and Proportional-Integral-Derivative (PID) controllers and reveal that a two-branch network is equivalent to a Proportional-Integral (PI) controller, which inherently suffers from similar overshoot issues. To alleviate this problem, we propose a novel three-branch network architecture: PIDNet, which contains three branches to parse detailed, context and boundary information, respectively, and employs boundary attention to guide the fusion of detailed and context branches. Our family of PIDNets achieve the best trade-off between inference speed and accuracy and their accuracy surpasses all the existing models with similar inference speed on the Cityscapes and CamVid datasets. Specifically, PIDNet-S achieves 78.6% mIOU with inference speed of 93.2 FPS on Cityscapes and 80.1% mIOU with speed of 153.7 FPS on CamVid.
TL;DR: In this article , the authors present a survey of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, where taxonomy is proposed to organize the representative methods according to their motivations, structures, and application scenarios.
Abstract: Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natural language processing (NLP). Inspired by such significant achievements, some pioneering works have recently been done on employing Transformer-liked architectures in the computer vision (CV) field, which have demonstrated their effectiveness on three fundamental CV tasks (classification, detection, and segmentation) as well as multiple sensory data stream (images, point clouds, and vision-language data). Because of their competitive modeling capabilities, the visual Transformers have achieved impressive performance improvements over multiple benchmarks as compared with modern convolution neural networks (CNNs). In this survey, we have reviewed over 100 of different visual Transformers comprehensively according to three fundamental CV tasks and different data stream types, where taxonomy is proposed to organize the representative methods according to their motivations, structures, and application scenarios. Because of their differences on training settings and dedicated vision tasks, we have also evaluated and compared all these existing visual Transformers under different configurations. Furthermore, we have revealed a series of essential but unexploited aspects that may empower such visual Transformers to stand out from numerous architectures, e.g., slack high-level semantic embeddings to bridge the gap between the visual Transformers and the sequential ones. Finally, two promising research directions are suggested for future investment. We will continue to update the latest articles and their released source codes at.
TL;DR: CMX is a unified framework for RGB-X semantic segmentation that generalizes well across different modalities and achieves state-of-the-art performance on various benchmarks.
Abstract: Scene understanding based on image segmentation is a crucial component of autonomous vehicles. Pixel-wise semantic segmentation of RGB images can be advanced by exploiting complementary features from the supplementary modality ( ${X}$ -modality). However, covering a wide variety of sensors with a modality-agnostic model remains an unresolved problem due to variations in sensor characteristics among different modalities. Unlike previous modality-specific methods, in this work, we propose a unified fusion framework, CMX, for RGB-X semantic segmentation. To generalize well across different modalities, that often include supplements as well as uncertainties, a unified cross-modal interaction is crucial for modality fusion. Specifically, we design a Cross-Modal Feature Rectification Module (CM-FRM) to calibrate bi-modal features by leveraging the features from one modality to rectify the features of the other modality. With rectified feature pairs, we deploy a Feature Fusion Module (FFM) to perform sufficient exchange of long-range contexts before mixing. To verify CMX, for the first time, we unify five modalities complementary to RGB, i.e., depth, thermal, polarization, event, and LiDAR. Extensive experiments show that CMX generalizes well to diverse multi-modal fusion, achieving state-of-the-art performances on five RGB-Depth benchmarks, as well as RGB-Thermal, RGB-Polarization, and RGB-LiDAR datasets. Besides, to investigate the generalizability to dense-sparse data fusion, we establish an RGB-Event semantic segmentation benchmark based on the EventScape dataset, on which CMX sets the new state-of-the-art. The source code of CMX is publicly available at https://github.com/huaaaliu/RGBX_Semantic_ Segmentation.
Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M. Ni, Heung‐Yeung Shum
1 Jun 2023
TL;DR: Mask DINO is a unified transformer-based framework for object detection and segmentation that significantly outperforms existing specialized methods.
Abstract: In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, and scalable, and it can benefit from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K) among models under one billion parameters. Code is available at https://github.com/IDEA-Research/MaskDINO.
TL;DR: EVA explores the limits of masked visual representation learning at scale, setting new records on various vision tasks and demonstrating its effectiveness as a vision-centric multi-modal pivot.
Abstract: We launch EVA, a vision-centric foundation model to Explore the limits of Visual representation at scAle using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVIS dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models.
TL;DR: SynthSeg as discussed by the authors proposes a domain randomization strategy where they fully randomise the contrast and resolution of the synthetic training data, which enables straightforward analysis of huge amounts of heterogeneous clinical data.
TL;DR: In this paper , a fusion model is proposed with the integration of the U-Net and Convolution Neural Network model for skin disease classification, which has outperformed on Adadelta optimizer with an accuracy value of 97.96%.
Abstract: Skin is one of the most significant organs, which serves as a barrier to the outside surroundings of the human body. To improve mortality, skin disease detection at a prior stage is necessary else it may convert to skin cancer. But its diagnosis at the prior stage that may increase life expectancy is a great experiment since it has a similar look to skin diseases. To deal with biomedical images, a new innovative automated system is required that can quickly and precisely identify skin lesions. Deep Learning is attracting a lot of attention in the treatment of numerous disorders. A fusion model is proposed here with the integration of the U-Net and Convolution Neural Network model. For this, the U-Net model has been utilized to segment the diseases using skin images and the Convolution Neural Network model has been proposed for the multi-class classification of segmented images. The model is simulated and analyzed using the HAM10000 dataset having 10,015 dermoscopy images of seven different classifications of skin diseases. The proposed model has been analyzed using two optimizers i.e. Adam and Adadelta on 20 epochs and 32 batch sizes for the skin disease classification. The model has outperformed on Adadelta optimizer with an accuracy value of 97.96%.
Feng Li, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yue Zhao, Hang Zhang, Peizhao Zhang, P. Vajda, Diana Marculescu
1 Jun 2023
TL;DR: Open-vocabulary semantic segmentation with Mask-adapted CLIP achieves state-of-the-art performance by finetuning CLIP on a collection of masked image regions and their corresponding text descriptions.
Abstract: Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the “blank” areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset specific adaptations.
Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi
1 Jun 2023
TL;DR: OneFormer is a universal image segmentation framework that unifies segmentation with a multi-task train-once design. It outperforms specialized Mask2Former models across all three segmentation tasks.
Abstract: Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better intertask and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, Cityscapes, and COCO, despite the latter being trained on each task individually. We believe OneFormer is a significant step towards making image segmentation more universal and accessible.
TL;DR: Zhang et al. as discussed by the authors proposed a hierarchical encoder-decoder network with two appealing designs: 1) a feed-forward network in transformer block of U-shaped encoderdecoder structure is redesigned, ReMix-FFN, which explore global dependencies and local context for better feature discrimination by re-integrating the local context and global dependencies.
Abstract: Transformer-based methods are recently popular in vision tasks because of their capability to model global dependencies alone. However, it limits the performance of networks due to the lack of modeling local context and global-local correlations of multi-scale features. In this paper, we present MISSFormer, a Medical Image Segmentation tranSFormer. MISSFormer is a hierarchical encoder-decoder network with two appealing designs: 1) a feed-forward network in transformer block of U-shaped encoder-decoder structure is redesigned, ReMix-FFN, which explore global dependencies and local context for better feature discrimination by re-integrating the local context and global dependencies; 2) a ReMixed Transformer Context Bridge is proposed to extract the correlations of global dependencies and local context in multi-scale features generated by our hierarchical transformer encoder. The MISSFormer shows a solid capacity to capture more discriminative dependencies and context in medical image segmentation. The experiments on multi-organ, cardiac segmentation and retinal vessel segmentation tasks demonstrate the superiority, effectiveness and robustness of our MISSFormer. Specifically, the experimental results of MISSFormer trained from scratch even outperform state-of-the-art methods pre-trained on ImageNet, and the core designs can be generalized to other visual segmentation tasks. The code has been released on Github: https://github.com/ZhifangDeng/MISSFormer.
TL;DR: In this article , the authors propose Hiformer, a novel method that efficiently bridges a CNN and a transformer for medical image segmentation, which designs two multi-scale feature representations using the seminal Swin Transformer module and a CNN-based encoder.
Abstract: Convolutional neural networks (CNNs) have been the consensus for medical image segmentation tasks. However, they suffer from the limitation in modeling long-range dependencies and spatial correlations due to the nature of convolution operation. Although transformers were first developed to address this issue, they fail to capture low-level features. In contrast, it is demonstrated that both local and global features are crucial for dense prediction, such as segmenting in challenging contexts. In this paper, we propose HiFormer, a novel method that efficiently bridges a CNN and a transformer for medical image segmentation. Specifically, we design two multi-scale feature representations using the seminal Swin Transformer module and a CNN-based encoder. To secure a fine fusion of global and local features obtained from the two aforementioned representations, we propose a Double-Level Fusion (DLF) module in the skip connection of the encoder-decoder structure. Extensive experiments on various medical image segmentation datasets demonstrate the effectiveness of HiFormer over other CNN-based, transformer-based, and hybrid methods in terms of computational complexity, quantitative and qualitative results. Our code is publicly available at GitHub.
TL;DR: In this article , the authors summarized the transformer-based segmentation models in the abdominal organs, heart, brain, and lung based on the relevant studies in the last two years and found that the combination of U-Net and transformer is more suitable for segmentation.
TL;DR: In this article , the authors proposed a lightweight transformer+CNN model, in which the encoder sub-network is a two-path design that enables both the global dependence of image features and the low layer spatial details to be effectively captured.
Lihe Yang, Qi Liu, Litong Feng, Wayne Zhang, Yinghuan Shi
1 Jun 2023
TL;DR: Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation achieves competitive results against recent advanced works, but relies heavily on manual data augmentation design. To address this, the paper proposes Unified Dual-Stream Perturbations approach (UniMatch) that expands the perturbation space and probes original image-level augmentations.
Abstract: In this work, we revisit the weak-to-strong consistency framework, popularized by FixMatch from semi-supervised classification, where the prediction of a weakly perturbed image serves as supervision for its strongly perturbed version. Intriguingly, we observe that such a simple pipeline already achieves competitive results against recent advanced works, when transferred to our segmentation scenario. Its success heavily relies on the manual design of strong data augmentations, however, which may be limited and inadequate to explore a broader perturbation space. Motivated by this, we propose an auxiliary feature perturbation stream as a supplement, leading to an expanded perturbation space. On the other, to sufficiently probe original image-level augmentations, we present a dual-stream perturbation technique, enabling two strong views to be simultaneously guided by a common weak view. Consequently, our overall Unified Dual-Stream Perturbations approach (UniMatch) surpasses all existing methods significantly across all evaluation protocols on the Pascal, Cityscapes, and COCO benchmarks. Its superiority is also demonstrated in remote sensing interpretation and medical image analysis. We hope our reproduced FixMatch and our results can inspire more future works.
TL;DR: DDRNet as discussed by the authors proposes a family of deep dual-resolution networks (DDRNets) for real-time and accurate semantic segmentation, which consist of two deep branches and multiple bilateral fusions of backbones generate higher quality details compared to existing two-pathway methods.
Abstract: Using light-weight architectures or reasoning on low-resolution images, recent methods realize very fast scene parsing, even running at more than 100 FPS on a single GPU. However, there is still a significant gap in performance between these real-time methods and the models based on dilation backbones. To this end, we proposed a family of deep dual-resolution networks (DDRNets) for real-time and accurate semantic segmentation, which consist of deep dual-resolution backbones and enhanced low-resolution contextual information extractors. The two deep branches and multiple bilateral fusions of backbones generate higher quality details compared to existing two-pathway methods. The enhanced contextual information extractor named Deep Aggregation Pyramid Pooling Module (DAPPM) enlarges effective receptive fields and fuses multi-scale context based on low-resolution feature maps with little time cost. Our method achieves a new state-of-the-art trade-off between accuracy and speed on both Cityscapes and CamVid dataset. For the input of full resolution, on a single 2080Ti GPU without hardware acceleration, DDRNet-23-slim yields 77.4% mIoU at 102 FPS on Cityscapes test set and 74.7% mIoU at 230 FPS on CamVid test set. With widely used test augmentation, our method is superior to most state-of-the-art models and requires much less computation. Codes and trained models are available at https://github.com/ydhongHIT/DDRNet .
TL;DR: Efficient Pyramid Squeeze Attention (EPSA) as discussed by the authors replaces the 3 × 3 convolution with the PSA module in the bottleneck blocks of the ResNet, a novel representational block named EPSA is obtained.
Abstract: Recently, it has been demonstrated that the performance of a deep convolutional neural network can be effectively improved by embedding an attention module into it. In this work, a novel lightweight and effective attention method named Pyramid Squeeze Attention (PSA) module is proposed. By replacing the 3 $$\,\times \,$$ 3 convolution with the PSA module in the bottleneck blocks of the ResNet, a novel representational block named Efficient Pyramid Squeeze Attention (EPSA) is obtained. The EPSA block can be easily added as a plug-and-play component into a well-established backbone network, and significant improvements on model performance can be achieved. Hence, a simple and efficient backbone architecture named EPSANet is developed in this work by stacking these ResNet-style EPSA blocks. Correspondingly, a stronger multi-scale representation ability can be offered by the proposed EPSANet for various computer vision tasks including but not limited to, image classification, object detection, instance segmentation, etc. Without bells and whistles, the performance of the proposed EPSANet outperforms most of the state-of-the-art channel attention methods. As compared to the SENet-50, the Top-1 accuracy is improved by 1.93 $$\%$$ on ImageNet dataset, a larger margin of +2.7 box AP for object detection and an improvement of +1.7 mask AP for instance segmentation by using the Mask-RCNN on MS-COCO dataset are obtained.