TL;DR: ConvNeXt as discussed by the authors is a family of pure ConvNet models, which compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation.
Abstract: The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.
TL;DR: Transformer networks as mentioned in this paper enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM).
Abstract: Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.
TL;DR: UNETR as discussed by the authors utilizes a transformer encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful U-shaped network design for the encoder and decoder.
Abstract: Fully Convolutional Neural Networks (FCNNs) with contracting and expanding paths have shown prominence for the majority of medical image segmentation applications since the past decade. In FCNNs, the encoder plays an integral role by learning both global and local features and contextual representations which can be utilized for semantic output prediction by the decoder. Despite their success, the locality of convolutional layers in FCNNs, limits the capability of learning long-range spatial dependencies. Inspired by the recent success of transformers for Natural Language Processing (NLP) in long-range sequence learning, we reformulate the task of volumetric (3D) medical image segmentation as a sequence-to-sequence prediction problem. We introduce a novel architecture, dubbed as UNEt TRansformers (UNETR), that utilizes a transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful "U-shaped" network design for the encoder and decoder. The transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. Our benchmarks demonstrate new state-of-the-art performance on the BTCV leaderboard.
TL;DR: Mask2former as discussed by the authors proposes Masked-Attention Mask Transformer (Mask2Transformer), which extracts localized features by constraining cross-attention within predicted mask regions. But it is not suitable for instance segmentation.
Abstract: Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing spe-cialized architectures for each task. We present Masked- attention Mask Transformer (Mask2Former), a new archi-tecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components in-clude masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most no-tably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU onADE20K).
TL;DR: Wang et al. as mentioned in this paper proposed a novel segmentation model termed Swin UNEt TRansformers (Swin UNETR), which reformulated the task of 3D brain tumor semantic segmentation as a sequence to sequence prediction problem.
Abstract: Semantic segmentation of brain tumors is a fundamental medical image analysis task involving multiple MRI imaging modalities that can assist clinicians in diagnosing the patient and successively studying the progression of the malignant entity. In recent years, Fully Convolutional Neural Networks (FCNNs) approaches have become the de facto standard for 3D medical image segmentation. The popular “U-shaped” network architecture has achieved state-of-the-art performance benchmarks on different 2D and 3D semantic segmentation tasks and across various imaging modalities. However, due to the limited kernel size of convolution layers in FCNNs, their performance of modeling long-range information is sub-optimal, and this can lead to deficiencies in the segmentation of tumors with variable sizes. On the other hand, transformer models have demonstrated excellent capabilities in capturing such long-range information in multiple domains, including natural language processing and computer vision. Inspired by the success of vision transformers and their variants, we propose a novel segmentation model termed Swin UNEt TRansformers (Swin UNETR). Specifically, the task of 3D brain tumor semantic segmentation is reformulated as a sequence to sequence prediction problem wherein multi-modal input data is projected into a 1D sequence of embedding and used as an input to a hierarchical Swin transformer as the encoder. The swin transformer encoder extracts features at five different resolutions by utilizing shifted windows for computing self-attention and is connected to an FCNN-based decoder at each resolution via skip connections. We have participated in BraTS 2021 segmentation challenge, and our proposed model ranks among the top-performing approaches in the validation phase. Code: https://monai.io/research/swin-unetr .
TL;DR: Zhang et al. as discussed by the authors proposed complete IoU (CIoU) loss and cluster-NMS for enhancing geometric factors in both bounding-box regression and non-maximum suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency.
Abstract: Deep learning-based object detection and instance segmentation have achieved unprecedented progress. In this article, we propose complete-IoU (CIoU) loss and Cluster-NMS for enhancing geometric factors in both bounding-box regression and nonmaximum suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency. In particular, we consider three geometric factors, that is: 1) overlap area; 2) normalized central-point distance; and 3) aspect ratio, which are crucial for measuring bounding-box regression in object detection and instance segmentation. The three geometric factors are then incorporated into CIoU loss for better distinguishing difficult regression cases. The training of deep models using CIoU loss results in consistent AP and AR improvements in comparison to widely adopted $\ell _{n}$ -norm loss and IoU-based loss. Furthermore, we propose Cluster-NMS, where NMS during inference is done by implicitly clustering detected boxes and usually requires fewer iterations. Cluster-NMS is very efficient due to its pure GPU implementation, and geometric factors can be incorporated to improve both AP and AR. In the experiments, CIoU loss and Cluster-NMS have been applied to state-of-the-art instance segmentation (e.g., YOLACT and BlendMask-RT), and object detection (e.g., YOLO v3, SSD, and Faster R-CNN) models. Taking YOLACT on MS COCO as an example, our method achieves performance gains as +1.7 AP and +6.2 AR 100 for object detection, and +1.1 AP and +3.5 AR 100 for instance segmentation, with 27.1 FPS on one NVIDIA GTX 1080Ti GPU. All the source code and trained models are available at https://github.com/Zzh-tju/CIoU .
TL;DR: In this article , a human-in-the-loop pipeline for rapid prototyping of new custom segmentation models is proposed. But the pipeline does not allow users to adapt the segmentation style to their specific needs and can perform suboptimally for test images that are very different from the training images.
Abstract: Pretrained neural network models for biological segmentation can provide good out-of-the-box results for many image types. However, such models do not allow users to adapt the segmentation style to their specific needs and can perform suboptimally for test images that are very different from the training images. Here we introduce Cellpose 2.0, a new package that includes an ensemble of diverse pretrained models as well as a human-in-the-loop pipeline for rapid prototyping of new custom models. We show that models pretrained on the Cellpose dataset can be fine-tuned with only 500-1,000 user-annotated regions of interest (ROI) to perform nearly as well as models trained on entire datasets with up to 200,000 ROI. A human-in-the-loop approach further reduced the required user annotation to 100-200 ROI, while maintaining high-quality segmentations. We provide software tools such as an annotation graphical user interface, a model zoo and a human-in-the-loop pipeline to facilitate the adoption of Cellpose 2.0.
TL;DR: CSWin Transformer as discussed by the authors proposes a cross-shaved window self-attention mechanism for computing selfattention in the horizontal and vertical stripes in parallel, with each stripe obtained by splitting the input feature into stripes of equal width.
Abstract: We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 52.2 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55.7 mIoU. 11 Code and pretrain model is available at https://github.com/microsoft/CSWin-Transformer
TL;DR: In this article , a comprehensive review of single stage object detectors, regression formulation, their architecture advancements, and performance statistics is presented, among different versions of YOLO, applications based on two-stage detectors, and applications with different methods for detecting objects.
Abstract: Object detection is one of the predominant and challenging problems in computer vision. Over the decade, with the expeditious evolution of deep learning, researchers have extensively experimented and contributed in the performance enhancement of object detection and related tasks such as object classification, localization, and segmentation using underlying deep models. Broadly, object detectors are classified into two categories viz. two stage and single stage object detectors. Two stage detectors mainly focus on selective region proposals strategy via complex architecture; however, single stage detectors focus on all the spatial region proposals for the possible detection of objects via relatively simpler architecture in one shot. Performance of any object detector is evaluated through detection accuracy and inference time. Generally, the detection accuracy of two stage detectors outperforms single stage object detectors. However, the inference time of single stage detectors is better compared to its counterparts. Moreover, with the advent of YOLO (You Only Look Once) and its architectural successors, the detection accuracy is improving significantly and sometime it is better than two stage detectors. YOLOs are adopted in various applications majorly due to their faster inferences rather than considering detection accuracy. As an example, detection accuracies are 63.4 and 70 for YOLO and Fast-RCNN respectively, however, inference time is around 300 times faster in case of YOLO. In this paper, we present a comprehensive review of single stage object detectors specially YOLOs, regression formulation, their architecture advancements, and performance statistics. Moreover, we summarize the comparative illustration between two stage and single stage object detectors, among different versions of YOLOs, applications based on two stage detectors, and different versions of YOLOs along with the future research directions.
TL;DR: TrackMate as mentioned in this paper is an automated tracking software used to analyze bioimages and is distributed as a Fiji plugin, which is built to address the broad spectrum of modern challenges researchers face by integrating state-of-the-art segmentation algorithms into tracking pipelines.
Abstract: TrackMate is an automated tracking software used to analyze bioimages and is distributed as a Fiji plugin. Here, we introduce a new version of TrackMate. TrackMate 7 is built to address the broad spectrum of modern challenges researchers face by integrating state-of-the-art segmentation algorithms into tracking pipelines. We illustrate qualitatively and quantitatively that these new capabilities function effectively across a wide range of bio-imaging experiments.
TL;DR: Zhiqi et al. as discussed by the authors proposed a new framework called BEVformer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks.
Abstract: 3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9% in terms of NDS metric on the nuScenes test set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. The code is available at https://github.com/zhiqi-li/BEVFormer .
TL;DR: Wang et al. as mentioned in this paper proposed a Transformer-based decoder and constructed a UNet-like Transformer (UNetformer) for real-time urban scene segmentation.
Abstract: Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment.Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated semantic segmentation for many years. CNN adopts hierarchical feature representation, demonstrating strong capabilities for local information extraction. However, the local property of the convolution layer limits the network from capturing the global context. Recently, as a hot topic in the domain of computer vision, Transformer has demonstrated its great potential in global information modelling, boosting many vision-related tasks such as image classification, object detection, and particularly semantic segmentation. In this paper, we propose a Transformer-based decoder and construct a UNet-like Transformer (UNetFormer) for real-time urban scene segmentation. For efficient segmentation, the UNetFormer selects the lightweight ResNet18 as the encoder and develops an efficient global-local attention mechanism to model both global and local information in the decoder. Extensive experiments reveal that our method not only runs faster but also produces higher accuracy compared with state-of-the-art lightweight models. Specifically, the proposed UNetFormer achieved 67.8% and 52.4% mIoU on the UAVid and LoveDA datasets, respectively, while the inference speed can achieve up to 322.4 FPS with a 512x512 input on a single NVIDIA GTX 3090 GPU. In further exploration, the proposed Transformer-based decoder combined with a Swin Transformer encoder also achieves the state-of-the-art result (91.3% F1 and 84.1% mIoU) on the Vaihingen dataset. The source code will be freely available at https://github.com/WangLibo1995/GeoSeg.
TL;DR: In this paper , Li et al. reviewed the state-of-the-art technologies of semantic segmentation based on deep learning and analyzed the key factors affecting the real-time performance of the segmentation model.
TL;DR: UNeXt as discussed by the authors is a Convolutional multilayer perceptron (MLP) based network for image segmentation, which has an early convolutional stage and a MLP stage in the latent stage.
Abstract: AbstractUNet and its latest extensions like TransUNet have been the leading medical image segmentation methods in recent years. However, these networks cannot be effectively adopted for rapid image segmentation in point-of-care applications as they are parameter-heavy, computationally complex and slow to use. To this end, we propose UNeXt which is a Convolutional multilayer perceptron (MLP) based network for image segmentation. We design UNeXt in an effective way with an early convolutional stage and a MLP stage in the latent stage. We propose a tokenized MLP block where we efficiently tokenize and project the convolutional features and use MLPs to model the representation. To further boost the performance, we propose shifting the channels of the inputs while feeding in to MLPs so as to focus on learning local dependencies. Using tokenized MLPs in latent space reduces the number of parameters and computational complexity while being able to result in a better representation to help segmentation. The network also consists of skip connections between various levels of encoder and decoder. We test UNeXt on multiple medical image segmentation datasets and show that we reduce the number of parameters by 72x, decrease the computational complexity by 68x, and improve the inference speed by 10x while also obtaining better segmentation performance over the state-of-the-art medical image segmentation architectures. Code is available at https://github.com/jeya-maria-jose/UNeXt-pytorch. KeywordsMedical image segmentationMLPPoint-of-care
TL;DR: In this article , a comprehensive thematic survey on medical image segmentation using deep learning techniques is presented, where the authors classify currently popular literatures according to a multi-level structure from coarse to fine.
Abstract: Deep learning has been widely used for medical image segmentation and a large number of papers has been presented recording the success of deep learning in the field. In this paper, we present a comprehensive thematic survey on medical image segmentation using deep learning techniques. This paper makes two original contributions. Firstly, compared to traditional surveys that directly divide literatures of deep learning on medical image segmentation into many groups and introduce literatures in detail for each group, we classify currently popular literatures according to a multi-level structure from coarse to fine. Secondly, this paper focuses on supervised and weakly supervised learning approaches, without including unsupervised approaches since they have been introduced in many old surveys and they are not popular currently. For supervised learning approaches, we analyze literatures in three aspects: the selection of backbone networks, the design of network blocks, and the improvement of loss functions. For weakly supervised learning approaches, we investigate literature according to data augmentation, transfer learning, and interactive segmentation, separately. Compared to existing surveys, this survey classifies the literatures very differently from before and is more convenient for readers to understand the relevant rationale and will guide them to think of appropriate improvements in medical image segmentation based on deep learning approaches.
TL;DR: Wang et al. as mentioned in this paper proposed a new 3D transformer-based model, dubbed Swin UNEt TRansformers (Swin UNETR), with a hierarchical encoder for self-supervised pretraining and tailored proxy tasks for learning the underlying pattern of human anatomy.
Abstract: Vision Transformers (ViT)s have shown great performance in self-supervised learning of global and local representations that can be transferred to downstream applications. Inspired by these results, we introduce a novel self-supervised learning framework with tailored proxy tasks for medical image analysis. Specifically, we propose: (i) a new 3D transformer-based model, dubbed Swin UNEt TRansformers (Swin UNETR), with a hierarchical encoder for self-supervised pretraining; (ii) tailored proxy tasks for learning the underlying pattern of human anatomy. We demonstrate successful pre-training of the proposed model on 5,050 publicly available computed tomography (CT) images from various body organs. The effectiveness of our approach is validated by fine-tuning the pre-trained models on the Beyond the Cranial Vault (BTCV) Segmentation Challenge with 13 abdominal organs and segmentation tasks from the Medical Segmentation Decathlon (MSD) dataset. Our model is currently the state-of-the-art on the public test leaderboards of both MSD 11 https://decathlon-10.grand-challenge.org/evaluation/challenge/leaderboard/ and BTCV 22 https://www.synapse.org/#!Synapse:syn3193805/wiki/217785/ datasets. Code: https://monai.io/research/swin-unetr.
TL;DR: Wang et al. as discussed by the authors proposed a semi-relevant contrastive learning (SRCL) strategy to align multiple positive instances with similar visual concepts, which increases the diversity of positives and then results in more informative representations.
TL;DR: In this article , the Unified Focal Loss (UFL) is proposed to handle class imbalance in medical image segmentation. But the proposed loss function is not robust to class imbalance and consistently outperforms the other loss functions.
TL;DR: Zhang et al. as discussed by the authors proposed a feature adaptive transformer network (FAT-Net) which integrates an extra transformer branch to capture long-range dependencies and global context information, and employed a memory-efficient decoder and a feature adaptation module to enhance the feature fusion between the adjacent-level features by activating the effective channels and restraining the irrelevant background noise.
TL;DR: Zhang et al. as mentioned in this paper proposed a brain tumor segmentation method based on the fusion of deep semantics and edge information in multimodal MRI, aiming to achieve a more sufficient utilization of multi-modal information for accurate segmentation.
TL;DR: In this paper , a hybrid model combined CNN and support vector machine (SVM) in terms of classification and with threshold-based segmentation for detection of brain tumor in MRI images.
Abstract: In this research paper, the brain MRI images are going to classify by considering the excellence of CNN on a public dataset to classify Benign and Malignant tumors. Deep learning (DL) methods due to good performance in the last few years have become more popular for Image classification. Convolution Neural Network (CNN), with several methods, can extract features without using handcrafted models, and eventually, show better accuracy of classification. The proposed hybrid model combined CNN and support vector machine (SVM) in terms of classification and with threshold-based segmentation in terms of detection. The findings of previous studies are based on different models with their accuracy as Rough Extreme Learning Machine (RELM)-94.233%, Deep CNN (DCNN)-95%, Deep Neural Network (DNN) and Discrete Wavelet Autoencoder (DWA)-96%, k-nearest neighbors (kNN)-96.6%, CNN-97.5%. The overall accuracy of the hybrid CNN-SVM is obtained as 98.4959%. In today's world, brain cancer is one of the most dangerous diseases with the highest death rate, detection and classification of brain tumors due to abnormal growth of cells, shapes, orientation, and the location is a challengeable task in medical imaging. Magnetic resonance imaging (MRI) is a typical method of medical imaging for brain tumor analysis. Conventional machine learning (ML) techniques categorize brain cancer based on some handicraft property with the radiologist specialist choice. That can lead to failure in the execution and also decrease the effectiveness of an Algorithm. With a brief look came to know that the proposed hybrid model provides more effective and improvement techniques for classification.
TL;DR: In this article , the authors provide an overview of the core concepts of the attention mechanism built into transformers and other basic components, and review various transformer architectures tailored for medical image applications and discuss their limitations.
Abstract: Transformers have dominated the field of natural language processing and have recently made an impact in the area of computer vision. In the field of medical image analysis, transformers have also been successfully used in to full-stack clinical applications, including image synthesis/reconstruction, registration, segmentation, detection, and diagnosis. This paper aims to promote awareness of the applications of transformers in medical image analysis. Specifically, we first provide an overview of the core concepts of the attention mechanism built into transformers and other basic components. Second, we review various transformer architectures tailored for medical image applications and discuss their limitations. Within this review, we investigate key challenges including the use of transformers in different learning paradigms, improving model efficiency, and coupling with other techniques. We hope this review will provide a comprehensive picture of transformers to readers with an interest in medical image analysis.
TL;DR: DenseCLIP as discussed by the authors proposes a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP and uses the pixel-text score maps to guide the learning of dense prediction models.
Abstract: Recent progress has shown that large-scale pre-training using contrastive image-text pairs can be a promising alternative for high-quality visual representation learning from natural language supervision. Benefiting from a broader source of supervision, this new paradigm exhibits impressive transferability to downstream classification tasks and datasets. However, the problem of transferring the knowledge learned from image-text pairs to more complex dense prediction tasks has barely been visited. In this work, we present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP. Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models. By further using the contextual information from the image to prompt the language model, we are able to facilitate our model to better exploit the pretrained knowledge. Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones including both CLIP models and ImageNet pre-trained models. Extensive experiments demonstrate the superior performance of our methods on semantic segmentation, object detection, and instance segmentation tasks. Code is available at https://github.com/raoyongming/DenseCLIP.
TL;DR: Li et al. as discussed by the authors proposed an approach for automatic land segmentation based on the Feature Pyramid Network (FPN), which can build a feature pyramid with high-level semantics throughout, but intrinsic defects in feature extraction and fusion hinder FPN from further aggregating more discriminative features.
Abstract: Semantic segmentation using fine-resolution remotely sensed images plays a critical role in many practical applications, such as urban planning, environmental protection, natural and anthropogenic landscape monitoring, etc. However, the automation of semantic segmentation, i.e., automatic categorization/labeling and segmentation is still a challenging task, particularly for fine-resolution images with huge spatial and spectral complexity. Addressing such a problem represents an exciting research field, which paves the way for scene-level landscape pattern analysis and decision making. In this paper, we propose an approach for automatic land segmentation based on the Feature Pyramid Network (FPN). As a classic architecture, FPN can build a feature pyramid with high-level semantics throughout. However, intrinsic defects in feature extraction and fusion hinder FPN from further aggregating more discriminative features. Hence, we propose an Attention Aggregation Module (AAM) to enhance multi-scale feature learning through attention-guided feature aggregation. Based on FPN and AAM, a novel framework named Attention Aggregation Feature Pyramid Network (A2-FPN) is developed for semantic segmentation of fine-resolution remotely sensed images. Extensive experiments conducted on three datasets demonstrate the effectiveness of our A2 -FPN in segmentation accuracy. Code is available at https://github.com/lironui/A2-FPN.
TL;DR: Wang et al. as mentioned in this paper proposed a key sampling strategy for each query point, where nearby points densely and distant points sparsely as its keys in a stratified way, which enables the model to enlarge the effective receptive field and enjoy long-range contexts at a low computational cost.
Abstract: 3D point cloud segmentation has made tremendous progress in recent years. Most current methods focus on aggregating local features, but fail to directly model long-range dependencies. In this paper, we propose Stratified Transformer that is able to capture long-range contexts and demonstrates strong generalization ability and high performance. Specifically, we first put forward a novel key sampling strategy. For each query point, we sample nearby points densely and distant points sparsely as its keys in a stratified way, which enables the model to enlarge the effective receptive field and enjoy long-range contexts at a low computational cost. Also, to combat the challenges posed by irregular point arrangements, we propose first-layer point embedding to aggregate local information, which facilitates convergence and boosts performance. Besides, we adopt contextual relative position encoding to adaptively capture position information. Finally, a memory-efficient implementation is introduced to overcome the issue of varying point numbers in each window. Extensive experiments demonstrate the effectiveness and superiority of our method on S3DIS, ScanNetv2 and ShapeNetPart datasets. Code is available at https://github.com/dvlab-research/Stratified-Transformer.
TL;DR: Wang et al. as discussed by the authors proposed a CNN and Transformer complementary network (CTCNet) for medical image segmentation, which combines the advantages of Transformers and Convolutional Neural Networks (CNNs).
TL;DR: SegNeXt as mentioned in this paper proposes a simple convolutional network architecture for semantic segmentation and achieves state-of-the-art performance on a variety of tasks, including ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID.
Abstract: We present SegNeXt, a simple convolutional network architecture for semantic segmentation. Recent transformer-based models have dominated the field of semantic segmentation due to the efficiency of self-attention in encoding spatial information. In this paper, we show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers. By re-examining the characteristics owned by successful segmentation models, we discover several key components leading to the performance improvement of segmentation models. This motivates us to design a novel convolutional attention network that uses cheap convolutional operations. Without bells and whistles, our SegNeXt significantly improves the performance of previous state-of-the-art methods on popular benchmarks, including ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID. Notably, SegNeXt outperforms EfficientNet-L2 w/ NAS-FPN and achieves 90.6% mIoU on the Pascal VOC 2012 test leaderboard using only 1/10 parameters of it. On average, SegNeXt achieves about 2.0% mIoU improvements compared to the state-of-the-art methods on the ADE20K datasets with the same or fewer computations. Code is available at https://github.com/uyzhang/JSeg (Jittor) and https://github.com/Visual-Attention-Network/SegNeXt (Pytorch).
TL;DR: In this paper , the authors provide an overview and interpretation guide on the following metrics for medical image segmentation evaluation in binary as well as multi-class problems: Dice similarity coefficient, Jaccard, Sensitivity, Specificity, Rand index, ROC curves, Cohen's Kappa, and Hausdorff distance.
Abstract: In the last decade, research on artificial intelligence has seen rapid growth with deep learning models, especially in the field of medical image segmentation. Various studies demonstrated that these models have powerful prediction capabilities and achieved similar results as clinicians. However, recent studies revealed that the evaluation in image segmentation studies lacks reliable model performance assessment and showed statistical bias by incorrect metric implementation or usage. Thus, this work provides an overview and interpretation guide on the following metrics for medical image segmentation evaluation in binary as well as multi-class problems: Dice similarity coefficient, Jaccard, Sensitivity, Specificity, Rand index, ROC curves, Cohen's Kappa, and Hausdorff distance. Furthermore, common issues like class imbalance and statistical as well as interpretation biases in evaluation are discussed. As a summary, we propose a guideline for standardized medical image segmentation evaluation to improve evaluation quality, reproducibility, and comparability in the research field.
TL;DR: The proposed deep neural network model is trained to detect targeted areas of the disease with its severity level using citrus fruits that have been labeled with the help of a domain expert with four severity levels (high, medium, low and healthy) as ground truth.
Abstract: Citrus fruit diseases have an egregious impact on both the quality and quantity of the citrus fruit production and market. Automatic detection of severity is essential for the high-quality production of fruit. In the current work, a citrus fruit dataset is preprocessed by rescaling and establishing bounding boxes with labeled image software. Then, a selective search, which combines the capabilities of both an extensive search and graph-based segmentation, is applied. The proposed deep neural network (DNN) model is trained to detect targeted areas of the disease with its severity level using citrus fruits that have been labeled with the help of a domain expert with four severity levels (high, medium, low and healthy) as ground truth. Transfer learning using VGGNet is applied to implement a multi-classification framework for each class of severity. The model predicts the low severity level with 99% accuracy, and the high severity level with 98% accuracy. The model demonstrates 96% accuracy in detecting healthy conditions and 97% accuracy in detecting medium severity levels. The result of the work shows that the proposed approach is valid, and it is efficient for detecting citrus fruit disease at four levels of severity.
TL;DR: SoftGroup as discussed by the authors performs bottom-up soft grouping followed by top-down refinement to mitigate the problems stemming from semantic prediction errors and suppresses false positive instances by learning to categorize them as background.
Abstract: Existing state-of-the-art 3D instance segmentation methods perform semantic segmentation followed by grouping. The hard predictions are made when performing semantic segmentation such that each point is associated with a single class. However, the errors stemming from hard decision propagate into grouping that results in (1) low overlaps between the predicted instance with the ground truth and (2) substantial false positives. To address the aforementioned problems, this paper proposes a 3D instance segmentation method referred to as SoftGroup by performing bottom-up soft grouping followed by top-down refinement. SoftGroup allows each point to be associated with multiple classes to mitigate the problems stemming from semantic prediction errors and suppresses false positive instances by learning to categorize them as background. Experimental results on different datasets and multiple evaluation metrics demonstrate the efficacy of SoftGroup. Its performance surpasses the strongest prior method by a significant margin of $+6.2\%$ on the ScanNet v2 hidden test set and $+6.8\%$ on S3DIS Area 5 in terms of $AP_{50}$ . Soft-Group is also fast, running at 345ms per scan with a sin-gle Titan X on ScanNet v2 dataset. The source code and trained models for both datasets are available at https://github.com/thangvubk/SoftGroup.git.