Top 22912 papers published in the topic of Segmentation in 2022

Showing papers on "Segmentation published in 2022"

Proceedings Article•10.1109/cvpr52688.2022.01167•

A ConvNet for the 2020s

[...]

1 Jun 2022

TL;DR: ConvNeXt as discussed by the authors is a family of pure ConvNet models, which compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation.

...read moreread less

Abstract: The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

...read moreread less

3,192 citations

Journal Article•10.1145/3505244•

Transformers in Vision: A Survey

[...]

31 Jan 2022-ACM Computing Surveys

TL;DR: Transformer networks as mentioned in this paper enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM).

...read moreread less

Abstract: Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.

...read moreread less

1,718 citations

Proceedings Article•10.1109/wacv51458.2022.00181•

UNETR: Transformers for 3D Medical Image Segmentation

[...]

1 Jan 2022

TL;DR: UNETR as discussed by the authors utilizes a transformer encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful U-shaped network design for the encoder and decoder.

...read moreread less

Abstract: Fully Convolutional Neural Networks (FCNNs) with contracting and expanding paths have shown prominence for the majority of medical image segmentation applications since the past decade. In FCNNs, the encoder plays an integral role by learning both global and local features and contextual representations which can be utilized for semantic output prediction by the decoder. Despite their success, the locality of convolutional layers in FCNNs, limits the capability of learning long-range spatial dependencies. Inspired by the recent success of transformers for Natural Language Processing (NLP) in long-range sequence learning, we reformulate the task of volumetric (3D) medical image segmentation as a sequence-to-sequence prediction problem. We introduce a novel architecture, dubbed as UNEt TRansformers (UNETR), that utilizes a transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful "U-shaped" network design for the encoder and decoder. The transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output. We have validated the performance of our method on the Multi Atlas Labeling Beyond The Cranial Vault (BTCV) dataset for multi-organ segmentation and the Medical Segmentation Decathlon (MSD) dataset for brain tumor and spleen segmentation tasks. Our benchmarks demonstrate new state-of-the-art performance on the BTCV leaderboard.

...read moreread less

1,332 citations

Proceedings Article•10.1109/cvpr52688.2022.00135•

Masked-attention Mask Transformer for Universal Image Segmentation

[...]

1 Jun 2022

TL;DR: Mask2former as discussed by the authors proposes Masked-Attention Mask Transformer (Mask2Transformer), which extracts localized features by constraining cross-attention within predicted mask regions. But it is not suitable for instance segmentation.

...read moreread less

Abstract: Image segmentation groups pixels with different semantics, e.g., category or instance membership. Each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing spe-cialized architectures for each task. We present Masked- attention Mask Transformer (Mask2Former), a new archi-tecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components in-clude masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most no-tably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU onADE20K).

...read moreread less

1,223 citations

Book Chapter•10.1007/978-3-031-08999-2_22•

Swin UNETR: Swin Transformers for Semantic Segmentation of Brain Tumors in MRI Images

[...]

Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R. Roth, Daguang Xu - Show less +2 more

4 Jan 2022

TL;DR: Wang et al. as mentioned in this paper proposed a novel segmentation model termed Swin UNEt TRansformers (Swin UNETR), which reformulated the task of 3D brain tumor semantic segmentation as a sequence to sequence prediction problem.

...read moreread less

Abstract: Semantic segmentation of brain tumors is a fundamental medical image analysis task involving multiple MRI imaging modalities that can assist clinicians in diagnosing the patient and successively studying the progression of the malignant entity. In recent years, Fully Convolutional Neural Networks (FCNNs) approaches have become the de facto standard for 3D medical image segmentation. The popular “U-shaped” network architecture has achieved state-of-the-art performance benchmarks on different 2D and 3D semantic segmentation tasks and across various imaging modalities. However, due to the limited kernel size of convolution layers in FCNNs, their performance of modeling long-range information is sub-optimal, and this can lead to deficiencies in the segmentation of tumors with variable sizes. On the other hand, transformer models have demonstrated excellent capabilities in capturing such long-range information in multiple domains, including natural language processing and computer vision. Inspired by the success of vision transformers and their variants, we propose a novel segmentation model termed Swin UNEt TRansformers (Swin UNETR). Specifically, the task of 3D brain tumor semantic segmentation is reformulated as a sequence to sequence prediction problem wherein multi-modal input data is projected into a 1D sequence of embedding and used as an input to a hierarchical Swin transformer as the encoder. The swin transformer encoder extracts features at five different resolutions by utilizing shifted windows for computing self-attention and is connected to an FCNN-based decoder at each resolution via skip connections. We have participated in BraTS 2021 segmentation challenge, and our proposed model ranks among the top-performing approaches in the validation phase. Code: https://monai.io/research/swin-unetr .

...read moreread less

1,053 citations

Journal Article•10.1109/tcyb.2021.3095305•

Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation

[...]

01 Aug 2022-IEEE transactions on cybernetics

TL;DR: Zhang et al. as discussed by the authors proposed complete IoU (CIoU) loss and cluster-NMS for enhancing geometric factors in both bounding-box regression and non-maximum suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency.

...read moreread less

Abstract: Deep learning-based object detection and instance segmentation have achieved unprecedented progress. In this article, we propose complete-IoU (CIoU) loss and Cluster-NMS for enhancing geometric factors in both bounding-box regression and nonmaximum suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency. In particular, we consider three geometric factors, that is: 1) overlap area; 2) normalized central-point distance; and 3) aspect ratio, which are crucial for measuring bounding-box regression in object detection and instance segmentation. The three geometric factors are then incorporated into CIoU loss for better distinguishing difficult regression cases. The training of deep models using CIoU loss results in consistent AP and AR improvements in comparison to widely adopted

$\ell _{n}$

-norm loss and IoU-based loss. Furthermore, we propose Cluster-NMS, where NMS during inference is done by implicitly clustering detected boxes and usually requires fewer iterations. Cluster-NMS is very efficient due to its pure GPU implementation, and geometric factors can be incorporated to improve both AP and AR. In the experiments, CIoU loss and Cluster-NMS have been applied to state-of-the-art instance segmentation (e.g., YOLACT and BlendMask-RT), and object detection (e.g., YOLO v3, SSD, and Faster R-CNN) models. Taking YOLACT on MS COCO as an example, our method achieves performance gains as +1.7 AP and +6.2 AR ₁₀₀ for object detection, and +1.1 AP and +3.5 AR ₁₀₀ for instance segmentation, with 27.1 FPS on one NVIDIA GTX 1080Ti GPU. All the source code and trained models are available at https://github.com/Zzh-tju/CIoU .

...read moreread less

712 citations

Journal Article•10.1038/s41592-022-01663-4•

Cellpose 2.0: how to train your own model

[...]

Carsen Stringer, Marius Pachitariu

05 Apr 2022-Nature Methods

TL;DR: In this article , a human-in-the-loop pipeline for rapid prototyping of new custom segmentation models is proposed. But the pipeline does not allow users to adapt the segmentation style to their specific needs and can perform suboptimally for test images that are very different from the training images.

...read moreread less

Abstract: Pretrained neural network models for biological segmentation can provide good out-of-the-box results for many image types. However, such models do not allow users to adapt the segmentation style to their specific needs and can perform suboptimally for test images that are very different from the training images. Here we introduce Cellpose 2.0, a new package that includes an ensemble of diverse pretrained models as well as a human-in-the-loop pipeline for rapid prototyping of new custom models. We show that models pretrained on the Cellpose dataset can be fine-tuned with only 500-1,000 user-annotated regions of interest (ROI) to perform nearly as well as models trained on entire datasets with up to 200,000 ROI. A human-in-the-loop approach further reduced the required user annotation to 100-200 ROI, while maintaining high-quality segmentations. We provide software tools such as an annotation graphical user interface, a model zoo and a human-in-the-loop pipeline to facilitate the adoption of Cellpose 2.0.

...read moreread less

706 citations

Proceedings Article•10.1109/cvpr52688.2022.01181•

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

[...]

1 Jun 2022

TL;DR: CSWin Transformer as discussed by the authors proposes a cross-shaved window self-attention mechanism for computing selfattention in the horizontal and vertical stripes in parallel, with each stripe obtained by splitting the input feature into stripes of equal width.

...read moreread less

Abstract: We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 52.2 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55.7 mIoU. ¹ ¹ Code and pretrain model is available at https://github.com/microsoft/CSWin-Transformer

...read moreread less

701 citations

Journal Article•10.1007/s11042-022-13644-y•

Object detection using YOLO: challenges, architectural successors, datasets and applications

[...]

Tausif Diwan, Grandhi Sai Anirudh, Jitendra V. Tembhurne

08 Aug 2022-Multimedia Tools and Applications

TL;DR: In this article , a comprehensive review of single stage object detectors, regression formulation, their architecture advancements, and performance statistics is presented, among different versions of YOLO, applications based on two-stage detectors, and applications with different methods for detecting objects.

...read moreread less

Abstract: Object detection is one of the predominant and challenging problems in computer vision. Over the decade, with the expeditious evolution of deep learning, researchers have extensively experimented and contributed in the performance enhancement of object detection and related tasks such as object classification, localization, and segmentation using underlying deep models. Broadly, object detectors are classified into two categories viz. two stage and single stage object detectors. Two stage detectors mainly focus on selective region proposals strategy via complex architecture; however, single stage detectors focus on all the spatial region proposals for the possible detection of objects via relatively simpler architecture in one shot. Performance of any object detector is evaluated through detection accuracy and inference time. Generally, the detection accuracy of two stage detectors outperforms single stage object detectors. However, the inference time of single stage detectors is better compared to its counterparts. Moreover, with the advent of YOLO (You Only Look Once) and its architectural successors, the detection accuracy is improving significantly and sometime it is better than two stage detectors. YOLOs are adopted in various applications majorly due to their faster inferences rather than considering detection accuracy. As an example, detection accuracies are 63.4 and 70 for YOLO and Fast-RCNN respectively, however, inference time is around 300 times faster in case of YOLO. In this paper, we present a comprehensive review of single stage object detectors specially YOLOs, regression formulation, their architecture advancements, and performance statistics. Moreover, we summarize the comparative illustration between two stage and single stage object detectors, among different versions of YOLOs, applications based on two stage detectors, and different versions of YOLOs along with the future research directions.

...read moreread less

691 citations

Journal Article•10.1038/s41592-022-01507-1•

TrackMate 7: integrating state-of-the-art segmentation algorithms into tracking pipelines

[...]

Dmitry Ershov, Minh Phan, Joanna W Pylvänäinen, Stéphane U Rigaud, Laure Le Blanc, Arthur Charles-Orszag, James R. W. Conway, Romain F. Laine, Nathan H. Roy, Daria Bonazzi, Guillaume Duménil, Guillaume Jacquemet, Jean-Yves Tinevez - Show less +9 more

02 Jun 2022-Nature Methods

TL;DR: TrackMate as mentioned in this paper is an automated tracking software used to analyze bioimages and is distributed as a Fiji plugin, which is built to address the broad spectrum of modern challenges researchers face by integrating state-of-the-art segmentation algorithms into tracking pipelines.

...read moreread less

Abstract: TrackMate is an automated tracking software used to analyze bioimages and is distributed as a Fiji plugin. Here, we introduce a new version of TrackMate. TrackMate 7 is built to address the broad spectrum of modern challenges researchers face by integrating state-of-the-art segmentation algorithms into tracking pipelines. We illustrate qualitatively and quantitatively that these new capabilities function effectively across a wide range of bio-imaging experiments.

...read moreread less

616 citations

Journal Article•10.1007/978-3-031-20077-9_1•

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

[...]

Dong Xu

01 Jan 2022-Lecture Notes in Computer Science

TL;DR: Zhiqi et al. as discussed by the authors proposed a new framework called BEVformer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks.

...read moreread less

Abstract: 3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9% in terms of NDS metric on the nuScenes test set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. The code is available at https://github.com/zhiqi-li/BEVFormer .

...read moreread less

Journal Article•10.1016/j.isprsjprs.2022.06.008•

UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery

[...]

Ismailov Isamiddin¹•Institutions (1)

Wuhan University¹

1 Aug 2022

TL;DR: Wang et al. as mentioned in this paper proposed a Transformer-based decoder and constructed a UNet-like Transformer (UNetformer) for real-time urban scene segmentation.

...read moreread less

Abstract: Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment.Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated semantic segmentation for many years. CNN adopts hierarchical feature representation, demonstrating strong capabilities for local information extraction. However, the local property of the convolution layer limits the network from capturing the global context. Recently, as a hot topic in the domain of computer vision, Transformer has demonstrated its great potential in global information modelling, boosting many vision-related tasks such as image classification, object detection, and particularly semantic segmentation. In this paper, we propose a Transformer-based decoder and construct a UNet-like Transformer (UNetFormer) for real-time urban scene segmentation. For efficient segmentation, the UNetFormer selects the lightweight ResNet18 as the encoder and develops an efficient global-local attention mechanism to model both global and local information in the decoder. Extensive experiments reveal that our method not only runs faster but also produces higher accuracy compared with state-of-the-art lightweight models. Specifically, the proposed UNetFormer achieved 67.8% and 52.4% mIoU on the UAVid and LoveDA datasets, respectively, while the inference speed can achieve up to 322.4 FPS with a 512x512 input on a single NVIDIA GTX 3090 GPU. In further exploration, the proposed Transformer-based decoder combined with a Swin Transformer encoder also achieves the state-of-the-art result (91.3% F1 and 84.1% mIoU) on the Vaihingen dataset. The source code will be freely available at https://github.com/WangLibo1995/GeoSeg.

...read moreread less

Journal Article•10.1016/j.neucom.2022.01.005•

Review the state-of-the-art technologies of semantic segmentation based on deep learning

[...]

Yuji Mo, Yangjie Wu, Xinneng Yang, Feilin Liu, Yujun Liao - Show less +1 more

01 Jan 2022-Neurocomputing

TL;DR: In this paper , Li et al. reviewed the state-of-the-art technologies of semantic segmentation based on deep learning and analyzed the key factors affecting the real-time performance of the segmentation model.

...read moreread less

Journal Article•10.1007/978-3-031-16443-9_3•

UNeXt: MLP-Based Rapid Medical Image Segmentation Network

[...]

01 Jan 2022-Lecture Notes in Computer Science

TL;DR: UNeXt as discussed by the authors is a Convolutional multilayer perceptron (MLP) based network for image segmentation, which has an early convolutional stage and a MLP stage in the latent stage.

...read moreread less

Abstract: AbstractUNet and its latest extensions like TransUNet have been the leading medical image segmentation methods in recent years. However, these networks cannot be effectively adopted for rapid image segmentation in point-of-care applications as they are parameter-heavy, computationally complex and slow to use. To this end, we propose UNeXt which is a Convolutional multilayer perceptron (MLP) based network for image segmentation. We design UNeXt in an effective way with an early convolutional stage and a MLP stage in the latent stage. We propose a tokenized MLP block where we efficiently tokenize and project the convolutional features and use MLPs to model the representation. To further boost the performance, we propose shifting the channels of the inputs while feeding in to MLPs so as to focus on learning local dependencies. Using tokenized MLPs in latent space reduces the number of parameters and computational complexity while being able to result in a better representation to help segmentation. The network also consists of skip connections between various levels of encoder and decoder. We test UNeXt on multiple medical image segmentation datasets and show that we reduce the number of parameters by 72x, decrease the computational complexity by 68x, and improve the inference speed by 10x while also obtaining better segmentation performance over the state-of-the-art medical image segmentation architectures. Code is available at https://github.com/jeya-maria-jose/UNeXt-pytorch. KeywordsMedical image segmentationMLPPoint-of-care

...read moreread less

Journal Article•10.1049/ipr2.12419•

Medical image segmentation using deep learning: A survey

[...]

Jorge Natalino da Silva¹•Institutions (1)

Shaanxi University of Science and Technology¹

17 Jan 2022-Iet Image Processing

TL;DR: In this article , a comprehensive thematic survey on medical image segmentation using deep learning techniques is presented, where the authors classify currently popular literatures according to a multi-level structure from coarse to fine.

...read moreread less

Abstract: Deep learning has been widely used for medical image segmentation and a large number of papers has been presented recording the success of deep learning in the field. In this paper, we present a comprehensive thematic survey on medical image segmentation using deep learning techniques. This paper makes two original contributions. Firstly, compared to traditional surveys that directly divide literatures of deep learning on medical image segmentation into many groups and introduce literatures in detail for each group, we classify currently popular literatures according to a multi-level structure from coarse to fine. Secondly, this paper focuses on supervised and weakly supervised learning approaches, without including unsupervised approaches since they have been introduced in many old surveys and they are not popular currently. For supervised learning approaches, we analyze literatures in three aspects: the selection of backbone networks, the design of network blocks, and the improvement of loss functions. For weakly supervised learning approaches, we investigate literature according to data augmentation, transfer learning, and interactive segmentation, separately. Compared to existing surveys, this survey classifies the literatures very differently from before and is more convenient for readers to understand the relevant rationale and will guide them to think of appropriate improvements in medical image segmentation based on deep learning approaches.

...read moreread less

Proceedings Article•10.1109/cvpr52688.2022.02007•

Self-Supervised Pre-Training of Swin Transformers for 3D Medical Image Analysis

[...]

1 Jun 2022

TL;DR: Wang et al. as mentioned in this paper proposed a new 3D transformer-based model, dubbed Swin UNEt TRansformers (Swin UNETR), with a hierarchical encoder for self-supervised pretraining and tailored proxy tasks for learning the underlying pattern of human anatomy.

...read moreread less

Abstract: Vision Transformers (ViT)s have shown great performance in self-supervised learning of global and local representations that can be transferred to downstream applications. Inspired by these results, we introduce a novel self-supervised learning framework with tailored proxy tasks for medical image analysis. Specifically, we propose: (i) a new 3D transformer-based model, dubbed Swin UNEt TRansformers (Swin UNETR), with a hierarchical encoder for self-supervised pretraining; (ii) tailored proxy tasks for learning the underlying pattern of human anatomy. We demonstrate successful pre-training of the proposed model on 5,050 publicly available computed tomography (CT) images from various body organs. The effectiveness of our approach is validated by fine-tuning the pre-trained models on the Beyond the Cranial Vault (BTCV) Segmentation Challenge with 13 abdominal organs and segmentation tasks from the Medical Segmentation Decathlon (MSD) dataset. Our model is currently the state-of-the-art on the public test leaderboards of both MSD ¹ ¹ https://decathlon-10.grand-challenge.org/evaluation/challenge/leaderboard/ and BTCV ² ² https://www.synapse.org/#!Synapse:syn3193805/wiki/217785/ datasets. Code: https://monai.io/research/swin-unetr.

...read moreread less

Journal Article•10.1016/j.media.2022.102559•

Transformer-based unsupervised contrastive learning for histopathological image classification

[...]

Xiyue Wang, Sen Yang, Jun Zhang, Minghui Wang, Qing Zhang, Wei Yang, Junzhou Huang, Xiao Han - Show less +4 more

01 Jul 2022-Medical Image Analysis

TL;DR: Wang et al. as discussed by the authors proposed a semi-relevant contrastive learning (SRCL) strategy to align multiple positive instances with similar visual concepts, which increases the diversity of positives and then results in more informative representations.

...read moreread less

Journal Article•10.1016/j.compmedimag.2021.102026•

Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation

[...]

Konrad Franz Seibel¹•Institutions (1)

University of North Carolina at Chapel Hill¹

01 Jan 2022-Computerized Medical Imaging and Graphics

TL;DR: In this article , the Unified Focal Loss (UFL) is proposed to handle class imbalance in medical image segmentation. But the proposed loss function is not robust to class imbalance and consistently outperforms the other loss functions.

...read moreread less

Journal Article•10.1016/j.media.2021.102327•

FAT-Net: Feature adaptive transformers for automated skin lesion segmentation

[...]

Zhou Deng¹•Institutions (1)

Shenzhen University¹

01 Feb 2022-Medical Image Analysis

TL;DR: Zhang et al. as discussed by the authors proposed a feature adaptive transformer network (FAT-Net) which integrates an extra transformer branch to capture long-range dependencies and global context information, and employed a memory-efficient decoder and a feature adaptation module to enhance the feature fusion between the adjacent-level features by activating the effective channels and restraining the irrelevant background noise.

...read moreread less

Journal Article•10.1016/j.inffus.2022.10.022•

Brain tumor segmentation based on the fusion of deep semantics and edge information in multimodal MRI

[...]

Zhi-lin Zhu, Xianyu He, Guanqiu Qi, Yuanyuan Li, Baisen Cong, Yu Liu - Show less +2 more

01 Oct 2022-Information Fusion

TL;DR: Zhang et al. as mentioned in this paper proposed a brain tumor segmentation method based on the fusion of deep semantics and edge information in multimodal MRI, aiming to achieve a more sufficient utilization of multi-modal information for accurate segmentation.

...read moreread less

Journal Article•10.1016/j.irbm.2021.06.003•

A Hybrid CNN-SVM Threshold Segmentation Approach for Tumor Detection and Classification of MRI Brain Images

[...]

1 Aug 2022

TL;DR: In this paper , a hybrid model combined CNN and support vector machine (SVM) in terms of classification and with threshold-based segmentation for detection of brain tumor in MRI images.

...read moreread less

Abstract: In this research paper, the brain MRI images are going to classify by considering the excellence of CNN on a public dataset to classify Benign and Malignant tumors. Deep learning (DL) methods due to good performance in the last few years have become more popular for Image classification. Convolution Neural Network (CNN), with several methods, can extract features without using handcrafted models, and eventually, show better accuracy of classification. The proposed hybrid model combined CNN and support vector machine (SVM) in terms of classification and with threshold-based segmentation in terms of detection. The findings of previous studies are based on different models with their accuracy as Rough Extreme Learning Machine (RELM)-94.233%, Deep CNN (DCNN)-95%, Deep Neural Network (DNN) and Discrete Wavelet Autoencoder (DWA)-96%, k-nearest neighbors (kNN)-96.6%, CNN-97.5%. The overall accuracy of the hybrid CNN-SVM is obtained as 98.4959%. In today's world, brain cancer is one of the most dangerous diseases with the highest death rate, detection and classification of brain tumors due to abnormal growth of cells, shapes, orientation, and the location is a challengeable task in medical imaging. Magnetic resonance imaging (MRI) is a typical method of medical imaging for brain tumor analysis. Conventional machine learning (ML) techniques categorize brain cancer based on some handicraft property with the radiologist specialist choice. That can lead to failure in the execution and also decrease the effectiveness of an Algorithm. With a brief look came to know that the proposed hybrid model provides more effective and improvement techniques for classification.

...read moreread less

Journal Article•10.1016/j.imed.2022.07.002•

Transformers in Medical Image Analysis: A Review

[...]

Kelei He, Chen Gan, Zhuoyuan Li, Islem Rekik, Zihao Yin, Wen Ji, Yang Gao, Junfeng Zhang, Dinggang Shen - Show less +5 more

24 Feb 2022-Intelligent medicine

TL;DR: In this article , the authors provide an overview of the core concepts of the attention mechanism built into transformers and other basic components, and review various transformer architectures tailored for medical image applications and discuss their limitations.

...read moreread less

Abstract: Transformers have dominated the field of natural language processing and have recently made an impact in the area of computer vision. In the field of medical image analysis, transformers have also been successfully used in to full-stack clinical applications, including image synthesis/reconstruction, registration, segmentation, detection, and diagnosis. This paper aims to promote awareness of the applications of transformers in medical image analysis. Specifically, we first provide an overview of the core concepts of the attention mechanism built into transformers and other basic components. Second, we review various transformer architectures tailored for medical image applications and discuss their limitations. Within this review, we investigate key challenges including the use of transformers in different learning paradigms, improving model efficiency, and coupling with other techniques. We hope this review will provide a comprehensive picture of transformers to readers with an interest in medical image analysis.

...read moreread less

Proceedings Article•10.1109/cvpr52688.2022.01755•

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

[...]

1 Jun 2022

TL;DR: DenseCLIP as discussed by the authors proposes a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP and uses the pixel-text score maps to guide the learning of dense prediction models.

...read moreread less

Abstract: Recent progress has shown that large-scale pre-training using contrastive image-text pairs can be a promising alternative for high-quality visual representation learning from natural language supervision. Benefiting from a broader source of supervision, this new paradigm exhibits impressive transferability to downstream classification tasks and datasets. However, the problem of transferring the knowledge learned from image-text pairs to more complex dense prediction tasks has barely been visited. In this work, we present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP. Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models. By further using the contextual information from the image to prompt the language model, we are able to facilitate our model to better exploit the pretrained knowledge. Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones including both CLIP models and ImageNet pre-trained models. Extensive experiments demonstrate the superior performance of our methods on semantic segmentation, object detection, and instance segmentation tasks. Code is available at https://github.com/raoyongming/DenseCLIP.

...read moreread less

Journal Article•10.1109/tgrs.2021.3093977•

Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images

[...]

01 Jan 2022-IEEE Transactions on Geoscience and Remote Sensing

TL;DR: Li et al. as discussed by the authors proposed an approach for automatic land segmentation based on the Feature Pyramid Network (FPN), which can build a feature pyramid with high-level semantics throughout, but intrinsic defects in feature extraction and fusion hinder FPN from further aggregating more discriminative features.

...read moreread less

Abstract: Semantic segmentation using fine-resolution remotely sensed images plays a critical role in many practical applications, such as urban planning, environmental protection, natural and anthropogenic landscape monitoring, etc. However, the automation of semantic segmentation, i.e., automatic categorization/labeling and segmentation is still a challenging task, particularly for fine-resolution images with huge spatial and spectral complexity. Addressing such a problem represents an exciting research field, which paves the way for scene-level landscape pattern analysis and decision making. In this paper, we propose an approach for automatic land segmentation based on the Feature Pyramid Network (FPN). As a classic architecture, FPN can build a feature pyramid with high-level semantics throughout. However, intrinsic defects in feature extraction and fusion hinder FPN from further aggregating more discriminative features. Hence, we propose an Attention Aggregation Module (AAM) to enhance multi-scale feature learning through attention-guided feature aggregation. Based on FPN and AAM, a novel framework named Attention Aggregation Feature Pyramid Network (A2-FPN) is developed for semantic segmentation of fine-resolution remotely sensed images. Extensive experiments conducted on three datasets demonstrate the effectiveness of our A2 -FPN in segmentation accuracy. Code is available at https://github.com/lironui/A2-FPN.

...read moreread less

Proceedings Article•10.1109/cvpr52688.2022.00831•

Stratified Transformer for 3D Point Cloud Segmentation

[...]

1 Jun 2022

TL;DR: Wang et al. as mentioned in this paper proposed a key sampling strategy for each query point, where nearby points densely and distant points sparsely as its keys in a stratified way, which enables the model to enlarge the effective receptive field and enjoy long-range contexts at a low computational cost.

...read moreread less

Abstract: 3D point cloud segmentation has made tremendous progress in recent years. Most current methods focus on aggregating local features, but fail to directly model long-range dependencies. In this paper, we propose Stratified Transformer that is able to capture long-range contexts and demonstrates strong generalization ability and high performance. Specifically, we first put forward a novel key sampling strategy. For each query point, we sample nearby points densely and distant points sparsely as its keys in a stratified way, which enables the model to enlarge the effective receptive field and enjoy long-range contexts at a low computational cost. Also, to combat the challenges posed by irregular point arrangements, we propose first-layer point embedding to aggregate local information, which facilitates convergence and boosts performance. Besides, we adopt contextual relative position encoding to adaptively capture position information. Finally, a memory-efficient implementation is introduced to overcome the issue of varying point numbers in each window. Extensive experiments demonstrate the effectiveness and superiority of our method on S3DIS, ScanNetv2 and ShapeNetPart datasets. Code is available at https://github.com/dvlab-research/Stratified-Transformer.

...read moreread less

Journal Article•10.1016/j.patcog.2022.109228•

An effective CNN and Transformer complementary network for medical image segmentation

[...]

Feiniu Yuan, Zhengxiao Zhang, Zhijun Fang

01 Nov 2022-Pattern Recognition

TL;DR: Wang et al. as discussed by the authors proposed a CNN and Transformer complementary network (CTCNet) for medical image segmentation, which combines the advantages of Transformers and Convolutional Neural Networks (CNNs).

...read moreread less

Posted Content•10.48550/arxiv.2209.08575•

SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation

[...]

18 Sep 2022

TL;DR: SegNeXt as mentioned in this paper proposes a simple convolutional network architecture for semantic segmentation and achieves state-of-the-art performance on a variety of tasks, including ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID.

...read moreread less

Abstract: We present SegNeXt, a simple convolutional network architecture for semantic segmentation. Recent transformer-based models have dominated the field of semantic segmentation due to the efficiency of self-attention in encoding spatial information. In this paper, we show that convolutional attention is a more efficient and effective way to encode contextual information than the self-attention mechanism in transformers. By re-examining the characteristics owned by successful segmentation models, we discover several key components leading to the performance improvement of segmentation models. This motivates us to design a novel convolutional attention network that uses cheap convolutional operations. Without bells and whistles, our SegNeXt significantly improves the performance of previous state-of-the-art methods on popular benchmarks, including ADE20K, Cityscapes, COCO-Stuff, Pascal VOC, Pascal Context, and iSAID. Notably, SegNeXt outperforms EfficientNet-L2 w/ NAS-FPN and achieves 90.6% mIoU on the Pascal VOC 2012 test leaderboard using only 1/10 parameters of it. On average, SegNeXt achieves about 2.0% mIoU improvements compared to the state-of-the-art methods on the ADE20K datasets with the same or fewer computations. Code is available at https://github.com/uyzhang/JSeg (Jittor) and https://github.com/Visual-Attention-Network/SegNeXt (Pytorch).

...read moreread less

Journal Article•10.1186/s13104-022-06096-y•

Towards a guideline for evaluation metrics in medical image segmentation

[...]

Dominik Müller, Inaki Soto Rey, Frank Kramer

10 Feb 2022-BMC Research Notes

TL;DR: In this paper , the authors provide an overview and interpretation guide on the following metrics for medical image segmentation evaluation in binary as well as multi-class problems: Dice similarity coefficient, Jaccard, Sensitivity, Specificity, Rand index, ROC curves, Cohen's Kappa, and Hausdorff distance.

...read moreread less

Abstract: In the last decade, research on artificial intelligence has seen rapid growth with deep learning models, especially in the field of medical image segmentation. Various studies demonstrated that these models have powerful prediction capabilities and achieved similar results as clinicians. However, recent studies revealed that the evaluation in image segmentation studies lacks reliable model performance assessment and showed statistical bias by incorrect metric implementation or usage. Thus, this work provides an overview and interpretation guide on the following metrics for medical image segmentation evaluation in binary as well as multi-class problems: Dice similarity coefficient, Jaccard, Sensitivity, Specificity, Rand index, ROC curves, Cohen's Kappa, and Hausdorff distance. Furthermore, common issues like class imbalance and statistical as well as interpretation biases in evaluation are discussed. As a summary, we propose a guideline for standardized medical image segmentation evaluation to improve evaluation quality, reproducibility, and comparability in the research field.

...read moreread less

Journal Article•10.3390/electronics11030495•

A Novel Deep Learning Model for Detection of Severity Level of the Disease in Citrus Fruits

[...]

Poonam Dhiman, V. K. Kukreja, Poongodi Manoharan, Amandeep Kaur, M. Kamruzzaman, Imed Ben Dhaou, Celestine Iwendi - Show less +3 more

08 Feb 2022-Electronics

TL;DR: The proposed deep neural network model is trained to detect targeted areas of the disease with its severity level using citrus fruits that have been labeled with the help of a domain expert with four severity levels (high, medium, low and healthy) as ground truth.

...read moreread less

Abstract: Citrus fruit diseases have an egregious impact on both the quality and quantity of the citrus fruit production and market. Automatic detection of severity is essential for the high-quality production of fruit. In the current work, a citrus fruit dataset is preprocessed by rescaling and establishing bounding boxes with labeled image software. Then, a selective search, which combines the capabilities of both an extensive search and graph-based segmentation, is applied. The proposed deep neural network (DNN) model is trained to detect targeted areas of the disease with its severity level using citrus fruits that have been labeled with the help of a domain expert with four severity levels (high, medium, low and healthy) as ground truth. Transfer learning using VGGNet is applied to implement a multi-classification framework for each class of severity. The model predicts the low severity level with 99% accuracy, and the high severity level with 98% accuracy. The model demonstrates 96% accuracy in detecting healthy conditions and 97% accuracy in detecting medium severity levels. The result of the work shows that the proposed approach is valid, and it is efficient for detecting citrus fruit disease at four levels of severity.

...read moreread less

Proceedings Article•10.1109/cvpr52688.2022.00273•

SoftGroup for 3D Instance Segmentation on Point Clouds

[...]

1 Jun 2022

TL;DR: SoftGroup as discussed by the authors performs bottom-up soft grouping followed by top-down refinement to mitigate the problems stemming from semantic prediction errors and suppresses false positive instances by learning to categorize them as background.

...read moreread less

Abstract: Existing state-of-the-art 3D instance segmentation methods perform semantic segmentation followed by grouping. The hard predictions are made when performing semantic segmentation such that each point is associated with a single class. However, the errors stemming from hard decision propagate into grouping that results in (1) low overlaps between the predicted instance with the ground truth and (2) substantial false positives. To address the aforementioned problems, this paper proposes a 3D instance segmentation method referred to as SoftGroup by performing bottom-up soft grouping followed by top-down refinement. SoftGroup allows each point to be associated with multiple classes to mitigate the problems stemming from semantic prediction errors and suppresses false positive instances by learning to categorize them as background. Experimental results on different datasets and multiple evaluation metrics demonstrate the efficacy of SoftGroup. Its performance surpasses the strongest prior method by a significant margin of $+6.2\%$ on the ScanNet v2 hidden test set and $+6.8\%$ on S3DIS Area 5 in terms of $AP_{50}$ . Soft-Group is also fast, running at 345ms per scan with a sin-gle Titan X on ScanNet v2 dataset. The source code and trained models for both datasets are available at https://github.com/thangvubk/SoftGroup.git.

...read moreread less

...

Expand