Top 3824 papers published in the topic of Object (computer science) in 2021

Showing papers on "Object (computer science) published in 2021"

Proceedings Article•

Learning Transferable Visual Models From Natural Language Supervision

[...]

Alec Radford¹, Jong Wook Kim¹, Chris Hallacy¹, Aditya Ramesh¹, Gabriel Goh¹, Sandhini Agarwal¹, Girish Sastry¹, Amanda Askell, Pamela Mishkin¹, Jack Clark¹, Gretchen Krueger¹, Ilya Sutskever¹ - Show less +8 more•Institutions (1)

OpenAI¹

18 Jul 2021

TL;DR: In this paper, a pre-training task of predicting which caption goes with which image is used to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

...read moreread less

Abstract: State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.

...read moreread less

3,738 citations

Proceedings Article•10.1109/CVPR46437.2021.01422•

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

[...]

Peize Sun¹, Rufeng Zhang², Yi Jiang, Tao Kong, Chenfeng Xu³, Wei Zhan³, Masayoshi Tomizuka³, Lei Li, Zehuan Yuan, Changhu Wang, Ping Luo¹ - Show less +7 more•Institutions (3)

University of Hong Kong¹, Tongji University², University of California, Berkeley³

1 Jun 2021

TL;DR: Sun et al. as mentioned in this paper proposed sparse R-CNN, a purely sparse method for object detection in images, which completely avoids all efforts related to object candidates design and many-to-one label assignment.

...read moreread less

Abstract: We present Sparse R-CNN, a purely sparse method for object detection in images. Existing works on object detection heavily rely on dense object candidates, such as k anchor boxes pre-defined on all grids of image feature map of size H × W. In our method, however, a fixed sparse set of learned object proposals, total length of N, are provided to object recognition head to perform classification and location. By eliminating HWk (up to hundreds of thousands) hand-designed object candidates to N (e.g. 100) learnable proposals, Sparse R-CNN completely avoids all efforts related to object candidates design and many-to-one label assignment. More importantly, final predictions are directly output without non-maximum suppression post-procedure. Sparse R-CNN demonstrates accuracy, run-time and training convergence performance on par with the well-established detector baselines on the challenging COCO dataset, e.g., achieving 45.0 AP in standard 3× training schedule and running at 22 fps using ResNet-50 FPN model. We hope our work could inspire re-thinking the convention of dense prior in object detectors. The code is available at: https://github.com/PeizeSun/SparseR-CNN.

...read moreread less

901 citations

Proceedings Article•10.1109/CVPR46437.2021.00729•

Dynamic Head: Unifying Object Detection Heads with Attentions

[...]

Xiyang Dai¹, Yinpeng Chen¹, Bin Xiao¹, Dongdong Chen¹, Mengchen Liu¹, Lu Yuan¹, Lei Zhang¹ - Show less +3 more•Institutions (1)

Microsoft¹

15 Jun 2021

TL;DR: In this article, a dynamic head framework is proposed to unify object detection heads with attentions, by coherently combining multiple self-attention mechanisms between feature levels for scale-awareness, among spatial locations for spatial-awareness and within output channels for task-awareness.

...read moreread less

Abstract: The complex nature of combining localization and classification in object detection has resulted in the flourished development of methods. Previous works tried to improve the performance in various object detection heads but failed to present a unified view. In this paper, we present a novel dynamic head framework to unify object detection heads with attentions. By coherently combining multiple self-attention mechanisms between feature levels for scale-awareness, among spatial locations for spatial-awareness, and within output channels for task-awareness, the proposed approach significantly improves the representation ability of object detection heads without any computational overhead. Further experiments demonstrate that the effectiveness and efficiency of the proposed dynamic head on the COCO benchmark. With a standard ResNeXt-101-DCN backbone, we largely improve the performance over popular object detectors and achieve a new state-of-the-art at 54.0 AP. The code will be released at https://github.com/microsoft/DynamicHead.

...read moreread less

699 citations

Journal Article•10.1109/TIP.2021.3089943•

LayerCAM: Exploring Hierarchical Class Activation Maps for Localization

[...]

Peng-Tao Jiang¹, Chang-Bin Zhang¹, Qibin Hou, Ming-Ming Cheng¹, Yunchao Wei² - Show less +1 more•Institutions (2)

Nankai University¹, Beijing Jiaotong University²

22 Jun 2021-IEEE Transactions on Image Processing

TL;DR: Li et al. as mentioned in this paper proposed a simple yet effective method, called LayerCAM, to generate more fine-grained object localization information from the class activation maps to locate the target objects more accurately.

...read moreread less

Abstract: The class activation maps are generated from the final convolutional layer of CNN. They can highlight discriminative object regions for the class of interest. These discovered object regions have been widely used for weakly-supervised tasks. However, due to the small spatial resolution of the final convolutional layer, such class activation maps often locate coarse regions of the target objects, limiting the performance of weakly-supervised tasks that need pixel-accurate object locations. Thus, we aim to generate more fine-grained object localization information from the class activation maps to locate the target objects more accurately. In this paper, by rethinking the relationships between the feature maps and their corresponding gradients, we propose a simple yet effective method, called LayerCAM. It can produce reliable class activation maps for different layers of CNN. This property enables us to collect object localization information from coarse (rough spatial localization) to fine (precise fine-grained details) levels. We further integrate them into a high-quality class activation map, where the object-related pixels can be better highlighted. To evaluate the quality of the class activation maps produced by LayerCAM, we apply them to weakly-supervised object localization and semantic segmentation. Experiments demonstrate that the class activation maps generated by our method are more effective and reliable than those by the existing attention methods. The code will be made publicly available.

...read moreread less

581 citations

Proceedings Article•10.1109/CVPR46437.2021.00577•

Towards Open World Object Detection

[...]

K J Joseph¹, Salman Khan², Fahad Shahbaz Khan², Vineeth N Balasubramanian¹•Institutions (2)

Indian Institute of Technology, Hyderabad¹, Zayed University²

3 Mar 2021

TL;DR: In this paper, the authors propose a novel computer vision problem called "Open World Object Detection", where a model is tasked to identify objects that have not been introduced to it as "unknown" and incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received.

...read moreread less

Abstract: Humans have a natural instinct to identify unknown object instances in their environments. The intrinsic curiosity about these unknown instances aids in learning about them, when the corresponding knowledge is eventually available. This motivates us to propose a novel computer vision problem called: ‘Open World Object Detection’, where a model is tasked to: 1) identify objects that have not been introduced to it as ‘unknown’, without explicit supervision to do so, and 2) incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received. We formulate the problem, introduce a strong evaluation protocol and provide a novel solution, which we call ORE: Open World Object Detector, based on contrastive clustering and energy based unknown identification. Our experimental evaluation and ablation studies analyse the efficacy of ORE in achieving Open World objectives. As an interesting by-product, we find that identifying and characterising unknown instances helps to reduce confusion in an incremental object detection setting, where we achieve state-of-the-art performance, with no extra methodological effort. We hope that our work will attract further research into this newly identified, yet crucial research direction.1

...read moreread less

550 citations

Journal Article•10.1109/TPAMI.2021.3085766•

Concealed Object Detection.

[...]

Deng-Ping Fan¹, Ge-Peng Ji², Ming-Ming Cheng¹, Ling Shao•Institutions (2)

Nankai University¹, Wuhan University²

01 Jun 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: Li et al. as discussed by the authors presented the first systematic study on concealed object detection (COD), which aims to identify objects that are?perfectly? embedded in their background, and designed a simple but strong baseline for COD, termed the Search Identification Network (SINet).

...read moreread less

Abstract: We present the first systematic study on concealed object detection (COD), which aims to identify objects that are ?perfectly? embedded in their background. The high intrinsic similarities between the concealed objects and their background make COD far more challenging than traditional object detection/segmentation. To better understand this task, we collect a large-scale dataset, called COD10K, which consists of 10,000 images covering concealed objects in diverse real-world scenarios from 78 object categories. Further, we provide rich annotations including object categories, object boundaries, challenging attributes, object-level labels, and instance-level annotations. Our COD10K enables comprehensive concealed object understanding and can even be used to help progress several other vision tasks, such as detection, segmentation, classification etc. We also design a simple but strong baseline for COD, termed the Search Identification Network (SINet). Without any bells and whistles, SINet outperform 12 cutting-edge baselines on all datasets tested, making them robust, general architectures that could serve as catalysts for future research in COD. Finally, we provide some interesting findings, and highlight several potential applications and future directions. To spark research in this new field, our code, dataset, and online demo are available at our project page: http://mmcheng.net/cod.

...read moreread less

505 citations

Proceedings Article•10.1109/CVPR46437.2021.00738•

3D Object Detection with Pointformer

[...]

Xuran Pan¹, Zhuofan Xia¹, Shiji Song¹, Li Erran Li², Gao Huang¹ - Show less +1 more•Institutions (2)

Tsinghua University¹, Columbia University²

1 Jun 2021

TL;DR: Pointformer as mentioned in this paper proposes a Transformer backbone for 3D point clouds to learn features effectively, where a Local Transformer module is employed to model interactions among points in a local region, which learns context-dependent region features at an object level.

...read moreread less

Abstract: Feature learning for 3D object detection from point clouds is very challenging due to the irregularity of 3D point cloud data. In this paper, we propose Pointformer, a Transformer backbone designed for 3D point clouds to learn features effectively. Specifically, a Local Transformer module is employed to model interactions among points in a local region, which learns context-dependent region features at an object level. A Global Transformer is designed to learn context-aware representations at the scene level. To further capture the dependencies among multi-scale representations, we propose Local-Global Transformer to integrate local features with global features from higher resolution. In addition, we introduce an efficient coordinate refinement module to shift down-sampled points closer to object centroids, which improves object proposal generation. We use Pointformer as the backbone for state-of-the-art object detection models and demonstrate significant improvements over original models on both indoor and outdoor datasets.

...read moreread less

480 citations

Proceedings Article•10.1109/CVPR46437.2021.00727•

FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding

[...]

Bo Sun¹, Banghuai Li, Shengcai Cai, Ye Yuan, Chi Zhang - Show less +1 more•Institutions (1)

University of Southern California¹

20 Jun 2021

TL;DR: In this article, a contrastive proposal encoding loss (CPE loss) was proposed to improve the performance of few-shot object detection by learning contrastive-aware object proposal encodings that facilitate the classification of detected objects.

...read moreread less

Abstract: Emerging interests have been brought to recognize previously unseen objects given very few training examples, known as few-shot object detection (FSOD). Recent researches demonstrate that good feature embedding is the key to reach favorable few-shot learning performance. We observe object proposals with different Intersection-of-Union (IoU) scores are analogous to the intra-image augmentation used in contrastive visual representation learning. And we exploit this analogy and incorporate supervised contrastive learning to achieve more robust objects representations in FSOD. We present Few-Shot object detection via Contrastive proposals Encoding (FSCE), a simple yet effective approach to learning contrastive-aware object proposal encodings that facilitate the classification of detected objects. We notice the degradation of average precision (AP) for rare objects mainly comes from misclassifying novel instances as confusable classes. And we ease the misclassification issues by promoting instance level intraclass compactness and inter-class variance via our contrastive proposal encoding loss (CPE loss). Our design outperforms current state-of-the-art works in any shot and all data splits, with up to +8.8% on standard benchmark PASCAL VOC and +2.7% on challenging COCO benchmark. Code is available at: https://github.com/MegviiDetection/FSCE.

...read moreread less

471 citations

Journal Article•10.1109/TPAMI.2019.2929520•

Deep Affinity Network for Multiple Object Tracking

[...]

Shijie Sun¹, Naveed Akhtar², Huansheng Song¹, Ajmal Mian², Mubarak Shah³ - Show less +1 more•Institutions (3)

Chang'an University¹, University of Western Australia², University of Central Florida³

01 Jan 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The proposed Deep Affinity Network (DAN) learns compact, yet comprehensive features of pre-detected objects at several levels of abstraction, and performs exhaustive pairing permutations of those features in any two frames to infer object affinities.

...read moreread less

Abstract: Multiple Object Tracking (MOT) plays an important role in solving many fundamental problems in video analysis and computer vision. Most MOT methods employ two steps: Object Detection and Data Association. The first step detects objects of interest in every frame of a video, and the second establishes correspondence between the detected objects in different frames to obtain their tracks. Object detection has made tremendous progress in the last few years due to deep learning. However, data association for tracking still relies on hand crafted constraints such as appearance, motion, spatial proximity, grouping etc. to compute affinities between the objects in different frames. In this paper, we harness the power of deep learning for data association in tracking by jointly modeling object appearances and their affinities between different frames in an end-to-end fashion. The proposed Deep Affinity Network (DAN) learns compact, yet comprehensive features of pre-detected objects at several levels of abstraction, and performs exhaustive pairing permutations of those features in any two frames to infer object affinities. DAN also accounts for multiple objects appearing and disappearing between video frames. We exploit the resulting efficient affinity computations to associate objects in the current frame deep into the previous frames for reliable on-line tracking. Our technique is evaluated on popular multiple object tracking challenges MOT15, MOT17 and UA-DETRAC. Comprehensive benchmarking under twelve evaluation metrics demonstrates that our approach is among the best performing techniques on the leader board for these challenges. The open source implementation of our work is available at https://github.com/shijieS/SST.git .

...read moreread less

433 citations

Posted Content•

Learning Transferable Visual Models From Natural Language Supervision

[...]

OpenAI¹

26 Feb 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, a pre-training task of predicting which caption goes with which image is used to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.

...read moreread less

426 citations

Posted Content•

MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding.

[...]

Aishwarya Kamath¹, Mannat Singh², Yann LeCun², Ishan Misra², Gabriel Synnaeve², Nicolas Carion³ - Show less +2 more•Institutions (3)

New York University¹, Facebook², Paris Dauphine University³

26 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: In this article, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question, is proposed.

...read moreread less

Abstract: Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This makes it challenging for such systems to capture the long tail of visual concepts expressed in free form text. In this paper we propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. We pre-train the network on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. We then fine-tune on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation, achieving state-of-the-art results on popular benchmarks. We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting. We show that our pre-training approach provides a way to handle the long tail of object categories which have very few labelled instances. Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR. The code and models are available at this https URL.

...read moreread less

Proceedings Article•10.1109/CVPR46437.2021.00014•

HOTR: End-to-End Human-Object Interaction Detection with Transformers

[...]

Bumsoo Kim, Junhyun Lee¹, Jaewoo Kang¹, Eun-Sol Kim, Hyunwoo Kim¹ - Show less +1 more•Institutions (1)

Korea University¹

20 Jun 2021

TL;DR: Zhang et al. as mentioned in this paper presented a novel framework, referred by HOTR, which directly predicts a set of human, object, interaction triplets from an image based on a transformer encoder-decoder architecture.

...read moreread less

Abstract: Human-Object Interaction (HOI) detection is a task of identifying "a set of interactions" in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels. Most existing methods have indirectly addressed this task by detecting human and object instances and individually inferring every pair of the detected instances. In this paper, we present a novel framework, referred by HOTR, which directly predicts a set of 〈human, object, interaction〉 triplets from an image based on a transformer encoder-decoder architecture. Through the set prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.

...read moreread less

Proceedings Article•10.1109/CVPR46437.2021.00330•

Objects are Different: Flexible Monocular 3D Object Detection

[...]

Yunpeng Zhang¹, Jiwen Lu¹, Jie Zhou¹•Institutions (1)

Tsinghua University¹

6 Apr 2021

TL;DR: Zhang et al. as discussed by the authors propose a flexible framework for monocular 3D object detection which explicitly decouples the truncated objects and adaptively combines multiple approaches for object depth estimation.

...read moreread less

Abstract: The precise localization of 3D objects from a single image without depth information is a highly challenging problem. Most existing methods adopt the same approach for all objects regardless of their diverse distributions, leading to limited performance for truncated objects. In this paper, we propose a flexible framework for monocular 3D object detection which explicitly decouples the truncated objects and adaptively combines multiple approaches for object depth estimation. Specifically, we decouple the edge of the feature map for predicting long-tail truncated objects so that the optimization of normal objects is not influenced. Furthermore, we formulate the object depth estimation as an uncertainty-guided ensemble of directly regressed object depth and solved depths from different groups of keypoints. Experiments demonstrate that our method outperforms the state-of-the-art method by relatively 27% for the moderate level and 30% for the hard level in the test set of KITTI benchmark while maintaining real-time efficiency. Code will be available at https://github.com/zhangyp15/MonoFlex.

...read moreread less

Journal Article•10.1016/J.RSE.2021.112636•

Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters

[...]

Zhuo Zheng¹, Yanfei Zhong¹, Junjue Wang¹, Ailong Ma¹, Liangpei Zhang¹ - Show less +1 more•Institutions (1)

Wuhan University¹

01 Nov 2021-Remote Sensing of Environment

TL;DR: A deep object-based semantic change detection framework, called ChangeOS, is proposed for building damage assessment that is superior to the currently published methods in speed and accuracy, and has a superior generalization ability for man-made disasters.

...read moreread less

Proceedings Article•10.1109/CVPR46437.2021.00893•

DexYCB: A Benchmark for Capturing Hand Grasping of Objects

[...]

Yu-Wei Chao¹, Wei Yang¹, Yu Xiang¹, Pavlo Molchanov¹, Ankur Handa¹, Jonathan Tremblay¹, Yashraj S. Narang¹, Karl Van Wyk¹, Umar Iqbal¹, Stan Birchfield¹, Jan Kautz¹, Dieter Fox¹ - Show less +8 more•Institutions (1)

Nvidia¹

1 Jun 2021

TL;DR: The DexYCB dataset as mentioned in this paper is a dataset for capturing hand grasping of objects, including 2D object and keypoint detection, 6D object pose estimation, and 3D hand pose estimation.

...read moreread less

Abstract: We introduce DexYCB, a new dataset for capturing hand grasping of objects. We first compare DexYCB with a related one through cross-dataset evaluation. We then present a thorough benchmark of state-of-the-art approaches on three relevant tasks: 2D object and keypoint detection, 6D object pose estimation, and 3D hand pose estimation. Finally, we evaluate a new robotics-relevant task: generating safe robot grasps in human-to-robot object handover. 1

...read moreread less

Journal Article•10.1007/S10462-020-09888-5•

Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review

[...]

Guoguang Du, Kai Wang, Shiguo Lian, Kaiyong Zhao

01 Mar 2021-Artificial Intelligence Review

TL;DR: Three key tasks during vision-based robotic grasping are concluded, which are object localization, object pose estimation and grasp estimation, which include 2D planar grasp methods and 6DoF grasp methods.

...read moreread less

Abstract: This paper presents a comprehensive survey on vision-based robotic grasping. We conclude three key tasks during vision-based robotic grasping, which are object localization, object pose estimation and grasp estimation. In detail, the object localization task contains object localization without classification, object detection and object instance segmentation. This task provides the regions of the target object in the input data. The object pose estimation task mainly refers to estimating the 6D object pose and includes correspondence-based methods, template-based methods and voting-based methods, which affords the generation of grasp poses for known objects. The grasp estimation task includes 2D planar grasp methods and 6DoF grasp methods, where the former is constrained to grasp from one direction. These three tasks could accomplish the robotic grasping with different combinations. Lots of object pose estimation methods need not object localization, and they conduct object localization and object pose estimation jointly. Lots of grasp estimation methods need not object localization and object pose estimation, and they conduct grasp estimation in an end-to-end manner. Both traditional methods and latest deep learning-based methods based on the RGB-D image inputs are reviewed elaborately in this survey. Related datasets and comparisons between state-of-the-art methods are summarized as well. In addition, challenges about vision-based robotic grasping and future directions in addressing these challenges are also pointed out.

...read moreread less

Journal Article•10.1109/TITS.2021.3096854•

A Review and Comparative Study on Probabilistic Object Detection in Autonomous Driving

[...]

Di Feng¹, Ali Harakeh, Steven L. Waslander, Klaus Dietmayer•Institutions (1)

University of Ulm¹

12 Jul 2021-IEEE Transactions on Intelligent Transportation Systems

TL;DR: An overview of generic uncertainty estimation in deep learning is provided, and a strict comparative study on existing probabilistic object detection methods for autonomous driving applications is presented.

...read moreread less

Abstract: Capturing uncertainty in object detection is indispensable for safe autonomous driving. In recent years, deep learning has become the de-facto approach for object detection, and many probabilistic object detectors have been proposed. However, there is no summary on uncertainty estimation in deep object detection, and existing methods are not only built with different network architectures and uncertainty estimation methods, but also evaluated on different datasets with a wide range of evaluation metrics. As a result, a comparison among methods remains challenging, as does the selection of a model that best suits a particular application. This paper aims to alleviate this problem by providing a review and comparative study on existing probabilistic object detection methods for autonomous driving applications. First, we provide an overview of generic uncertainty estimation in deep learning, and then systematically survey existing methods and evaluation metrics for probabilistic object detection. Next, we present a strict comparative study for probabilistic object detection based on an image detector and three public autonomous driving datasets. Finally, we present a discussion of the remaining challenges and future works. Code has been made available at this https URL

...read moreread less

Journal Article•10.1007/S11263-021-01465-9•

OCNet: Object Context for Semantic Segmentation

[...]

Yuhui Yuan¹, Yuhui Yuan², Lang Huang³, Jianyuan Guo³, Chao Zhang³, Xilin Chen¹, Jingdong Wang² - Show less +3 more•Institutions (3)

Chinese Academy of Sciences¹, Microsoft², Peking University³

24 May 2021-International Journal of Computer Vision

TL;DR: This paper proposes an efficient interlaced sparse self-attention scheme to model the dense relations between any two of all pixels via the combination of two sparse relation matrices and empirically shows the advantages of this approach with competitive performances on five challenging benchmarks.

...read moreread less

Abstract: In this paper, we address the semantic segmentation task with a new context aggregation scheme named object context, which focuses on enhancing the role of object information. Motivated by the fact that the category of each pixel is inherited from the object it belongs to, we define the object context for each pixel as the set of pixels that belong to the same category as the given pixel in the image. We use a binary relation matrix to represent the relationship between all pixels, where the value one indicates the two selected pixels belong to the same category and zero otherwise. We propose to use a dense relation matrix to serve as a surrogate for the binary relation matrix. The dense relation matrix is capable to emphasize the contribution of object information as the relation scores tend to be larger on the object pixels than the other pixels. Considering that the dense relation matrix estimation requires quadratic computation overhead and memory consumption w.r.t. the input size, we propose an efficient interlaced sparse self-attention scheme to model the dense relations between any two of all pixels via the combination of two sparse relation matrices. To capture richer context information, we further combine our interlaced sparse self-attention scheme with the conventional multi-scale context schemes including pyramid pooling (Zhao et al. 2017) and atrous spatial pyramid pooling (Chen et al. 2018). We empirically show the advantages of our approach with competitive performances on five challenging benchmarks including: Cityscapes, ADE20K, LIP, PASCAL-Context and COCO-Stuff.

...read moreread less

Proceedings Article•10.1109/CVPR46437.2021.00994•

Uncertainty-aware Joint Salient Object and Camouflaged Object Detection

[...]

Aixuan Li¹, Jing Zhang², Yunqiu Lv¹, Bowen Liu¹, Tong Zhang³, Yuchao Dai¹ - Show less +2 more•Institutions (3)

Northwestern Polytechnical University¹, Australian National University², École Polytechnique Fédérale de Lausanne³

1 Jun 2021

TL;DR: Zhang et al. as discussed by the authors leveraged the contradictory information to enhance the detection ability of both salient object detection and camouflaged object detection, and proposed an adversarial learning network to achieve both higher order similarity measure and network confidence estimation.

...read moreread less

Abstract: Visual salient object detection (SOD) aims at finding the salient object(s) that attract human attention, while camouflaged object detection (COD) on the contrary intends to discover the camouflaged object(s) that hidden in the surrounding. In this paper, we propose a paradigm of lever-aging the contradictory information to enhance the detection ability of both salient object detection and camouflaged object detection. We start by exploiting the easy positive samples in the COD dataset to serve as hard positive samples in the SOD task to improve the robustness of the SOD model. Then, we introduce a "similarity measure" module to explicitly model the contradicting attributes of these two tasks. Furthermore, considering the uncertainty of labeling in both tasks’ datasets, we propose an adversarial learning network to achieve both higher order similarity measure and network confidence estimation. Experimental results on benchmark datasets demonstrate that our solution leads to state-of-the-art (SOTA) performance for both tasks1.

...read moreread less

Journal Article•10.1007/S41095-020-0199-Z•

RGB-D salient object detection: A survey.

[...]

Tao Zhou, Deng-Ping Fan, Ming-Ming Cheng¹, Jianbing Shen, Ling Shao - Show less +1 more•Institutions (1)

Nankai University¹

07 Jan 2021-Computational Visual Media

TL;DR: Li et al. as discussed by the authors provided a comprehensive survey of RGB-D based salient object detection models from various perspectives, and reviewed related benchmark datasets in detail, and carried out a comprehensive attribute-based evaluation of several representative RGBD-based saliency detection models.

...read moreread less

Abstract: Salient object detection, which simulates human visual perception in locating the most significant object(s) in a scene, has been widely applied to various computer vision tasks. Now, the advent of depth sensors means that depth maps can easily be captured; this additional spatial information can boost the performance of salient object detection. Although various RGB-D based salient object detection models with promising performance have been proposed over the past several years, an in-depth understanding of these models and the challenges in this field remains lacking. In this paper, we provide a comprehensive survey of RGB-D based salient object detection models from various perspectives, and review related benchmark datasets in detail. Further, as light fields can also provide depth maps, we review salient object detection models and popular benchmark datasets from this domain too. Moreover, to investigate the ability of existing models to detect salient objects, we have carried out a comprehensive attribute-based evaluation of several representative RGB-D based salient object detection models. Finally, we discuss several challenges and open directions of RGB-D based salient object detection for future research. All collected models, benchmark datasets, datasets constructed for attribute-based evaluation, and related code are publicly available at https://github.com/taozh2017/RGBD-SODsurvey.

...read moreread less

Journal Article•10.1007/S10489-021-02293-7•

Deep learning in multi-object detection and tracking: state of the art

[...]

Sankar K. Pal¹, Anima Pramanik², Jhareswar Maiti², Pabitra Mitra²•Institutions (2)

Indian Statistical Institute¹, Indian Institute of Technology Kharagpur²

09 Apr 2021-Applied Intelligence

TL;DR: In this article, the authors provide a comprehensive overview of object detection and tracking using deep learning (DL) networks and compare the performance of different object detectors and trackers, including the recent development in granulated DL models.

...read moreread less

Abstract: Object detection and tracking is one of the most important and challenging branches in computer vision, and have been widely applied in various fields, such as health-care monitoring, autonomous driving, anomaly detection, and so on. With the rapid development of deep learning (DL) networks and GPU’s computing power, the performance of object detectors and trackers has been greatly improved. To understand the main development status of object detection and tracking pipeline thoroughly, in this survey, we have critically analyzed the existing DL network-based methods of object detection and tracking and described various benchmark datasets. This includes the recent development in granulated DL models. Primarily, we have provided a comprehensive overview of a variety of both generic object detection and specific object detection models. We have enlisted various comparative results for obtaining the best detector, tracker, and their combination. Moreover, we have listed the traditional and new applications of object detection and tracking showing its developmental trends. Finally, challenging issues, including the relevance of granular computing, in the said domain are elaborated as a future scope of research, together with some concerns. An extensive bibliography is also provided.

...read moreread less

Proceedings Article•10.1109/CVPR46437.2021.01023•

ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection

[...]

Jihan Yang¹, Shaoshuai Shi², Zhe Wang³, Hongsheng Li², Xiaojuan Qi¹ - Show less +1 more•Institutions (3)

University of Hong Kong¹, The Chinese University of Hong Kong², SenseTime³

1 Jun 2021

TL;DR: ST3D as discussed by the authors proposed a domain adaptive self-training pipeline, named ST3D, for unsupervised domain adaptation on 3D object detection from point clouds by pre-training the 3D detector on the source domain with a proposed random object scaling strategy for mitigating the negative effects of source domain bias.

...read moreread less

Abstract: We present a new domain adaptive self-training pipeline, named ST3D, for unsupervised domain adaptation on 3D object detection from point clouds. First, we pre-train the 3D detector on the source domain with our proposed random object scaling strategy for mitigating the negative effects of source domain bias. Then, the detector is iteratively improved on the target domain by alternatively conducting two steps, which are the pseudo label updating with the developed quality-aware triplet memory bank and the model training with curriculum data augmentation. These specific designs for 3D object detection enable the detector to be trained with consistent and high-quality pseudo labels and to avoid overfitting to the large number of easy examples in pseudo labeled data. Our ST3D achieves state-of-the-art performance on all evaluated datasets and even surpasses fully supervised results on KITTI 3D object detection benchmark. Code will be available at https://github.com/CVMI-Lab/ST3D.

...read moreread less

Posted Content•

Anchor DETR: Query Design for Transformer-Based Detector.

[...]

Yingming Wang, Xiangyu Zhang, Tong Yang, Jian Sun

15 Sep 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as mentioned in this paper proposed a novel query design for the transformer-based detectors, in which each object query focus on the objects near the anchor point, which can predict multiple objects at one position to solve the difficulty.

...read moreread less

Abstract: In this paper, we propose a novel query design for the transformer-based detectors. In previous transformer-based detectors, the object queries are a set of learned embeddings. However, each learned embedding does not have an explicit physical meaning and we can not explain where it will focus on. It is difficult to optimize as the prediction slot of each object query does not have a specific mode. In other words, each object query will not focus on a specific region. To solved these problems, in our query design, object queries are based on anchor points, which are widely used in CNN-based detectors. So each object query focus on the objects near the anchor point. Moreover, our query design can predict multiple objects at one position to solve the difficulty: "one region, multiple objects". In addition, we design an attention variant, which can reduce the memory cost while achieving similar or better performance than the standard attention in DETR. Thanks to the query design and the attention variant, the proposed detector that we called Anchor DETR, can achieve better performance and run faster than the DETR with 10$\times$ fewer training epochs. For example, it achieves 44.2 AP with 16 FPS on the MSCOCO dataset when using the ResNet50-DC5 feature for training 50 epochs. Extensive experiments on the MSCOCO benchmark prove the effectiveness of the proposed methods. Code is available at this https URL.

...read moreread less

Posted Content•

TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization.

[...]

Wei Gao¹, Fang Wan¹, Xingjia Pan¹, Zhiliang Peng, Qi Tian², Zhenjun Han¹, Bolei Zhou³, Qixiang Ye¹ - Show less +4 more•Institutions (3)

Chinese Academy of Sciences¹, Huawei², The Chinese University of Hong Kong³

27 Mar 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper introduces the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction and achieves state-of-the-art performance.

...read moreread less

Abstract: Weakly supervised object localization (WSOL) is a challenging problem when given image category labels but requires to learn object localization models. Optimizing a convolutional neural network (CNN) for classification tends to activate local discriminative regions while ignoring complete object extent, causing the partial activation issue. In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN, where the convolution operations produce local receptive fields and experience difficulty to capture long-range feature dependency among pixels. We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction. TS-CAM first splits an image into a sequence of patch tokens for spatial embedding, which produce attention maps of long-range visual dependency to avoid partial activation. TS-CAM then re-allocates category-related semantics for patch tokens, enabling each of them to be aware of object categories. TS-CAM finally couples the patch tokens with the semantic-agnostic attention map to achieve semantic-aware localization. Experiments on the ILSVRC/CUB-200-2011 datasets show that TS-CAM outperforms its CNN-CAM counterparts by 7.1%/27.1% for WSOL, achieving state-of-the-art performance.

...read moreread less

Proceedings Article•10.1109/CVPR46437.2021.00267•

BBAM: Bounding Box Attribution Map for Weakly Supervised Semantic and Instance Segmentation

[...]

Jungbeom Lee¹, Jihun Yi¹, Chaehun Shin¹, Sungroh Yoon¹•Institutions (1)

Seoul National University¹

1 Jun 2021

TL;DR: In this paper, a bounding-box attribution map (BBAM) was proposed to identify the target object in its bounding box and thus serve as pseudo ground truth for weakly supervised semantic and instance segmentation.

...read moreread less

Abstract: Weakly supervised segmentation methods using bounding box annotations focus on obtaining a pixel-level mask from each box containing an object. Existing methods typically depend on a class-agnostic mask generator, which operates on the low-level information intrinsic to an image. In this work, we utilize higher-level information from the behavior of a trained object detector, by seeking the smallest areas of the image from which the object detector produces almost the same result as it does from the whole image. These areas constitute a bounding-box attribution map (BBAM), which identifies the target object in its bounding box and thus serves as pseudo ground-truth for weakly supervised semantic and instance segmentation. This approach significantly outperforms recent comparable techniques on both the PASCAL VOC and MS COCO benchmarks in weakly supervised semantic and instance segmentation. In addition, we provide a detailed analysis of our method, offering deeper insight into the behavior of the BBAM.

...read moreread less

Proceedings Article•10.1109/CVPR46437.2021.00134•

Efficient Regional Memory Network for Video Object Segmentation

[...]

Haozhe Xie¹, Hongxun Yao¹, Shangchen Zhou², Shengping Zhang¹, Wenxiu Sun³ - Show less +1 more•Institutions (3)

Harbin Institute of Technology¹, Nanyang Technological University², SenseTime³

1 Jun 2021

TL;DR: In this paper, the precise regional memory is constructed by memorizing local regions where the target objects appear in the past frames, and the query regions are tracked and predicted based on the optical flow estimated from the previous frame.

...read moreread less

Abstract: Recently, several Space-Time Memory based networks have shown that the object cues (e.g. video frames as well as the segmented object masks) from the past frames are useful for segmenting objects in the current frame. However, these methods exploit the information from the memory by global-to-global matching between the current and past frames, which lead to mismatching to similar objects and high computational complexity. To address these problems, we propose a novel local-to-local matching solution for semi-supervised VOS, namely Regional Memory Network (RMNet). In RMNet, the precise regional memory is constructed by memorizing local regions where the target objects appear in the past frames. For the current query frame, the query regions are tracked and predicted based on the optical flow estimated from the previous frame. The proposed local-to-local matching effectively alleviates the ambiguity of similar objects in both memory and query frames, which allows the information to be passed from the regional memory to the query region efficiently and effectively. Experimental results indicate that the proposed RM-Net performs favorably against state-of-the-art methods on the DAVIS and YouTube-VOS datasets.

...read moreread less

Journal Article•10.1007/S11263-020-01393-0•

MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking

[...]

Patrick Dendorfer¹, Aljosa Osep¹, Anton Milan², Konrad Schindler³, Daniel Cremers¹, Ian Reid⁴, Stefan Roth⁵, Laura Leal-Taixé¹ - Show less +4 more•Institutions (5)

Technische Universität München¹, Amazon.com², ETH Zurich³, University of Adelaide⁴, Technische Universität Darmstadt⁵

01 Apr 2021-International Journal of Computer Vision

TL;DR: The MOTChallenge as mentioned in this paper is a benchmark for single-camera multiple object tracking (MOT) which has been widely used in the field of computer vision and has been used to evaluate the performance of object tracking algorithms.

...read moreread less

Abstract: Standardized benchmarks have been crucial in pushing the performance of computer vision algorithms, especially since the advent of deep learning. Although leaderboards should not be over-claimed, they often provide the most objective measure of performance and are therefore important guides for research. We present MOTChallenge, a benchmark for single-camera Multiple Object Tracking (MOT) launched in late 2014, to collect existing and new data and create a framework for the standardized evaluation of multiple object tracking methods. The benchmark is focused on multiple people tracking, since pedestrians are by far the most studied object in the tracking community, with applications ranging from robot navigation to self-driving cars. This paper collects the first three releases of the benchmark: (i) MOT15, along with numerous state-of-the-art results that were submitted in the last years, (ii) MOT16, which contains new challenging videos, and (iii) MOT17, that extends MOT16 sequences with more precise labels and evaluates tracking performance on three different object detectors. The second and third release not only offers a significant increase in the number of labeled boxes, but also provide labels for multiple object classes beside pedestrians, as well as the level of visibility for every single object of interest. We finally provide a categorization of state-of-the-art trackers and a broad error analysis. This will help newcomers understand the related work and research trends in the MOT community, and hopefully shed some light into potential future research directions.

...read moreread less

Proceedings Article•10.1109/CVPR46437.2021.01005•

Dense Relation Distillation with Context-aware Aggregation for Few-Shot Object Detection

[...]

Hanzhe Hu¹, Shuai Bai², Aoxue Li¹, Jinshi Cui¹, Liwei Wang¹ - Show less +1 more•Institutions (2)

Peking University¹, Beijing University of Posts and Telecommunications²

20 Jun 2021

TL;DR: DCNet as mentioned in this paper proposes Dense Relation Distillation with Context-aware Aggregation (DCNet) to tackle the few-shot detection problem, which learns to adapt to novel classes with only a few annotated examples.

...read moreread less

Abstract: Conventional deep learning based methods for object detection require a large amount of bounding box annotations for training, which is expensive to obtain such high quality annotated data. Few-shot object detection, which learns to adapt to novel classes with only a few annotated examples, is very challenging since the fine-grained feature of novel object can be easily overlooked with only a few data available. In this work, aiming to fully exploit features of annotated novel object and capture fine-grained features of query object, we propose Dense Relation Distillation with Context-aware Aggregation (DCNet) to tackle the few-shot detection problem. Built on the meta-learning based framework, Dense Relation Distillation module targets at fully exploiting support features, where support features and query feature are densely matched, covering all spatial locations in a feed-forward fashion. The abundant usage of the guidance information endows model the capability to handle common challenges such as appearance changes and occlusions. Moreover, to better capture scale-aware features, Context-aware Aggregation module adaptively harnesses features from different scales for a more comprehensive feature representation. Extensive experiments illustrate that our proposed approach achieves state-of-the-art results on PASCAL VOC and MS COCO datasets. Code will be made available at https://github.com/hzhupku/DCNet.

...read moreread less

Proceedings Article•10.1109/CVPR46437.2021.01416•

Open-Vocabulary Object Detection Using Captions

[...]

Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, Shih-Fu Chang¹•Institutions (1)

Columbia University¹

1 Jun 2021

TL;DR: Open-vocabulary object detection as discussed by the authors uses bounding box annotations for a limited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost.

...read moreread less

Abstract: Despite the remarkable accuracy of deep neural networks in object detection, they are costly to train and scale due to supervision requirements. Particularly, learning more object categories typically requires proportionally more bounding box annotations. Weakly supervised and zero-shot learning techniques have been explored to scale object detectors to more categories with less supervision, but they have not been as successful and widely adopted as supervised models. In this paper, we put forth a novel formulation of the object detection problem, namely open-vocabulary object detection, which is more general, more practical, and more effective than weakly supervised and zero-shot approaches. We propose a new method to train object detectors using bounding box annotations for a limited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost. We show that the proposed method can detect and localize objects for which no bounding box annotation is provided during training, at a significantly higher accuracy than zero-shot approaches. Meanwhile, objects with bounding box annotation can be detected almost as accurately as supervised methods, which is significantly better than weakly supervised baselines. Accordingly, we establish a new state of the art for scalable object detection.

...read moreread less

Posted Content•

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

[...]

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui¹•Institutions (1)

Google¹

28 Apr 2021-arXiv: Computer Vision and Pattern Recognition

TL;DR: Zhang et al. as mentioned in this paper distill the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student) to encode category texts and image regions of object proposals.

...read moreread less

Abstract: We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. Existing object detection datasets only contain hundreds of categories, and it is costly to scale further. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories not seen during training. ViLD obtains 16.1 mask AP$_r$, even outperforming the supervised counterpart by 3.8 with a ResNet-50 backbone. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$, 36.6 AP and 11.8 AP on PASCAL VOC, COCO and Objects365, respectively. On COCO, ViLD outperforms previous SOTA by 4.8 on novel AP and 11.4 on overall AP.

...read moreread less

...

Expand