Scispace (Formerly Typeset)
  1. Home
  2. Topics
  3. Object (computer science)
  4. 2021
  1. Home
  2. Topics
  3. Object (computer science)
  4. 2021
Showing papers on "Object (computer science) published in 2021"
Proceedings Article•
Learning Transferable Visual Models From Natural Language Supervision

[...]

Alec Radford1, Jong Wook Kim1, Chris Hallacy1, Aditya Ramesh1, Gabriel Goh1, Sandhini Agarwal1, Girish Sastry1, Amanda Askell, Pamela Mishkin1, Jack Clark1, Gretchen Krueger1, Ilya Sutskever1 •
OpenAI1
18 Jul 2021
TL;DR: In this paper, a pre-training task of predicting which caption goes with which image is used to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Abstract: State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.

3,738 citations

Proceedings Article•10.1109/CVPR46437.2021.01422•
Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

[...]

Peize Sun1, Rufeng Zhang2, Yi Jiang, Tao Kong, Chenfeng Xu3, Wei Zhan3, Masayoshi Tomizuka3, Lei Li, Zehuan Yuan, Changhu Wang, Ping Luo1 •
University of Hong Kong1, Tongji University2, University of California, Berkeley3
1 Jun 2021
TL;DR: Sun et al. as mentioned in this paper proposed sparse R-CNN, a purely sparse method for object detection in images, which completely avoids all efforts related to object candidates design and many-to-one label assignment.
Abstract: We present Sparse R-CNN, a purely sparse method for object detection in images. Existing works on object detection heavily rely on dense object candidates, such as k anchor boxes pre-defined on all grids of image feature map of size H × W. In our method, however, a fixed sparse set of learned object proposals, total length of N, are provided to object recognition head to perform classification and location. By eliminating HWk (up to hundreds of thousands) hand-designed object candidates to N (e.g. 100) learnable proposals, Sparse R-CNN completely avoids all efforts related to object candidates design and many-to-one label assignment. More importantly, final predictions are directly output without non-maximum suppression post-procedure. Sparse R-CNN demonstrates accuracy, run-time and training convergence performance on par with the well-established detector baselines on the challenging COCO dataset, e.g., achieving 45.0 AP in standard 3× training schedule and running at 22 fps using ResNet-50 FPN model. We hope our work could inspire re-thinking the convention of dense prior in object detectors. The code is available at: https://github.com/PeizeSun/SparseR-CNN.

901 citations

Proceedings Article•10.1109/CVPR46437.2021.00729•
Dynamic Head: Unifying Object Detection Heads with Attentions

[...]

Xiyang Dai1, Yinpeng Chen1, Bin Xiao1, Dongdong Chen1, Mengchen Liu1, Lu Yuan1, Lei Zhang1 •
Microsoft1
15 Jun 2021
TL;DR: In this article, a dynamic head framework is proposed to unify object detection heads with attentions, by coherently combining multiple self-attention mechanisms between feature levels for scale-awareness, among spatial locations for spatial-awareness and within output channels for task-awareness.
Abstract: The complex nature of combining localization and classification in object detection has resulted in the flourished development of methods. Previous works tried to improve the performance in various object detection heads but failed to present a unified view. In this paper, we present a novel dynamic head framework to unify object detection heads with attentions. By coherently combining multiple self-attention mechanisms between feature levels for scale-awareness, among spatial locations for spatial-awareness, and within output channels for task-awareness, the proposed approach significantly improves the representation ability of object detection heads without any computational overhead. Further experiments demonstrate that the effectiveness and efficiency of the proposed dynamic head on the COCO benchmark. With a standard ResNeXt-101-DCN backbone, we largely improve the performance over popular object detectors and achieve a new state-of-the-art at 54.0 AP. The code will be released at https://github.com/microsoft/DynamicHead.

699 citations

Journal Article•10.1109/TIP.2021.3089943•
LayerCAM: Exploring Hierarchical Class Activation Maps for Localization

[...]

Peng-Tao Jiang1, Chang-Bin Zhang1, Qibin Hou, Ming-Ming Cheng1, Yunchao Wei2 •
Nankai University1, Beijing Jiaotong University2
22 Jun 2021-IEEE Transactions on Image Processing
TL;DR: Li et al. as mentioned in this paper proposed a simple yet effective method, called LayerCAM, to generate more fine-grained object localization information from the class activation maps to locate the target objects more accurately.
Abstract: The class activation maps are generated from the final convolutional layer of CNN. They can highlight discriminative object regions for the class of interest. These discovered object regions have been widely used for weakly-supervised tasks. However, due to the small spatial resolution of the final convolutional layer, such class activation maps often locate coarse regions of the target objects, limiting the performance of weakly-supervised tasks that need pixel-accurate object locations. Thus, we aim to generate more fine-grained object localization information from the class activation maps to locate the target objects more accurately. In this paper, by rethinking the relationships between the feature maps and their corresponding gradients, we propose a simple yet effective method, called LayerCAM. It can produce reliable class activation maps for different layers of CNN. This property enables us to collect object localization information from coarse (rough spatial localization) to fine (precise fine-grained details) levels. We further integrate them into a high-quality class activation map, where the object-related pixels can be better highlighted. To evaluate the quality of the class activation maps produced by LayerCAM, we apply them to weakly-supervised object localization and semantic segmentation. Experiments demonstrate that the class activation maps generated by our method are more effective and reliable than those by the existing attention methods. The code will be made publicly available.

581 citations

Proceedings Article•10.1109/CVPR46437.2021.00577•
Towards Open World Object Detection

[...]

K J Joseph1, Salman Khan2, Fahad Shahbaz Khan2, Vineeth N Balasubramanian1•
Indian Institute of Technology, Hyderabad1, Zayed University2
3 Mar 2021
TL;DR: In this paper, the authors propose a novel computer vision problem called "Open World Object Detection", where a model is tasked to identify objects that have not been introduced to it as "unknown" and incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received.
Abstract: Humans have a natural instinct to identify unknown object instances in their environments. The intrinsic curiosity about these unknown instances aids in learning about them, when the corresponding knowledge is eventually available. This motivates us to propose a novel computer vision problem called: ‘Open World Object Detection’, where a model is tasked to: 1) identify objects that have not been introduced to it as ‘unknown’, without explicit supervision to do so, and 2) incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received. We formulate the problem, introduce a strong evaluation protocol and provide a novel solution, which we call ORE: Open World Object Detector, based on contrastive clustering and energy based unknown identification. Our experimental evaluation and ablation studies analyse the efficacy of ORE in achieving Open World objectives. As an interesting by-product, we find that identifying and characterising unknown instances helps to reduce confusion in an incremental object detection setting, where we achieve state-of-the-art performance, with no extra methodological effort. We hope that our work will attract further research into this newly identified, yet crucial research direction.1

550 citations

Journal Article•10.1109/TPAMI.2021.3085766•
Concealed Object Detection.

[...]

Deng-Ping Fan1, Ge-Peng Ji2, Ming-Ming Cheng1, Ling Shao•
Nankai University1, Wuhan University2
01 Jun 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence
TL;DR: Li et al. as discussed by the authors presented the first systematic study on concealed object detection (COD), which aims to identify objects that are?perfectly? embedded in their background, and designed a simple but strong baseline for COD, termed the Search Identification Network (SINet).
Abstract: We present the first systematic study on concealed object detection (COD), which aims to identify objects that are ?perfectly? embedded in their background. The high intrinsic similarities between the concealed objects and their background make COD far more challenging than traditional object detection/segmentation. To better understand this task, we collect a large-scale dataset, called COD10K, which consists of 10,000 images covering concealed objects in diverse real-world scenarios from 78 object categories. Further, we provide rich annotations including object categories, object boundaries, challenging attributes, object-level labels, and instance-level annotations. Our COD10K enables comprehensive concealed object understanding and can even be used to help progress several other vision tasks, such as detection, segmentation, classification etc. We also design a simple but strong baseline for COD, termed the Search Identification Network (SINet). Without any bells and whistles, SINet outperform 12 cutting-edge baselines on all datasets tested, making them robust, general architectures that could serve as catalysts for future research in COD. Finally, we provide some interesting findings, and highlight several potential applications and future directions. To spark research in this new field, our code, dataset, and online demo are available at our project page: http://mmcheng.net/cod.

505 citations

Proceedings Article•10.1109/CVPR46437.2021.00738•
3D Object Detection with Pointformer

[...]

Xuran Pan1, Zhuofan Xia1, Shiji Song1, Li Erran Li2, Gao Huang1 •
Tsinghua University1, Columbia University2
1 Jun 2021
TL;DR: Pointformer as mentioned in this paper proposes a Transformer backbone for 3D point clouds to learn features effectively, where a Local Transformer module is employed to model interactions among points in a local region, which learns context-dependent region features at an object level.
Abstract: Feature learning for 3D object detection from point clouds is very challenging due to the irregularity of 3D point cloud data. In this paper, we propose Pointformer, a Transformer backbone designed for 3D point clouds to learn features effectively. Specifically, a Local Transformer module is employed to model interactions among points in a local region, which learns context-dependent region features at an object level. A Global Transformer is designed to learn context-aware representations at the scene level. To further capture the dependencies among multi-scale representations, we propose Local-Global Transformer to integrate local features with global features from higher resolution. In addition, we introduce an efficient coordinate refinement module to shift down-sampled points closer to object centroids, which improves object proposal generation. We use Pointformer as the backbone for state-of-the-art object detection models and demonstrate significant improvements over original models on both indoor and outdoor datasets.

480 citations

Proceedings Article•10.1109/CVPR46437.2021.00727•
FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding

[...]

Bo Sun1, Banghuai Li, Shengcai Cai, Ye Yuan, Chi Zhang •
University of Southern California1
20 Jun 2021
TL;DR: In this article, a contrastive proposal encoding loss (CPE loss) was proposed to improve the performance of few-shot object detection by learning contrastive-aware object proposal encodings that facilitate the classification of detected objects.
Abstract: Emerging interests have been brought to recognize previously unseen objects given very few training examples, known as few-shot object detection (FSOD). Recent researches demonstrate that good feature embedding is the key to reach favorable few-shot learning performance. We observe object proposals with different Intersection-of-Union (IoU) scores are analogous to the intra-image augmentation used in contrastive visual representation learning. And we exploit this analogy and incorporate supervised contrastive learning to achieve more robust objects representations in FSOD. We present Few-Shot object detection via Contrastive proposals Encoding (FSCE), a simple yet effective approach to learning contrastive-aware object proposal encodings that facilitate the classification of detected objects. We notice the degradation of average precision (AP) for rare objects mainly comes from misclassifying novel instances as confusable classes. And we ease the misclassification issues by promoting instance level intraclass compactness and inter-class variance via our contrastive proposal encoding loss (CPE loss). Our design outperforms current state-of-the-art works in any shot and all data splits, with up to +8.8% on standard benchmark PASCAL VOC and +2.7% on challenging COCO benchmark. Code is available at: https://github.com/MegviiDetection/FSCE.

471 citations

Journal Article•10.1109/TPAMI.2019.2929520•
Deep Affinity Network for Multiple Object Tracking

[...]

Shijie Sun1, Naveed Akhtar2, Huansheng Song1, Ajmal Mian2, Mubarak Shah3 •
Chang'an University1, University of Western Australia2, University of Central Florida3
01 Jan 2021-IEEE Transactions on Pattern Analysis and Machine Intelligence
TL;DR: The proposed Deep Affinity Network (DAN) learns compact, yet comprehensive features of pre-detected objects at several levels of abstraction, and performs exhaustive pairing permutations of those features in any two frames to infer object affinities.
Abstract: Multiple Object Tracking (MOT) plays an important role in solving many fundamental problems in video analysis and computer vision. Most MOT methods employ two steps: Object Detection and Data Association. The first step detects objects of interest in every frame of a video, and the second establishes correspondence between the detected objects in different frames to obtain their tracks. Object detection has made tremendous progress in the last few years due to deep learning. However, data association for tracking still relies on hand crafted constraints such as appearance, motion, spatial proximity, grouping etc. to compute affinities between the objects in different frames. In this paper, we harness the power of deep learning for data association in tracking by jointly modeling object appearances and their affinities between different frames in an end-to-end fashion. The proposed Deep Affinity Network (DAN) learns compact, yet comprehensive features of pre-detected objects at several levels of abstraction, and performs exhaustive pairing permutations of those features in any two frames to infer object affinities. DAN also accounts for multiple objects appearing and disappearing between video frames. We exploit the resulting efficient affinity computations to associate objects in the current frame deep into the previous frames for reliable on-line tracking. Our technique is evaluated on popular multiple object tracking challenges MOT15, MOT17 and UA-DETRAC. Comprehensive benchmarking under twelve evaluation metrics demonstrates that our approach is among the best performing techniques on the leader board for these challenges. The open source implementation of our work is available at https://github.com/shijieS/SST.git .

433 citations

Posted Content•
Learning Transferable Visual Models From Natural Language Supervision

[...]

Alec Radford1, Jong Wook Kim1, Chris Hallacy1, Aditya Ramesh1, Gabriel Goh1, Sandhini Agarwal1, Girish Sastry1, Amanda Askell, Pamela Mishkin1, Jack Clark1, Gretchen Krueger1, Ilya Sutskever1 •
OpenAI1
26 Feb 2021-arXiv: Computer Vision and Pattern Recognition
TL;DR: In this article, a pre-training task of predicting which caption goes with which image is used to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.
Abstract: State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at this https URL.

426 citations

Posted Content•
MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding.

[...]

Aishwarya Kamath1, Mannat Singh2, Yann LeCun2, Ishan Misra2, Gabriel Synnaeve2, Nicolas Carion3 •
New York University1, Facebook2, Paris Dauphine University3
26 Apr 2021-arXiv: Computer Vision and Pattern Recognition
TL;DR: In this article, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question, is proposed.
Abstract: Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This makes it challenging for such systems to capture the long tail of visual concepts expressed in free form text. In this paper we propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query, like a caption or a question. We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model. We pre-train the network on 1.3M text-image pairs, mined from pre-existing multi-modal datasets having explicit alignment between phrases in text and objects in the image. We then fine-tune on several downstream tasks such as phrase grounding, referring expression comprehension and segmentation, achieving state-of-the-art results on popular benchmarks. We also investigate the utility of our model as an object detector on a given label set when fine-tuned in a few-shot setting. We show that our pre-training approach provides a way to handle the long tail of object categories which have very few labelled instances. Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR. The code and models are available at this https URL.
Proceedings Article•10.1109/CVPR46437.2021.00014•
HOTR: End-to-End Human-Object Interaction Detection with Transformers

[...]

Bumsoo Kim, Junhyun Lee1, Jaewoo Kang1, Eun-Sol Kim, Hyunwoo Kim1 •
Korea University1
20 Jun 2021
TL;DR: Zhang et al. as mentioned in this paper presented a novel framework, referred by HOTR, which directly predicts a set of human, object, interaction triplets from an image based on a transformer encoder-decoder architecture.
Abstract: Human-Object Interaction (HOI) detection is a task of identifying "a set of interactions" in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels. Most existing methods have indirectly addressed this task by detecting human and object instances and individually inferring every pair of the detected instances. In this paper, we present a novel framework, referred by HOTR, which directly predicts a set of 〈human, object, interaction〉 triplets from an image based on a transformer encoder-decoder architecture. Through the set prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.
Proceedings Article•10.1109/CVPR46437.2021.00330•
Objects are Different: Flexible Monocular 3D Object Detection

[...]

Yunpeng Zhang1, Jiwen Lu1, Jie Zhou1•
Tsinghua University1
6 Apr 2021
TL;DR: Zhang et al. as discussed by the authors propose a flexible framework for monocular 3D object detection which explicitly decouples the truncated objects and adaptively combines multiple approaches for object depth estimation.
Abstract: The precise localization of 3D objects from a single image without depth information is a highly challenging problem. Most existing methods adopt the same approach for all objects regardless of their diverse distributions, leading to limited performance for truncated objects. In this paper, we propose a flexible framework for monocular 3D object detection which explicitly decouples the truncated objects and adaptively combines multiple approaches for object depth estimation. Specifically, we decouple the edge of the feature map for predicting long-tail truncated objects so that the optimization of normal objects is not influenced. Furthermore, we formulate the object depth estimation as an uncertainty-guided ensemble of directly regressed object depth and solved depths from different groups of keypoints. Experiments demonstrate that our method outperforms the state-of-the-art method by relatively 27% for the moderate level and 30% for the hard level in the test set of KITTI benchmark while maintaining real-time efficiency. Code will be available at https://github.com/zhangyp15/MonoFlex.
Journal Article•10.1016/J.RSE.2021.112636•
Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters

[...]

Zhuo Zheng1, Yanfei Zhong1, Junjue Wang1, Ailong Ma1, Liangpei Zhang1 •
Wuhan University1
01 Nov 2021-Remote Sensing of Environment
TL;DR: A deep object-based semantic change detection framework, called ChangeOS, is proposed for building damage assessment that is superior to the currently published methods in speed and accuracy, and has a superior generalization ability for man-made disasters.
Proceedings Article•10.1109/CVPR46437.2021.00893•
DexYCB: A Benchmark for Capturing Hand Grasping of Objects

[...]

Yu-Wei Chao1, Wei Yang1, Yu Xiang1, Pavlo Molchanov1, Ankur Handa1, Jonathan Tremblay1, Yashraj S. Narang1, Karl Van Wyk1, Umar Iqbal1, Stan Birchfield1, Jan Kautz1, Dieter Fox1 •
Nvidia1
1 Jun 2021
TL;DR: The DexYCB dataset as mentioned in this paper is a dataset for capturing hand grasping of objects, including 2D object and keypoint detection, 6D object pose estimation, and 3D hand pose estimation.
Abstract: We introduce DexYCB, a new dataset for capturing hand grasping of objects. We first compare DexYCB with a related one through cross-dataset evaluation. We then present a thorough benchmark of state-of-the-art approaches on three relevant tasks: 2D object and keypoint detection, 6D object pose estimation, and 3D hand pose estimation. Finally, we evaluate a new robotics-relevant task: generating safe robot grasps in human-to-robot object handover. 1
Journal Article•10.1007/S10462-020-09888-5•
Vision-based robotic grasping from object localization, object pose estimation to grasp estimation for parallel grippers: a review

[...]

Guoguang Du, Kai Wang, Shiguo Lian, Kaiyong Zhao
01 Mar 2021-Artificial Intelligence Review
TL;DR: Three key tasks during vision-based robotic grasping are concluded, which are object localization, object pose estimation and grasp estimation, which include 2D planar grasp methods and 6DoF grasp methods.
Abstract: This paper presents a comprehensive survey on vision-based robotic grasping. We conclude three key tasks during vision-based robotic grasping, which are object localization, object pose estimation and grasp estimation. In detail, the object localization task contains object localization without classification, object detection and object instance segmentation. This task provides the regions of the target object in the input data. The object pose estimation task mainly refers to estimating the 6D object pose and includes correspondence-based methods, template-based methods and voting-based methods, which affords the generation of grasp poses for known objects. The grasp estimation task includes 2D planar grasp methods and 6DoF grasp methods, where the former is constrained to grasp from one direction. These three tasks could accomplish the robotic grasping with different combinations. Lots of object pose estimation methods need not object localization, and they conduct object localization and object pose estimation jointly. Lots of grasp estimation methods need not object localization and object pose estimation, and they conduct grasp estimation in an end-to-end manner. Both traditional methods and latest deep learning-based methods based on the RGB-D image inputs are reviewed elaborately in this survey. Related datasets and comparisons between state-of-the-art methods are summarized as well. In addition, challenges about vision-based robotic grasping and future directions in addressing these challenges are also pointed out.
Journal Article•10.1109/TITS.2021.3096854•
A Review and Comparative Study on Probabilistic Object Detection in Autonomous Driving

[...]

Di Feng1, Ali Harakeh, Steven L. Waslander, Klaus Dietmayer•
University of Ulm1
12 Jul 2021-IEEE Transactions on Intelligent Transportation Systems
TL;DR: An overview of generic uncertainty estimation in deep learning is provided, and a strict comparative study on existing probabilistic object detection methods for autonomous driving applications is presented.
Abstract: Capturing uncertainty in object detection is indispensable for safe autonomous driving. In recent years, deep learning has become the de-facto approach for object detection, and many probabilistic object detectors have been proposed. However, there is no summary on uncertainty estimation in deep object detection, and existing methods are not only built with different network architectures and uncertainty estimation methods, but also evaluated on different datasets with a wide range of evaluation metrics. As a result, a comparison among methods remains challenging, as does the selection of a model that best suits a particular application. This paper aims to alleviate this problem by providing a review and comparative study on existing probabilistic object detection methods for autonomous driving applications. First, we provide an overview of generic uncertainty estimation in deep learning, and then systematically survey existing methods and evaluation metrics for probabilistic object detection. Next, we present a strict comparative study for probabilistic object detection based on an image detector and three public autonomous driving datasets. Finally, we present a discussion of the remaining challenges and future works. Code has been made available at this https URL
Journal Article•10.1007/S11263-021-01465-9•
OCNet: Object Context for Semantic Segmentation

[...]

Yuhui Yuan1, Yuhui Yuan2, Lang Huang3, Jianyuan Guo3, Chao Zhang3, Xilin Chen1, Jingdong Wang2 •
Chinese Academy of Sciences1, Microsoft2, Peking University3
24 May 2021-International Journal of Computer Vision
TL;DR: This paper proposes an efficient interlaced sparse self-attention scheme to model the dense relations between any two of all pixels via the combination of two sparse relation matrices and empirically shows the advantages of this approach with competitive performances on five challenging benchmarks.
Abstract: In this paper, we address the semantic segmentation task with a new context aggregation scheme named object context, which focuses on enhancing the role of object information. Motivated by the fact that the category of each pixel is inherited from the object it belongs to, we define the object context for each pixel as the set of pixels that belong to the same category as the given pixel in the image. We use a binary relation matrix to represent the relationship between all pixels, where the value one indicates the two selected pixels belong to the same category and zero otherwise. We propose to use a dense relation matrix to serve as a surrogate for the binary relation matrix. The dense relation matrix is capable to emphasize the contribution of object information as the relation scores tend to be larger on the object pixels than the other pixels. Considering that the dense relation matrix estimation requires quadratic computation overhead and memory consumption w.r.t. the input size, we propose an efficient interlaced sparse self-attention scheme to model the dense relations between any two of all pixels via the combination of two sparse relation matrices. To capture richer context information, we further combine our interlaced sparse self-attention scheme with the conventional multi-scale context schemes including pyramid pooling (Zhao et al. 2017) and atrous spatial pyramid pooling (Chen et al. 2018). We empirically show the advantages of our approach with competitive performances on five challenging benchmarks including: Cityscapes, ADE20K, LIP, PASCAL-Context and COCO-Stuff.
Proceedings Article•10.1109/CVPR46437.2021.00994•
Uncertainty-aware Joint Salient Object and Camouflaged Object Detection

[...]

Aixuan Li1, Jing Zhang2, Yunqiu Lv1, Bowen Liu1, Tong Zhang3, Yuchao Dai1 •
Northwestern Polytechnical University1, Australian National University2, École Polytechnique Fédérale de Lausanne3
1 Jun 2021
TL;DR: Zhang et al. as discussed by the authors leveraged the contradictory information to enhance the detection ability of both salient object detection and camouflaged object detection, and proposed an adversarial learning network to achieve both higher order similarity measure and network confidence estimation.
Abstract: Visual salient object detection (SOD) aims at finding the salient object(s) that attract human attention, while camouflaged object detection (COD) on the contrary intends to discover the camouflaged object(s) that hidden in the surrounding. In this paper, we propose a paradigm of lever-aging the contradictory information to enhance the detection ability of both salient object detection and camouflaged object detection. We start by exploiting the easy positive samples in the COD dataset to serve as hard positive samples in the SOD task to improve the robustness of the SOD model. Then, we introduce a "similarity measure" module to explicitly model the contradicting attributes of these two tasks. Furthermore, considering the uncertainty of labeling in both tasks’ datasets, we propose an adversarial learning network to achieve both higher order similarity measure and network confidence estimation. Experimental results on benchmark datasets demonstrate that our solution leads to state-of-the-art (SOTA) performance for both tasks1.
Journal Article•10.1007/S41095-020-0199-Z•
RGB-D salient object detection: A survey.

[...]

Tao Zhou, Deng-Ping Fan, Ming-Ming Cheng1, Jianbing Shen, Ling Shao •
Nankai University1
07 Jan 2021-Computational Visual Media
TL;DR: Li et al. as discussed by the authors provided a comprehensive survey of RGB-D based salient object detection models from various perspectives, and reviewed related benchmark datasets in detail, and carried out a comprehensive attribute-based evaluation of several representative RGBD-based saliency detection models.
Abstract: Salient object detection, which simulates human visual perception in locating the most significant object(s) in a scene, has been widely applied to various computer vision tasks. Now, the advent of depth sensors means that depth maps can easily be captured; this additional spatial information can boost the performance of salient object detection. Although various RGB-D based salient object detection models with promising performance have been proposed over the past several years, an in-depth understanding of these models and the challenges in this field remains lacking. In this paper, we provide a comprehensive survey of RGB-D based salient object detection models from various perspectives, and review related benchmark datasets in detail. Further, as light fields can also provide depth maps, we review salient object detection models and popular benchmark datasets from this domain too. Moreover, to investigate the ability of existing models to detect salient objects, we have carried out a comprehensive attribute-based evaluation of several representative RGB-D based salient object detection models. Finally, we discuss several challenges and open directions of RGB-D based salient object detection for future research. All collected models, benchmark datasets, datasets constructed for attribute-based evaluation, and related code are publicly available at https://github.com/taozh2017/RGBD-SODsurvey.
Journal Article•10.1007/S10489-021-02293-7•
Deep learning in multi-object detection and tracking: state of the art

[...]

Sankar K. Pal1, Anima Pramanik2, Jhareswar Maiti2, Pabitra Mitra2•
Indian Statistical Institute1, Indian Institute of Technology Kharagpur2
09 Apr 2021-Applied Intelligence
TL;DR: In this article, the authors provide a comprehensive overview of object detection and tracking using deep learning (DL) networks and compare the performance of different object detectors and trackers, including the recent development in granulated DL models.
Abstract: Object detection and tracking is one of the most important and challenging branches in computer vision, and have been widely applied in various fields, such as health-care monitoring, autonomous driving, anomaly detection, and so on. With the rapid development of deep learning (DL) networks and GPU’s computing power, the performance of object detectors and trackers has been greatly improved. To understand the main development status of object detection and tracking pipeline thoroughly, in this survey, we have critically analyzed the existing DL network-based methods of object detection and tracking and described various benchmark datasets. This includes the recent development in granulated DL models. Primarily, we have provided a comprehensive overview of a variety of both generic object detection and specific object detection models. We have enlisted various comparative results for obtaining the best detector, tracker, and their combination. Moreover, we have listed the traditional and new applications of object detection and tracking showing its developmental trends. Finally, challenging issues, including the relevance of granular computing, in the said domain are elaborated as a future scope of research, together with some concerns. An extensive bibliography is also provided.
Proceedings Article•10.1109/CVPR46437.2021.01023•
ST3D: Self-training for Unsupervised Domain Adaptation on 3D Object Detection

[...]

Jihan Yang1, Shaoshuai Shi2, Zhe Wang3, Hongsheng Li2, Xiaojuan Qi1 •
University of Hong Kong1, The Chinese University of Hong Kong2, SenseTime3
1 Jun 2021
TL;DR: ST3D as discussed by the authors proposed a domain adaptive self-training pipeline, named ST3D, for unsupervised domain adaptation on 3D object detection from point clouds by pre-training the 3D detector on the source domain with a proposed random object scaling strategy for mitigating the negative effects of source domain bias.
Abstract: We present a new domain adaptive self-training pipeline, named ST3D, for unsupervised domain adaptation on 3D object detection from point clouds. First, we pre-train the 3D detector on the source domain with our proposed random object scaling strategy for mitigating the negative effects of source domain bias. Then, the detector is iteratively improved on the target domain by alternatively conducting two steps, which are the pseudo label updating with the developed quality-aware triplet memory bank and the model training with curriculum data augmentation. These specific designs for 3D object detection enable the detector to be trained with consistent and high-quality pseudo labels and to avoid overfitting to the large number of easy examples in pseudo labeled data. Our ST3D achieves state-of-the-art performance on all evaluated datasets and even surpasses fully supervised results on KITTI 3D object detection benchmark. Code will be available at https://github.com/CVMI-Lab/ST3D.
Posted Content•
Anchor DETR: Query Design for Transformer-Based Detector.

[...]

Yingming Wang, Xiangyu Zhang, Tong Yang, Jian Sun
15 Sep 2021-arXiv: Computer Vision and Pattern Recognition
TL;DR: Zhang et al. as mentioned in this paper proposed a novel query design for the transformer-based detectors, in which each object query focus on the objects near the anchor point, which can predict multiple objects at one position to solve the difficulty.
Abstract: In this paper, we propose a novel query design for the transformer-based detectors. In previous transformer-based detectors, the object queries are a set of learned embeddings. However, each learned embedding does not have an explicit physical meaning and we can not explain where it will focus on. It is difficult to optimize as the prediction slot of each object query does not have a specific mode. In other words, each object query will not focus on a specific region. To solved these problems, in our query design, object queries are based on anchor points, which are widely used in CNN-based detectors. So each object query focus on the objects near the anchor point. Moreover, our query design can predict multiple objects at one position to solve the difficulty: "one region, multiple objects". In addition, we design an attention variant, which can reduce the memory cost while achieving similar or better performance than the standard attention in DETR. Thanks to the query design and the attention variant, the proposed detector that we called Anchor DETR, can achieve better performance and run faster than the DETR with 10$\times$ fewer training epochs. For example, it achieves 44.2 AP with 16 FPS on the MSCOCO dataset when using the ResNet50-DC5 feature for training 50 epochs. Extensive experiments on the MSCOCO benchmark prove the effectiveness of the proposed methods. Code is available at this https URL.
Posted Content•
TS-CAM: Token Semantic Coupled Attention Map for Weakly Supervised Object Localization.

[...]

Wei Gao1, Fang Wan1, Xingjia Pan1, Zhiliang Peng, Qi Tian2, Zhenjun Han1, Bolei Zhou3, Qixiang Ye1 •
Chinese Academy of Sciences1, Huawei2, The Chinese University of Hong Kong3
27 Mar 2021-arXiv: Computer Vision and Pattern Recognition
TL;DR: This paper introduces the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction and achieves state-of-the-art performance.
Abstract: Weakly supervised object localization (WSOL) is a challenging problem when given image category labels but requires to learn object localization models. Optimizing a convolutional neural network (CNN) for classification tends to activate local discriminative regions while ignoring complete object extent, causing the partial activation issue. In this paper, we argue that partial activation is caused by the intrinsic characteristics of CNN, where the convolution operations produce local receptive fields and experience difficulty to capture long-range feature dependency among pixels. We introduce the token semantic coupled attention map (TS-CAM) to take full advantage of the self-attention mechanism in visual transformer for long-range dependency extraction. TS-CAM first splits an image into a sequence of patch tokens for spatial embedding, which produce attention maps of long-range visual dependency to avoid partial activation. TS-CAM then re-allocates category-related semantics for patch tokens, enabling each of them to be aware of object categories. TS-CAM finally couples the patch tokens with the semantic-agnostic attention map to achieve semantic-aware localization. Experiments on the ILSVRC/CUB-200-2011 datasets show that TS-CAM outperforms its CNN-CAM counterparts by 7.1%/27.1% for WSOL, achieving state-of-the-art performance.
Proceedings Article•10.1109/CVPR46437.2021.00267•
BBAM: Bounding Box Attribution Map for Weakly Supervised Semantic and Instance Segmentation

[...]

Jungbeom Lee1, Jihun Yi1, Chaehun Shin1, Sungroh Yoon1•
Seoul National University1
1 Jun 2021
TL;DR: In this paper, a bounding-box attribution map (BBAM) was proposed to identify the target object in its bounding box and thus serve as pseudo ground truth for weakly supervised semantic and instance segmentation.
Abstract: Weakly supervised segmentation methods using bounding box annotations focus on obtaining a pixel-level mask from each box containing an object. Existing methods typically depend on a class-agnostic mask generator, which operates on the low-level information intrinsic to an image. In this work, we utilize higher-level information from the behavior of a trained object detector, by seeking the smallest areas of the image from which the object detector produces almost the same result as it does from the whole image. These areas constitute a bounding-box attribution map (BBAM), which identifies the target object in its bounding box and thus serves as pseudo ground-truth for weakly supervised semantic and instance segmentation. This approach significantly outperforms recent comparable techniques on both the PASCAL VOC and MS COCO benchmarks in weakly supervised semantic and instance segmentation. In addition, we provide a detailed analysis of our method, offering deeper insight into the behavior of the BBAM.
Proceedings Article•10.1109/CVPR46437.2021.00134•
Efficient Regional Memory Network for Video Object Segmentation

[...]

Haozhe Xie1, Hongxun Yao1, Shangchen Zhou2, Shengping Zhang1, Wenxiu Sun3 •
Harbin Institute of Technology1, Nanyang Technological University2, SenseTime3
1 Jun 2021
TL;DR: In this paper, the precise regional memory is constructed by memorizing local regions where the target objects appear in the past frames, and the query regions are tracked and predicted based on the optical flow estimated from the previous frame.
Abstract: Recently, several Space-Time Memory based networks have shown that the object cues (e.g. video frames as well as the segmented object masks) from the past frames are useful for segmenting objects in the current frame. However, these methods exploit the information from the memory by global-to-global matching between the current and past frames, which lead to mismatching to similar objects and high computational complexity. To address these problems, we propose a novel local-to-local matching solution for semi-supervised VOS, namely Regional Memory Network (RMNet). In RMNet, the precise regional memory is constructed by memorizing local regions where the target objects appear in the past frames. For the current query frame, the query regions are tracked and predicted based on the optical flow estimated from the previous frame. The proposed local-to-local matching effectively alleviates the ambiguity of similar objects in both memory and query frames, which allows the information to be passed from the regional memory to the query region efficiently and effectively. Experimental results indicate that the proposed RM-Net performs favorably against state-of-the-art methods on the DAVIS and YouTube-VOS datasets.
Journal Article•10.1007/S11263-020-01393-0•
MOTChallenge: A Benchmark for Single-Camera Multiple Target Tracking

[...]

Patrick Dendorfer1, Aljosa Osep1, Anton Milan2, Konrad Schindler3, Daniel Cremers1, Ian Reid4, Stefan Roth5, Laura Leal-Taixé1 •
Technische Universität München1, Amazon.com2, ETH Zurich3, University of Adelaide4, Technische Universität Darmstadt5
01 Apr 2021-International Journal of Computer Vision
TL;DR: The MOTChallenge as mentioned in this paper is a benchmark for single-camera multiple object tracking (MOT) which has been widely used in the field of computer vision and has been used to evaluate the performance of object tracking algorithms.
Abstract: Standardized benchmarks have been crucial in pushing the performance of computer vision algorithms, especially since the advent of deep learning. Although leaderboards should not be over-claimed, they often provide the most objective measure of performance and are therefore important guides for research. We present MOTChallenge, a benchmark for single-camera Multiple Object Tracking (MOT) launched in late 2014, to collect existing and new data and create a framework for the standardized evaluation of multiple object tracking methods. The benchmark is focused on multiple people tracking, since pedestrians are by far the most studied object in the tracking community, with applications ranging from robot navigation to self-driving cars. This paper collects the first three releases of the benchmark: (i) MOT15, along with numerous state-of-the-art results that were submitted in the last years, (ii) MOT16, which contains new challenging videos, and (iii) MOT17, that extends MOT16 sequences with more precise labels and evaluates tracking performance on three different object detectors. The second and third release not only offers a significant increase in the number of labeled boxes, but also provide labels for multiple object classes beside pedestrians, as well as the level of visibility for every single object of interest. We finally provide a categorization of state-of-the-art trackers and a broad error analysis. This will help newcomers understand the related work and research trends in the MOT community, and hopefully shed some light into potential future research directions.
Proceedings Article•10.1109/CVPR46437.2021.01005•
Dense Relation Distillation with Context-aware Aggregation for Few-Shot Object Detection

[...]

Hanzhe Hu1, Shuai Bai2, Aoxue Li1, Jinshi Cui1, Liwei Wang1 •
Peking University1, Beijing University of Posts and Telecommunications2
20 Jun 2021
TL;DR: DCNet as mentioned in this paper proposes Dense Relation Distillation with Context-aware Aggregation (DCNet) to tackle the few-shot detection problem, which learns to adapt to novel classes with only a few annotated examples.
Abstract: Conventional deep learning based methods for object detection require a large amount of bounding box annotations for training, which is expensive to obtain such high quality annotated data. Few-shot object detection, which learns to adapt to novel classes with only a few annotated examples, is very challenging since the fine-grained feature of novel object can be easily overlooked with only a few data available. In this work, aiming to fully exploit features of annotated novel object and capture fine-grained features of query object, we propose Dense Relation Distillation with Context-aware Aggregation (DCNet) to tackle the few-shot detection problem. Built on the meta-learning based framework, Dense Relation Distillation module targets at fully exploiting support features, where support features and query feature are densely matched, covering all spatial locations in a feed-forward fashion. The abundant usage of the guidance information endows model the capability to handle common challenges such as appearance changes and occlusions. Moreover, to better capture scale-aware features, Context-aware Aggregation module adaptively harnesses features from different scales for a more comprehensive feature representation. Extensive experiments illustrate that our proposed approach achieves state-of-the-art results on PASCAL VOC and MS COCO datasets. Code will be made available at https://github.com/hzhupku/DCNet.
Proceedings Article•10.1109/CVPR46437.2021.01416•
Open-Vocabulary Object Detection Using Captions

[...]

Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, Shih-Fu Chang1•
Columbia University1
1 Jun 2021
TL;DR: Open-vocabulary object detection as discussed by the authors uses bounding box annotations for a limited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost.
Abstract: Despite the remarkable accuracy of deep neural networks in object detection, they are costly to train and scale due to supervision requirements. Particularly, learning more object categories typically requires proportionally more bounding box annotations. Weakly supervised and zero-shot learning techniques have been explored to scale object detectors to more categories with less supervision, but they have not been as successful and widely adopted as supervised models. In this paper, we put forth a novel formulation of the object detection problem, namely open-vocabulary object detection, which is more general, more practical, and more effective than weakly supervised and zero-shot approaches. We propose a new method to train object detectors using bounding box annotations for a limited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost. We show that the proposed method can detect and localize objects for which no bounding box annotation is provided during training, at a significantly higher accuracy than zero-shot approaches. Meanwhile, objects with bounding box annotation can be detected almost as accurately as supervised methods, which is significantly better than weakly supervised baselines. Accordingly, we establish a new state of the art for scalable object detection.
Posted Content•
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

[...]

Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui1•
Google1
28 Apr 2021-arXiv: Computer Vision and Pattern Recognition
TL;DR: Zhang et al. as mentioned in this paper distill the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student) to encode category texts and image regions of object proposals.
Abstract: We aim at advancing open-vocabulary object detection, which detects objects described by arbitrary text inputs. The fundamental challenge is the availability of training data. Existing object detection datasets only contain hundreds of categories, and it is costly to scale further. To overcome this challenge, we propose ViLD, a training method via Vision and Language knowledge Distillation. Our method distills the knowledge from a pretrained open-vocabulary image classification model (teacher) into a two-stage detector (student). Specifically, we use the teacher model to encode category texts and image regions of object proposals. Then we train a student detector, whose region embeddings of detected boxes are aligned with the text and image embeddings inferred by the teacher. We benchmark on LVIS by holding out all rare categories as novel categories not seen during training. ViLD obtains 16.1 mask AP$_r$, even outperforming the supervised counterpart by 3.8 with a ResNet-50 backbone. The model can directly transfer to other datasets without finetuning, achieving 72.2 AP$_{50}$, 36.6 AP and 11.8 AP on PASCAL VOC, COCO and Objects365, respectively. On COCO, ViLD outperforms previous SOTA by 4.8 on novel AP and 11.4 on overall AP.
...

Tools

SciSpace AgentBiomedical AgentSciSpace RecruitSciSpace for EnterpriseAgent GalleryChat with PDFLiterature ReviewAI WriterFind TopicsParaphraserCitation GeneratorExtract DataAI DetectorCitation Booster

Learn

ResourcesLive Workshops

SciSpace

CareersSupportBrowse PapersPricingSciSpace Affiliate ProgramCancellation & Refund PolicyTermsPrivacyData Sources

Directories

PapersTopicsJournalsAuthorsConferencesInstitutionsCitation StylesWriting templates

Extension & Apps

SciSpace Chrome ExtensionSciSpace Mobile App

Contact

support@scispace.com
SciSpace

© 2026 | PubGenius Inc. | Suite # 217 691 S Milpitas Blvd Milpitas CA 95035, USA

soc2
Secured by Delve