Journal Article10.1109/tcsvt.2022.3222305
Motion Stimulation for Compositional Action Recognition
28
TL;DR: Wang et al. as mentioned in this paper proposed a Motion Stimulation (MS) block, which is specifically designed to mine dynamic clues of the local regions autonomously from adjacent frames, which can be directly and conveniently integrated into existing video backbones to enhance the ability of compositional generalization for action recognition algorithms.
read more
Abstract: Recognizing the unseen combinations of action and different objects, namely (zero-shot) compositional action recognition, is extremely challenging for conventional action recognition algorithms in real-world applications. Previous methods focus on enhancing the dynamic clues of objects that appear in the scene by building region features or tracklet embedding from ground-truths or detected bounding boxes. These methods rely heavily on manual annotation or the quality of detectors, which are inflexible for practical applications. In this work, we aim to mining the temporal clues from moving objects or hands without explicit supervision. Thus, we propose a novel Motion Stimulation (MS) block, which is specifically designed to mine dynamic clues of the local regions autonomously from adjacent frames. Furthermore, MS consists of the following three steps: motion feature extraction, motion feature recalibration, and action-centric excitation. The proposed MS block can be directly and conveniently integrated into existing video backbones to enhance the ability of compositional generalization for action recognition algorithms. Extensive experimental results on three action recognition datasets, the Something-Else, IKEA-Assembly and EPIC-KITCHENS datasets, indicate the effectiveness and interpretability of our MS block.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Discriminative and Robust Attribute Alignment for Zero-Shot Learning
TL;DR: Zhang et al. as discussed by the authors proposed to improve the discriminative power of the learned visual features by contrastive embedding, which exploits both the class-wise and instance-wise supervision for GZSL, under the attribute guided weakly supervised representation learning framework.
24
Picking point recognition for ripe tomatoes using semantic segmentation and morphological processing
TL;DR: In this paper , a semantic segmentation model with the improved Swin Transformer V2 and a picking point recognition algorithm based on the connection of tomato fruit, calyx and stem are proposed for the problem of picking point detection of ripe tomatoes in complex environments.
19
Adaptive Slicing-Aided Hyper Inference for Small Object Detection in High-Resolution Remote Sensing Images
TL;DR: Adaptive Slicing Aided Hyper Inference (ASHI) as discussed by the authors adaptively adjusts the slicing size to control the number of slices according to the image resolution, which can dramatically reduce redundant computation using an adaptive slicing size.
14
Hierarchical Coupled Discriminative Dictionary Learning for Zero-shot Learning
TL;DR: Zhang et al. as mentioned in this paper proposed hierarchical coupled discriminative dictionary learning (HCDDL) method to hierarchically establish visual-semantic embedding at class-level and image-level with a coarse-to-fine way.
10
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
ImageNet: A large-scale hierarchical image database
Jia Deng,Wei Dong,Richard Socher,Li-Jia Li,Kai Li,Li Fei-Fei +5 more
- 20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Mask R-CNN
Kaiming He,Georgia Gkioxari,Piotr Dollár,Ross Girshick +3 more
- 20 Mar 2017
TL;DR: This work presents a conceptually simple, flexible, and general framework for object instance segmentation, which extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.
•Posted Content
Squeeze-and-Excitation Networks
TL;DR: Squeeze-and-excitation (SE) as mentioned in this paper adaptively recalibrates channel-wise feature responses by explicitly modeling interdependencies between channels, which can be stacked together to form SENet architectures.
18.9K
Non-local Neural Networks
Xiaolong Wang,Ross Girshick,Abhinav Gupta,Kaiming He +3 more
- 18 Jun 2018
TL;DR: In this article, the non-local operation computes the response at a position as a weighted sum of the features at all positions, which can be used to capture long-range dependencies.