Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization
Yuanhao Zhai,Le Wang,Wei Tang,Qilin Zhang,Junsong Yuan,Gang Hua +5 more
- 23 Aug 2020
- pp 37-54
109
TL;DR: Wang et al. as discussed by the authors proposed a Two-Stream Consensus Network (TSCN) to simultaneously address the challenges of weakly-supervised action localization and false positive action proposal elimination.
read more
Abstract: Weakly-supervised Temporal Action Localization (W-TAL) aims to classify and localize all action instances in an untrimmed video under only video-level supervision. However, without frame-level annotations, it is challenging for W-TAL methods to identify false positive action proposals and generate action proposals with precise temporal boundaries. In this paper, we present a Two-Stream Consensus Network (TSCN) to simultaneously address these challenges. The proposed TSCN features an iterative refinement training method, where a frame-level pseudo ground truth is iteratively updated, and used to provide frame-level supervision for improved model training and false positive action proposal elimination. Furthermore, we propose a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries. Experiments conducted on the THUMOS14 and ActivityNet datasets show that the proposed TSCN outperforms current state-of-the-art methods, and even achieves comparable results with some recent fully-supervised methods.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning
Can Zhang,Meng Cao,Dongming Yang,Jie Chen,Yuexian Zou +4 more
- 01 Jun 2021
TL;DR: Wang et al. as mentioned in this paper proposed to refine the hard snippet representation in feature space, which guides the network to perceive precise temporal boundaries and avoid the temporal interval interruption, and they introduced a Hard Snippet Mining algorithm to locate the potential hard snippets.
•Posted Content
End-to-end Temporal Action Detection with Transformer.
TL;DR: TadTR as mentioned in this paper proposes an end-to-end framework for temporal action detection, which maps a set of learnable embeddings to action instances in parallel, by selectively attending to a sparse set of snippets in a video.
147
Uncertainty Guided Collaborative Training for Weakly Supervised Temporal Action Detection
Wenfei Yang,Tianzhu Zhang,Xiaoyuan Yu,Tian Qi,Yongdong Zhang,FengWu +5 more
- 20 Jun 2021
TL;DR: In this paper, uncertainty guided collaborative training (UGCT) is proposed to mitigate the noise in the generated pseudo labels, which can improve the performance of weakly supervised temporal action detection.
•Proceedings Article
Weakly-supervised Temporal Action Localization by Uncertainty Modeling.
Pilhyeon Lee,Jinglu Wang,Yan Lu,Hyeran Byun +3 more
- 18 May 2021
TL;DR: In this article, a new perspective on background frames where they are modeled as out-of-distribution samples regarding their inconsistency is presented, and background frames can be detected by estimating the probability of each frame being out ofdistribution, known as uncertainty.
Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization
Fa-Ting Hong,Jia-Chang Feng,Dan Xu,Ying Shan,Wei-Shi Zheng +4 more
- 17 Oct 2021
TL;DR: In this article, a cross-modal consensus network (CO2-Net) is proposed to reduce the task-irrelevant information redundancy in weakly-supervised temporal action localization.
93
References
ImageNet: A large-scale hierarchical image database
Jia Deng,Wei Dong,Richard Socher,Li-Jia Li,Kai Li,Li Fei-Fei +5 more
- 20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Histograms of oriented gradients for human detection
Navneet Dalal,Bill Triggs +1 more
- 20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.
•Proceedings Article
Faster R-CNN: towards real-time object detection with region proposal networks
Shaoqing Ren,Kaiming He,Ross Girshick,Jian Sun +3 more
- 07 Dec 2015
TL;DR: Ren et al. as discussed by the authors proposed a region proposal network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.
•Proceedings Article
PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke,Sam Gross,Francisco Massa,Adam Lerer,James Bradbury,Gregory Chanan,Trevor Killeen,Zeming Lin,Natalia Gimelshein,Luca Antiga,Alban Desmaison,Andreas Kopf,Edward Z. Yang,Zachary DeVito,Martin Raison,Alykhan Tejani,Sasank Chilamkurthy,Benoit Steiner,Lu Fang,Junjie Bai,Soumith Chintala +20 more
- 01 Jan 2019
TL;DR: This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
Learning Deep Features for Discriminative Localization
Bolei Zhou,Aditya Khosla,Agata Lapedriza,Aude Oliva,Antonio Torralba +4 more
- 27 Jun 2016
TL;DR: This work revisits the global average pooling layer proposed in [13], and sheds light on how it explicitly enables the convolutional neural network (CNN) to have remarkable localization ability despite being trained on imagelevel labels.