Open AccessPosted Content
Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization
TL;DR: A Two-Stream Consensus Network (TSCN) to simultaneously address weakly-supervised Temporal Action Localization challenges and a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries.
read more
Abstract: Weakly-supervised Temporal Action Localization (W-TAL) aims to classify and localize all action instances in an untrimmed video under only video-level supervision. However, without frame-level annotations, it is challenging for W-TAL methods to identify false positive action proposals and generate action proposals with precise temporal boundaries. In this paper, we present a Two-Stream Consensus Network (TSCN) to simultaneously address these challenges. The proposed TSCN features an iterative refinement training method, where a frame-level pseudo ground truth is iteratively updated, and used to provide frame-level supervision for improved model training and false positive action proposal elimination. Furthermore, we propose a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries. Experiments conducted on the THUMOS14 and ActivityNet datasets show that the proposed TSCN outperforms current state-of-the-art methods, and even achieves comparable results with some recent fully-supervised methods.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Fig. 1: Visualization of two-stream outputs and their late fusion result. The first two rows are an input video and the ground truth action instances, respectively. The last three rows are attention sequences (scaled from 0 to 1) predicted by the RGB stream, the flow stream and their weighted sum (i.e., the fusion result), respectively, and the horizontal and vertical axes denote the time and the intensity of attention values, respectively. The green boxes denote the localization results generated by thresholding the attention at the value of 0.5. By properly combining the two different attention distributions predicted by the RGB and flow streams, the late fusion result achieves a higher true positive rate and a lower false positive rate, and thus has better localization performance 
Table 1: Comparison of our method with state-of-the-art TAL methods on the THUMOS14 testing set. UNT and I3D are abbreviations for UntrimmedNet feature and I3D feature, respectively 
Fig. 4: Qualitative results on the THUMOS14 testing set. The eight rows in each example are input video, ground truth action instance, RGB stream, flow stream, and fusion attention sequences from the model trained with only videolevel labels and frame-level pseudo ground truth, respectively. Action proposals are represented by green boxes. The horizontal and vertical axes are time and intensity of attention, respectively 
Table 5: Comparison between the model trained with only video-level labels and the model trained with hard pseudo ground truth on the THUMOS14 testing set. The label column denotes the supervision used in training, where “video” indicates only video-level labels are leveraged, and “frame” indicates the hard pseudo ground truth is also leveraged during training. Precision, recall and F-measure are calculated under IoU threshold 0.5 ![Table 4: Comparison of our method with different attention normalization functions on the THUMOS14 testing set. Lbg is the background classification loss introduced in [28], and Latt is defined in Equation (5). The var column denotes the average attention variance over the whole testing set](/figures/table4-1-6v4o77nrezq1.png)
Table 4: Comparison of our method with different attention normalization functions on the THUMOS14 testing set. Lbg is the background classification loss introduced in [28], and Latt is defined in Equation (5). The var column denotes the average attention variance over the whole testing set 
Fig. 3: Comparison between models trained with different pseudo ground truth on the THUMOS14 testing set. The upper bounds denote models trained with ground truth actionness sequence
Citations
CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning
Can Zhang,Meng Cao,Dongming Yang,Jie Chen,Yuexian Zou +4 more
- 01 Jun 2021
TL;DR: Wang et al. as mentioned in this paper proposed to refine the hard snippet representation in feature space, which guides the network to perceive precise temporal boundaries and avoid the temporal interval interruption, and they introduced a Hard Snippet Mining algorithm to locate the potential hard snippets.
Uncertainty Guided Collaborative Training for Weakly Supervised Temporal Action Detection
Wenfei Yang,Tianzhu Zhang,Xiaoyuan Yu,Tian Qi,Yongdong Zhang,FengWu +5 more
- 20 Jun 2021
TL;DR: In this paper, uncertainty guided collaborative training (UGCT) is proposed to mitigate the noise in the generated pseudo labels, which can improve the performance of weakly supervised temporal action detection.
Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization
Fa-Ting Hong,Jia-Chang Feng,Dan Xu,Ying Shan,Wei-Shi Zheng +4 more
- 17 Oct 2021
TL;DR: In this article, a cross-modal consensus network (CO2-Net) is proposed to reduce the task-irrelevant information redundancy in weakly-supervised temporal action localization.
93
•Posted Content
Weakly-supervised Temporal Action Localization by Uncertainty Modeling
TL;DR: A new perspective on background frames is presented where they are modeled as out-of-distribution samples regarding their inconsistency and a background entropy loss is introduced to better discriminate background frames by encouraging their in-dist distribution (action) probabilities to be uniformly distributed over all action classes.
87
•Posted Content
D2-Net: Weakly-Supervised Action Localization via Discriminative Embeddings and Denoised Activations.
TL;DR: A novel loss formulation is introduced, which jointly enhances the discriminability of latent embeddings and robustness of the output temporal class activations with respect to foreground-background noise caused by weak supervision, in a weakly-supervised temporal action localization framework, called D2-Net.
77
References
•Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
- 01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
138.5K
•Posted Content
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.
82.5K
ImageNet: A large-scale hierarchical image database
Jia Deng,Wei Dong,Richard Socher,Li-Jia Li,Kai Li,Li Fei-Fei +5 more
- 20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.
Histograms of oriented gradients for human detection
Navneet Dalal,Bill Triggs +1 more
- 20 Jun 2005
TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.