Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

Open AccessPosted Content

Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

- 22 Oct 2020

- arXiv: Computer Vision and Pattern Recog...

88

TL;DR: A Two-Stream Consensus Network (TSCN) to simultaneously address weakly-supervised Temporal Action Localization challenges and a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Fig. 1: Visualization of two-stream outputs and their late fusion result. The first two rows are an input video and the ground truth action instances, respectively. The last three rows are attention sequences (scaled from 0 to 1) predicted by the RGB stream, the flow stream and their weighted sum (i.e., the fusion result), respectively, and the horizontal and vertical axes denote the time and the intensity of attention values, respectively. The green boxes denote the localization results generated by thresholding the attention at the value of 0.5. By properly combining the two different attention distributions predicted by the RGB and flow streams, the late fusion result achieves a higher true positive rate and a lower false positive rate, and thus has better localization performance

Table 1: Comparison of our method with state-of-the-art TAL methods on the THUMOS14 testing set. UNT and I3D are abbreviations for UntrimmedNet feature and I3D feature, respectively

Fig. 4: Qualitative results on the THUMOS14 testing set. The eight rows in each example are input video, ground truth action instance, RGB stream, flow stream, and fusion attention sequences from the model trained with only videolevel labels and frame-level pseudo ground truth, respectively. Action proposals are represented by green boxes. The horizontal and vertical axes are time and intensity of attention, respectively

Table 5: Comparison between the model trained with only video-level labels and the model trained with hard pseudo ground truth on the THUMOS14 testing set. The label column denotes the supervision used in training, where “video” indicates only video-level labels are leveraged, and “frame” indicates the hard pseudo ground truth is also leveraged during training. Precision, recall and F-measure are calculated under IoU threshold 0.5

Table 4: Comparison of our method with different attention normalization functions on the THUMOS14 testing set. Lbg is the background classification loss introduced in [28], and Latt is defined in Equation (5). The var column denotes the average attention variance over the whole testing set

Fig. 3: Comparison between models trained with different pseudo ground truth on the THUMOS14 testing set. The upper bounds denote models trained with ground truth actionness sequence

Citations

•Proceedings Article•10.1109/CVPR46437.2021.01575

CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning

Can Zhang, +4 more

- 01 Jun 2021

TL;DR: Wang et al. as mentioned in this paper proposed to refine the hard snippet representation in feature space, which guides the network to perceive precise temporal boundaries and avoid the temporal interval interruption, and they introduced a Hard Snippet Mining algorithm to locate the potential hard snippets.

...read moreread less

166

•Proceedings Article•10.1109/CVPR46437.2021.00012

Uncertainty Guided Collaborative Training for Weakly Supervised Temporal Action Detection

Wenfei Yang, +5 more

- 20 Jun 2021

TL;DR: In this paper, uncertainty guided collaborative training (UGCT) is proposed to mitigate the noise in the generated pseudo labels, which can improve the performance of weakly supervised temporal action detection.

...read moreread less

116

•Proceedings Article•10.1145/3474085.3475298

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

Fa-Ting Hong, +4 more

- 17 Oct 2021

TL;DR: In this article, a cross-modal consensus network (CO2-Net) is proposed to reduce the task-irrelevant information redundancy in weakly-supervised temporal action localization.

...read moreread less

93

•Posted Content

Weakly-supervised Temporal Action Localization by Uncertainty Modeling

Pilhyeon Lee, +3 more

- 12 Jun 2020

- arXiv: Computer Vision and Pattern Recog...

TL;DR: A new perspective on background frames is presented where they are modeled as out-of-distribution samples regarding their inconsistency and a background entropy loss is introduced to better discriminate background frames by encouraging their in-dist distribution (action) probabilities to be uniformly distributed over all action classes.

...read moreread less

87

•Posted Content

D2-Net: Weakly-Supervised Action Localization via Discriminative Embeddings and Denoised Activations.

Sanath Narayan, +5 more

- 11 Dec 2020

- arXiv: Computer Vision and Pattern Recog...

TL;DR: A novel loss formulation is introduced, which jointly enhances the discriminability of latent embeddings and robustness of the output temporal class activations with respect to foreground-background noise caused by weak supervision, in a weakly-supervised temporal action localization framework, called D2-Net.

...read moreread less

77

...

Expand

References

•Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

138.5K

•Posted Content

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 22 Dec 2014

- arXiv: Learning

TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.

...read moreread less

82.5K

Proceedings Article•10.1109/CVPR.2009.5206848

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

- 20 Jun 2009

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

75.9K

•Journal Article•10.1109/TPAMI.2016.2577031

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, +3 more

- 01 Jun 2017

- IEEE Transactions on Pattern Analysis an...

TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.

...read moreread less

64.4K

•Proceedings Article•10.1109/CVPR.2005.177

Histograms of oriented gradients for human detection

Navneet Dalal, +1 more

- 20 Jun 2005

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.

...read moreread less

36.7K