End-to-End Learning of Visual Representations From Uncurated Instructional Videos
Antoine Miech,Jean-Baptiste Alayrac,Lucas Smaira,Ivan Laptev,Josef Sivic,Andrew Zisserman +5 more
- 14 Jun 2020
- pp 9879-9889
TL;DR: This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
read more
Abstract: Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to- video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
M<sup>3</sup>3D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding
Muhammad Abdullah Jamal,Omid Mohareri +1 more
- 03 Jan 2024
RAP: Retrieval-Augmented Planner for Adaptive Procedure Planning in Instructional Videos
A Ghamari Zare,Yulei Niu,Hammad Ayyubi,Shih‐Fu Chang +3 more
Contrastive Learning of Person-Independent Representations for Facial Action Unit Detection
Yong Li,Shiguang Shan +1 more
TL;DR: In this paper , a margin-based temporal contrastive learning paradigm was adopted to perceive the temporal AU coherence and evolution characteristics within a clip that consists of consecutive input facial frames.
Achieving Procedure-Aware Instructional Video Correlation Learning Under Weak Supervision from a Collaborative Perspective
Tianyao He,Huabin Liu,Zelin Ni,Yuxi Li,Xiao Ma,Cheng Zhong,Shuicheng Yan,Yingxue Wang,Weiyao Lin +8 more
Exploring Temporal Concurrency for Video-Language Representation Learning
Heng Zhang,Daqing Liu,Zezhong Lv,Bing Su,Dacheng Tao +4 more
- 01 Oct 2023
TL;DR: This paper proposes to learn video-language representations by modeling video-language pairs as Temporal Concurrent Processes (TCP) via a process-wised distance metric learning framework and introduces a regularization term that enforces the embeddings of each modality approximating a stochastic process to guarantee the inherent dynamics.
References
•Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
- 01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
138.5K
•Posted Content
Deep Residual Learning for Image Recognition
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
117.9K
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
99K
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Posted Content
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.
82.5K
Related Papers (5)
Joao Carreira,Andrew Zisserman +1 more
- 21 Jul 2017
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016