MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

doi:10.1109/CVPR.2016.571

Proceedings Article10.1109/CVPR.2016.571

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Jun Xu, +3 more

- 01 Jun 2016

- pp 5288-5296

1.5K

TL;DR: A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.

Abstract: While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content. In this paper we present MSR-VTT (standing for "MSRVideo to Text") which is a new large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers. We present a detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches. We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on MSR-VTT.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Proceedings Article•10.1109/ICCV.2017.622

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

Raghav Goyal, +13 more

- 13 Jun 2017

TL;DR: This work describes the ongoing collection of the “something-something” database of video prediction tasks whose solutions require a common sense understanding of the depicted situation, and describes the challenges in crowd-sourcing this data at scale.

...read moreread less

1.8K

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language [Supplementary Material]

Jun Xu, +3 more

- 06 Oct 2016

TL;DR: The MSR-VTT dataset as discussed by the authors was used in the Microsoft Research Video To Language challenge (http://ms-multimedia-challenge.com/), where the authors removed simple and duplicated sentences and replaced them with refined ones to control the quality of data and annotations.

...read moreread less

1.2K

•Proceedings Article•10.1109/ICCV.2019.00473

Attention on Attention for Image Captioning

Lun Huang, +3 more

- 01 Oct 2019

TL;DR: AoANet as mentioned in this paper proposes an Attention on Attention (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries and achieves state-of-the-art performance.

...read moreread less

1.1K

•Proceedings Article•10.1109/CVPR42600.2020.00990

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

Antoine Miech, +5 more

- 14 Jun 2020

TL;DR: This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

...read moreread less

943

•Posted Content

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Antoine Miech, +5 more

- 07 Jun 2019

- arXiv: Computer Vision and Pattern Recog...

TL;DR: It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

...read moreread less

820

...

Expand

References

•Book Chapter•10.1007/978-3-319-10602-1_48

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

- 06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

51.7K

•Proceedings Article•10.3115/1073083.1073135

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

- 06 Jul 2002

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

28.9K

•Proceedings Article

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, +10 more

- 06 Jul 2015

TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

...read moreread less

10.1K

•Proceedings Article•10.1109/CVPR.2015.7298935

Show and tell: A neural image caption generator

Oriol Vinyals, +3 more

- 07 Jun 2015

TL;DR: In this paper, a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation is proposed to generate natural sentences describing an image, which can be used to automatically describe the content of an image.

...read moreread less

7.5K

•Proceedings Article•10.1109/CVPR.2014.223

Large-Scale Video Classification with Convolutional Neural Networks

Andrej Karpathy, +5 more

- 23 Jun 2014

TL;DR: This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

...read moreread less

6.6K

...

Expand

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Chat with Paper

AI Agents for this Paper

Citations

The “Something Something” Video Database for Learning and Evaluating Visual Common Sense

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language [Supplementary Material]

Attention on Attention for Image Captioning

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

References

Microsoft COCO: Common Objects in Context

Bleu: a Method for Automatic Evaluation of Machine Translation

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Show and tell: A neural image caption generator

Large-Scale Video Classification with Convolutional Neural Networks

Related Papers (5)

Deep Residual Learning for Image Recognition

Bleu: a Method for Automatic Evaluation of Machine Translation

Long short-term memory

Learning Spatiotemporal Features with 3D Convolutional Networks

ROUGE: A Package for Automatic Evaluation of Summaries