MSR-VTT: A Large Video Description Dataset for Bridging Video and Language [Supplementary Material]

Open Access

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language [Supplementary Material]

- 06 Oct 2016

1.2K

TL;DR: The MSR-VTT dataset as discussed by the authors was used in the Microsoft Research Video To Language challenge (http://ms-multimedia-challenge.com/), where the authors removed simple and duplicated sentences and replaced them with refined ones to control the quality of data and annotations.

Abstract: When organizing the Microsoft Research Video To Language challenge (http://ms-multimedia-challenge.com/), we found that, in our previously released dataset (CVPR 2016 paper), some sentences annotated by AMT workers are identical in one video clip or very similar in one category. Therefore, to control the quality of data and annotations, as well as the competitions, we removed those simple and duplicated sentences and replaced them with refined ones. We finally released the fixed dataset in our challenge website. Due to these modifications of the dataset, the performance cannot be well matched with what we reported in our CVPR paper. Here, we have reported the new performance in the following tables which also appeared in our CVPR paper (referred to as Table 1~7, respectively). If you are trying to reproduce or compare the baselines conducted on our MSR-VTT dataset, please refer to this supplementary material and the updated performance reported in this material. However, please cite our CVPR paper if you want to use the MSR-VTT as your dataset. J. Xu, T. Mei, T. Yao, Y. Rui. “MSR-VTT: A Large Video Description Dataset for Bridging Video and Language“, In Proceedings of CVPR, 2016. MSR Video to Language challenge (http://ms-multimedia-challenge.com/).

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1109/tcsvt.2023.3303945

SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

Xingning Dong, +7 more

- 31 Jan 2024

TL;DR: The Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance, is proposed and established a new state-of-the-art in pixel-level video-text pre-training.

...read moreread less

Journal Article•10.48550/arxiv.2404.00234

Grid Diffusion Models for Text-to-Video Generation

Taegyeong Lee, +2 more

- 30 Mar 2024

- arXiv.org

TL;DR: Grid diffusion models are effective for text-to-video generation, reducing the need for large datasets and high computational costs.

...read moreread less

Proceedings Article•10.48550/arXiv.2210.12617

Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval

Minjoon Jung, +4 more

- 23 Oct 2022

TL;DR: It is shown that MPGN successfully learns to localize the video corpus moment without any explicit annotation, and the effectiveness of MPGN on the TVR dataset is validated, showing competitive results compared with both supervised models and unsupervised setting models.

...read moreread less

Journal Article•10.48550/arXiv.2304.11431

A Review of Deep Learning for Video Captioning

Moloud Abdar, +10 more

- 22 Apr 2023

- arXiv.org

TL;DR: Video captioning is a fast-moving, cross-disciplinary area of research that bridges work in the fields of computer vision, natural language processing (NLP), linguistics, and human-computer interaction as discussed by the authors .

...read moreread less

Generative Prompt Model for Weakly Supervised Object Localization

Yuzhong Zhao, +4 more

- 19 Jul 2023

TL;DR: GenPromp as discussed by the authors proposes a generative pipeline to localize less discriminative object parts by formulating weakly supervised object localization (WSOL) as a conditional image denoising procedure.

...read moreread less

...

Expand

References

•Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

- 04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

102.6K

•Journal Article•10.1145/3065386

ImageNet classification with deep convolutional neural networks

Alex Krizhevsky, +2 more

- 24 May 2017

- Communications of The ACM

TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

...read moreread less

98.2K

Proceedings Article•10.1109/CVPR.2009.5206848

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

- 20 Jun 2009

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

75.9K

•Proceedings Article•10.1109/CVPR.2015.7298594

Going deeper with convolutions

Christian Szegedy, +8 more

- 07 Jun 2015

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

56.6K

•Book Chapter•10.1007/978-3-319-10602-1_48

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

- 06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

51.7K

...

Expand

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language [Supplementary Material]

Chat with Paper

AI Agents for this Paper

Citations

SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

Grid Diffusion Models for Text-to-Video Generation

Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval

A Review of Deep Learning for Video Captioning

Generative Prompt Model for Weakly Supervised Object Localization

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet classification with deep convolutional neural networks

ImageNet: A large-scale hierarchical image database

Going deeper with convolutions

Microsoft COCO: Common Objects in Context

Related Papers (5)

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

Bleu: a Method for Automatic Evaluation of Machine Translation

CIDEr: Consensus-based image description evaluation

ADADELTA: An Adaptive Learning Rate Method

METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments