Open Access
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language [Supplementary Material]
TL;DR: The MSR-VTT dataset as discussed by the authors was used in the Microsoft Research Video To Language challenge (http://ms-multimedia-challenge.com/), where the authors removed simple and duplicated sentences and replaced them with refined ones to control the quality of data and annotations.
read more
Abstract: When organizing the Microsoft Research Video To Language challenge (http://ms-multimedia-challenge.com/), we found that, in our previously released dataset (CVPR 2016 paper), some sentences annotated by AMT workers are identical in one video clip or very similar in one category. Therefore, to control the quality of data and annotations, as well as the competitions, we removed those simple and duplicated sentences and replaced them with refined ones. We finally released the fixed dataset in our challenge website. Due to these modifications of the dataset, the performance cannot be well matched with what we reported in our CVPR paper. Here, we have reported the new performance in the following tables which also appeared in our CVPR paper (referred to as Table 1~7, respectively). If you are trying to reproduce or compare the baselines conducted on our MSR-VTT dataset, please refer to this supplementary material and the updated performance reported in this material. However, please cite our CVPR paper if you want to use the MSR-VTT as your dataset. J. Xu, T. Mei, T. Yao, Y. Rui. “MSR-VTT: A Large Video Description Dataset for Bridging Video and Language“, In Proceedings of CVPR, 2016. MSR Video to Language challenge (http://ms-multimedia-challenge.com/).
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks
Xingning Dong,Qingpei Guo,Tian Gan,Qing Wang,Jianlong Wu,Xiangyuan Ren,Yuan Cheng,Wei-Chieh Chu +7 more
- 31 Jan 2024
TL;DR: The Significant Semantic Strengthening (S3) strategy, which includes a novel masking and matching proxy task to promote the pre-training performance, is proposed and established a new state-of-the-art in pixel-level video-text pre-training.
Grid Diffusion Models for Text-to-Video Generation
TL;DR: Grid diffusion models are effective for text-to-video generation, reducing the need for large datasets and high computational costs.
Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval
Minjoon Jung,Seong-Ho Choi,Joo-Young Kim,Jin-Hwa Kim,Byoung-Tak Zhang +4 more
- 23 Oct 2022
TL;DR: It is shown that MPGN successfully learns to localize the video corpus moment without any explicit annotation, and the effectiveness of MPGN on the TVR dataset is validated, showing competitive results compared with both supervised models and unsupervised setting models.
A Review of Deep Learning for Video Captioning
Moloud Abdar,Meenakshi Kollati,Swaraja Kuraparthi,Farhad Pourpanah,Daniel McDuff,Mohammad Ghavamzadeh,Shuicheng Yan,Abduallah Mohamed,Abbas Khosravi,Erik Cambria,Fatih Porikli +10 more
TL;DR: Video captioning is a fast-moving, cross-disciplinary area of research that bridges work in the fields of computer vision, natural language processing (NLP), linguistics, and human-computer interaction as discussed by the authors .
Generative Prompt Model for Weakly Supervised Object Localization
Yuzhong Zhao,Qixiang Ye,Weijia Wu,Chien-Yeh Shen,Fang Wang +4 more
- 19 Jul 2023
TL;DR: GenPromp as discussed by the authors proposes a generative pipeline to localize less discriminative object parts by formulating weakly supervised object localization (WSOL) as a conditional image denoising procedure.
References
•Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
- 04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
102.6K
ImageNet classification with deep convolutional neural networks
TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
ImageNet: A large-scale hierarchical image database
Jia Deng,Wei Dong,Richard Socher,Li-Jia Li,Kai Li,Li Fei-Fei +5 more
- 20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Going deeper with convolutions
Christian Szegedy,Wei Liu,Yangqing Jia,Pierre Sermanet,Scott Reed,Dragomir Anguelov,Dumitru Erhan,Vincent Vanhoucke,Andrew Rabinovich +8 more
- 07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
- 06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.