Jointly Modeling Embedding and Translation to Bridge Video and Language

doi:10.1109/CVPR.2016.497

Open AccessProceedings Article10.1109/CVPR.2016.497

Jointly Modeling Embedding and Translation to Bridge Video and Language

Yingwei Pan, +4 more

- 27 Jun 2016

- pp 4594-4602

700

TL;DR: Liu et al. as discussed by the authors presented a unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visualsemantic embedding.

Abstract: Automatically describing video content with natural language is a fundamental challenge of computer vision. Re-current Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with the given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best published performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on two movie description datasets (M-VAD and MPII-MD). In addition, we demonstrate that LSTM-E outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

Beyond Caption To Narrative: Video Captioning With Multiple Sentences

Andrew Shin, +2 more

- 18 May 2016

- arXiv: Computer Vision and Pattern Recog...

TL;DR: In this paper, the authors attempt to generate video captions that convey richer contents by temporally segmenting the video with action localization, generating multiple captions from multiple frames, and connecting them with natural language processing techniques, in order to generate a story-like caption.

...read moreread less

22

•Proceedings Article

Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization

Jun-Tae Lee, +3 more

- 03 May 2021

TL;DR: In this article, a multi-stage cross-attention mechanism is proposed to fuse audio and visual features for weakly-supervised action localization, which preserves the intra-modal characteristics.

...read moreread less

21

•Journal Article•10.1609/AAAI.V33I01.33018271

Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention

Fang Kuncheng, +6 more

- 17 Jul 2019

TL;DR: A novel architecture to generate the optimal descriptions for videos is proposed, which focuses on constructing a new network structure that can generate sentences superior to the basic model with LSTM, and establishing special attention mechanisms that can provide more useful visual information for caption generation.

...read moreread less

21

•Journal Article•10.1109/ACCESS.2021.3058248

Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey

Khushboo Khurana, +1 more

- 09 Feb 2021

- IEEE Access

TL;DR: A brief survey of the video captioning techniques and a comprehensive review of existing techniques, datasets, and evaluation metrics for the task of video-QA can be found in this paper.

...read moreread less

21

•Proceedings Article•10.1109/CVPR.2017.225

Unified Embedding and Metric Learning for Zero-Exemplar Event Detection

Noureldien Hussein, +2 more

- 21 Jun 2017

TL;DR: In this paper, a joint space in which the visual and textual representations are embedded is learned, which casts a novel event as a probability of pre-defined events and measures the distance between an event and its related videos.

...read moreread less

21

...

Expand

References

•Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

- 04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

102.6K

Journal Article•10.1162/NECO.1997.9.8.1735

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997

- Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

99K

•Journal Article•10.1145/3065386

ImageNet classification with deep convolutional neural networks

Alex Krizhevsky, +2 more

- 24 May 2017

- Communications of The ACM

TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

...read moreread less

98.2K

•Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

- 03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

88.4K

•Proceedings Article•10.1109/CVPR.2015.7298594

Going deeper with convolutions

Christian Szegedy, +8 more

- 07 Jun 2015

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

56.6K

...

Expand

Jointly Modeling Embedding and Translation to Bridge Video and Language

Chat with Paper

AI Agents for this Paper

Citations

Beyond Caption To Narrative: Video Captioning With Multiple Sentences

Cross-Attentional Audio-Visual Fusion for Weakly-Supervised Action Localization

Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention

Video Question-Answering Techniques, Benchmark Datasets and Evaluation Metrics Leveraging Video Captioning: A Comprehensive Survey

Unified Embedding and Metric Learning for Zero-Exemplar Event Detection

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

Long short-term memory

ImageNet classification with deep convolutional neural networks

ImageNet Classification with Deep Convolutional Neural Networks

Going deeper with convolutions

Related Papers (5)

Bleu: a Method for Automatic Evaluation of Machine Translation

Deep Residual Learning for Image Recognition

CIDEr: Consensus-based image description evaluation

Long short-term memory

Learning Spatiotemporal Features with 3D Convolutional Networks