Jointly Modeling Embedding and Translation to Bridge Video and Language

doi:10.1109/CVPR.2016.497

Open AccessProceedings Article10.1109/CVPR.2016.497

Jointly Modeling Embedding and Translation to Bridge Video and Language

Yingwei Pan, +4 more

- 27 Jun 2016

- pp 4594-4602

700

TL;DR: Liu et al. as discussed by the authors presented a unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visualsemantic embedding.

Abstract: Automatically describing video content with natural language is a fundamental challenge of computer vision. Re-current Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with the given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best published performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on two movie description datasets (M-VAD and MPII-MD). In addition, we demonstrate that LSTM-E outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1145/3579825

Fine-Grained Text-to-Video Temporal Grounding from Coarse Boundary

Jiachang Hao, +6 more

- 12 Jan 2023

- ACM Transactions on Multimedia Computing...

TL;DR: In this paper, a fine-grained text-to-video temporal grounding approach is proposed to locate a target video moment that semantically corresponds to the given sentence query in an untrimmed video.

...read moreread less

2

•Book Chapter•10.1007/978-3-030-58595-2_26

Graph Wasserstein Correlation Analysis for Movie Retrieval.

Xueya Zhang, +4 more

- 23 Aug 2020

TL;DR: Wang et al. as mentioned in this paper proposed Graph Wasserstein Correlation Analysis (GWCA) to deal with the core issue therein, i.e., cross heterogeneous graph comparison. And they derived the solution of the graph comparison model as a classic generalized eigenvalue decomposition problem, which has an exactly closed form solution.

...read moreread less

2

Patent

Cognitive print speaker modeler

Amsterdam Jeff, +3 more

- 31 Oct 2019

TL;DR: In this paper, a hierarchical long short term model (LSTM) is used to identify a speaker in a streaming video with audio according to words spoken by the speaker matched to a cognitive print.

...read moreread less

2

•Posted Content

Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding

Hao Zhou, +4 more

- 31 Mar 2021

- arXiv: Computer Vision and Pattern Recog...

TL;DR: The authors disentangle each query into a relation feature and a modified feature, which is mainly based on skeleton-like words (including nouns and verbs) to extract basic and consistent information in the presence of query uncertainty.

...read moreread less

2

•Posted Content

Learning Video-Story Composition via Recurrent Neural Network

Guangyu Zhong, +5 more

- 31 Jan 2018

- arXiv: Computer Vision and Pattern Recog...

TL;DR: In this article, a learning-based method to compose a video-story from a group of video clips that describe an activity or experience is proposed to learn the coherence between video clips from real videos via the Recurrent Neural Network (RNN) that jointly incorporates the spatial-temporal semantics and motion dynamics to generate smooth and relevant compositions.

...read moreread less

2

...

Expand

References

•Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

- 04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

102.6K

Journal Article•10.1162/NECO.1997.9.8.1735

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997

- Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

99K

•Journal Article•10.1145/3065386

ImageNet classification with deep convolutional neural networks

Alex Krizhevsky, +2 more

- 24 May 2017

- Communications of The ACM

TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

...read moreread less

98.2K

•Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

- 03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

88.4K

•Proceedings Article•10.1109/CVPR.2015.7298594

Going deeper with convolutions

Christian Szegedy, +8 more

- 07 Jun 2015

TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).

...read moreread less

56.6K

...

Expand

Jointly Modeling Embedding and Translation to Bridge Video and Language

Chat with Paper

AI Agents for this Paper

Citations

Fine-Grained Text-to-Video Temporal Grounding from Coarse Boundary

Graph Wasserstein Correlation Analysis for Movie Retrieval.

Cognitive print speaker modeler

Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding

Learning Video-Story Composition via Recurrent Neural Network

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

Long short-term memory

ImageNet classification with deep convolutional neural networks

ImageNet Classification with Deep Convolutional Neural Networks

Going deeper with convolutions

Related Papers (5)

Bleu: a Method for Automatic Evaluation of Machine Translation

Deep Residual Learning for Image Recognition

CIDEr: Consensus-based image description evaluation

Long short-term memory

Learning Spatiotemporal Features with 3D Convolutional Networks