Jointly Modeling Embedding and Translation to Bridge Video and Language
Yingwei Pan,Tao Mei,Ting Yao,Houqiang Li,Yong Rui +4 more
- 27 Jun 2016
- pp 4594-4602
TL;DR: Liu et al. as discussed by the authors presented a unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visualsemantic embedding.
read more
Abstract: Automatically describing video content with natural language is a fundamental challenge of computer vision. Re-current Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with the given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best published performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. Superior performances are also reported on two movie description datasets (M-VAD and MPII-MD). In addition, we demonstrate that LSTM-E outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Fine-Grained Text-to-Video Temporal Grounding from Coarse Boundary
TL;DR: In this paper, a fine-grained text-to-video temporal grounding approach is proposed to locate a target video moment that semantically corresponds to the given sentence query in an untrimmed video.
2
Graph Wasserstein Correlation Analysis for Movie Retrieval.
Xueya Zhang,Tong Zhang,Xiaobin Hong,Zhen Cui,Jian Yang +4 more
- 23 Aug 2020
TL;DR: Wang et al. as mentioned in this paper proposed Graph Wasserstein Correlation Analysis (GWCA) to deal with the core issue therein, i.e., cross heterogeneous graph comparison. And they derived the solution of the graph comparison model as a classic generalized eigenvalue decomposition problem, which has an exactly closed form solution.
2
Patent
Cognitive print speaker modeler
Amsterdam Jeff,Aaron K. Baughman,Hammer Stephen C,Provan David A +3 more
- 31 Oct 2019
TL;DR: In this paper, a hierarchical long short term model (LSTM) is used to identify a speaker in a streaming video with audio according to words spoken by the speaker matched to a cognitive print.
2
•Posted Content
Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding
TL;DR: The authors disentangle each query into a relation feature and a modified feature, which is mainly based on skeleton-like words (including nouns and verbs) to extract basic and consistent information in the presence of query uncertainty.
2
•Posted Content
Learning Video-Story Composition via Recurrent Neural Network
TL;DR: In this article, a learning-based method to compose a video-story from a group of video clips that describe an activity or experience is proposed to learn the coherence between video clips from real videos via the Recurrent Neural Network (RNN) that jointly incorporates the spatial-temporal semantics and motion dynamics to generate smooth and relevant compositions.
2
References
•Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
- 04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
102.6K
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
99K
ImageNet classification with deep convolutional neural networks
TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
•Proceedings Article
ImageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky,Ilya Sutskever,Geoffrey E. Hinton +2 more
- 03 Dec 2012
TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Going deeper with convolutions
Christian Szegedy,Wei Liu,Yangqing Jia,Pierre Sermanet,Scott Reed,Dragomir Anguelov,Dumitru Erhan,Vincent Vanhoucke,Andrew Rabinovich +8 more
- 07 Jun 2015
TL;DR: Inception as mentioned in this paper is a deep convolutional neural network architecture that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
Related Papers (5)
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
Ramakrishna Vedantam,C. Lawrence Zitnick,Devi Parikh +2 more
- 07 Jun 2015