Proceedings Article10.1145/3240508.3240667
SibNet: Sibling Convolutional Encoder for Video Captioning
Sheng Liu,Zhou Ren,Junsong Yuan +2 more
- 15 Oct 2018
- pp 1425-1434
119
TL;DR: This work introduces a novel Sibling Convolutional Encoder (SibNet) for video captioning, which utilizes a two-branch architecture to collaboratively encode videos.
read more
Abstract: Video captioning is a challenging task owing to the complexity of understanding the copious visual information in videos and describing it using natural language. Different from previous work that encodes video information using a single flow, in this work, we introduce a novel Sibling Convolutional Encoder (SibNet) for video captioning, which utilizes a two-branch architecture to collaboratively encode videos. The first content branch encodes the visual content information of the video via autoencoder, and the second semantic branch encodes the semantic information by visual-semantic joint embedding. Then both branches are effectively combined with soft-attention mechanism and finally fed into a RNN decoder to generate captions. With our SibNet explicitly capturing both content and semantic information, the proposed method can better represent the rich information in videos. Extensive experiments on YouTube2Text and MSR-VTT datasets validate that the proposed architecture outperforms existing methods by a large margin across different evaluation metrics.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Object Relational Graph With Teacher-Recommended Learning for Video Captioning
Ziqi Zhang,Yaya Shi,Chunfeng Yuan,Bing Li,Peijin Wang,Weiming Hu,Zheng-Jun Zha +6 more
- 14 Jun 2020
TL;DR: Zhang et al. as mentioned in this paper proposed an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation, and designed a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang,Zhengyuan Yang,Xiaowei Hu,Linjie Li,Kevin Lin,Zhe Yuan Gan,Zicheng Liu,Ce Liu,Lijuan Wang +8 more
- 27 May 2022
TL;DR: This paper designs and train a GIT to unify vision-language tasks such as image/video captioning and question answering and presents a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.
340
Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network
Bairui Wang,Lin Ma,Wei Zhang,Wenhao Jiang,Jingwen Wang,Wei Liu +5 more
- 01 Oct 2019
TL;DR: A gating strategy is proposed to dynamically and adaptively incorporate the global syntactic POS information into the decoder for generating each word, which not only boosts the video captioning performance but also improves the diversity of the generated captions.
Heuristic Black-Box Adversarial Attacks on Video Recognition Models
Zhipeng Wei,Jingjing Chen,Xingxing Wei,Linxi Jiang,Tat-Seng Chua,Fengfeng Zhou,Yu-Gang Jiang +6 more
- 03 Apr 2020
TL;DR: A heuristic black-box adversarial attack model that generates adversarial perturbations only on the selected frames and regions is proposed that can significantly reduce the computation cost and lead to more than 28% reduction in query numbers for the untargeted attack on both datasets.
Black-box Adversarial Attacks on Video Recognition Models
Linxi Jiang,Xingjun Ma,Shaoxiang Chen,James Bailey,Yu-Gang Jiang +4 more
- 15 Oct 2019
TL;DR: In this paper, the authors proposed the first black-box video attack framework, called V-BAD, which is equivalent to estimating the projection of the adversarial gradient on a selected subspace.
137
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
•Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
- 01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
138.5K
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
99K
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Posted Content
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
TL;DR: In this article, the adaptive estimates of lower-order moments are used for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimate of lowerorder moments.
82.5K
Related Papers (5)
David L. Chen,William B. Dolan +1 more
- 19 Jun 2011
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016