End-to-End Dense Video Captioning with Masked Transformer
Luowei Zhou,Yingbo Zhou,Jason J. Corso,Richard Socher,Caiming Xiong +4 more
- 03 Apr 2018
- pp 8739-8748
TL;DR: In this article, an end-to-end transformer model is proposed for dense video captioning, which employs a self-attention mechanism to enable the use of efficient non-recurrent structure during encoding and leads to performance improvements.
read more
Abstract: Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding feature. This masking network converts the event proposal to a differentiable mask, which ensures the consistency between the proposal and captioning during training. In addition, our model employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. We demonstrate the effectiveness of this end-to-end model on ActivityNet Captions and YouCookII datasets, where we achieved 10.12 and 6.58 METEOR score, respectively.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Convformer-NSE: A Novel End-to-End Gearbox Fault Diagnosis Framework Under Heavy Noise Using Joint Global and Local Information
TL;DR: In this article , a novel framework named Convformer-NSE is developed to extract robust features that integrate both global and local information, aiming at improving the end-to-end fault diagnostic performance of gearbox under heavy noise.
63
TubeDETR: Spatio-Temporal Video Grounding with Transformers
01 Jun 2022
TL;DR: This paper proposed TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection, which includes an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and a space-time decoder that jointly performs spatio-temporal localization.
Multimodal research in vision and language: A review of current and emerging trends
01 Jan 2022
TL;DR: A detailed overview of the latest trends in research pertaining to visual and language modalities is presented in this paper , where the authors look at their applications in their task formulations and how to solve various problems related to semantic perception and content generation.
•Posted Content
Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning.
TL;DR: In this article, audio signals can carry surprising amount of information when it comes to high-level visual-lingual tasks, such as weakly-supervised dense event captioning in videos.
61
Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks
TL;DR: In this paper, a Progressive Transformer Network (PTN) is proposed to translate from spoken language sentences to continuous 3D multi-channel sign pose sequences in an end-to-end manner.
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
•Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
- 04 Sep 2014
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
102.6K
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
99K
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Attention Is All You Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Łukasz Kaiser,Illia Polosukhin +7 more
- 01 Jan 2017
Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
51.8K
Related Papers (5)
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
Diederik P. Kingma,Jimmy Ba +1 more
- 01 Jan 2015