End-to-End Dense Video Captioning with Masked Transformer

doi:10.1109/CVPR.2018.00911

Open AccessProceedings Article10.1109/CVPR.2018.00911

End-to-End Dense Video Captioning with Masked Transformer

Luowei Zhou, +4 more

- 03 Apr 2018

- pp 8739-8748

490

TL;DR: In this article, an end-to-end transformer model is proposed for dense video captioning, which employs a self-attention mechanism to enable the use of efficient non-recurrent structure during encoding and leads to performance improvements.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1109/tmech.2022.3199985

Convformer-NSE: A Novel End-to-End Gearbox Fault Diagnosis Framework Under Heavy Noise Using Joint Global and Local Information

01 Feb 2023

- IEEE-ASME Transactions on Mechatronics

TL;DR: In this article , a novel framework named Convformer-NSE is developed to extract robust features that integrate both global and local information, aiming at improving the end-to-end fault diagnostic performance of gearbox under heavy noise.

...read moreread less

63

•Proceedings Article•10.1109/cvpr52688.2022.01595

TubeDETR: Spatio-Temporal Video Grounding with Transformers

01 Jun 2022

TL;DR: This paper proposed TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection, which includes an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and a space-time decoder that jointly performs spatio-temporal localization.

...read moreread less

62

•Journal Article•10.1016/j.inffus.2021.07.009

Multimodal research in vision and language: A review of current and emerging trends

01 Jan 2022

- Information Fusion

TL;DR: A detailed overview of the latest trends in research pertaining to visual and language modalities is presented in this paper , where the authors look at their applications in their task formulations and how to solve various problems related to semantic perception and content generation.

...read moreread less

61

•Posted Content

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning.

Tanzila Rahman, +2 more

- 22 Sep 2019

- arXiv: Computer Vision and Pattern Recog...

TL;DR: In this article, audio signals can carry surprising amount of information when it comes to high-level visual-lingual tasks, such as weakly-supervised dense event captioning in videos.

...read moreread less

61

•Journal Article•10.1007/S11263-021-01457-9

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

Ben Saunders, +2 more

- 11 Mar 2021

- International Journal of Computer Vision

TL;DR: In this paper, a Progressive Transformer Network (PTN) is proposed to translate from spoken language sentences to continuous 3D multi-channel sign pose sequences in an end-to-end manner.

...read moreread less

54

...

Expand

References

•Proceedings Article•10.1109/CVPR.2016.90

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

198.7K

•Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

- 04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

102.6K

Journal Article•10.1162/NECO.1997.9.8.1735

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997

- Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

99K

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

Preprint•10.48550/arxiv.1706.03762

Attention Is All You Need

Ashish Vaswani, +7 more

- 01 Jan 2017

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

...read moreread less

51.8K

...

Expand

End-to-End Dense Video Captioning with Masked Transformer

Chat with Paper

AI Agents for this Paper

Citations

Convformer-NSE: A Novel End-to-End Gearbox Fault Diagnosis Framework Under Heavy Noise Using Joint Global and Local Information

TubeDETR: Spatio-Temporal Video Grounding with Transformers

Multimodal research in vision and language: A review of current and emerging trends

Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning.

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

References

Deep Residual Learning for Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Long short-term memory

Attention is All you Need

Attention Is All You Need

Related Papers (5)

Attention is All you Need

Bleu: a Method for Automatic Evaluation of Machine Translation

Deep Residual Learning for Image Recognition

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Adam: A Method for Stochastic Optimization