Multi30K: Multilingual English-German Image Descriptions

doi:10.18653/V1/W16-3210

Open AccessProceedings Article10.18653/V1/W16-3210

Multi30K: Multilingual English-German Image Descriptions

Desmond Elliott, +3 more

- 12 Aug 2016

- pp 70-74

622

TL;DR: The Multi30K dataset as mentioned in this paper was introduced to stimulate multilingual multimodal research, which includes German translations created by professional translators over a subset of the English descriptions, and German descriptions crowdsourced independently of the original English descriptions.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Posted Content

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Hao Tan, +1 more

- 14 Oct 2020

- arXiv: Computation and Language

TL;DR: A technique named "vokenization" is developed that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which the authors call "vokens").

...read moreread less

84

•Journal Article•10.1007/S10590-017-9197-Z

Zero-resource machine translation by multimodal encoder---decoder network with multimedia pivot

Hideki Nakayama, +1 more

- 01 Jun 2017

- Machine Translation

TL;DR: This work proposes an approach to build a neural machine translation system with no supervised resources (i.e., no parallel corpora) using multimodal embedded representation over texts and images using multimedia as the “pivot” and finds that an end-to-end model that simultaneously optimized both rank loss in multimodAL encoders and cross-entropy loss in decoders performed the best.

...read moreread less

83

Journal Article•10.1109/tpami.2022.3181116

Video Pivoting Unsupervised Multi-Modal Machine Translation

01 Jan 2022

- IEEE Transactions on Pattern Analysis an...

TL;DR: The authors employ a spatial-temporal graph obtained from videos to exploit object interactions in space and time for disambiguation purposes and to promote latent space alignment in unsupervised machine translation.

...read moreread less

81

•Proceedings Article•10.18653/V1/D19-1447

Countering language drift via visual grounding

Jason Lee, +2 more

- 01 Sep 2019

TL;DR: It is shown that a combination of syntactic (language model likelihood) and semantic (visual grounding) constraints gives the best communication performance, allowing pre-trained agents to retain English syntax while learning to accurately convey the intended meaning.

...read moreread less

81

•Proceedings Article•10.18653/V1/2021.NAACL-MAIN.195

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models.

Po-Yao Huang, +5 more

- 01 Jun 2021

TL;DR: This paper focuses on multilingual text-to-video search and proposes a Transformer-based model that learns contextual multilingual multimodal embeddings and significantly improves video search in non-English languages without additional annotations.

...read moreread less

77

...

Expand

References

•Proceedings Article

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

- 01 Jan 2015

TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

25.7K

•Posted Content

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, +2 more

- 01 Sep 2014

- arXiv: Computation and Language

TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

...read moreread less

20.9K

•Proceedings Article

Sequence to Sequence Learning with Neural Networks

Ilya Sutskever, +2 more

- 08 Dec 2014

TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.

...read moreread less

20.1K

•Proceedings Article

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Kelvin Xu, +10 more

- 06 Jul 2015

TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.

...read moreread less

10.1K

•Proceedings Article•10.1109/CVPR.2015.7298935

Show and tell: A neural image caption generator

Oriol Vinyals, +3 more

- 07 Jun 2015

TL;DR: In this paper, a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation is proposed to generate natural sentences describing an image, which can be used to automatically describe the content of an image.

...read moreread less

7.5K

...

Expand

Multi30K: Multilingual English-German Image Descriptions

Chat with Paper

AI Agents for this Paper

Citations

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

Zero-resource machine translation by multimodal encoder---decoder network with multimedia pivot

Video Pivoting Unsupervised Multi-Modal Machine Translation

Countering language drift via visual grounding

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models.

References

Neural Machine Translation by Jointly Learning to Align and Translate

Neural Machine Translation by Jointly Learning to Align and Translate

Sequence to Sequence Learning with Neural Networks

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Show and tell: A neural image caption generator

Related Papers (5)

Deep Residual Learning for Image Recognition

Bleu: a Method for Automatic Evaluation of Machine Translation

Neural Machine Translation by Jointly Learning to Align and Translate

Adam: A Method for Stochastic Optimization

Microsoft COCO: Common Objects in Context