Multi30K: Multilingual English-German Image Descriptions
Desmond Elliott,Stella Frank,Khalil Sima'an,Lucia Specia +3 more
- 12 Aug 2016
- pp 70-74
TL;DR: The Multi30K dataset as mentioned in this paper was introduced to stimulate multilingual multimodal research, which includes German translations created by professional translators over a subset of the English descriptions, and German descriptions crowdsourced independently of the original English descriptions.
read more
Abstract: We introduce the Multi30K dataset to stimulate multilingual multimodal research. Recent advances in image description have been demonstrated on Englishlanguage datasets almost exclusively, but image description should not be limited to English. This dataset extends the Flickr30K dataset with i) German translations created by professional translators over a subset of the English descriptions, and ii) German descriptions crowdsourced independently of the original English descriptions. We describe the data and outline how it can be used for multilingual image description and multimodal machine translation, but we anticipate the data will be useful for a broader range of tasks.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
•Posted Content
Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision
Hao Tan,Mohit Bansal +1 more
TL;DR: A technique named "vokenization" is developed that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which the authors call "vokens").
84
Zero-resource machine translation by multimodal encoder---decoder network with multimedia pivot
Hideki Nakayama,Noriki Nishida +1 more
TL;DR: This work proposes an approach to build a neural machine translation system with no supervised resources (i.e., no parallel corpora) using multimodal embedded representation over texts and images using multimedia as the “pivot” and finds that an end-to-end model that simultaneously optimized both rank loss in multimodAL encoders and cross-entropy loss in decoders performed the best.
Video Pivoting Unsupervised Multi-Modal Machine Translation
TL;DR: The authors employ a spatial-temporal graph obtained from videos to exploit object interactions in space and time for disambiguation purposes and to promote latent space alignment in unsupervised machine translation.
81
Countering language drift via visual grounding
Jason Lee,Kyunghyun Cho,Douwe Kiela +2 more
- 01 Sep 2019
TL;DR: It is shown that a combination of syntactic (language model likelihood) and semantic (visual grounding) constraints gives the best communication performance, allowing pre-trained agents to retain English syntax while learning to accurately convey the intended meaning.
Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models.
Po-Yao Huang,Mandela Patrick,Junjie Hu,Graham Neubig,Florian Metze,Alexander G. Hauptmann +5 more
- 01 Jun 2021
TL;DR: This paper focuses on multilingual text-to-video search and proposes a Transformer-based model that learns contextual multilingual multimodal embeddings and significantly improves video search in non-English languages without additional annotations.
References
•Proceedings Article
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau,Kyunghyun Cho,Yoshua Bengio +2 more
- 01 Jan 2015
TL;DR: It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
25.7K
•Posted Content
Neural Machine Translation by Jointly Learning to Align and Translate
TL;DR: In this paper, the authors propose to use a soft-searching model to find the parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
20.9K
•Proceedings Article
Sequence to Sequence Learning with Neural Networks
Ilya Sutskever,Oriol Vinyals,Quoc V. Le +2 more
- 08 Dec 2014
TL;DR: The authors used a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.
•Proceedings Article
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Kelvin Xu,Jimmy Ba,Ryan Kiros,Kyunghyun Cho,Aaron Courville,Ruslan Salakhudinov,Ruslan Salakhudinov,Rich Zemel,Rich Zemel,Yoshua Bengio,Yoshua Bengio +10 more
- 06 Jul 2015
TL;DR: An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.
Show and tell: A neural image caption generator
Oriol Vinyals,Alexander Toshev,Samy Bengio,Dumitru Erhan +3 more
- 07 Jun 2015
TL;DR: In this paper, a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation is proposed to generate natural sentences describing an image, which can be used to automatically describe the content of an image.
Related Papers (5)
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
Diederik P. Kingma,Jimmy Ba +1 more
- 01 Jan 2015