Multimodal learning

Topic Tools

Papers published on a yearly basis

Papers

Proceedings Article•

Multimodal Deep Learning

[...]

Jiquan Ngiam¹, Aditya Khosla¹, Mingyu Kim¹, Juhan Nam¹, Honglak Lee², Andrew Y. Ng¹ - Show less +2 more•Institutions (2)

Stanford University¹, University of Michigan²

28 Jun 2011

TL;DR: This work presents a series of tasks for multimodal learning and shows how to train deep networks that learn features to address these tasks, and demonstrates cross modality feature learning, where better features for one modality can be learned if multiple modalities are present at feature learning time.

...read moreread less

Abstract: Deep networks have been successfully applied to unsupervised feature learning for single modalities (e.g., text, images or audio). In this work, we propose a novel application of deep networks to learn features over multiple modalities. We present a series of tasks for multimodal learning and show how to train deep networks that learn features to address these tasks. In particular, we demonstrate cross modality feature learning, where better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time. Furthermore, we show how to learn a shared representation between modalities and evaluate it on a unique task, where the classifier is trained with audio-only data but tested with video-only data and vice-versa. Our models are validated on the CUAVE and AVLetters datasets on audio-visual speech classification, demonstrating best published visual speech classification on AVLetters and effective shared representation learning.

...read moreread less

3,393 citations

Posted Content•

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

[...]

Ryan Kiros, Ruslan Salakhutdinov, Richard S. Zemel¹•Institutions (1)

University of Toronto¹

10 Nov 2014-arXiv: Learning

TL;DR: This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.

...read moreread less

Abstract: Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match the state-of-the-art performance on Flickr8K and Flickr30K without using object detections. We also set new best results when using the 19-layer Oxford convolutional network. Furthermore we show that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic e.g. *image of a blue car* - "blue" + "red" is near images of red cars. Sample captions generated for 800 images are made available for comparison.

...read moreread less

1,734 citations

Journal Article•

Multimodal learning with deep Boltzmann machines

[...]

Nitish Srivastava¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

01 Jan 2014-Journal of Machine Learning Research

TL;DR: A Deep Boltzmann Machine is proposed for learning a generative model of multimodal data and it is shown that the model can be used to create fused representations by combining features across modalities, which are useful for classification and information retrieval.

...read moreread less

Abstract: Data often consists of multiple diverse modalities For example, images are tagged with textual information and videos are accompanied by audio Each modality is characterized by having distinct statistical properties We propose a Deep Boltzmann Machine for learning a generative model of such multimodal data We show that the model can be used to create fused representations by combining features across modalities These learned representations are useful for classification and information retrieval By sampling from the conditional distributions over each data modality, it is possible to create these representations even when some data modalities are missing We conduct experiments on bimodal image-text and audio-video data The fused representation achieves good classification results on the MIR-Flickr data set matching or outperforming other deep models as well as SVM based models that use Multiple Kernel Learning We further demonstrate that this multimodal model helps classification and retrieval even when only unimodal data is available at test time

...read moreread less

1,667 citations

Journal Article•10.1007/S10648-007-9047-2•

Interactive Multimodal Learning Environments Special Issue on Interactive Learning Environments: Contemporary Issues and Trends

[...]

Roxana Moreno¹, Richard E. Mayer²•Institutions (2)

University of New Mexico¹, University of California, Santa Barbara²

22 Jun 2007-Educational Psychology Review

TL;DR: In this paper, a cognitive-affective theory of learning with media from which instructional design principles are derived is presented, and a set of experimental studies in which they found empirical support for five design principles: guided activity, reflection, feedback, control and pretraining.

...read moreread less

Abstract: What are interactive multimodal learning environments and how should they be designed to promote students’ learning? In this paper, we offer a cognitive–affective theory of learning with media from which instructional design principles are derived. Then, we review a set of experimental studies in which we found empirical support for five design principles: guided activity, reflection, feedback, control, and pretraining. Finally, we offer directions for future instructional technology research.

...read moreread less

1,461 citations

Journal Article•10.1016/J.NEUNET.2014.09.005•

Challenges in representation learning

[...]

Ian Goodfellow¹, Dumitru Erhan¹, Pierre Luc Carrier¹, Aaron Courville¹, Mehdi Mirza¹, Ben Hamner¹, William Cukierski¹, Yichuan Tang¹, David Thaler¹, Dong-Hyun Lee¹, Yingbo Zhou¹, Chetan Ramaiah¹, Fangxiang Feng¹, Ruifan Li¹, Xiaojie Wang¹, Dimitris Athanasakis¹, John Shawe-Taylor¹, Maxim Milakov¹, John Park¹, Radu Ionescu¹, Marius Popescu¹, Cristian Grozea¹, James Bergstra¹, Jingjing Xie¹, Lukasz Romaszko¹, Bing Xu¹, Zhang Chuang¹, Yoshua Bengio¹ - Show less +24 more•Institutions (1)

Université de Montréal¹

01 Apr 2015-Neural Networks

TL;DR: The datasets created for these challenges are described, the results of the competitions are summarized, and some comments are provided on what kind of knowledge can be gained from machine learning competitions.

...read moreread less

1,393 citations

...

Expand

Year	Papers
2025	19
2024	12
2023	78
2022	88
2021	132
2020	116

Topic Tools

Papers published on a yearly basis

Papers

Multimodal Deep Learning

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Multimodal learning with deep Boltzmann machines

Interactive Multimodal Learning Environments Special Issue on Interactive Learning Environments: Contemporary Issues and Trends

Challenges in representation learning

Related Topics (5)

Performance Metrics