Modality (human–computer interaction)

Topic Tools

Papers published on a yearly basis

1 / 2

Papers

Journal Article•10.1109/TPAMI.2018.2798607•

Multimodal Machine Learning: A Survey and Taxonomy

[...]

Tadas Baltrusaitis¹, Chaitanya Ahuja², Louis-Philippe Morency²•Institutions (2)

Microsoft¹, Carnegie Mellon University²

01 Feb 2019-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy to enable researchers to better understand the state of the field and identify directions for future research.

...read moreread less

Abstract: Our experience of the world is multimodal - we see objects, hear sounds, feel texture, smell odors, and taste flavors Modality refers to the way in which something happens or is experienced and a research problem is characterized as multimodal when it includes multiple such modalities In order for Artificial Intelligence to make progress in understanding the world around us, it needs to be able to interpret such multimodal signals together Multimodal machine learning aims to build models that can process and relate information from multiple modalities It is a vibrant multi-disciplinary field of increasing importance and with extraordinary potential Instead of focusing on specific multimodal applications, this paper surveys the recent advances in multimodal machine learning itself and presents them in a common taxonomy We go beyond the typical early and late fusion categorization and identify broader challenges that are faced by multimodal machine learning, namely: representation, translation, alignment, fusion, and co-learning This new taxonomy will enable researchers to better understand the state of the field and identify directions for future research

...read moreread less

3,404 citations

Journal Article•

Multimodal learning with deep Boltzmann machines

[...]

Nitish Srivastava¹, Ruslan Salakhutdinov¹•Institutions (1)

University of Toronto¹

01 Jan 2014-Journal of Machine Learning Research

TL;DR: A Deep Boltzmann Machine is proposed for learning a generative model of multimodal data and it is shown that the model can be used to create fused representations by combining features across modalities, which are useful for classification and information retrieval.

...read moreread less

Abstract: Data often consists of multiple diverse modalities For example, images are tagged with textual information and videos are accompanied by audio Each modality is characterized by having distinct statistical properties We propose a Deep Boltzmann Machine for learning a generative model of such multimodal data We show that the model can be used to create fused representations by combining features across modalities These learned representations are useful for classification and information retrieval By sampling from the conditional distributions over each data modality, it is possible to create these representations even when some data modalities are missing We conduct experiments on bimodal image-text and audio-video data The fused representation achieves good classification results on the MIR-Flickr data set matching or outperforming other deep models as well as SVM based models that use Multiple Kernel Learning We further demonstrate that this multimodal model helps classification and retrieval even when only unimodal data is available at test time

...read moreread less

1,667 citations

Journal Article•10.1109/T-AFFC.2011.25•

A Multimodal Database for Affect Recognition and Implicit Tagging

[...]

Mohammad Soleymani¹, Jeroen Lichtenauer², Thierry Pun¹, Maja Pantic²•Institutions (2)

University of Geneva¹, Imperial College London²

01 Jan 2012-IEEE Transactions on Affective Computing

TL;DR: Results show the potential uses of the recorded modalities and the significance of the emotion elicitation protocol and single modality and modality fusion results for both emotion recognition and implicit tagging experiments are reported.

...read moreread less

Abstract: MAHNOB-HCI is a multimodal database recorded in response to affective stimuli with the goal of emotion recognition and implicit tagging research. A multimodal setup was arranged for synchronized recording of face videos, audio signals, eye gaze data, and peripheral/central nervous system physiological signals. Twenty-seven participants from both genders and different cultural backgrounds participated in two experiments. In the first experiment, they watched 20 emotional videos and self-reported their felt emotions using arousal, valence, dominance, and predictability as well as emotional keywords. In the second experiment, short videos and images were shown once without any tag and then with correct or incorrect tags. Agreement or disagreement with the displayed tags was assessed by the participants. The recorded videos and bodily responses were segmented and stored in a database. The database is made available to the academic community via a web-based system. The collected data were analyzed and single modality and modality fusion results for both emotion recognition and implicit tagging experiments are reported. These results show the potential uses of the recorded modalities and the significance of the emotion elicitation protocol.

...read moreread less

1,605 citations

Journal Article•10.1016/J.INFFUS.2017.02.003•

A review of affective computing

[...]

Soujanya Poria¹, Erik Cambria², Rajiv Bajpai², Amir Hussain¹•Institutions (2)

University of Stirling¹, Nanyang Technological University²

01 Sep 2017-Information Fusion

TL;DR: This first of its kind, comprehensive literature review of the diverse field of affective computing focuses mainly on the use of audio, visual and text information for multimodal affect analysis, and outlines existing methods for fusing information from different modalities.

...read moreread less

1,451 citations

Journal Article•10.1007/S00530-010-0182-0•

Multimodal fusion for multimedia analysis: a survey

[...]

Pradeep K. Atrey¹, M. Anwar Hossain², Abdulmotaleb El Saddik², Mohan S. Kankanhalli³•Institutions (3)

University of Winnipeg¹, University of Ottawa², National University of Singapore³

01 Nov 2010-Multimedia Systems

TL;DR: This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks.

...read moreread less

Abstract: This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks. The existing literature on multimodal fusion research is presented through several classifications based on the fusion methodology and the level of fusion (feature, decision, and hybrid). The fusion methods are described from the perspective of the basic concept, advantages, weaknesses, and their usage in various analysis tasks as reported in the literature. Moreover, several distinctive issues that influence a multimodal fusion process such as, the use of correlation and independence, confidence level, contextual information, synchronization between different modalities, and the optimal modality selection are also highlighted. Finally, we present the open issues for further research in the area of multimodal fusion.

...read moreread less

1,300 citations

...

Expand

Year	Papers
2026	16
2025	1,075
2024	1,738
2023	2,260
2022	3,196
2021	815

Topic Tools

Papers published on a yearly basis

Papers

Multimodal Machine Learning: A Survey and Taxonomy

Multimodal learning with deep Boltzmann machines

A Multimodal Database for Affect Recognition and Implicit Tagging

A review of affective computing

Multimodal fusion for multimedia analysis: a survey

Performance Metrics