Unsupervised Multimodal Language Representations using Convolutional Autoencoders.

Open AccessPosted Content

Unsupervised Multimodal Language Representations using Convolutional Autoencoders.

- 06 Oct 2021

TL;DR: In this paper, word-level aligned multimodal sequences are mapped to 2-D matrices and then CNNs are used to learn embeddings by combining multiple datasets.

Abstract: Multimodal Language Analysis is a demanding area of research, since it is associated with two requirements: combining different modalities and capturing temporal information. During the last years, several works have been proposed in the area, mostly centered around supervised learning in downstream tasks. In this paper we propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks. Towards this end, we map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets. Extensive experimentation on Sentiment Analysis (MOSEI) and Emotion Recognition (IEMOCAP) indicate that the learned representations can achieve near-state-of-the-art performance with just the use of a Logistic Regression algorithm for downstream classification. It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters. The proposed multimodal representation models are open-sourced and will help grow the applicability of Multimodal Language.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

References

Proceedings Article•10.3115/V1/D14-1162

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

- 01 Oct 2014

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

41.6K

Journal Article•10.1007/S10579-008-9076-6

IEMOCAP: interactive emotional dyadic motion capture database

Carlos Busso, +8 more

- 05 Nov 2008

TL;DR: A new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory at the University of Southern California (USC), which provides detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios.

...read moreread less

3.8K

•Proceedings Article•10.18653/V1/D17-1115

Tensor Fusion Network for Multimodal Sentiment Analysis

Amir Zadeh, +4 more

- 01 Sep 2017

TL;DR: In this article, a tensor fusion network (Tensor fusion network) is proposed to model intra-modality and inter-modal dynamics for multimodal sentiment analysis.

...read moreread less

1.2K

•Proceedings Article•10.18653/V1/P18-1208

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph

AmirAli Bagher Zadeh, +4 more

- 01 Jul 2018

TL;DR: This paper introduces CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), the largest dataset of sentiment analysis and emotion recognition to date and uses a novel multimodal fusion technique called the Dynamic Fusion Graph (DFG), which is highly interpretable and achieves competative performance when compared to the previous state of the art.

...read moreread less

1.1K

•Proceedings Article•10.18653/V1/P19-1656

Multimodal Transformer for Unaligned Multimodal Language Sequences

Yao-Hung Hubert Tsai, +6 more

- 01 Jun 2019

TL;DR: In this paper, a directional pairwise cross-modal attention mechanism is proposed to attend to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another.

...read moreread less

1K