Open AccessPosted Content
Unsupervised Multimodal Language Representations using Convolutional Autoencoders.
TL;DR: In this paper, word-level aligned multimodal sequences are mapped to 2-D matrices and then CNNs are used to learn embeddings by combining multiple datasets.
read more
Abstract: Multimodal Language Analysis is a demanding area of research, since it is associated with two requirements: combining different modalities and capturing temporal information. During the last years, several works have been proposed in the area, mostly centered around supervised learning in downstream tasks. In this paper we propose extracting unsupervised Multimodal Language representations that are universal and can be applied to different tasks. Towards this end, we map the word-level aligned multimodal sequences to 2-D matrices and then use Convolutional Autoencoders to learn embeddings by combining multiple datasets. Extensive experimentation on Sentiment Analysis (MOSEI) and Emotion Recognition (IEMOCAP) indicate that the learned representations can achieve near-state-of-the-art performance with just the use of a Logistic Regression algorithm for downstream classification. It is also shown that our method is extremely lightweight and can be easily generalized to other tasks and unseen data with small performance drop and almost the same number of parameters. The proposed multimodal representation models are open-sourced and will help grow the applicability of Multimodal Language.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
References
Glove: Global Vectors for Word Representation
Jeffrey Pennington,Richard Socher,Christopher D. Manning +2 more
- 01 Oct 2014
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
IEMOCAP: interactive emotional dyadic motion capture database
Carlos Busso,Murtaza Bulut,Chi-Chun Lee,Abe Kazemzadeh,Emily Mower,Samuel Kim,Jeannette N. Chang,Sungbok Lee,Shrikanth S. Narayanan +8 more
- 05 Nov 2008
TL;DR: A new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory at the University of Southern California (USC), which provides detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios.
Tensor Fusion Network for Multimodal Sentiment Analysis
Amir Zadeh,Minghai Chen,Soujanya Poria,Erik Cambria,Louis-Philippe Morency +4 more
- 01 Sep 2017
TL;DR: In this article, a tensor fusion network (Tensor fusion network) is proposed to model intra-modality and inter-modal dynamics for multimodal sentiment analysis.
Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph
AmirAli Bagher Zadeh,Paul Pu Liang,Soujanya Poria,Erik Cambria,Louis-Philippe Morency +4 more
- 01 Jul 2018
TL;DR: This paper introduces CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI), the largest dataset of sentiment analysis and emotion recognition to date and uses a novel multimodal fusion technique called the Dynamic Fusion Graph (DFG), which is highly interpretable and achieves competative performance when compared to the previous state of the art.
Multimodal Transformer for Unaligned Multimodal Language Sequences
Yao-Hung Hubert Tsai,Shaojie Bai,Paul Pu Liang,J. Zico Kolter,J. Zico Kolter,Louis-Philippe Morency,Ruslan Salakhutdinov +6 more
- 01 Jun 2019
TL;DR: In this paper, a directional pairwise cross-modal attention mechanism is proposed to attend to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another.