Journal Article10.48550/arXiv.2211.10885
Contrastive Regularization for Multimodal Emotion Recognition Using Audio and Text
Fan Qian,Jiqing Han +1 more
3
TL;DR: In this article , a discriminator is introduced to distinguish the difference between the same and different emotional pairs, and the latent code of each modality is restricted to contain the same emotional information.
read more
Abstract: Speech emotion recognition is a challenge and an impor-tant step towards more natural human-computer interaction (HCI). The popular approach is multimodal emotion recognition based on model-level fusion, which means that the multimodal signals can be encoded to acquire embeddings, and then the embeddings are concatenated together for the final classification. However, due to the influence of noise or other factors, each modality does not always tend to the same emotional category, which affects the generalization of a model. In this paper, we propose a novel regularization method via contrastive learning for multimodal emotion recognition using audio and text. By introducing a discriminator to distinguish the difference between the same and different emotional pairs, we explicitly restrict the latent code of each modality to contain the same emotional information, so as to reduce the noise interference and get more discrim-inative representation. Experiments are performed on the standard IEMOCAP dataset for 4-class emotion recognition. The results show a significant improvement of 1.44% and 1.53% in terms of weighted accuracy (WA) and unweighted accuracy (UA) compared to the baseline system.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A Review of Key Technologies for Emotion Analysis Using Multimodal Information
Xianxun Zhu,Chaopeng Guo,Heyang Feng,Hongxun Yao,Yichen Feng,Xiangyang Wang,Rui Wang +6 more
12
Enhancing Resilience to Missing Data in Audio-Text Emotion Recognition with Multi-Scale Chunk Regularization
Wei-Cheng Lin,Lucas Goncalves,Carlos Busso +2 more
- 09 Oct 2023
TL;DR: This study presents a transformer-based model designed with a word-chunk concept, which offers an ideal framework to explore different strategies to align text and speech, and proposes a multi-scale chunk alignment strategy, which generates random alignments to create the chunks without considering lexical boundaries.
6
GCF2-Net: global-aware cross-modal feature fusion network for speech emotion recognition
TL;DR: Zhang et al. as discussed by the authors proposed a global-aware cross-modal feature fusion network (GCF2-Net) for emotion recognition, which constructs a residual crossmodal fusion attention module (ResCMFA) to fuse information from multiple modalities.
References
Glove: Global Vectors for Word Representation
Jeffrey Pennington,Richard Socher,Christopher D. Manning +2 more
- 01 Oct 2014
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Matplotlib: A 2D Graphics Environment
TL;DR: Matplotlib is a 2D graphics package used for Python for application development, interactive scripting, and publication-quality image generation across user interfaces and operating systems.
34.7K
•Posted Content
Representation Learning with Contrastive Predictive Coding
TL;DR: This work proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which it calls Contrastive Predictive Coding, and demonstrates that the approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.
8.2K
IEMOCAP: interactive emotional dyadic motion capture database
Carlos Busso,Murtaza Bulut,Chi-Chun Lee,Abe Kazemzadeh,Emily Mower,Samuel Kim,Jeannette N. Chang,Sungbok Lee,Shrikanth S. Narayanan +8 more
- 05 Nov 2008
TL;DR: A new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory at the University of Southern California (USC), which provides detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios.
•Proceedings Article
Linguistic Regularities in Continuous Space Word Representations
Tomas Mikolov,Wen-tau Yih,Geoffrey Zweig +2 more
- 27 May 2013
TL;DR: The vector-space word representations that are implicitly learned by the input-layer weights are found to be surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset.