Contrastive Regularization for Multimodal Emotion Recognition Using Audio and Text

doi:10.48550/arXiv.2211.10885

Journal Article10.48550/arXiv.2211.10885

Contrastive Regularization for Multimodal Emotion Recognition Using Audio and Text

Fan Qian, +1 more

- 20 Nov 2022

- arXiv.org

- Vol. abs/2211.10885

3

TL;DR: In this article , a discriminator is introduced to distinguish the difference between the same and different emotional pairs, and the latent code of each modality is restricted to contain the same emotional information.

Abstract: Speech emotion recognition is a challenge and an impor-tant step towards more natural human-computer interaction (HCI). The popular approach is multimodal emotion recognition based on model-level fusion, which means that the multimodal signals can be encoded to acquire embeddings, and then the embeddings are concatenated together for the ﬁnal classiﬁcation. However, due to the inﬂuence of noise or other factors, each modality does not always tend to the same emotional category, which affects the generalization of a model. In this paper, we propose a novel regularization method via contrastive learning for multimodal emotion recognition using audio and text. By introducing a discriminator to distinguish the difference between the same and different emotional pairs, we explicitly restrict the latent code of each modality to contain the same emotional information, so as to reduce the noise interference and get more discrim-inative representation. Experiments are performed on the standard IEMOCAP dataset for 4-class emotion recognition. The results show a signiﬁcant improvement of 1.44% and 1.53% in terms of weighted accuracy (WA) and unweighted accuracy (UA) compared to the baseline system.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Review•10.1007/s12559-024-10287-z

A Review of Key Technologies for Emotion Analysis Using Multimodal Information

Xianxun Zhu, +6 more

- 01 Jun 2024

- Cognitive Computation

12

Journal Article•10.1145/3577190.3614110

Enhancing Resilience to Missing Data in Audio-Text Emotion Recognition with Multi-Scale Chunk Regularization

Wei-Cheng Lin, +2 more

- 09 Oct 2023

TL;DR: This study presents a transformer-based model designed with a word-chunk concept, which offers an ideal framework to explore different strategies to align text and speech, and proposes a multi-scale chunk alignment strategy, which generates random alignments to create the chunks without considering lexical boundaries.

...read moreread less

6

•Journal Article•10.3389/fnins.2023.1183132

GCF2-Net: global-aware cross-modal feature fusion network for speech emotion recognition

Feng Li, +3 more

- 04 May 2023

- Frontiers in neuroscience

TL;DR: Zhang et al. as discussed by the authors proposed a global-aware cross-modal feature fusion network (GCF2-Net) for emotion recognition, which constructs a residual crossmodal fusion attention module (ResCMFA) to fuse information from multiple modalities.

...read moreread less

References

Proceedings Article•10.3115/V1/D14-1162

Glove: Global Vectors for Word Representation

Jeffrey Pennington, +2 more

- 01 Oct 2014

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

41.6K

Journal Article•10.1109/MCSE.2007.55

Matplotlib: A 2D Graphics Environment

J.D. Hunter

- 01 May 2007

- Computing in Science and Engineering

TL;DR: Matplotlib is a 2D graphics package used for Python for application development, interactive scripting, and publication-quality image generation across user interfaces and operating systems.

...read moreread less

34.7K

•Posted Content

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, +2 more

- 10 Jul 2018

- arXiv: Learning

TL;DR: This work proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which it calls Contrastive Predictive Coding, and demonstrates that the approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.

...read moreread less

8.2K

Journal Article•10.1007/S10579-008-9076-6

IEMOCAP: interactive emotional dyadic motion capture database

Carlos Busso, +8 more

- 05 Nov 2008

TL;DR: A new corpus named the “interactive emotional dyadic motion capture database” (IEMOCAP), collected by the Speech Analysis and Interpretation Laboratory at the University of Southern California (USC), which provides detailed information about their facial expressions and hand movements during scripted and spontaneous spoken communication scenarios.

...read moreread less

3.8K

•Proceedings Article

Linguistic Regularities in Continuous Space Word Representations

Tomas Mikolov, +2 more

- 27 May 2013

TL;DR: The vector-space word representations that are implicitly learned by the input-layer weights are found to be surprisingly good at capturing syntactic and semantic regularities in language, and that each relationship is characterized by a relation-specific vector offset.

...read moreread less

3.8K

...

Expand