Transformer-based Self-supervised Multimodal Representation Learning for Wearable Emotion Recognition

doi:10.1109/taffc.2023.3263907

Journal Article10.1109/taffc.2023.3263907

Transformer-based Self-supervised Multimodal Representation Learning for Wearable Emotion Recognition

Yujin Wu, +2 more

- 29 Mar 2023

- IEEE Transactions on Affective Computing

- Vol. abs/2303.17611, pp 1-16

30

TL;DR: Li et al. as mentioned in this paper proposed a self-supervised learning (SSL) framework for wearable emotion recognition, where efficient multimodal fusion is realized with temporal convolution-based modality-specific encoders and a transformer-based shared encoder, capturing both intra-modal and intermodal correlations.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Figures

Fig. 4: Shared encoder based on the multimodal transformer. (FC: fully-connected layer with 128 units, LN: layer normalization)

Fig. 5: Modality-specific classification head Cm for signal transformation recognition task. (GAP: 1D global average pooling, FC: fully-connected layer, BatchNorm: batch normalization, num class: number of signal transformations, i.e., 6 in our work.)

Fig. 3: Modality-specific backbone based on temporal convolutional network (TCN). Each backbone consists of two residual blocks for capturing low-level features for transformed unimodal signals x′m. (k: kernel size, f: number of filters, d: dilation factor, p: padding size, s: stride size, weightnorm: weight normalization for convolution filters)

TABLE 4: Performance comparison of different emotion recognition tasks with state-of-the-art methods on the K-EmoCon dataset. (SL: supervised learning methods, UL: unsupervised learning methods, SSL: self-supervised learning methods, S: supervised, F: frozen, T: fine-tuned.)

TABLE 3: Performance comparison of different emotion recognition tasks with state-of-the-art methods on the CASE dataset. (SL: supervised learning methods, UL: unsupervised learning methods, SSL: self-supervised learning methods, S: supervised, F: frozen, T: fine-tuned.)

Fig. 1: Overview of our self-supervised multimodal representation learning framework. The proposed SSL model is first pre-trained with signal transform recognition as the pretext task to learn generalized multimodal representation. The encoder part of the resulting pre-trained model is then served as a feature extractor for downstream tasks which is frozen or fine-tuned on the labeled samples to predict emotion classes.

Citations

Journal Article•10.1016/j.engappai.2024.108339

Using transformers for multimodal emotion recognition: Taxonomies and state of the art review

Samira Hazmoune, +1 more

- 01 Jul 2024

- Engineering Applications of Artificial I...

TL;DR: This review paper surveys Transformers-based approaches for Multimodal Emotion Recognition, exploring architectures, scenarios, datasets, fusion mechanisms, and taxonomies, while addressing challenges and future directions to advance the field of affective computing and human-computer interaction.

...read moreread less

27

Journal Article•10.1145/3648469

Intelligent Wearable Systems: Opportunities and Challenges in Health and Sports

Luyao Yang, +2 more

- 14 Feb 2024

- ACM Computing Surveys

TL;DR: An overview of various types of intelligent wearables and their applications in health and sports is provided, ML algorithms are categorized, and the wireless body area sensor network (WBASN) used for communication in wearable sensors is introduced.

...read moreread less

19

Journal Article•10.3390/app14178071

Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review

Gustave Udahemuka, +2 more

- 09 Sep 2024

- Applied Sciences

TL;DR: This review examines multimodal emotion recognition methods integrating visual, vocal, and physiological signals, highlighting challenges and solutions for high-quality emotion recognition systems, and emphasizing the benefits of dynamic expression analysis and multimodal fusion for improving accuracy.

...read moreread less

11

Journal Article•10.1016/j.eswa.2024.123723

Enhancing emotion recognition using multimodal fusion of physiological, environmental, personal data

Hakpyeong Kim, +1 more

- Expert Systems With Applications

TL;DR: This study proposes a novel emotion recognition model that fuses physiological, environmental, and personal data, achieving a 31.6% error reduction and demonstrating robustness to individual differences through multimodal fusion and metadata incorporation.

...read moreread less

11

Journal Article•10.1109/tim.2024.3420349

Deep Learning-Based Automated Emotion Recognition Using Multi modal Physiological Signals and Time-Frequency Methods

Sriram Kumar P, +4 more

- 01 Jan 2024

- IEEE Transactions on Instrumentation and...

5

...

Expand

References

•Proceedings Article•10.1109/CVPR.2016.90

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 27 Jun 2016

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.

...read moreread less

198.7K

•Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

117.9K

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

Preprint•10.48550/arxiv.1706.03762

Attention Is All You Need

Ashish Vaswani, +7 more

- 01 Jan 2017

Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

...read moreread less

51.8K

...

Expand

Transformer-based Self-supervised Multimodal Representation Learning for Wearable Emotion Recognition

Chat with Paper

AI Agents for this Paper

Figures

Citations

Using transformers for multimodal emotion recognition: Taxonomies and state of the art review

Intelligent Wearable Systems: Opportunities and Challenges in Health and Sports

Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review

Enhancing emotion recognition using multimodal fusion of physiological, environmental, personal data

Deep Learning-Based Automated Emotion Recognition Using Multi modal Physiological Signals and Time-Frequency Methods

References

Deep Residual Learning for Image Recognition

Deep Residual Learning for Image Recognition

Attention is All you Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Attention Is All You Need

Related Papers (5)

Measuring Generalization and Overfitting in Machine Learning

Recognition of handwritten devanagari characters using linear discriminant analysis

Hybrid Machine Learning Model for Face Recognition Using SVM

New Nonlinear Machine Learning Algorithms With Applications to Biomedical Data Science

Best practices for supervised machine learning when examining biomarkers in clinical populations