Journal Article10.1109/taffc.2023.3263907
Transformer-based Self-supervised Multimodal Representation Learning for Wearable Emotion Recognition
Yujin Wu,Mohamed Daoudi,Ali Amad +2 more
TL;DR: Li et al. as mentioned in this paper proposed a self-supervised learning (SSL) framework for wearable emotion recognition, where efficient multimodal fusion is realized with temporal convolution-based modality-specific encoders and a transformer-based shared encoder, capturing both intra-modal and intermodal correlations.
read more
Abstract: Recently, wearable emotion recognition based on peripheral physiological signals has drawn massive attention due to its less invasive nature and its applicability in real-life scenarios. However, how to effectively fuse multimodal data remains a challenging problem. Moreover, traditional fully-supervised based approaches suffer from overfitting given limited labeled data. To address the above issues, we propose a novel self-supervised learning (SSL) framework for wearable emotion recognition, where efficient multimodal fusion is realized with temporal convolution-based modality-specific encoders and a transformer-based shared encoder, capturing both intra-modal and inter-modal correlations. Extensive unlabeled data is automatically assigned labels by five signal transforms, and the proposed SSL model is pre-trained with signal transformation recognition as a pretext task, allowing the extraction of generalized multimodal representations for emotion-related downstream tasks. For evaluation, the proposed SSL model was first pre-trained on a large-scale self-collected physiological dataset and the resulting encoder was subsequently frozen or fine-tuned on three public supervised emotion recognition datasets. Ultimately, our SSL-based method achieved state-of-the-art results in various emotion classification tasks. Meanwhile, the proposed model proved to be more accurate and robust compared to fully-supervised methods on low data regimes.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Fig. 4: Shared encoder based on the multimodal transformer. (FC: fully-connected layer with 128 units, LN: layer normalization) 
Fig. 5: Modality-specific classification head Cm for signal transformation recognition task. (GAP: 1D global average pooling, FC: fully-connected layer, BatchNorm: batch normalization, num class: number of signal transformations, i.e., 6 in our work.) 
Fig. 3: Modality-specific backbone based on temporal convolutional network (TCN). Each backbone consists of two residual blocks for capturing low-level features for transformed unimodal signals x′m. (k: kernel size, f: number of filters, d: dilation factor, p: padding size, s: stride size, weightnorm: weight normalization for convolution filters) 
TABLE 4: Performance comparison of different emotion recognition tasks with state-of-the-art methods on the K-EmoCon dataset. (SL: supervised learning methods, UL: unsupervised learning methods, SSL: self-supervised learning methods, S: supervised, F: frozen, T: fine-tuned.) 
TABLE 3: Performance comparison of different emotion recognition tasks with state-of-the-art methods on the CASE dataset. (SL: supervised learning methods, UL: unsupervised learning methods, SSL: self-supervised learning methods, S: supervised, F: frozen, T: fine-tuned.) 
Fig. 1: Overview of our self-supervised multimodal representation learning framework. The proposed SSL model is first pre-trained with signal transform recognition as the pretext task to learn generalized multimodal representation. The encoder part of the resulting pre-trained model is then served as a feature extractor for downstream tasks which is frozen or fine-tuned on the labeled samples to predict emotion classes.
Citations
Using transformers for multimodal emotion recognition: Taxonomies and state of the art review
Samira Hazmoune,Fateh Bougamouza +1 more
TL;DR: This review paper surveys Transformers-based approaches for Multimodal Emotion Recognition, exploring architectures, scenarios, datasets, fusion mechanisms, and taxonomies, while addressing challenges and future directions to advance the field of affective computing and human-computer interaction.
27
Intelligent Wearable Systems: Opportunities and Challenges in Health and Sports
Luyao Yang,Osama Amin,Basem Shihada +2 more
TL;DR: An overview of various types of intelligent wearables and their applications in health and sports is provided, ML algorithms are categorized, and the wireless body area sensor network (WBASN) used for communication in wearable sensors is introduced.
19
Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review
Gustave Udahemuka,Karim Djouani,Anish Kurien +2 more
TL;DR: This review examines multimodal emotion recognition methods integrating visual, vocal, and physiological signals, highlighting challenges and solutions for high-quality emotion recognition systems, and emphasizing the benefits of dynamic expression analysis and multimodal fusion for improving accuracy.
11
Enhancing emotion recognition using multimodal fusion of physiological, environmental, personal data
TL;DR: This study proposes a novel emotion recognition model that fuses physiological, environmental, and personal data, achieving a 31.6% error reduction and demonstrating robustness to individual differences through multimodal fusion and metadata incorporation.
11
Deep Learning-Based Automated Emotion Recognition Using Multi modal Physiological Signals and Time-Frequency Methods
Sriram Kumar P,Praveen Kumar Govarthan,Abdul Aleem Shaik Gadda,Nagarajan Ganapathy,Jac Fredo Agastinose Ronickom +4 more
5
References
Deep Residual Learning for Image Recognition
Kaiming He,Xiangyu Zhang,Shaoqing Ren,Jian Sun +3 more
- 27 Jun 2016
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
•Posted Content
Deep Residual Learning for Image Recognition
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
117.9K
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
Attention Is All You Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Łukasz Kaiser,Illia Polosukhin +7 more
- 01 Jan 2017
Abstract: The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
51.8K