1. What contributions have the authors mentioned in the paper "A review of affective computing: from unimodal analysis to multimodal fusion" ?
This is the primary motivation behind their first of its kind, comprehensive literature review of the diverse field of affective computing.. Furthermore, existing literature surveys lack a detailed discussion of state of the art in multimodal affect analysis frameworks, which this review aims to address.. In this paper, the authors focus mainly on the use of audio, visual and text information for multimodal affect analysis, since around 90 % of the relevant literature appears to cover these three modalities.. As part of this review, the authors carry out an extensive study of different categories of state-of-the-art fusion techniques, followed by a critical analysis of potential performance improvements with multimodal analysis compared to unimodal analysis.. A comprehensive overview of these two complementary fields aims to form the building blocks for readers, to better understand this challenging and exciting research field.
read more
2. What future works have the authors mentioned in the paper "A review of affective computing: from unimodal analysis to multimodal fusion" ?
One important area of future research is to investigate novel approaches for advancing their understanding of the temporal dependency between utterances, i. e., the effect of utterance at time t on the utterance at time t+1.. The progress in text classification research can play a major role in future of the multimodal affect analysis research.. Future research should focus on answering this question.. The use of deep learning for multimodal fusion can also be an important future work.
read more
3. What is the primary advantage of analyzing videos over textual analysis?
The primary advantage of analyzing videos over textual analysis, for detecting emotions and sentiments from opinions, is the surplus of behavioral cues.
read more
4. What was the acoustic feature used to generate the feature representation of the entire dataset?
For acoustic features, low-level acoustic features were extracted at frame level on each utterance and used to generate feature representation of the entire dataset, using the OpenSMILE toolkit.
read more





