A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence

doi:10.1109/TIP.2020.2966082

Journal Article10.1109/TIP.2020.2966082

A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence

Xiongkuo Min, +5 more

- 17 Jan 2020

- IEEE Transactions on Image Processing

- Vol. 29, pp 3805-3819

203

TL;DR: The proposed MMS model has captured the influence of audio, which is not considered in the latest deep learning based saliency models, and it is found that an average of 5% performance gain is obtained.

Abstract: Audio information has been bypassed by most of current visual attention prediction studies. However, sound could have influence on visual attention and such influence has been widely investigated and proofed by many psychological studies. In this paper, we propose a novel multi-modal saliency (MMS) model for videos containing scenes with high audio-visual correspondence. In such scenes, humans tend to be attracted by the sound sources and it is also possible to localize the sound sources via cross-modal analysis. Specifically, we first detect the spatial and temporal saliency maps from the visual modality by using a novel free energy principle. Then we propose to detect the audio saliency map from both audio and visual modalities by localizing the moving-sounding objects using cross-modal kernel canonical correlation analysis, which is first of its kind in the literature. Finally we propose a new two-stage adaptive audiovisual saliency fusion method to integrate the spatial, temporal and audio saliency maps to our audio-visual saliency map. The proposed MMS model has captured the influence of audio, which is not considered in the latest deep learning based saliency models. To take advantages of both deep saliency modeling and audio-visual saliency modeling, we propose to combine deep saliency models and the MMS model via a later fusion, and we find that an average of 5% performance gain is obtained. Experimental results on audio-visual attention databases show that the introduced models incorporating audio cues have significant superiority over state-of-the-art image and video saliency models which utilize a single visual modality.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.1007/S11432-019-2757-1

Perceptual image quality assessment: a survey

Guangtao Zhai, +1 more

- 26 Apr 2020

- Science in China Series F: Information S...

TL;DR: This survey provides a general overview of classical algorithms and recent progresses in the field of perceptual image quality assessment and describes the performances of the state-of-the-art quality measures for visual signals.

...read moreread less

556

•Journal Article•10.1109/ACCESS.2021.3053956

Enhanced YOLO v3 Tiny Network for Real-Time Ship Detection From Visual Image

Li Hao, +4 more

- 25 Jan 2021

- IEEE Access

TL;DR: Wang et al. as discussed by the authors proposed an enhanced YOLO v3 tiny network for real-time ship detection, which can be used in video surveillance to realize the accurate classification and positioning of six types of ships (including ore carrier, bulk cargo carrier, general cargo ship, container ship, fishing boat, and passenger ship).

...read moreread less

120

•Journal Article•10.1145/3457905

Perceptual Quality Assessment of Low-light Image Enhancement

ZhaiGuangtao, +3 more

- 12 Nov 2021

- ACM Transactions on Multimedia Computing...

TL;DR: In this paper, low-light image enhancement algorithms (LIEA) can light up images captured in dark or back-lighting conditions, however, LIEA may introduce various distortions such as structure damage, color shift, etc.

...read moreread less

74

•Journal Article•10.1145/3508361

Multimodality in VR: A Survey

31 Jan 2022

- ACM Computing Surveys

TL;DR: A survey of multimodal experiences in VR can be found in this paper , where the authors review the body of work addressing multimodality in VR and its role and benefits in user experience.

...read moreread less

67

Journal Article•10.1109/TCYB.2020.3037208

RIHOOP: Robust Invisible Hyperlinks in Offline and Online Photographs.

Jun Jia, +6 more

- 01 Jan 2021

- IEEE Transactions on Systems, Man, and C...

TL;DR: Li et al. as discussed by the authors proposed an end-to-end neural network with an encoder to hide messages and a decoder to extract messages, which can make the hyperlinks invisible for human eyes but detectable for mobile devices equipped with a camera.

...read moreread less

67

...

Expand

References

•Journal Article•10.1167/9.12.15

Static and space-time visual saliency detection by self-resemblance.

Hae Jong Seo, +1 more

- 20 Nov 2009

- Journal of Vision

TL;DR: A novel unified framework for both static and space-time saliency detection, which results in a saliency map where each pixel indicates the statistical likelihood of saliency of a feature matrix given its surrounding feature matrices.

...read moreread less

746

•Journal Article•10.1109/TIP.2018.2851672

Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.

Marcella Cornia, +3 more

- 29 Jun 2018

- IEEE Transactions on Image Processing

TL;DR: Zhang et al. as mentioned in this paper proposed a convolutional long short-term memory (LSTM) network to iteratively refine the predicted saliency map by focusing on the most salient regions of the input image.

...read moreread less

669

Journal Article•10.1109/TMM.2014.2373812

Using free energy principle for blind image quality assessment

Ke Gu, +3 more

- 01 Jan 2015

- IEEE Transactions on Multimedia

TL;DR: A new no-reference (NR) image quality assessment (IQA) metric is proposed using the recently revealed free-energy-based brain theory and classical human visual system (HVS)-inspired features to predict an image that the HVS perceives from a distorted image based on the free energy theory.

...read moreread less

653

•Proceedings Article•10.1109/ICCV.2013.26

Saliency Detection: A Boolean Map Approach

Jianming Zhang, +1 more

- 01 Dec 2013

TL;DR: A novel Boolean Map based Saliency model, based on a Gestalt principle of figure-ground segregation, that consistently achieves state-of-the-art performance compared with ten leading methods on five eye tracking datasets.

...read moreread less

631

•Journal Article•10.1109/TPAMI.2012.147

Visual Saliency Based on Scale-Space Analysis in the Frequency Domain

Jian Li, +4 more

- 01 Apr 2013

- IEEE Transactions on Pattern Analysis an...

TL;DR: A new bottom-up paradigm for detecting visual saliency is proposed, characterized by a scale-space analysis of the amplitude spectrum of natural images, and it is shown that the convolution of the image amplitude spectrum with a low-pass Gaussian kernel of an appropriate scale is equivalent to an image saliency detector.

...read moreread less

625

...

Expand

A Multimodal Saliency Model for Videos With High Audio-Visual Correspondence

Chat with Paper

AI Agents for this Paper

Citations

Perceptual image quality assessment: a survey

Enhanced YOLO v3 Tiny Network for Real-Time Ship Detection From Visual Image

Perceptual Quality Assessment of Low-light Image Enhancement

Multimodality in VR: A Survey

RIHOOP: Robust Invisible Hyperlinks in Offline and Online Photographs.

References

Static and space-time visual saliency detection by self-resemblance.

Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model.

Using free energy principle for blind image quality assessment

Saliency Detection: A Boolean Map Approach

Visual Saliency Based on Scale-Space Analysis in the Frequency Domain

Related Papers (5)

Deep Audio-Visual Saliency: Baseline Model and Data

Fixation prediction through multimodal analysis

DAVE: A Deep Audio-Visual Embedding for Dynamic Saliency Prediction.

Multimodal Saliency Models for Videos

Blind Quality Assessment Based on Pseudo-Reference Image