Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval

doi:10.1145/3503161.3547922

Proceedings Article10.1145/3503161.3547922

Deep Evidential Learning with Noisy Correspondence for Cross-modal Retrieval

10 Oct 2022

43

TL;DR: Yang et al. as discussed by the authors proposed a generalized Deep Evidential Cross-modal Learning framework (DECL), which integrates a novel Cross-Modal Evidential Learning paradigm (CEL) and a Robust Dynamic Hinge loss (RDH) with positive and negative learning.

Abstract: Cross-modal retrieval has been a compelling topic in the multimodal community. Recently, to mitigate the high cost of data collection, the co-occurred pairs (e.g., image and text) could be collected from the Internet as a large-scaled cross-modal dataset, e.g., Conceptual Captions. However, it will unavoidably introduce noise (i.e., mismatched pairs) into training data, dubbed noisy correspondence. Unquestionably, such noise will make supervision information unreliable/uncertain and remarkably degrade the performance. Besides, most existing methods focus training on hard negatives, which will amplify the unreliability of noise. To address the issues, we propose a generalized Deep Evidential Cross-modal Learning framework (DECL), which integrates a novel Cross-modal Evidential Learning paradigm (CEL) and a Robust Dynamic Hinge loss (RDH) with positive and negative learning. CEL could capture and learn the uncertainty brought by noise to improve the robustness and reliability of cross-modal retrieval. Specifically, the bidirectional evidence based on cross-modal similarity is first modeled and parameterized into the Dirichlet distribution, which not only provides accurate uncertainty estimation but also imparts resilience to perturbations against noisy correspondence. To address the amplification problem, RDH smoothly increases the hardness of negatives focused on, thus embracing higher robustness against high noise. Extensive experiments are conducted on three image-text benchmark datasets, i.e., Flickr30K, MS-COCO, and Conceptual Captions, to verify the effectiveness and efficiency of the proposed method. The code is available at \urlhttps://github.com/QinYang79/DECL.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Proceedings Article•10.1609/aaai.v38i2.27911

Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation

Zhuohang Dang, +5 more

- 24 Mar 2024

- Proceedings of the ... AAAI Conference o...

TL;DR: Noisy correspondence learning with self-reinforcing errors mitigation framework (SREM) alleviates noisy correspondences and improves cross-modal retrieval performance by refining sample filtration and leveraging negative matches.

...read moreread less

Journal Article•10.1145/3700596

Bias Mitigation and Representation Optimization for Noise-Robust Cross-modal Retrieval

Yu Liu, +4 more

- 14 Oct 2024

- ACM Transactions on Multimedia Computing...

TL;DR: This paper proposes BMRO, a framework for bias mitigation and representation optimization in noise-robust cross-modal retrieval, utilizing a Bias Estimator and Adaptive Representation Optimizer to enhance accurate sample division and tailored optimization strategies for clean and noisy samples.

...read moreread less

Book•10.1145/3581783.3612296

ROAD: Robust Unsupervised Domain Adaptation with Noisy Labels

Yanglin Feng, +4 more

- 26 Oct 2023

TL;DR: A robust unsupervised domain adaptation framework (ROAD), which prevents the network model from overfitting noisy labels to capture accurate discrimination knowledge for domain adaptation, and a Robust Adaptive Weighted Learning mechanism (RSWL) is proposed to adaptively assign weights to each sample based on its reliability to enforce the model to focus more on reliable samples and less on unreliable samples, thereby mining robust discrimination knowledge against noisy labels in the source domain.

...read moreread less

Journal Article•10.1109/isbi53787.2023.10230589

EVIL: Evidential Inference Learning for Trustworthy Semi-Supervised Medical Image Segmentation

Yingyu Chen, +5 more

- 18 Apr 2023

TL;DR: EVIL introduces a novel semi-supervised medical image segmentation framework that effectively utilizes uncertainty quantification and consistency regularization for accurate segmentation with few labeled data.

...read moreread less

Journal Article•10.1109/icme55011.2023.00153

Semantic Embedding Uncertainty Learning for Image and Text Matching

Yan Wang, +6 more

- 01 Jul 2023

TL;DR: A novel Semantic Embedding Uncertainty Learning (SEUL) is proposed, which represents the embedding uncertainty of image and text as Gaussian distributions and simultaneously learns the salient embedding and uncertainty in the common space.

...read moreread less

...

Expand

References

•Journal Article•10.1109/TPAMI.2016.2577031

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, +3 more

- 01 Jun 2017

- IEEE Transactions on Pattern Analysis an...

TL;DR: This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.

...read moreread less

64.4K

•Book Chapter•10.1007/978-3-319-10602-1_48

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

- 06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

51.7K

•Proceedings Article•10.1109/CVPR42600.2020.00975

Momentum Contrast for Unsupervised Visual Representation Learning

Kaiming He, +4 more

- 14 Jun 2020

TL;DR: This article proposed Momentum Contrast (MoCo) for unsupervised visual representation learning, which enables building a large and consistent dictionary on-the-fly that facilitates contrastive learning.

...read moreread less

9.7K

•Journal Article•10.1162/TACL_A_00166

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Peter Young, +3 more

- 28 Feb 2014

- Transactions of the Association for Comp...

TL;DR: This work proposes to use the visual denotations of linguistic expressions to define novel denotational similarity metrics, which are shown to be at least as beneficial as distributional similarities for two tasks that require semantic inference.

...read moreread less

3K

•Proceedings Article•10.18653/V1/P18-1238

Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning

Piyush Sharma, +3 more

- 01 Jul 2018

TL;DR: The Conceptual Captions dataset as discussed by the authors contains an order of magnitude more images than the MS-COCO dataset and represents a wider variety of both images and image caption styles.

...read moreread less

2.7K