Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

doi:10.48550/arXiv.2301.04558

Journal Article10.48550/arXiv.2301.04558

Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing

Shruthi Bannur, +14 more

- 11 Jan 2023

- arXiv.org

- Vol. abs/2301.04558

58

TL;DR: BioViL-T as discussed by the authors uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model, achieving state-of-the-art performance on progression classification, phrase grounding, and report generation.

Abstract: Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. Prior work in biomedical VLP has mostly relied on the alignment of single image and report pairs even though clinical notes commonly refer to prior images. This does not only introduce poor alignment between the modalities but also a missed opportunity to exploit rich self-supervision through existing temporal content in the data. In this work, we explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model. It is designed to be versatile to arising challenges such as pose variations and missing input images across time. The resulting model excels on downstream tasks both in single- and multi-image setups, achieving state-of-the-art performance on (I) progression classification, (II) phrase grounding, and (III) report generation, whilst offering consistent improvements on disease classification and sentence-similarity tasks. We release a novel multi-modal temporal benchmark dataset, MS-CXR-T, to quantify the quality of vision-language representations in terms of temporal semantics. Our experimental results show the advantages of incorporating prior images and reports to make most use of the data.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1016/J.MEDIA.2020.101840

Models Genesis

Zongwei Zhou, +4 more

- 09 Apr 2020

- Medical Image Analysis

TL;DR: This work has built a set of models, called Generic Autodidactic Models, nicknamed Models Genesis, because they are created ex nihilo (with no manual labeling), self-taught (learnt by self-supervision), and generic (served as source models for generating application-specific target models).

...read moreread less

322

Journal Article•10.48550/arxiv.2310.18689

Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision

Bobby Azad, +6 more

- 28 Oct 2023

- arXiv.org

TL;DR: A methodical taxonomy of foundation models within the medical domain is offered, proposing a classification system primarily structured around training strategies, while also incorporating additional facets such as application domains, imaging modalities, specific organs of interest, and the algorithms integral to these models.

...read moreread less

34

Journal Article•10.1101/2023.11.03.23298067

A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging

Yingshu Li, +9 more

- 31 Oct 2023

- medRxiv

TL;DR: This paper presents a comprehensive evaluation of GPT-4V's capabilities across diverse medical imaging tasks, including Radiology Report Generation, Medical Visual Question Answering (VQA), and Visual Grounding, and finds the limitations of conventional evaluation metrics like the BLEU score.

...read moreread less

31

10.1145/3613904.3642013

Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology

Nur Yildirim, +20 more

- 22 Feb 2024

TL;DR: This work engaged in an iterative, multidisciplinary design process to envision clinically relevant VLM interactions, and co-designed four VLM use concepts: Draft Report Generation, Augmented Report Review, Visual Search and Querying, and Patient Imaging History Highlights.

...read moreread less

26

Journal Article•10.48550/arxiv.2401.10815

RAD-DINO: Exploring Scalable Medical Image Encoders Beyond Text Supervision

Fernando P'erez-Garc'ia, +14 more

- 19 Jan 2024

- arXiv.org

TL;DR: RAD-DINO is introduced, a biomedical image encoder pre-trained solely on unimodal biomedical imaging data that obtains similar or greater performance than state-of-the-art biomedical language supervised models on a diverse range of benchmarks.

...read moreread less

17

...

Expand

References

•Posted Content

Deep Residual Learning for Image Recognition

Kaiming He, +3 more

- 10 Dec 2015

- arXiv: Computer Vision and Pattern Recog...

TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

...read moreread less

117.9K

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Posted Content

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, +11 more

- 22 Oct 2020

- arXiv: Computer Vision and Pattern Recog...

TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

...read moreread less

36.9K

•Proceedings Article•10.3115/1073083.1073135

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

- 06 Jul 2002

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

28.9K

...

Expand