Journal Article10.48550/arXiv.2301.04558
Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing
Shruthi Bannur,Stephanie L. Hyland,Qianchu Liu,Fernando Perez-Garcia,Maximilian Ilse,Daniel C. Castro,Benedikt Böcking,Harshita Sharma,Kenza Bouzid,Anja Thieme,Anton Schwaighofer,Matthew P. Lungren,Aditya Nori,Javier Alvarez-Valle,Ozan Oktay +14 more
58
TL;DR: BioViL-T as discussed by the authors uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model, achieving state-of-the-art performance on progression classification, phrase grounding, and report generation.
read more
Abstract: Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities. Prior work in biomedical VLP has mostly relied on the alignment of single image and report pairs even though clinical notes commonly refer to prior images. This does not only introduce poor alignment between the modalities but also a missed opportunity to exploit rich self-supervision through existing temporal content in the data. In this work, we explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model. It is designed to be versatile to arising challenges such as pose variations and missing input images across time. The resulting model excels on downstream tasks both in single- and multi-image setups, achieving state-of-the-art performance on (I) progression classification, (II) phrase grounding, and (III) report generation, whilst offering consistent improvements on disease classification and sentence-similarity tasks. We release a novel multi-modal temporal benchmark dataset, MS-CXR-T, to quantify the quality of vision-language representations in terms of temporal semantics. Our experimental results show the advantages of incorporating prior images and reports to make most use of the data.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Models Genesis
TL;DR: This work has built a set of models, called Generic Autodidactic Models, nicknamed Models Genesis, because they are created ex nihilo (with no manual labeling), self-taught (learnt by self-supervision), and generic (served as source models for generating application-specific target models).
Foundational Models in Medical Imaging: A Comprehensive Survey and Future Vision
TL;DR: A methodical taxonomy of foundation models within the medical domain is offered, proposing a classification system primarily structured around training strategies, while also incorporating additional facets such as application domains, imaging modalities, specific organs of interest, and the algorithms integral to these models.
34
A Comprehensive Study of GPT-4V's Multimodal Capabilities in Medical Imaging
Yingshu Li,Yunyi Liu,Zhan Wang,Xinyu Liang,Lingqiao Liu,Lei Wang,Leyang Cui,Zhaopeng Tu,Longyue Wang,Luping Zhou +9 more
TL;DR: This paper presents a comprehensive evaluation of GPT-4V's capabilities across diverse medical imaging tasks, including Radiology Report Generation, Medical Visual Question Answering (VQA), and Visual Grounding, and finds the limitations of conventional evaluation metrics like the BLEU score.
31
Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology
Nur Yildirim,Hannah Richardson,M. Wetscherek,Junaid Bajwa,Joseph Jacob,Mark A. Pinnock,Stephen Harris,Daniel C. Castro,Shruthi Bannur,Stephanie L. Hyland,Pratik Ghosh,Mercy Prasanna Ranjit,Kenza Bouzid,Anton Schwaighofer,Fernando P'erez-Garc'ia,Harshita Sharma,Ozan Oktay,M. Lungren,Javier Alvarez-Valle,Aditya Nori,Anja Thieme +20 more
- 22 Feb 2024
TL;DR: This work engaged in an iterative, multidisciplinary design process to envision clinically relevant VLM interactions, and co-designed four VLM use concepts: Draft Report Generation, Augmented Report Review, Visual Search and Querying, and Patient Imaging History Highlights.
26
RAD-DINO: Exploring Scalable Medical Image Encoders Beyond Text Supervision
Fernando P'erez-Garc'ia,Harshita Sharma,Sam Bond-Taylor,Kenza Bouzid,Valentina Salvatelli,Maximilian Ilse,Shruthi Bannur,Daniel C. Castro,Anton Schwaighofer,M. Lungren,M. Wetscherek,Noel Codella,Stephanie L. Hyland,Javier Alvarez-Valle,Ozan Oktay +14 more
TL;DR: RAD-DINO is introduced, a biomedical image encoder pre-trained solely on unimodal biomedical imaging data that obtains similar or greater performance than state-of-the-art biomedical language supervised models on a diverse range of benchmarks.
17
References
•Posted Content
Deep Residual Learning for Image Recognition
TL;DR: This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
117.9K
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
•Posted Content
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy,Lucas Beyer,Alexander Kolesnikov,Dirk Weissenborn,Xiaohua Zhai,Thomas Unterthiner,Mostafa Dehghani,Matthias Minderer,Georg Heigold,Sylvain Gelly,Jakob Uszkoreit,Neil Houlsby +11 more
TL;DR: Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
Bleu: a Method for Automatic Evaluation of Machine Translation
Kishore Papineni,Salim Roukos,Todd Ward,Wei-Jing Zhu +3 more
- 06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.