Question-controlled Text-aware Image Captioning

doi:10.1145/3474085.3475452

Open AccessProceedings Article10.1145/3474085.3475452

Question-controlled Text-aware Image Captioning

Anwen Hu, +2 more

- 17 Oct 2021

- pp 3097-3105

20

TL;DR: Zhang et al. as mentioned in this paper proposed a Geometry and Question Aware Model (GQAM) to fuse region-level object features and scene text features with considering spatial relationships.

Abstract: For an image with multiple scene texts, different people may be interested in different text information. Current text-aware image captioning models are not able to generate distinctive captions according to various information needs. To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap). With questions as control signals, this task requires models to understand questions, find related scene texts and describe them together with objects fluently in human language. Based on two existing text-aware captioning datasets, we automatically construct two datasets, ControlTextCaps and ControlVizWiz to support the task. We propose a novel Geometry and Question Aware Model (GQAM). GQAM first applies a Geometry-informed Visual Encoder to fuse region-level object features and region-level scene text features with considering spatial relationships. Then, we design a Question-guided Encoder to select the most relevant visual features for each question. Finally, GQAM generates a personalized text-aware caption with a Multimodal Decoder. Our model achieves better captioning performance and question answering ability than carefully designed baselines on both two datasets. With questions as control signals, our model generates more informative and diverse captions than the state-of-the-art text-aware captioning model. Our code and datasets are publicly available at https://github.com/HAWLYQ/Qc-TextCap.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

Journal Article•10.48550/arXiv.2307.02499

mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding

Jiabo Ye, +11 more

- 04 Jul 2023

- arXiv.org

TL;DR: Li et al. as mentioned in this paper proposed mPLUG-DocOwl based on large language models (MLLMs) for OCR-free document understanding by jointly training the model on language-only, general vision-and-language, and document instruction tuning dataset.

...read moreread less

63

Proceedings Article•10.1145/3503161.3548237

Point to Rectangle Matching for Image Text Retrieval

Zhengxu Wang, +5 more

- 10 Oct 2022

TL;DR: This work proposes a Point to Rectangle Matching mechanism, which actually is a geometric representation learning method for image-text retrieval, and intuitive insight is that the representations of different modalities could be extended to rectangles, then a set of points inside a rectangle embedding could be semantically related to many candidate correspondences.

...read moreread less

22

Journal Article•10.18653/v1/2023.findings-emnlp.187

UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Jiabo Ye, +13 more

- 01 Jan 2023

TL;DR: The UReader model is a multimodal large language model for visually-situated language understanding without optical character recognition (OCR). It achieves state-of-the-art performance on various tasks, including text extraction, object detection, and scene understanding.

...read moreread less

13

Journal Article•10.48550/arXiv.2305.11033

Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature

Valeska Uchôa, +1 more

- 18 May 2023

- arXiv.org

TL;DR: Visual Question Answering (VQA) is an emerging area of interest for researches as mentioned in this paper , being a recent problem in natural language processing and image prediction, and an algorithm needs to answer questions about certain images.

...read moreread less

11

Preprint•10.48550/arxiv.2404.16635

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Liang Zhang, +7 more

- 25 Apr 2024

TL;DR: TinyChart is an efficient MLLM for chart understanding with only 3B parameters that achieves SOTA performance on various benchmarks. It overcomes challenges in efficient chart understanding through visual token merging and PoT learning strategies.

...read moreread less

7

...

Expand

References

•Proceedings Article

Attention is All you Need

Ashish Vaswani, +7 more

- 12 Jun 2017

TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.

...read moreread less

94.2K

•Posted Content

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, +3 more

- 11 Oct 2018

- arXiv: Computation and Language

TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

...read moreread less

81.7K

•Book Chapter•10.1007/978-3-319-10602-1_48

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

- 06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

51.7K

•Proceedings Article•10.3115/1073083.1073135

Bleu: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, +3 more

- 06 Jul 2002

TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.

...read moreread less

28.9K

•Proceedings Article

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin

- 25 Jul 2004

TL;DR: Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations.

...read moreread less

14.8K