Question-controlled Text-aware Image Captioning
Anwen Hu,Shizhe Chen,Qin Jin +2 more
- 17 Oct 2021
- pp 3097-3105
20
TL;DR: Zhang et al. as mentioned in this paper proposed a Geometry and Question Aware Model (GQAM) to fuse region-level object features and scene text features with considering spatial relationships.
read more
Abstract: For an image with multiple scene texts, different people may be interested in different text information. Current text-aware image captioning models are not able to generate distinctive captions according to various information needs. To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap). With questions as control signals, this task requires models to understand questions, find related scene texts and describe them together with objects fluently in human language. Based on two existing text-aware captioning datasets, we automatically construct two datasets, ControlTextCaps and ControlVizWiz to support the task. We propose a novel Geometry and Question Aware Model (GQAM). GQAM first applies a Geometry-informed Visual Encoder to fuse region-level object features and region-level scene text features with considering spatial relationships. Then, we design a Question-guided Encoder to select the most relevant visual features for each question. Finally, GQAM generates a personalized text-aware caption with a Multimodal Decoder. Our model achieves better captioning performance and question answering ability than carefully designed baselines on both two datasets. With questions as control signals, our model generates more informative and diverse captions than the state-of-the-art text-aware captioning model. Our code and datasets are publicly available at https://github.com/HAWLYQ/Qc-TextCap.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
Jiabo Ye,Anwen Hu,Haiyang Xu,Qinghao Ye,Mingshi Yan,Yuhao Dan,Guohai Xu,Chenliang Li,Junfeng Tian,Qiang Qi,Jie Zhang,Feiyan Huang +11 more
TL;DR: Li et al. as mentioned in this paper proposed mPLUG-DocOwl based on large language models (MLLMs) for OCR-free document understanding by jointly training the model on language-only, general vision-and-language, and document instruction tuning dataset.
63
Point to Rectangle Matching for Image Text Retrieval
Zhengxu Wang,Zhenwei Gao,Xin Xu,Ya Dan Luo,Yang Yang,Heng Tao Shen +5 more
- 10 Oct 2022
TL;DR: This work proposes a Point to Rectangle Matching mechanism, which actually is a geometric representation learning method for image-text retrieval, and intuitive insight is that the representations of different modalities could be extended to rectangles, then a set of points inside a rectangle embedding could be semantically related to many candidate correspondences.
22
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model
Jiabo Ye,Anwen Hu,Hui Xu,Qinghao Ye,Mi Yan,Guohai Xu,Chenliang Li,Jun Tian,Quan Qian,Ji Zhang,Qin Jin,He Li,Xichuan Lin,Fei Huang +13 more
- 01 Jan 2023
TL;DR: The UReader model is a multimodal large language model for visually-situated language understanding without optical character recognition (OCR). It achieves state-of-the-art performance on various tasks, including text extraction, object detection, and scene understanding.
Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature
TL;DR: Visual Question Answering (VQA) is an emerging area of interest for researches as mentioned in this paper , being a recent problem in natural language processing and image prediction, and an algorithm needs to answer questions about certain images.
TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
Liang Zhang,Anwen Hu,Haiyang Xu,Mi Yan,Yuehang Xu,Qin Jin,Ji Zhang,Fei Huang +7 more
- 25 Apr 2024
TL;DR: TinyChart is an efficient MLLM for chart understanding with only 3B parameters that achieves SOTA performance on various benchmarks. It overcomes challenges in efficient chart understanding through visual token merging and PoT learning strategies.
References
•Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
- 12 Jun 2017
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
•Posted Content
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
81.7K
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
- 06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Bleu: a Method for Automatic Evaluation of Machine Translation
Kishore Papineni,Salim Roukos,Todd Ward,Wei-Jing Zhu +3 more
- 06 Jul 2002
TL;DR: This paper proposed a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.
•Proceedings Article
ROUGE: A Package for Automatic Evaluation of Summaries
Chin-Yew Lin
- 25 Jul 2004
TL;DR: Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations.