Grounded Situation Recognition
Sarah M Pratt,Mark Yatskar,Luca Weihs,Ali Farhadi,Aniruddha Kembhavi,Aniruddha Kembhavi +5 more
- 23 Aug 2020
- Vol. 12349, pp 314-332
73
TL;DR: In this article, the authors introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images describing: the primary activity, entities engaged in the activity with their roles, and bounding-box groundings of entities.
read more
Abstract: We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images describing: the primary activity, entities engaged in the activity with their roles (e.g. agent, tool), and bounding-box groundings of entities. GSR presents important technical challenges: identifying semantic saliency, categorizing and localizing a large and diverse set of entities, overcoming semantic sparsity, and disambiguating roles. Moreover, unlike in captioning, GSR is straightforward to evaluate. To study this new task we create the Situations With Groundings (SWiG) dataset which adds 278,336 bounding-box groundings to the 11,538 entity classes in the imSitu dataset. We propose a Joint Situation Localizer and find that jointly predicting situations and groundings with end-to-end training handily outperforms independent training on the entire grounding metric suite with relative gains between 8% and 32%. Finally, we show initial findings on three exciting future directions enabled by our models: conditional querying, visual chaining, and grounded semantic aware image retrieval. Code and data available at https://prior.allenai.org/projects/gsr.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Multi-Modal Knowledge Graph Construction and Application: A Survey
TL;DR: Multi-modal Knowledge Graphs (MMKGs) as mentioned in this paper is a promising approach towards the realization of human-level machine intelligence, where knowledge graphs are constructed by text and images.
Human-like Controllable Image Captioning with Verb-specific Semantic Roles
Long Chen,Zhihong Jiang,Jun Xiao,Wei Liu +3 more
- 01 Jun 2021
TL;DR: In this article, the authors propose a new control signal for CIC, Verb-specific Semantic Roles (VSR), which consists of a verb and some semantic roles, which represents a targeted activity and the roles of entities involved in this activity.
Visual Semantic Role Labeling for Video Understanding
Arka Sadhu,Tanmay Gupta,Mark Yatskar,Ram Nevatia,Aniruddha Kembhavi +4 more
- 01 Jun 2021
TL;DR: The VidSitu benchmark as mentioned in this paper is a large scale video understanding dataset with 29k 10-second movie clips richly annotated with a verb and semantic role every 2 seconds, where entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations.
CLIP-Event: Connecting Text and Images with Event Structures
01 Jun 2022
TL;DR: Zhang et al. as mentioned in this paper propose a contrastive learning framework to enforce vision-language pretraining models to comprehend events and associated argument (participant) roles, which takes advantage of text information extraction technologies to obtain event structural knowledge, and utilize multiple prompt functions to contrast difficult negative descriptions by manipulating event structures.
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
Ana Marasović,Chandra Bhagavatula,Jae Sung Park,Ronan Le Bras,Noah A. Smith,Yejin Choi +5 more
- 01 Nov 2020
TL;DR: This article proposed RationaleˆVT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs.
51
References
•Proceedings Article
Adam: A Method for Stochastic Optimization
Diederik P. Kingma,Jimmy Ba +1 more
- 01 Jan 2015
TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
138.5K
Long short-term memory
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
99K
ImageNet: A large-scale hierarchical image database
Jia Deng,Wei Dong,Richard Socher,Li-Jia Li,Kai Li,Li Fei-Fei +5 more
- 20 Jun 2009
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin,Michael Maire,Serge Belongie,James Hays,Pietro Perona,Deva Ramanan,Piotr Dollár,C. Lawrence Zitnick +7 more
- 06 Sep 2014
TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.
Feature Pyramid Networks for Object Detection
Tsung-Yi Lin,Piotr Dollár,Ross Girshick,Kaiming He,Bharath Hariharan,Serge Belongie +5 more
- 21 Jul 2017
TL;DR: This paper exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost and achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles.