Grounded Situation Recognition

doi:10.1007/978-3-030-58548-8_19

Open AccessBook Chapter10.1007/978-3-030-58548-8_19

Grounded Situation Recognition

Sarah M Pratt, +5 more

- 23 Aug 2020

- Vol. 12349, pp 314-332

73

TL;DR: In this article, the authors introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images describing: the primary activity, entities engaged in the activity with their roles, and bounding-box groundings of entities.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.1109/tkde.2022.3224228

Multi-Modal Knowledge Graph Construction and Application: A Survey

01 Jan 2022

- IEEE Transactions on Knowledge and Data ...

TL;DR: Multi-modal Knowledge Graphs (MMKGs) as mentioned in this paper is a promising approach towards the realization of human-level machine intelligence, where knowledge graphs are constructed by text and images.

...read moreread less

123

•Proceedings Article•10.1109/CVPR46437.2021.01657

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

Long Chen, +3 more

- 01 Jun 2021

TL;DR: In this article, the authors propose a new control signal for CIC, Verb-specific Semantic Roles (VSR), which consists of a verb and some semantic roles, which represents a targeted activity and the roles of entities involved in this activity.

...read moreread less

86

•Proceedings Article•10.1109/CVPR46437.2021.00554

Visual Semantic Role Labeling for Video Understanding

Arka Sadhu, +4 more

- 01 Jun 2021

TL;DR: The VidSitu benchmark as mentioned in this paper is a large scale video understanding dataset with 29k 10-second movie clips richly annotated with a verb and semantic role every 2 seconds, where entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations.

...read moreread less

76

•Proceedings Article•10.1109/cvpr52688.2022.01593

CLIP-Event: Connecting Text and Images with Event Structures

01 Jun 2022

TL;DR: Zhang et al. as mentioned in this paper propose a contrastive learning framework to enforce vision-language pretraining models to comprehend events and associated argument (participant) roles, which takes advantage of text information extraction technologies to obtain event structural knowledge, and utilize multiple prompt functions to contrast difficult negative descriptions by manipulating event structures.

...read moreread less

74

•Proceedings Article•10.18653/V1/2020.FINDINGS-EMNLP.253

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

Ana Marasović, +5 more

- 01 Nov 2020

TL;DR: This article proposed RationaleˆVT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs.

...read moreread less

51

...

Expand

References

•Proceedings Article

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, +1 more

- 01 Jan 2015

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

...read moreread less

138.5K

Journal Article•10.1162/NECO.1997.9.8.1735

Long short-term memory

Sepp Hochreiter, +1 more

- 01 Nov 1997

- Neural Computation

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.

...read moreread less

99K

Proceedings Article•10.1109/CVPR.2009.5206848

ImageNet: A large-scale hierarchical image database

Jia Deng, +5 more

- 20 Jun 2009

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

...read moreread less

75.9K

•Book Chapter•10.1007/978-3-319-10602-1_48

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, +7 more

- 06 Sep 2014

TL;DR: A new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding by gathering images of complex everyday scenes containing common objects in their natural context.

...read moreread less

51.7K

•Proceedings Article•10.1109/CVPR.2017.106

Feature Pyramid Networks for Object Detection

Tsung-Yi Lin, +5 more

- 21 Jul 2017

TL;DR: This paper exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost and achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles.

...read moreread less

29.5K

...

Expand

Grounded Situation Recognition

Chat with Paper

AI Agents for this Paper

Citations

Multi-Modal Knowledge Graph Construction and Application: A Survey

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

Visual Semantic Role Labeling for Video Understanding

CLIP-Event: Connecting Text and Images with Event Structures

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

References

Adam: A Method for Stochastic Optimization

Long short-term memory

ImageNet: A large-scale hierarchical image database

Microsoft COCO: Common Objects in Context

Feature Pyramid Networks for Object Detection

Related Papers (5)

Situation Recognition: Visual Semantic Role Labeling for Image Understanding

Deep Residual Learning for Image Recognition

Attention is All you Need

Bleu: a Method for Automatic Evaluation of Machine Translation

Going Deeper With Semantics: Video Activity Interpretation Using Semantic Contextualization