Forecasting Hands and Objects in Future Frames
Chenyou Fan,Jangwon Lee,Michael S. Ryoo +2 more
- 08 Sep 2018
- pp 124-137
TL;DR: In this paper, a two-stream fully convolutional neural network (CNN) architecture is proposed to predict future object presence and location in a video given an image frame, where the intermediate representation of a CNN model abstracts scene information in its frame and can predict such representations corresponding to the future frames based on that of the current frame.
read more
Abstract: This paper presents an approach to forecast future presence and location of human hands and objects. Given an image frame, the goal is to predict what objects will appear in the future frame (e.g., 5 s later) and where they will be located at, even when they are not visible in the current frame. The key idea is that (1) an intermediate representation of a convolutional object recognition model abstracts scene information in its frame and that (2) we can predict (i.e., regress) such representations corresponding to the future frames based on that of the current frame. We present a new two-stream fully convolutional neural network (CNN) architecture designed for forecasting future objects given a video. The experiments confirm that our approach allows reliable estimation of future objects in videos, obtaining much higher accuracy compared to the state-of-the-art future object presence forecast method on public datasets.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
Overcoming Limitations of Mixture Density Networks: A Sampling and Fitting Framework for Multimodal Future Prediction
Osama Makansi,Eddy Ilg,Özgün Çiçek,Thomas Brox +3 more
- 15 Jun 2019
TL;DR: In this paper, a winner-takes-all loss and an iterative grouping of samples to multiple modes is proposed to predict multimodal distributions of the future states, including the common real scenario.
Overcoming Limitations of Mixture Density Networks: A Sampling and Fitting Framework for Multimodal Future Prediction.
TL;DR: This work presents an approach that involves the prediction of several samples of the future with a winner-takes-all loss and iterative grouping of samples to multiple modes and shows on synthetic and real data that the proposed approach triggers good estimates of multimodal distributions and avoids mode collapse.
132
Predicting the future from first person (egocentric) vision: A survey
TL;DR: It is highlighted that methods for future prediction from egocentric vision can have a significant impact in a range of applications and that further research efforts should be devoted to the standardisation of tasks and the proposal of datasets considering real-world scenarios such as the ones with an industrial vocation.
56
Multimodal Future Localization and Emergence Prediction for Objects in Egocentric View With a Reachability Prior
Osama Makansi,Özgün Çiçek,Kevin Buchicchio,Thomas Brox +3 more
- 14 Jun 2020
TL;DR: Experiments show that the reachability prior combined with multi-hypotheses learning improves multimodal prediction of the future location of tracked objects and, for the first time, the emergence of new objects.
Early Pedestrian Intent Prediction via Features Estimation
16 Oct 2022
TL;DR: In this paper , a model for egocentric action anticipation (RU-LSTM) is proposed to predict pedestrians crossing intentions using a properly attention-based fusion mechanism.
3
References
•Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
- 01 Jan 2015
TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.
51.9K
SSD: Single Shot MultiBox Detector
Wei Liu,Dragomir Anguelov,Dumitru Erhan,Christian Szegedy,Scott Reed,Cheng-Yang Fu,Alexander C. Berg +6 more
- 08 Oct 2016
TL;DR: The approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, which makes SSD easy to train and straightforward to integrate into systems that require a detection component.
SSD: Single Shot MultiBox Detector
Wei Liu,Dragomir Anguelov,Dumitru Erhan,Christian Szegedy,Scott Reed,Cheng-Yang Fu,Alexander C. Berg +6 more
TL;DR: SSD as mentioned in this paper discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location, and combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes.
14K
The Cityscapes Dataset for Semantic Urban Scene Understanding
Marius Cordts,Mohamed Omran,Sebastian Ramos,Timo Rehfeld,Markus Enzweiler,Rodrigo Benenson,Uwe Franke,Stefan Roth,Bernt Schiele +8 more
- 01 Jun 2016
TL;DR: This work introduces Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling, and exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity.
11.5K
•Proceedings Article
Two-Stream Convolutional Networks for Action Recognition in Videos
Karen Simonyan,Andrew Zisserman +1 more
- 08 Dec 2014
TL;DR: This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.
8.3K