Open AccessJournal Article10.1109/LRA.2016.2529686

Action Recognition Based on Efficient Deep Feature Learning in the Spatio-Temporal Domain

- 12 Feb 2016

- Vol. 1, Iss: 2, pp 984-991

TL;DR: A simple, yet robust, 2-D convolutional neural network extended to a concatenated 3-D network that learns to extract features from the spatio-temporal domain of raw video data and is used for content-based recognition of videos.

Abstract: Hand-crafted feature functions are usually designed based on the domain knowledge of a presumably controlled environment and often fail to generalize, as the statistics of real-world data cannot always be modeled correctly. Data-driven feature learning methods, on the other hand, have emerged as an alternative that often generalize better in uncontrolled environments. We present a simple, yet robust, 2-D convolutional neural network extended to a concatenated 3-D network that learns to extract features from the spatio-temporal domain of raw video data. The resulting network model is used for content-based recognition of videos. Relying on a 2-D convolutional neural network allows us to exploit a pretrained network as a descriptor that yielded the best results on the largest and challenging ILSVRC-2014 dataset. Experimental results on commonly used benchmarking video datasets demonstrate that our results are state-of-the-art in terms of accuracy and computational time without requiring any preprocessing (e.g., optic flow) or a priori knowledge on data capture (e.g., camera motion estimation), which makes it more general and flexible than other approaches. Our implementation is made available.

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Most frequently asked questions

1. What are the contributions in "Action recognition based on efficient deep feature learning in the spatio-temporal domain" ?

The authors present a simple, yet robust, 2D convolutional neural network extended to a concatenated 3D network that learns to extract features from the spatio-temporal domain of raw video data.. Experimental results on commonly used benchmarking video datasets demonstrate that their results are state-of-the-art in terms of accuracy and computational time without requiring any preprocessing ( e. g., optic flow ) or a priori knowledge on data capture ( e. g., camera motion estimation ), which makes it more general and flexible than other approaches.

2. What are the future works in "Action recognition based on efficient deep feature learning in the spatio-temporal domain" ?

In the future, the authors plan to explore possible modifications in the network design to further exploit learning in the temporal domain.. One possibility would be to gradually increase the number of temporal connections along the sequence of layers.. The authors also plan to investigate the effect on performance of gradually clipping the top layers of the network and evaluation on the recently introduced Sports-1M dataset [ 31 ] which contains over 1 million labeled sample videos.

Fig. 3. Confusion matrix for the UCF-101 dataset accumulated for all three splits

Fig. 5. Confusion matrix for the HMDB dataset accumulated for all three splits

Fig. 6. Top-5 predictions using our approach for selected test sequences from the UCF-101 dataset [47] with 101 action categories. First row (green color) shows the ground-truth followed by predictions in decreasing level of confidence. Blue and red show correct and incorrect predictions, respectively.

Fig. 7. Top-5 predictions using our approach for selected test sequences from the HMDB dataset [48] with 51 action categories. First row (green color) shows the ground-truth followed by predictions in decreasing level of confidence. Blue and red show correct and incorrect predictions, respectively.

Fig. 1. Illustration of the network. We use the output from layer 16 of the VGG-Net (Table 1 in [45]), as a descriptor. The output is concatenated to form 512, 3D feature maps. The 3D feature maps are used as input for the network consisting of a volumetric convolutional layer followed by two fully-connected layers.

Fig. 2. Illustration of how the different network outputs are combined, where VGG-3D-fc2 refers to the fc2 layer in Fig. 1

Citations

Journal Article•10.1007/S11042-017-5438-7

Video scene analysis: an overview and challenges on deep learning algorithms

Qaisar Abbas, +3 more

- 01 Aug 2018

- Multimedia Tools and Applications

TL;DR: A review of the recent developments in deep learning and video sceneAnalysis problems is presented and a detailed overview of the particular challenges existed in real-time video scene analysis that has been contributed towards activity recognition, scene interpretation, and video description/captioning is provided.

...read moreread less

•Posted Content

A Survey on Deep Learning Methods for Robot Vision.

Javier Ruiz-del-Solar, +2 more

- 28 Mar 2018

- arXiv: Computer Vision and Pattern Recog...

TL;DR: A comprehensive overview of deep learning and its usage in computer vision is given, that includes a description of the most frequently used neural models and their main application areas, and a review of the principal work using deep learning in robot vision.

...read moreread less

•Proceedings Article•10.1145/3123266.3123299

3D CNNs on Distance Matrices for Human Action Recognition

Alejandro Hernandez Ruiz, +3 more

- 19 Oct 2017

TL;DR: A 3D Convolutional Neural Network with body representations based on Euclidean Distance Matrices is combined with a novel architecture that simultaneously, and in an end-to-end manner, learns an optimal transformation of the joints, while optimizing the rest of parameters of the convolutional network.

...read moreread less

Journal Article•10.1007/S11042-017-4614-0

Action recognition based on hierarchical dynamic Bayesian network

Qinkun Xiao, +1 more

- 01 Mar 2018

- Multimedia Tools and Applications

TL;DR: A novel action recognition method based on hierarchical dynamic Bayesian network (HDBN), divided into system learning stage and action recognition stage, which shows the proposed method is promising.

...read moreread less

•Journal Article•10.1016/J.IMAGE.2019.115731

Correlation Net: Spatiotemporal multimodal deep learning for action recognition

Novanto Yudistira, +2 more

- 01 Mar 2020

- Signal Processing-image Communication

TL;DR: A correlation network with a Shannon fusion for learning a pre-trained CNN that captures multimodal correlations over arbitrary timestamps and is validated in comparison experiments on the UCF-101 and HMDB-51 datasets.

...read moreread less

...

Expand

References

•Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

- 04 Sep 2014

TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

...read moreread less

102.6K

•Journal Article•10.1145/3065386

ImageNet classification with deep convolutional neural networks

Alex Krizhevsky, +2 more

- 24 May 2017

- Communications of The ACM

TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

...read moreread less

98.2K

•Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

- 03 Dec 2012

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

88.4K

Journal Article•10.1109/5.726791

Gradient-based learning applied to document recognition

Yann LeCun, +6 more

- 01 Jan 1998

TL;DR: In this article, a graph transformer network (GTN) is proposed for handwritten character recognition, which can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters.

...read moreread less

53.5K

•Proceedings Article

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan, +1 more

- 01 Jan 2015

TL;DR: In this paper, the authors investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting and showed that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 layers.

...read moreread less

51.9K

...

Expand

Action Recognition Based on Efficient Deep Feature Learning in the Spatio-Temporal Domain

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What are the contributions in "Action recognition based on efficient deep feature learning in the spatio-temporal domain" ?

2. What are the future works in "Action recognition based on efficient deep feature learning in the spatio-temporal domain" ?

Figures

Citations

Video scene analysis: an overview and challenges on deep learning algorithms

A Survey on Deep Learning Methods for Robot Vision.

3D CNNs on Distance Matrices for Human Action Recognition

Action recognition based on hierarchical dynamic Bayesian network

Correlation Net: Spatiotemporal multimodal deep learning for action recognition

References

Very Deep Convolutional Networks for Large-Scale Image Recognition

ImageNet classification with deep convolutional neural networks

ImageNet Classification with Deep Convolutional Neural Networks

Gradient-based learning applied to document recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Related Papers (5)

Large-Scale Video Classification with Convolutional Neural Networks

Two-Stream Convolutional Networks for Action Recognition in Videos

ImageNet Classification with Deep Convolutional Neural Networks

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

3D Convolutional Neural Networks for Human Action Recognition