1. What are the contributions in "Action recognition based on efficient deep feature learning in the spatio-temporal domain" ?
The authors present a simple, yet robust, 2D convolutional neural network extended to a concatenated 3D network that learns to extract features from the spatio-temporal domain of raw video data.. Experimental results on commonly used benchmarking video datasets demonstrate that their results are state-of-the-art in terms of accuracy and computational time without requiring any preprocessing ( e. g., optic flow ) or a priori knowledge on data capture ( e. g., camera motion estimation ), which makes it more general and flexible than other approaches.
read more
2. What are the future works in "Action recognition based on efficient deep feature learning in the spatio-temporal domain" ?
In the future, the authors plan to explore possible modifications in the network design to further exploit learning in the temporal domain.. One possibility would be to gradually increase the number of temporal connections along the sequence of layers.. The authors also plan to investigate the effect on performance of gradually clipping the top layers of the network and evaluation on the recently introduced Sports-1M dataset [ 31 ] which contains over 1 million labeled sample videos.
read more


![Fig. 6. Top-5 predictions using our approach for selected test sequences from the UCF-101 dataset [47] with 101 action categories. First row (green color) shows the ground-truth followed by predictions in decreasing level of confidence. Blue and red show correct and incorrect predictions, respectively.](/figures/fig-6-top-5-predictions-using-our-approach-for-selected-test-3kbl6ta3.png)
![Fig. 7. Top-5 predictions using our approach for selected test sequences from the HMDB dataset [48] with 51 action categories. First row (green color) shows the ground-truth followed by predictions in decreasing level of confidence. Blue and red show correct and incorrect predictions, respectively.](/figures/fig-7-top-5-predictions-using-our-approach-for-selected-test-l65m3f6y.png)
![Fig. 1. Illustration of the network. We use the output from layer 16 of the VGG-Net (Table 1 in [45]), as a descriptor. The output is concatenated to form 512, 3D feature maps. The 3D feature maps are used as input for the network consisting of a volumetric convolutional layer followed by two fully-connected layers.](/figures/fig-1-illustration-of-the-network-we-use-the-output-from-37au7usq.png)
