How much time does it take to compute a dense flow field?

Once the flow field is computed, tracking feature points is relatively inexpensive, it only takes 6% of the total computation time.

How much time is used to compute dense optical flow fields?

The authors achieve 2.4 frames/second for W = 14 pixels to W = 20 pixels, in which case most of the time is used to compute dense optical flow fields.

How do the authors set the contrast threshold for the SIFT detector?

The authors set the contrast threshold for the SIFT detector to 0.004, which is one order lower than the default setting, and makes sure that there are enough SIFT interest points for matching.

How do the authors compute the descriptors for HMDB51?

For their descriptors, the authors take the positions of Harris3D interest points, and then compute descriptors in the 3D patches around these positions with the same parameters as for the dense trajectories, i.e., the size of the 3D patch is 32×32 pixels and 15 frames.

Why are the trajectories not robust to shot boundaries?

Unlike the KLT trajectories, SIFT trajectories are very robust to shot boundaries, i.e., most trajectories crossing the shot boundaries are removed.

How much performance gain is there between the default parameters and the parameters for which the feature number is?

The performance gain between the defaultparameters and the parameters for which the feature number is comparable to the standard dense trajectories is around 1% for both KLT and SIFT trajectories.

How much time does the descriptor take to compute?

Note that the descriptor17 http://www.irisa.fr/vista/Equipe/People/Laptev/download.htmlcomputation reuses the optical flow field which is only computed once.

Open AccessJournal Article10.1007/S11263-012-0594-8

Dense Trajectories and Motion Boundary Descriptors for Action Recognition

Heng Wang, +3 more

- 06 Mar 2013

- International Journal of Computer Vision

- Vol. 103, Iss: 1, pp 60-79

1.9K

TL;DR: The MBH descriptor shows to consistently outperform other state-of-the-art descriptors, in particular on real-world videos that contain a significant amount of camera motion.

Abstract: This paper introduces a video representation based on dense trajectories and motion boundary descriptors. Trajectories capture the local motion information of the video. A dense representation guarantees a good coverage of foreground motion as well as of the surrounding context. A state-of-the-art optical flow algorithm enables a robust and efficient extraction of dense trajectories. As descriptors we extract features aligned with the trajectories to characterize shape (point coordinates), appearance (histograms of oriented gradients) and motion (histograms of optical flow). Additionally, we introduce a descriptor based on motion boundary histograms (MBH) which rely on differential optical flow. The MBH descriptor shows to consistently outperform other state-of-the-art descriptors, in particular on real-world videos that contain a significant amount of camera motion. We evaluate our video representation in the context of action classification on nine datasets, namely KTH, YouTube, Hollywood2, UCF sports, IXMAS, UIUC, Olympic Sports, UCF50 and HMDB51. On all datasets our approach outperforms current state-of-the-art results.

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Most frequently asked questions

1. What contributions have the authors mentioned in the paper "Dense trajectories and motion boundary descriptors for action recognition" ?

This paper introduces a video representation based on dense trajectories and motion boundary descriptors.. As descriptors the authors extract features aligned with the trajectories to characterize shape ( point coordinates ), appearance ( histograms of oriented gradients ) and motion ( histograms of optical flow ).. Additionally, the authors introduce a descriptor based on motion boundary histograms ( MBH ) which rely on differential optical flow.. The authors evaluate their video representation in the context of action classification on nine datasets, namely KTH, YouTube, Hollywood2, UCF sports, IXMAS, UIUC, Olympic Sports, UCF50 and HMDB51.. On all datasets their approach outperforms current state-of-the-art results.

2. What are the 11 action categories in the YouTube dataset?

The YouTube dataset6 (Liu et al, 2009) contains 11 action categories: basketball shooting, biking/cycling, diving, golf swinging, horse back riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volleyball spiking, and walking with a dog.

3. How does the algorithm perform on large video datasets?

Developing better optical flow algorithms suitable for large realistic video datasets is important to improve the performance of current action recognition systems.

4. What are the default parameters for their experiments?

The default parameters for their experiments are N = 32, nσ = 2, nτ = 3, which showed to give best performance when cross validating on the training set of Hollywood2.

Fig. 5 Illustration of the information captured by HOG, HOF, and MBH descriptors. The camera is moving from right to left, and the person is walking away from the camera. Gradient/flow orientation is indicated by color (hue) and magnitude by saturation. The optical flow (top, middle) shows constant motion in the background, which is due to the camera movements. The motion boundaries (right) encode the relative motion between the person and the background.

Table 1 Comparison of different descriptors and methods for extracting trajectories on nine datasets. We report mean average precision over all classes (mAP) for Hollywood2 and Olympic Sports, average accuracy over all classes for the other seven datasets. The three best results for each dataset are in bold.

Table 6 Comparing the optical flow algorithms of Farnebäck (2003) and Brox and Malik (2011) for extracting our dense trajectories. Results are reported on the YouTube and Hollywood2 datasets.

Fig. 10 Examples from Hollywood2 (top) and YouTube (bottom). The left column shows the overlaid image of two consecutive frames. The middle and right columns are the visualization of two optical flow methods Farnebäck (2003) and Brox and Malik (2011).

Fig. 8 Sample frames from the nine action recognition datasets used in our experiments. From top to bottom: KTH, YouTube, Hollywood2, UCF sports, IXMAS, UIUC, Olympic Sports, UCF50 and HMDB51.

Fig. 9 Performance of dense, KLT and SIFT trajectories for a varying number of features per frame. On the left the results for the MBH descriptor and on the right for a combination of trajectory, HOG, HOF and MBH.

Citations

Proceedings Article•10.1109/IJCNN.2017.7966010

Exploring quantization error to improve human action classification

Raquel Almeida, +2 more

- 01 May 2017

TL;DR: A mid-level representation, in which information about quantization errors is embedded together with the aggregated data on low level features, and a survey of the most common protocols for human action classification methods when applied to three different datasets.

...read moreread less

Book Chapter•10.1016/B978-0-12-805195-5.00015-6

Visual information-based activity recognition and fall detection for assisted living and ehealthcare

Yixiao Yun, +1 more

- 01 Jan 2017

TL;DR: This chapter mainly focuses on describing visual information-based daily activity recognition and anomaly detection through using low-resolution visual sensors, and provides further support to the robustness of manifold-based methods.

...read moreread less

In search of video event semantics

Masoud Mazloom

- 01 Jan 2016

TL;DR: This thesis proposes a new semantic video representation that is based on freely available social tagged videos only, without the need for training any intermediate concept detectors, and proposes an algorithm that learns from examples what concepts in bank are most informative per event.

...read moreread less

Improving gesture recognition through spatial focus of attention

Pradyumna Narayana

- 31 Dec 2017

TL;DR: An architecture (FOANet) is proposed that divides processing among four modalities (RGB, depth, RGB flow, and depth flow), and three spatial focus of attention regions (global, left hand, and right hand), and 12 channels are fused using sparse networks.

...read moreread less

•Journal Article•10.1016/J.CVIU.2017.02.001

Image and video mining through online learning

Andrew Gilbert, +1 more

- 01 May 2017

- Computer Vision and Image Understanding

TL;DR: The approach is based around the concept of an image signature which, unlike a standard bag of words model, can express co-occurrence statistics as well as symbol frequency and efficiently compute metric distances between signatures despite their inherent high dimensionality.

...read moreread less

...

Expand

References

Journal Article•10.1023/B:VISI.0000029664.99615.94

Distinctive Image Features from Scale-Invariant Keypoints

David G. Lowe

- 01 Nov 2004

- International Journal of Computer Vision

TL;DR: This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene and can robustly identify objects among clutter and occlusion while achieving near real-time performance.

...read moreread less

59.3K

•Proceedings Article•10.1109/CVPR.2005.177

Histograms of oriented gradients for human detection

Navneet Dalal, +1 more

- 20 Jun 2005

TL;DR: It is shown experimentally that grids of histograms of oriented gradient (HOG) descriptors significantly outperform existing feature sets for human detection, and the influence of each stage of the computation on performance is studied.

...read moreread less

36.7K

Journal Article•10.1109/TPAMI.2002.1017623

Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

Timo Ojala, +2 more

- 01 Jul 2002

- IEEE Transactions on Pattern Analysis an...

TL;DR: A generalized gray-scale and rotation invariant operator presentation that allows for detecting the "uniform" patterns for any quantization of the angular space and for any spatial resolution and presents a method for combining multiple operators for multiresolution analysis.

...read moreread less

16.4K

•Book Chapter•10.1007/11744023_32

SURF: speeded up robust features

Herbert Bay, +2 more

- 07 May 2006

TL;DR: A novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.

...read moreread less

16K

Distinctive Image Features from Scale-Invariant Keypoints

Matthijs Dorst

- 01 Jan 2011

TL;DR: The Scale-Invariant Feature Transform (or SIFT) algorithm is a highly robust method to extract and consequently match distinctive invariant features from images that can then be used to reliably match objects in diering images.

...read moreread less

15.8K

...

Expand

Dense Trajectories and Motion Boundary Descriptors for Action Recognition

Chat with Paper

AI Agents for this Paper

Most frequently asked questions

1. What contributions have the authors mentioned in the paper "Dense trajectories and motion boundary descriptors for action recognition" ?

2. What are the 11 action categories in the YouTube dataset?

3. How does the algorithm perform on large video datasets?

4. What are the default parameters for their experiments?

5. How much time does it take to compute a dense flow field?

6. How much time is used to compute dense optical flow fields?

7. How do the authors set the contrast threshold for the SIFT detector?

8. How do the authors compute the descriptors for HMDB51?

9. Why are the trajectories not robust to shot boundaries?

10. How much performance gain is there between the default parameters and the parameters for which the feature number is?

11. How much time does the descriptor take to compute?

Figures

Citations

Exploring quantization error to improve human action classification

Visual information-based activity recognition and fall detection for assisted living and ehealthcare

In search of video event semantics

Improving gesture recognition through spatial focus of attention

Image and video mining through online learning

References

Distinctive Image Features from Scale-Invariant Keypoints

Histograms of oriented gradients for human detection

Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

SURF: speeded up robust features

Distinctive Image Features from Scale-Invariant Keypoints

Related Papers (5)

Action Recognition with Improved Trajectories

Learning realistic human actions from movies

HMDB: A large video database for human motion recognition

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Histograms of oriented gradients for human detection