Deep Learning Based Human Activity Recognition Using Spatio-Temporal Image Formation of Skeleton Joints

doi:10.3390/APP11062675

Open AccessJournal Article10.3390/APP11062675

Deep Learning Based Human Activity Recognition Using Spatio-Temporal Image Formation of Skeleton Joints

Nusrat Tasnim, +2 more

- 17 Mar 2021

- Applied Sciences

- Vol. 11, Iss: 6, pp 2675

28

TL;DR: This paper proposes a spatio-temporal image formation (STIF) technique of 3D skeleton joints by capturing spatial information and temporal changes for action discrimination and investigates the effect of three fusion methods such as element-wise average, multiplication, and maximization on the performance variation to human action recognition.

Abstract: Human activity recognition has become a significant research trend in the fields of computer vision, image processing, and human–machine or human–object interaction due to cost-effectiveness, time management, rehabilitation, and the pandemic of diseases. Over the past years, several methods published for human action recognition using RGB (red, green, and blue), depth, and skeleton datasets. Most of the methods introduced for action classification using skeleton datasets are constrained in some perspectives including features representation, complexity, and performance. However, there is still a challenging problem of providing an effective and efficient method for human action discrimination using a 3D skeleton dataset. There is a lot of room to map the 3D skeleton joint coordinates into spatio-temporal formats to reduce the complexity of the system, to provide a more accurate system to recognize human behaviors, and to improve the overall performance. In this paper, we suggest a spatio-temporal image formation (STIF) technique of 3D skeleton joints by capturing spatial information and temporal changes for action discrimination. We conduct transfer learning (pretrained models- MobileNetV2, DenseNet121, and ResNet18 trained with ImageNet dataset) to extract discriminative features and evaluate the proposed method with several fusion techniques. We mainly investigate the effect of three fusion methods such as element-wise average, multiplication, and maximization on the performance variation to human action recognition. Our deep learning-based method outperforms prior works using UTD-MHAD (University of Texas at Dallas multi-modal human action dataset) and MSR-Action3D (Microsoft action 3D), publicly available benchmark 3D skeleton datasets with STIF representation. We attain accuracies of approximately 98.93%, 99.65%, and 98.80% for UTD-MHAD and 96.00%, 98.75%, and 97.08% for MSR-Action3D skeleton datasets using MobileNetV2, DenseNet121, and ResNet18, respectively.

Chat with Paper

AI Agents for this Paper

Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps

Citations

•Journal Article•10.3390/S21217225

A CSI-Based Human Activity Recognition Using Deep Learning.

Parisa Fard Moshiri, +3 more

- 30 Oct 2021

- Sensors

TL;DR: In this article, a 2D Convolutional Neural Network (CNN) classifier was used to recognize seven different human daily activities using channel state information (CSI) data collected from a Raspberry Pi 4.

...read moreread less

65

•Journal Article•10.1049/cvi2.12119

A robust and efficient method for skeleton-based human action recognition and its application for cross-dataset evaluation

Tien Nguyen, +3 more

- 06 Jul 2022

- Iet Computer Vision

TL;DR: TD-Net as mentioned in this paper improves the Double-Feature Double-motion Network (DD-Net) by adding a normalised coordinates of joints (NCJ) branch to enrich the spatial information.

...read moreread less

17

•Journal Article•10.3390/s23020778

Dynamic Edge Convolutional Neural Network for Skeleton-Based Human Action Recognition

Nusrat Tasnim, +1 more

- 01 Jan 2023

- Sensors

TL;DR: Wang et al. as discussed by the authors designed a new deep learning model by integrating crisscross attention and edge convolution to extract discriminative features from the skeleton sequence for action recognition, which achieved average accuracies of 99.53% and 95.64% on the UTD-MHAD and MSR-Action3D datasets, respectively.

...read moreread less

16

Journal Article•10.1016/j.cmpb.2023.107344

Multi-speed transformer network for neurodegenerative disease assessment and activity recognition

Mohamed Cheriet, +4 more

- 01 Jan 2023

- Computer Methods and Programs in Biomedi...

TL;DR: In this article , the authors presented a dataset named "Beside gait" containing the coordinates of extracted body joints of people with neurodegenerative diseases in various stages of the disease as well as control subjects.

...read moreread less

16

•Journal Article•10.3390/S21134342

Skeleton Driven Action Recognition Using an Image-Based Spatial-Temporal Representation and Convolution Neural Network.

Vinicius Silva, +4 more

- 25 Jun 2021

- Sensors

TL;DR: In this paper, the authors proposed a method to automatically detect in real-time typical and stereotypical actions of children with ASD by using the Intel RealSense and the Nuitrack SDK to detect and extract the user joint coordinates.

...read moreread less

16

...

Expand

References

•Proceedings Article•10.1109/CVPR.2017.243

Densely Connected Convolutional Networks

Gao Huang, +3 more

- 21 Jul 2017

TL;DR: DenseNet as mentioned in this paper proposes to connect each layer to every other layer in a feed-forward fashion, which can alleviate the vanishing gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.

...read moreread less

46.1K

•Proceedings Article•10.1109/CVPR.2018.00474

MobileNetV2: Inverted Residuals and Linear Bottlenecks

Mark Sandler, +4 more

- 18 Jun 2018

TL;DR: MobileNetV2 as mentioned in this paper is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers and intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity.

...read moreread less

19.4K

Proceedings Article•10.1109/ICCV.1999.790410

Object recognition from local scale-invariant features

David G. Lowe

- 20 Sep 1999

TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.

...read moreread less

19.3K

•Journal Article•10.1016/J.CVIU.2007.09.014

Speeded-Up Robust Features (SURF)

Herbert Bay, +3 more

- 01 Jun 2008

- Computer Vision and Image Understanding

TL;DR: A novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.

...read moreread less

14.9K

•Posted Content

MobileNetV2: Inverted Residuals and Linear Bottlenecks

Mark Sandler, +4 more

- 13 Jan 2018

- arXiv: Computer Vision and Pattern Recog...

TL;DR: A new mobile architecture, MobileNetV2, is described that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes and allows decoupling of the input/output domains from the expressiveness of the transformation.

...read moreread less

13.9K