Deep Learning Based Human Activity Recognition Using Spatio-Temporal Image Formation of Skeleton Joints
TL;DR: This paper proposes a spatio-temporal image formation (STIF) technique of 3D skeleton joints by capturing spatial information and temporal changes for action discrimination and investigates the effect of three fusion methods such as element-wise average, multiplication, and maximization on the performance variation to human action recognition.
read more
Abstract: Human activity recognition has become a significant research trend in the fields of computer vision, image processing, and human–machine or human–object interaction due to cost-effectiveness, time management, rehabilitation, and the pandemic of diseases. Over the past years, several methods published for human action recognition using RGB (red, green, and blue), depth, and skeleton datasets. Most of the methods introduced for action classification using skeleton datasets are constrained in some perspectives including features representation, complexity, and performance. However, there is still a challenging problem of providing an effective and efficient method for human action discrimination using a 3D skeleton dataset. There is a lot of room to map the 3D skeleton joint coordinates into spatio-temporal formats to reduce the complexity of the system, to provide a more accurate system to recognize human behaviors, and to improve the overall performance. In this paper, we suggest a spatio-temporal image formation (STIF) technique of 3D skeleton joints by capturing spatial information and temporal changes for action discrimination. We conduct transfer learning (pretrained models- MobileNetV2, DenseNet121, and ResNet18 trained with ImageNet dataset) to extract discriminative features and evaluate the proposed method with several fusion techniques. We mainly investigate the effect of three fusion methods such as element-wise average, multiplication, and maximization on the performance variation to human action recognition. Our deep learning-based method outperforms prior works using UTD-MHAD (University of Texas at Dallas multi-modal human action dataset) and MSR-Action3D (Microsoft action 3D), publicly available benchmark 3D skeleton datasets with STIF representation. We attain accuracies of approximately 98.93%, 99.65%, and 98.80% for UTD-MHAD and 96.00%, 98.75%, and 97.08% for MSR-Action3D skeleton datasets using MobileNetV2, DenseNet121, and ResNet18, respectively.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Citations
A CSI-Based Human Activity Recognition Using Deep Learning.
TL;DR: In this article, a 2D Convolutional Neural Network (CNN) classifier was used to recognize seven different human daily activities using channel state information (CSI) data collected from a Raspberry Pi 4.
65
A robust and efficient method for skeleton-based human action recognition and its application for cross-dataset evaluation
TL;DR: TD-Net as mentioned in this paper improves the Double-Feature Double-motion Network (DD-Net) by adding a normalised coordinates of joints (NCJ) branch to enrich the spatial information.
17
Dynamic Edge Convolutional Neural Network for Skeleton-Based Human Action Recognition
Nusrat Tasnim,Joong-Hwan Baek +1 more
TL;DR: Wang et al. as discussed by the authors designed a new deep learning model by integrating crisscross attention and edge convolution to extract discriminative features from the skeleton sequence for action recognition, which achieved average accuracies of 99.53% and 95.64% on the UTD-MHAD and MSR-Action3D datasets, respectively.
Multi-speed transformer network for neurodegenerative disease assessment and activity recognition
TL;DR: In this article , the authors presented a dataset named "Beside gait" containing the coordinates of extracted body joints of people with neurodegenerative diseases in various stages of the disease as well as control subjects.
16
Skeleton Driven Action Recognition Using an Image-Based Spatial-Temporal Representation and Convolution Neural Network.
TL;DR: In this paper, the authors proposed a method to automatically detect in real-time typical and stereotypical actions of children with ASD by using the Intel RealSense and the Nuitrack SDK to detect and extract the user joint coordinates.
16
References
Densely Connected Convolutional Networks
Gao Huang,Zhuang Liu,Laurens van der Maaten,Kilian Q. Weinberger +3 more
- 21 Jul 2017
TL;DR: DenseNet as mentioned in this paper proposes to connect each layer to every other layer in a feed-forward fashion, which can alleviate the vanishing gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.
MobileNetV2: Inverted Residuals and Linear Bottlenecks
Mark Sandler,Andrew Howard,Menglong Zhu,Andrey Zhmoginov,Liang-Chieh Chen +4 more
- 18 Jun 2018
TL;DR: MobileNetV2 as mentioned in this paper is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers and intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity.
Object recognition from local scale-invariant features
David G. Lowe
- 20 Sep 1999
TL;DR: Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds.
Speeded-Up Robust Features (SURF)
TL;DR: A novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features), which approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster.
14.9K
•Posted Content
MobileNetV2: Inverted Residuals and Linear Bottlenecks
TL;DR: A new mobile architecture, MobileNetV2, is described that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes and allows decoupling of the input/output domains from the expressiveness of the transformation.
13.9K