Two-Stream Modality-Based Deep Learning Approach for Enhanced Two-Person Human Interaction Recognition in Videos
Abstract
:1. Introduction
- 1
- We propose a novel two-stream deep learning-based human interaction recognition (HIR) system that integrates RGB and skeleton-based hierarchical features to address the challenges of video-based HIR, such as action complexity, motion variations, different viewpoints, and the lack of effective feature extraction methods.
- 2
- Our system utilises two distinct streams: the first stream employs YOLOv8-Pose for human pose extraction, enhanced by stacked LSM modules and a dense layer. The second stream utilises the Segment Anything Model (SAM) for segmented mesh generation, followed by an integrated LSTM-GRU network for long-range dependency feature extraction. The first stream processes images extracted from videos using the YOLOv8-Pose pre-trained model, which generates bounding boxes around humans and identifies key body points. Subsequently, it evaluates proximity and collisions between these bounding boxes, particularly focusing on instances where individuals or objects come into close contact or collide. This standardised procedure lays the foundation for the initial phase of our model. Our approach combines YOLOv8 Pose for precise human keypoint detection and bounding box creation with SAM for image segmentation, refining spatial relationships through a custom filter for meshes and keypoints. The filtered mesh is then processed by an ImageNet model to produce a comprehensive feature vector.
- 3
- We introduce a novel custom filter function to enhance computational efficiency by eliminating irrelevant keypoints and mesh components, thereby improving the overall performance of the HIR system.
- 4
- Through extensive experimentation, our proposed model demonstrates superior performance on the TPIK and HAR Video datasets, achieving 96.56% and 96.16% accuracy, respectively, thereby confirming its effectiveness and reliability in video-based HIR.
2. Related Work
- Stream 1 (YOLOv8-Pose): detects key body points, generating bounding boxes and evaluating interactions between individuals or objects.
- Stream 2 (SAM-based segmentation): creates meshes processed by an LSTM-GRU network to capture long-range dependencies, accounting for both motion and environmental context.
- Multimodal fusion: by integrating RGB and skeleton data, our system provides a more robust representation of human interactions, overcoming the limitations of single-modal approaches.
- Improved accuracy and efficiency: the combination of YOLOv8-Pose and SAM enhances feature extraction, achieving 96.56% accuracy on a benchmark dataset.
- Custom filtering: our custom filter improves real-time performance by removing irrelevant data and optimising computational efficiency.
3. Dataset
3.1. Two Person Interaction Kinect (TPIk) Dataset
3.2. HAR Video Dataset
4. Proposed Method
- RGB stream pose extraction: The first stream employs the YOLOv8-Pose model for human pose extraction. This module processes video frames and generates bounding boxes around detected humans, identifying key body points such as joints and limbs. These keypoints are used to assess proximity and collisions between individuals or objects, especially in scenarios involving close contact or interaction. To refine these pose features further, stacked long short-term memory (LSTM) modules and a dense layer are applied, capturing complex spatial–temporal relationships for the accurate recognition of human poses and interactions.
- Skeleton stream segmentation and mesh generation: The second stream uses the Segment Anything Model (SAM) to generate segmented meshes from input images. These segmented outputs are passed through an integrated LSTM-GRU network, which captures long-range dependencies and dynamic temporal patterns in human movements. This stream focuses on the structural relationships between body parts across frames.
- Custom filter function: A custom filter function is applied to both streams to enhance computational efficiency. This filter eliminates irrelevant keypoints and mesh components, reducing the amount of data processed while maintaining accuracy. The filtered mesh data are processed through an ImageNet model, producing a comprehensive feature vector that captures both spatial and temporal aspects of human activities.
- Integrate RGB and skeleton features: The outputs from the two streams (RGB and skeleton) are integrated to form a unified feature representation [32]. The RGB stream provides crucial visual context, such as object appearance and surrounding environment, while the skeleton stream focuses on the structural and dynamic movements of the body. By combining these two streams, the fused representation captures both the spatial relationships from the RGB data and the temporal dynamics from the skeleton features. This unified feature vector is then passed to a final classification module, where the system accurately recognises human activities. This multimodal fusion enables the system to overcome challenges posed by occlusion, noisy environments, and complex multi-agent interactions, significantly enhancing recognition accuracy and robustness in diverse real-world scenarios.
- Optimisation for efficiency: To further optimise system performance, the custom filter function plays a crucial role in reducing computational overhead. By eliminating irrelevant features, the system processes data more efficiently, ensuring real-time performance while maintaining high accuracy.
4.1. Preprocessing: Augmentation
4.2. YOLov8
- Speed and accuracy: inherits YOLOv8’s renowned capabilities for fast and precise detection.
- Anchor-free detection: bypasses the limitations of predefined anchor boxes, facilitating more flexible and efficient object detection.
- Dual-branch architecture: separates bounding box and keypoint detection, effectively managing the complexities of each task.
- Robust training data: trained on the COCO keypoints dataset, which includes diverse annotations and visibility conditions, improving the model’s generalisation across various pose estimation tasks.
Position of Activity and Skeleton Extraction
4.3. Stream-1: Skeleton-Based Stream
4.4. Stream-2: Pixels-Based Stream
4.5. SAM
4.5.1. Box Filter
- Collision in x-axis:
- Collision in y-axis:
- Collision in z-axis:
- Efficient filtering: By considering the depth information provided by keypoints, our method accurately discerns whether individuals are genuinely close or merely share similar two-dimensional coordinates.
- Reduced computational complexity: This innovative approach significantly reduces computational complexity, enhancing the model’s efficiency without compromising accuracy.
- Enhanced precision: Effectively filtering out individuals not in close proximity optimises the identification of potential interactions, contributing to the model’s overall speed and precision.
4.5.2. ImageNet
4.5.3. Long Short-Term Memory (LSTM)
4.5.4. Gated Recurrent Units (GRUs)
4.6. Final Feature
5. Experimental Evolution and Performances
5.1. Experimental Setup
5.2. Ablation Study
5.3. Performance Accuracy with the TPIk Dataset
State of the Art Comparison for TPIk Dataset
5.4. Performance Accuracy with HAR Video Dataset
5.5. In-Depth Analysis and Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
LSTM | Long short-term memory |
BiLSTM | Bi-directional long short-term memory |
CNN | Convolutional neural networks |
GMM | Gaussian mixture model |
DD-Net | Double-feature double-motion network |
GAN | Generative adversarial network |
SOTA | State of the art |
MobileNetV2 | Mobile network variant 2 |
ReLU | Rectified linear unit |
DCNN | Dilated convolutional neural network |
GRNN | General regression neural network |
KFDI | Key frames dynamic image |
QST | Quaternion spatial-temporal |
RNNs | Recurrent neural networks |
BT-LSTM | Block-term long-short memory |
KF | Kalman filter |
KM-Model | Keypoint mesh model |
SRGB-Model | Segmented RGB model |
References
- Ullah, A.; Ullah, A.; Ahmad, J.; Muhammad, K.; Sajjad, M.; Baik, S.W. Action recognition in video sequences using deep Bi-directional LSTM with CNN features. IEEE Access 2017, 6, 1155–1166. [Google Scholar] [CrossRef]
- Hassan, N.; Miah, A.S.M.; Shin, J. Enhancing Human Action Recognition in Videos through Dense-Level Features Extraction and Optimized Long Short-Term Memory. In Proceedings of the 2024 7th International Conference on Electronics, Communications, and Control Engineering (ICECC), Kuala Lumpur, Malaysia, 22–24 March 2024; pp. 19–23. [Google Scholar] [CrossRef]
- Hassan, N.; Miah, A.S.M.; Shin, J. A Deep Bidirectional LSTM Model Enhanced by Transfer-Learning-Based Feature Extraction for Dynamic Human Activity Recognition. Appl. Sci. 2024, 14, 603. [Google Scholar] [CrossRef]
- Egawa, R.; Miah, A.S.M.; Hirooka, K.; Tomioka, Y.; Shin, J. Dynamic Fall Detection Using Graph-Based Spatial Temporal Convolution and Attention Network. Electronics 2023, 12, 3234. [Google Scholar] [CrossRef]
- Ullah, A.; Muhammad, K.; Del Ser, J.; Baik, S.W.; de Albuquerque, V.H.C. Activity recognition using temporal optical flow convolutional features and multi-layer LSTM. IEEE Trans. Ind. Electron. 2018, 66, 9692–9702. [Google Scholar] [CrossRef]
- Zhang, S.; Li, Y.; Zhang, S.; Shahabi, F.; Xia, S.; Deng, Y.; Alshurafa, N. Deep learning in human activity recognition with wearable sensors: A review on advances. Sensors 2022, 22, 1476. [Google Scholar] [CrossRef]
- Mekruksavanich, S.; Jitpattanakul, A. LSTM networks using smartphone data for sensor-based human activity recognition in smart homes. Sensors 2021, 21, 1636. [Google Scholar] [CrossRef]
- Khan, M.A.; Javed, K.; Khan, S.A.; Saba, T.; Habib, U.; Khan, J.A.; Abbasi, A.A. Human action recognition using fusion of multiview and deep features: An application to video surveillance. Multimed. Tools Appl. 2024, 83, 14885–14911. [Google Scholar] [CrossRef]
- Liu, Y.; Cui, J.; Zhao, H.; Zha, H. Fusion of low-and high-dimensional approaches by trackers sampling for generic human motion tracking. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan, 11–15 November 2012; IEEE: Piscataway, NJ, USA, 2012. [Google Scholar]
- Yun, K.; Honorio, J.; Chattopadhyay, D.; Berg, T.L.; Samaras, D. Two-person interaction detection using body-pose features and multiple instance learning. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 28–35. [Google Scholar]
- Hu, T.; Zhu, X.; Guo, W.; Su, K. Efficient interaction recognition through positive action representation. Math. Probl. Eng. 2013, 2013, 795360. [Google Scholar] [CrossRef]
- Saha, S.; Konar, A.; Janarthanan, R. Two person interaction detection using kinect sensor. In Proceedings of the Facets of Uncertainties and Applications: ICFUA, Kolkata, India, December 2013; Springer: New Delhi, India; pp. 167–176.
- Yang, F.; Wu, Y.; Sakti, S.; Nakamura, S. Make skeleton-based action recognition model smaller, faster and better. In Proceedings of the ACM Multimedia Asia, Beijing, China, 16–18 December 2019; pp. 1–6. [Google Scholar]
- Ray, A.; Kolekar, M.H.; Balasubramanian, R.; Hafiane, A. Transfer learning enhanced vision-based human activity recognition: A decade-long analysis. Int. J. Inf. Manag. Data Insights 2023, 3, 100142. [Google Scholar] [CrossRef]
- Lalwani, P.; Ramasamy, G. Human activity recognition using a multi-branched CNN-BiLSTM-BiGRU model. Appl. Soft Comput. 2024, 154, 111344. [Google Scholar] [CrossRef]
- Li, T.; Sawanagi, T.; Nakanishi, H. Interaction Recognition between Two Persons from Individual Features Using LSTM-CRF Based on 3D Skeleton Data. In Proceedings of the 63rd Joint Conference on Automatic Control. Joint Conference on Automatic Control, Online, 21–22 November 2020; pp. 220–224. [Google Scholar]
- Hsueh, Y.L.; Lie, W.N.; Guo, G.Y. Human behavior recognition from multiview videos. Inf. Sci. 2020, 517, 275–296. [Google Scholar] [CrossRef]
- Miah, A.S.M.; Hasan, M.A.M.; Nishimura, S.; Shin, J. Sign Language Recognition using Graph and General Deep Neural Network Based on Large Scale Dataset. IEEE Access 2024, 12, 34553–34569. [Google Scholar] [CrossRef]
- Qi, J.; Yang, P.; Newcombe, L.; Peng, X.; Yang, Y.; Zhao, Z. An overview of data fusion techniques for Internet of Things enabled physical activity recognition and measure. Inf. Fusion 2020, 55, 269–280. [Google Scholar] [CrossRef]
- Franco, A.; Magnani, A.; Maio, D. A multimodal approach for human activity recognition based on skeleton and RGB data. Pattern Recognit. Lett. 2020, 131, 293–299. [Google Scholar] [CrossRef]
- Miah, A.S.M.; Hasan, M.A.M.; Shin, J. Dynamic Hand Gesture Recognition using Multi-Branch Attention Based Graph and General Deep Learning Model. IEEE Access 2023, 11, 4703–4716. [Google Scholar] [CrossRef]
- Miah, A.S.M.; Shin, J.; Hasan, M.A.M.; Okuyama, Y.; Nobuyoshi, A. Dynamic Hand Gesture Recognition Using Effective Feature Extraction and Attention Based Deep Neural Network. In Proceedings of the 2023 IEEE 16th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC), Singapore, 18–21 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 241–247. [Google Scholar]
- Miah, A.S.M.; Hasan, M.A.M.; Okuyama, Y.; Tomioka, Y.; Shin, J. Spatial–temporal attention with graph and general neural network-based sign language recognition. Pattern Anal. Appl. 2024, 27, 37. [Google Scholar] [CrossRef]
- Rahim, M.A.; Miah, A.S.M.; Sayeed, A.; Shin, J. Hand gesture recognition based on optimal segmentation in human-computer interaction. In Proceedings of the 2020 3rd IEEE International Conference on Knowledge Innovation and Invention (ICKII), Kaohsiung, Taiwan, 21–23 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 163–166. [Google Scholar]
- Miah, A.S.M.; Shin, J.; Al Mehedi Hasan, M.; Rahim, M.A.; Okuyama, Y. Rotation, Translation And Scale Invariant Sign Word Recognition Using Deep Learning. Comput. Syst. Sci. Eng. 2023, 44, 2521–2536. [Google Scholar] [CrossRef]
- Rahim, M.A.; Miah, A.S.M.; Akash, H.S.; Shin, J.; Hossain, M.I.; Hossain, M.N. An Advanced Deep Learning Based Three-Stream Hybrid Model for Dynamic Hand Gesture Recognition. arXiv 2024, arXiv:2408.08035. [Google Scholar]
- Miah, A.S.M.; Shin, J.; Hasan, M.A.M.; Fujimoto, Y.; Nobuyoshi, A. Skeleton-based hand gesture recognition using geometric features and spatio-temporal deep learning approach. In Proceedings of the 2023 11th European Workshop on Visual Information Processing (EUVIP), Gjøvik, Norway, 11–14 September 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
- Miah, A.S.M.; Hasan, M.A.M.; Shin, J.; Okuyama, Y.; Tomioka, Y. Multistage Spatial Attention-Based Neural Network for Hand Gesture Recognition. Computers 2023, 12, 13. [Google Scholar] [CrossRef]
- Mallik, B.; Rahim, M.A.; Miah, A.S.M.; Yun, K.S.; Shin, J. Virtual Keyboard: A Real-Time Hand Gesture Recognition-Based Character Input System Using LSTM and Mediapipe Holistic. Comput. Syst. Sci. Eng. 2024, 48, 555–570. [Google Scholar] [CrossRef]
- Shin, J.; Miah, A.S.M.; Kabir, M.H.; Rahim, M.A.; Al Shiam, A. A Methodological and Structural Review of Hand Gesture Recognition Across Diverse Data Modalities. IEEE Access 2024, 12, 142606–142639. [Google Scholar] [CrossRef]
- Khaire, P.; Kumar, P. Deep learning and RGB-D based human action, human–human and human–object interaction recognition: A survey. J. Vis. Commun. Image Represent. 2022, 86, 103531. [Google Scholar] [CrossRef]
- Pan, H.; Tong, S.; Wei, X.; Teng, B. Fatigue state recognition system for miners based on a multi-modal feature extraction and fusion framework. In IEEE Transactions on Cognitive and Developmental Systems; IEEE: Piscataway, NJ, USA, 2024; pp. 1–10. [Google Scholar] [CrossRef]
- Saeed, S.M.; Akbar, H.; Nawaz, T.; Elahi, H.; Khan, U.S. Body-Pose-Guided Action Recognition with Convolutional Long Short-Term Memory (LSTM) in Aerial Videos. Appl. Sci. 2023, 13, 9384. [Google Scholar] [CrossRef]
- Shin, J.; Miah, A.S.M.; Akiba, Y.; Hirooka, K.; Hassan, N.; Hwang, Y.S. Korean Sign Language Alphabet Recognition through the Integration of Handcrafted and Deep Learning-Based Two-Stream Feature Extraction Approach. IEEE Access 2024, 12, 68303–68318. [Google Scholar] [CrossRef]
- Shin, J.; Miah, A.S.M.; Suzuki, K.; Hirooka, K.; Hasan, M.A.M. Dynamic Korean Sign Language Recognition Using Pose Estimation Based and Attention-Based Neural Network. IEEE Access 2023, 11, 143501–143513. [Google Scholar] [CrossRef]
- Gammulle, H.; Denman, S.; Sridharan, S.; Fookes, C. Two stream lstm: A deep fusion framework for human action recognition. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017. [Google Scholar]
- Dua, N.; Singh, S.N.; Semwal, V.B. Multi-input CNN-GRU based human activity recognition using wearable sensors. Computing 2021, 103, 1461–1478. [Google Scholar] [CrossRef]
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv 2016, arXiv:1603.04467. [Google Scholar]
- Jin, S.; Wang, X.; Meng, Q. Spatial memory-augmented visual navigation based on hierarchical deep reinforcement learning in unknown environments. Knowl.-Based Syst. 2024, 285, 111358. [Google Scholar] [CrossRef]
Augmentation Technique | Range |
---|---|
Rotation range | 20 |
Width shift range | 0.2 |
Height shift range | 0.2 |
Shear range | 0.2 |
Zoom range | 0.2 |
Horizontal flip | True |
Fill mode | Nearest |
Layer (Type) | Output Shape | Param | Connected to |
---|---|---|---|
lstm_input (InputLayer) | (None, 20, 102) | 0 | [] |
input_1 (InputLayer) | (None, 20, 2048) | 0 | [] |
lstm | (None, 20, 64) | 42,752 | [’lstm_input[0][0]’] |
lstm_3 | (None, 20, 64) | 540,928 | [’input_1[0][0]’] |
lstm_1 | (None, 20, 128) | 98,816 | [’lstm[0][0]’] |
gru | (None, 8) | 1776 | [’lstm_3[0][0]’] |
lstm_2 | (None, 64) | 49,408 | [’lstm_1[0][0]’] |
dropout | (None, 8) | 0 | [’gru[0][0]’] |
dense | (None, 64) | 4160 | [’lstm_2[0][0]’] |
dense_2 | (None, 8) | 72 | [’dropout[0][0]’] |
dense_1 | (None, 32) | 2080 | [’dense[0][0]’] |
concatenate | (None, 40) | 0 | [’dense_2[0][0]’,’dense_1[0][0]’] |
dense_3 | (None, 8) | 328 | [’concatenate[0][0]’] |
Total params | Trainable params: 740,320 | Non-trainable params: 740,320 |
Dataset Name | Modality | Modality Name | Stream Name | Validation Accuracy | Validation Loss |
---|---|---|---|---|---|
TPIk Dataset | Single-Modality | Only Skeleton | Stream-1 | 71.00 | 1.82 |
TPIk Dataset | Single-Modality | Only RGB | Stream-2 | 95.00 | 0.14 |
TPIk Dataset | Multi-Modality | Skeleton+RGB | Stream-1 + Stream-2 | 96.56 | 0.14 |
HAR Video Dataset | Single-Modality | Only Skeleton | Stream-1 | 89.42 | 0.43 |
HAR Video Dataset | Single-Modality | Only RGB | Stream-2 | 94.00 | 0.14 |
HAR Video Dataset | Multi-Modality | Skeleton+RGB | Stream-1 + Stream-2 | 96.16 | 0.10 |
Task | Precision | Recall | F1 Score | Accuracy (%) |
---|---|---|---|---|
Close Up | 98.36 | 98.36 | 98.36 | 98.36 |
Get Away | 99.18 | 97.58 | 98.37 | 97.58 |
Kick | 99.19 | 96.09 | 97.62 | 96.09 |
Push | 99.07 | 99.07 | 99.07 | 99.07 |
Shake Hands | 96.15 | 98.04 | 97.09 | 98.04 |
Hug | 97.87 | 90.20 | 93.88 | 90.20 |
Give a Notebook | 91.74 | 97.37 | 94.47 | 97.37 |
Punch | 89.77 | 91.86 | 90.80 | 91.86 |
Average | 96.42 | 96.07 | 96.21 | 96.56 |
Author | Dataset Name | Feature Extraction | Classifier | Accuracy (%) |
---|---|---|---|---|
Yun et al. [10] | Two Person Interaction Kinect Dataset | Joint feature | SVM | 80.03 |
Yun et al. [10] | Two Person Interaction Kinect Dataset | Plane feature | SVM | 73.80 |
Yun et al. [10] | Two Person Interaction Kinect Dataset | Velocity feature | SVM | 48.03 |
Yun et al. [10] | Two Person Interaction Kinect Dataset | Joint+plane feature | SVM | 79.00 |
Yun et al. [10] | Two Person Interaction Kinect Dataset | Joint+velocity | SVM | 80.02 |
Yun et al. [10] | Two Person Interaction Kinect Dataset | Velocity+plane | SVM | 74.44 |
Yun et al. [10] | Two Person Interaction Kinect Dataset | All feature | SVM | 79.00 |
Hu et al. [11] | Two Person Interaction Kinect Dataset | Joint, plane, velocity | SVM | 81.67 |
Hu et al. [11] | Two Person Interaction Kinect Dataset | Joint, plane, velocity | MIL | 83.33 |
Saha et al. [12] | Two Person Interaction Kinect Dataset | Rotation invariance Rotation variance phenomenon | SVM | 90.00 |
Li et al. [16] | Two Person Interaction Kinect Dataset | Torso (forward, back, still, turning), Hand (left right), Arm (right, left with free and raised), Leg(forward, back, kick, still, both) | LSTM | 90.40 |
Li et al. [16] | Two Person Interaction Kinect Dataset | Torso (forward, back, still, turning), Hand (left right), Arm (right, left with free and raised), Leg(forward, back, kick, still, both) | LSTM-CRF | 90.60 |
Proposed Method | Two Person Interaction Kinect Dataset | Two-stream DL | DL | 96.56 |
Task | Precision (%) | Recall (%) | F1 Score (%) | Accuracy (%) |
---|---|---|---|---|
Clapping | 100.00 | 100.00 | 100.00 | 100.00 |
Meet and Split | 100.00 | 100.00 | 100.00 | 100.00 |
Sitting | 100.00 | 100.00 | 100.00 | 100.00 |
Standing Still | 100.00 | 100.00 | 100.00 | 100.00 |
Walking | 91.67 | 94.29 | 92.96 | 94.29 |
Walking While Reading Book | 100.00 | 94.44 | 97.14 | 94.44 |
Walking While Using Phone | 86.67 | 89.66 | 88.14 | 89.66 |
Combined Features | 96.90 | 96.91 | 96.89 | 96.16 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Akash, H.S.; Rahim, M.A.; Miah, A.S.M.; Lee, H.-S.; Jang, S.-W.; Shin, J. Two-Stream Modality-Based Deep Learning Approach for Enhanced Two-Person Human Interaction Recognition in Videos. Sensors 2024, 24, 7077. https://doi.org/10.3390/s24217077
Akash HS, Rahim MA, Miah ASM, Lee H-S, Jang S-W, Shin J. Two-Stream Modality-Based Deep Learning Approach for Enhanced Two-Person Human Interaction Recognition in Videos. Sensors. 2024; 24(21):7077. https://doi.org/10.3390/s24217077
Chicago/Turabian StyleAkash, Hemel Sharker, Md Abdur Rahim, Abu Saleh Musa Miah, Hyoun-Sup Lee, Si-Woong Jang, and Jungpil Shin. 2024. "Two-Stream Modality-Based Deep Learning Approach for Enhanced Two-Person Human Interaction Recognition in Videos" Sensors 24, no. 21: 7077. https://doi.org/10.3390/s24217077
APA StyleAkash, H. S., Rahim, M. A., Miah, A. S. M., Lee, H.-S., Jang, S.-W., & Shin, J. (2024). Two-Stream Modality-Based Deep Learning Approach for Enhanced Two-Person Human Interaction Recognition in Videos. Sensors, 24(21), 7077. https://doi.org/10.3390/s24217077