ARMNet: A Network for Image Dimensional Emotion Prediction Based on Affective Region Extraction and Multi-Channel Fusion
Abstract
:1. Introduction
- (1)
- A method for extracting union affective regions, combining eye fixation detection and attention detection, is proposed to expand the effective emotional area. This method can extract the joint affective regions composed of the objects and the background, which has high contributions to emotion prediction.
- (2)
- An improved channel attention mechanism is proposed, which increases the gating mechanism and fuses the multi-level features to consider the different contributions from multi-level features through attention-based weight adaptive adjustment.
2. Related Work
2.1. Image Emotion Analysis Based on Specific Affective Regions
2.2. Image Emotion Analysis Based on Multi-Level Features Fusion
3. Method
3.1. The Union Affective Region Extraction Module
3.1.1. The Multi-Level Features Fusion Module
3.1.2. The Human Eye Fixation Detection Module
3.1.3. The Spatial Attention Module
3.2. The Improved Channel Attention Module
3.3. The VA Values Prediction Module
4. Experiment and Results Analysis
4.1. Implementation Details
4.2. Datasets
4.3. Performance Comparison
4.4. Ablation Experiments
- (1)
- According to rows 1 and 2 of Table 2, the results of the model with the multi-level features fusion module are better than those without the multi-level features fusion module. The MSE value for valence and arousal of the model with the multi-level features fusion module was reduced by 4.45% and 1.58%, respectively.
- (2)
- According to rows 1, 3, 4, and 5 of Table 2, the eye fixation detection module and the spatial attention mechanism can improve performance. The combination of them performs better than every single module. This proves the necessity of adding a human attention detection module and a spatial attention detection module to the ARMNet. For example, the eye fixation detection module reduced the MSE value for valence and arousal by 1.33% and 1.56%, respectively. Additionally, the spatial attention mechanism module reduces the MSE value by 3.26% in the valence domain, but the MSE value in the arousal domain is almost the same.
- (3)
- According to rows 6 and 7 of Table 2, a comparison shows the performance differences between CAM and SAM. While both combinations (R + M + S + CAM and R + M + S + SAM) yield similar MSE values, R + M + S + SAM slightly outperforms R + M + S + CAM in both the valence and arousal domains. This suggests that although CAM effectively captures channel-wise information, SAM shows more robust performance for spatial attention in the emotional prediction task.
- (4)
- According to rows 6 and 10 of Table 2, the network with the channel attention mechanism module reduces the MSE value for valence and arousal by 2.64% and 1.84% for valence and arousal, respectively, which verifies the validity of the channel attention mechanism module.
- (5)
- When CAM is introduced, as seen in rows 9 and 10 of Table 2, the combination of R + M + S + CBAM + CAM does not outperform R + M + S + SAM + CAM, which delivers better results. This demonstrates that although CBAM has advantages in certain setups, SAM, when combined with CAM, provides more stable and superior performance for emotional prediction.
- (6)
- Furthermore, the spatial-channel attention module is designed based on the CBAM module by adding a gating mechanism, including a spatial attention mechanism module and a channel attention mechanism module. The result is shown in rows 6, 8, 9, and 10 of Table 2. It proves that the CBAM module is effective, while the spatial-channel attention is better. Compared with the CBAM module network, the network with the spatial-channel attention module reduced the MSE value for valence and arousal by 1.03% and 2.49%, respectively.
MSE_V ↓ | MSE_A ↓ | MAE_V ↓ | MAE_A ↓ | R2_V ↑ | R2_A ↑ | ||
---|---|---|---|---|---|---|---|
1 | R | 0.02701 | 0.02246 | 0.1289 | 0.1199 | 0.3644 | 0.2467 |
2 | R + M | 0.02586 | 0.02211 | 0.1259 | 0.1187 | 0.3912 | 0.2586 |
3 | R + S | 0.02569 | 0.02050 | 0.1259 | 0.1151 | 0.3956 | 0.3126 |
4 | R + SAM | 0.02521 | 0.02082 | 0.1247 | 0.1153 | 0.4067 | 0.3018 |
5 | R + S + SAM | 0.02488 | 0.02050 | 0.1237 | 0.1151 | 0.4145 | 0.3124 |
6 | R + M + S + SAM | 0.02488 | 0.02049 | 0.1235 | 0.1151 | 0.4145 | 0.3129 |
7 | R + M + S + CAM | 0.02497 | 0.02100 | 0.1239 | 0.1159 | 0.4123 | 0.2955 |
8 | R + M + S + CBAM | 0.02449 | 0.02062 | 0.1225 | 0.1150 | 0.4237 | 0.3084 |
9 | R + M + S + CBAM + CAM | 0.02797 | 0.02433 | 0.1304 | 0.1236 | 0.3417 | 0.1839 |
10 | R + M + S + SAM + CAM | 0.02424 | 0.02012 | 0.1217 | 0.1137 | 0.4294 | 0.3249 |
4.5. Visualization
4.5.1. Visualization of the Spatial Attention Module
4.5.2. Visualization of the Improved Channel Attention
- (1)
- The different parameters , , and indicate that the network assigns different importance to different channel feature descriptors.
- (2)
- The specific values of , , and are different. The gating weights of the output features from the Conv3 and Conv4 branches are approximately zero, and the gating weights of the output features from the Conv2 branch are small. However, the gating weights of the output features from the Conv5 branch are bigger and fluctuate more sharply, which show a greater influence on the final prediction.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Chen, H.; Shao, F.; Mu, B.; Jiang, Q. Image Aesthetics Assessment With Emotion-Aware Multibranch Network. IEEE Trans. Instrum. Meas. 2024, 73, 1–15. [Google Scholar] [CrossRef]
- Su, Z.; Feng, Y.; Liu, J.; Peng, J.; Jiang, W.; Liu, J. An Audiovisual Correlation Matching Method Based on Fine-Grained Emotion and Feature Fusion. Sensors 2024, 24, 5681. [Google Scholar] [CrossRef] [PubMed]
- Kosti, M.V.; Georgakopoulou, N.; Diplaris, S.; Pistola, T.; Chatzistavros, K.; Xefteris, V.-R.; Tsanousa, A.; Vrochidis, S.; Kompatsiaris, I. Assessing Virtual Reality Spaces for Elders Using Image-Based Sentiment Analysis and Stress Level Detection. Sensors 2023, 23, 4130. [Google Scholar] [CrossRef] [PubMed]
- Horvat, M.; Jović, A.; Burnik, K. Investigation of Relationships between Discrete and Dimensional Emotion Models in Affective Picture Databases Using Unsupervised Machine Learning. Appl. Sci. 2022, 12, 7864. [Google Scholar] [CrossRef]
- Li, H.; Lu, Y.; Zhu, H. Multi-Modal Sentiment Analysis Based on Image and Text Fusion Based on Cross-Attention Mechanism. Electronics 2024, 13, 2069. [Google Scholar] [CrossRef]
- Zhao, S.; Yao, X. An Overview of Image Affective Computing. Intell. Comput. Appl. 2017, 7, 1–5. [Google Scholar]
- Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 2017, 37, 98–125. [Google Scholar] [CrossRef]
- Alarcão, M.; Ribeiro, C.; Garcia, N.; Maruta, C.; Fonseca, M.J. Unfolding hand-crafted features contribution in CNNs for valence and arousal estimation in images. J. Vis. Commun. Image Represent 2022, 13–26. [Google Scholar] [CrossRef]
- Lang, P.J.; Bradley, M.M.; Cuthbert, B.N. International affective picture system (IAPS): Technical manual and affective ratings. NIMH Cent. Study Emot. Atten. 1997, 1, 3. [Google Scholar]
- Marchewka, A.; Żurawski, Ł.; Jednoróg, K.; Grabowska, A. The Nencki Affective Picture System (NAPS): Introduction to a novel, standardized, wide-range, high-quality, realistic picture database. Behav. Res. Methods 2014, 46, 596–610. [Google Scholar] [CrossRef]
- Dan-Glauser, E.S.; Scherer, K.R. The Geneva affective picture database (GAPED): A new 730-picture database focusing on valence and normative significance. Behav. Res. Methods 2011, 43, 468–477. [Google Scholar] [CrossRef] [PubMed]
- Kurdi, B.; Lozano, S.; Banaji, M.R. Introducing the open affective standardized image set (OASIS). Behav. Res. Methods 2017, 49, 457–470. [Google Scholar] [CrossRef] [PubMed]
- Kim, H.-R.; Kim, Y.-S.; Kim, S.J.; Lee, I.-K. Building emotional machines: Recognizing image emotions through deep neural networks. IEEE Trans. Multimed. 2018, 20, 2980–2992. [Google Scholar] [CrossRef]
- Yan, M.; Xiong, R.; Wang, Y.; Li, C. Edge Computing Task Offloading Optimization for a UAV-assisted Internet of Vehicles via Deep Reinforcement Learning. IEEE Trans. Veh. Technol. 2024, 73, 5647–5658. [Google Scholar] [CrossRef]
- Yan, M.; Luo, M.; Chan, C.A.; Gygax, A.F.; Li, C.; I, C.-L. Energy-Efficient Content Fetching Strategies in Cache-Enabled D2D Networks via an Actor-Critic Reinforcement Learning Structure. IEEE Trans. Veh. Technol. 2024; early access. [Google Scholar] [CrossRef]
- Zhao, S.; Jia, Z.; Chen, H.; Li, L.; Ding, G.; Keutzer, K. PDANet: Polarity-consistent deep attention network for fine-grained visual emotion regression. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 192–201. [Google Scholar]
- Li, B.; Ren, H.; Jiang, X.; Miao, F.; Feng, F.; Jin, L. SCEP—A new image dimensional emotion recognition model based on spatial and channel-wise attention mechanisms. IEEE Access 2021, 9, 25278–25290. [Google Scholar] [CrossRef]
- Deng, Z.; Zhu, Q.; He, P.; Zhang, D.; Luo, Y. A Saliency Detection and Gram Matrix Transform-Based Convolutional Neural Network for Image Emotion Classification. Secur. Commun. Netw. 2021, 2021, 6854586. [Google Scholar] [CrossRef]
- Sowmyayani, S.; Rani, P. Salient object-based visual sentiment analysis by combining deep features and handcrafted features. Multimed. Tools Appl. 2022, 81, 7941–7955. [Google Scholar] [CrossRef]
- Rao, T.; Li, X.; Zhang, H.; Xu, M. Multi-level region-based convolutional neural network for image emotion classification. Neurocomputing 2019, 333, 429–439. [Google Scholar] [CrossRef]
- Zhu, X.; Li, L.; Zhang, W.; Rao, T.; Xu, M.; Huang, Q.; Xu, D. Dependency Exploitation: A Unified CNN-RNN Approach for Visual Emotion Recognition. In Proceedings of the IJCAI, Melbourne, Australia, 19–25 August 2017; pp. 3595–3601. [Google Scholar]
- Rao, T.; Li, X.; Xu, M. Learning multi-level deep representations for image emotion classification. Neural Process. Lett. 2020, 51, 2043–2061. [Google Scholar] [CrossRef]
- She, D.; Yang, J.; Cheng, M.-M.; Lai, Y.-K.; Rosin, P.L.; Wang, L. WSCNet: Weakly supervised coupled networks for visual sentiment classification and detection. IEEE Trans. Multimed. 2019, 22, 1358–1371. [Google Scholar] [CrossRef]
- Xiong, H.; Liu, Q.; Song, S.; Cai, Y. Region-based convolutional neural network using group sparse regularization for image sentiment classification. EURASIP J. Image Video Process 2019, 2019, 30. [Google Scholar] [CrossRef]
- Yang, J.; She, D.; Sun, M.; Cheng, M.-M.; Rosin, P.L.; Wang, L. Visual sentiment prediction based on automatic discovery of affective regions. IEEE Trans. Multimed. 2018, 20, 2513–2525. [Google Scholar] [CrossRef]
- Yao, X.; She, D.; Zhao, S.; Liang, J.; Lai, Y.-K.; Yang, J. Attention-aware polarity sensitive embedding for affective image retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1140–1150. [Google Scholar]
- Wang, W.; Shen, J.; Dong, X.; Borji, A.; Yang, R. Inferring salient objects from human fixations. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 1913–1927. [Google Scholar] [CrossRef] [PubMed]
- Huang, X.; Shen, C.; Boix, X.; Zhao, Q. SALICON: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 262–270. [Google Scholar]
- Zhang, H.; Xu, M. Weakly supervised emotion intensity prediction for recognition of emotions in images. IEEE Trans. Multimed. 2020, 23, 2033–2044. [Google Scholar] [CrossRef]
- Nagappan, S.; Tan, J.Q.; Wong, L.K.; See, J. Context-Aware Multi-Stream Networks for Dimensional Emotion Prediction in Images. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2480–2484. [Google Scholar]
- Rapolu, S.; Singh, A.; Dhingra, A. Convolutional Neural Networks for Image Emotion Recognition by Fusing Differential and Supplementary Information. In Proceedings of the 2023 International Conference on Bio Signals, Images, and Instrumentation (ICBSII), Chennai, India, 16–17 March 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Sermanet, P.; Chintala, S.; LeCun, Y. Convolutional neural networks applied to house numbers digit classification. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR), Tsukuba, Japan, 11–15 November 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 3288–3291. [Google Scholar]
- Yang, Z.; Zhu, L.; Wu, Y.; Yang, Y. Gated channel transformation for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11794–11803. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Qin, X.; Zhang, Z.; Huang, C.; Gao, C.; Dehghan, M.; Jagersand, M. BASNet: Boundary-aware salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7479–7489. [Google Scholar]
MSE_V ↓ | MSE_A ↓ | MAE_V ↓ | MAE_A ↓ | R2_V ↑ | R2_A ↑ | |
---|---|---|---|---|---|---|
ResNet101 [32] | 0.02701 | 0.02246 | 0.1289 | 0.1199 | 0.3644 | 0.2467 |
PDANet [16] | 0.02589 | 0.02083 | 0.1263 | 0.1159 | 0.3909 | 0.3014 |
ViT [38] | 0.03462 | 0.02705 | 0.1455 | 0.1351 | 0.1852 | 0.0927 |
SCEP [17] | 0.02539 | 0.02117 | 0.1261 | 0.1160 | 0.4024 | 0.2900 |
ARMNet (ours) | 0.02424 | 0.02012 | 0.1217 | 0.1137 | 0.4294 | 0.3249 |
MSE_V ↓ | MSE_A ↓ | MAE_V ↓ | MAE_A ↓ | R2_V ↑ | R2_A ↑ | |
---|---|---|---|---|---|---|
(a) | 0.02528 | 0.02134 | 0.1258 | 0.1167 | 0.3456 | 0.2509 |
(b) | 0.01696 | 0.02167 | 0.1012 | 0.1172 | 0.5229 | 0.2858 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, J.; Sun, J.; Wang, C.; Tao, Z.; Zhang, F. ARMNet: A Network for Image Dimensional Emotion Prediction Based on Affective Region Extraction and Multi-Channel Fusion. Sensors 2024, 24, 7099. https://doi.org/10.3390/s24217099
Zhang J, Sun J, Wang C, Tao Z, Zhang F. ARMNet: A Network for Image Dimensional Emotion Prediction Based on Affective Region Extraction and Multi-Channel Fusion. Sensors. 2024; 24(21):7099. https://doi.org/10.3390/s24217099
Chicago/Turabian StyleZhang, Jingjing, Jiaying Sun, Chunxiao Wang, Zui Tao, and Fuxiao Zhang. 2024. "ARMNet: A Network for Image Dimensional Emotion Prediction Based on Affective Region Extraction and Multi-Channel Fusion" Sensors 24, no. 21: 7099. https://doi.org/10.3390/s24217099
APA StyleZhang, J., Sun, J., Wang, C., Tao, Z., & Zhang, F. (2024). ARMNet: A Network for Image Dimensional Emotion Prediction Based on Affective Region Extraction and Multi-Channel Fusion. Sensors, 24(21), 7099. https://doi.org/10.3390/s24217099