CTAB-GAN+: enhancing tabular data synthesis
- PMID: 38260053
- PMCID: PMC10801038
- DOI: 10.3389/fdata.2023.1296508
CTAB-GAN+: enhancing tabular data synthesis
Abstract
The usage of synthetic data is gaining momentum in part due to the unavailability of original data due to privacy and legal considerations and in part due to its utility as an augmentation to the authentic data. Generative adversarial networks (GANs), a paragon of generative models, initially for images and subsequently for tabular data, has contributed many of the state-of-the-art synthesizers. As GANs improve, the synthesized data increasingly resemble the real data risking to leak privacy. Differential privacy (DP) provides theoretical guarantees on privacy loss but degrades data utility. Striking the best trade-off remains yet a challenging research question. In this study, we propose CTAB-GAN+ a novel conditional tabular GAN. CTAB-GAN+ improves upon state-of-the-art by (i) adding downstream losses to conditional GAN for higher utility synthetic data in both classification and regression domains; (ii) using Wasserstein loss with gradient penalty for better training convergence; (iii) introducing novel encoders _targeting mixed continuous-categorical variables and variables with unbalanced or skewed data; and (iv) training with DP stochastic gradient descent to impose strict privacy guarantees. We extensively evaluate CTAB-GAN+ on statistical similarity and machine learning utility against state-of-the-art tabular GANs. The results show that CTAB-GAN+ synthesizes privacy-preserving data with at least 21.9% higher machine learning utility (i.e., F1-Score) across multiple datasets and learning tasks under given privacy budget.
Keywords: GAN; data synthesis; differential privacy; imbalanced distribution; tabular data.
Copyright © 2024 Zhao, Kunar, Birke, Van der Scheer and Chen.
Conflict of interest statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Figures
Similar articles
-
HT-Fed-GAN: Federated Generative Model for Decentralized Tabular Data Synthesis.Entropy (Basel). 2022 Dec 31;25(1):88. doi: 10.3390/e25010088. Entropy (Basel). 2022. PMID: 36673229 Free PMC article.
-
Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy.J Biomed Inform. 2023 Jul;143:104404. doi: 10.1016/j.jbi.2023.104404. Epub 2023 Jun 1. J Biomed Inform. 2023. PMID: 37268168
-
Tunable Privacy Risk Evaluation of Generative Adversarial Networks.Stud Health Technol Inform. 2024 Aug 22;316:1233-1237. doi: 10.3233/SHTI240634. Stud Health Technol Inform. 2024. PMID: 39176604
-
Systematic Review of Generative Adversarial Networks (GANs) for Medical Image Classification and Segmentation.J Digit Imaging. 2022 Apr;35(2):137-152. doi: 10.1007/s10278-021-00556-w. Epub 2022 Jan 12. J Digit Imaging. 2022. PMID: 35022924 Free PMC article. Review.
-
Generative Adversarial Networks in Digital Histopathology: Current Applications, Limitations, Ethical Considerations, and Future Directions.Mod Pathol. 2024 Jan;37(1):100369. doi: 10.1016/j.modpat.2023.100369. Epub 2023 Oct 27. Mod Pathol. 2024. PMID: 37890670 Review.
Cited by
-
Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence.NPJ Digit Med. 2024 Mar 20;7(1):76. doi: 10.1038/s41746-024-01076-x. NPJ Digit Med. 2024. PMID: 38509224 Free PMC article.
-
A Novel Digital Twin Strategy to Examine the Implications of Randomized Clinical Trials for Real-World Populations.medRxiv [Preprint]. 2024 Sep 6:2024.03.25.24304868. doi: 10.1101/2024.03.25.24304868. medRxiv. 2024. PMID: 38585929 Free PMC article. Preprint.
-
HydraGAN: A Cooperative Agent Model for Multi-Objective Data Generation.ACM Trans Intell Syst Technol. 2024 Jun;15(3):60. doi: 10.1145/3653982. Epub 2024 May 17. ACM Trans Intell Syst Technol. 2024. PMID: 39469108 Free PMC article.
References
-
- Abadi M., Chu A., Goodfellow I., McMahan H. B., Mironov I., Talwar K., et al. . (2016). “Deep learning with differential privacy,” in ACM SIGSAC Conference on Computer and Communications Security (CCS). 10.1145/2976749.2978318 - DOI
-
- Arjovsky M., Chintala S., Bottou L. (2017). “Wasserstein generative adversarial networks,” in Proceedings of the 34th ICML, 214–223.
-
- Bellemare M. G., Danihelka I., Dabney W., Mohamed S., Lakshminarayanan B., Hoyer S., et al. . (2017). The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743.
-
- Bishop C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin; Heidelberg: Springer-Verlag.
-
- Chen D., Orekondy T., Fritz M. (2020a). GS-WGAN: a gradient-sanitized approach for learning differentially private generators. arXiv preprint arXiv:2006.08265.
Grants and funding
LinkOut - more resources
Full Text Sources
Research Materials