Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jan 8:6:1296508.
doi: 10.3389/fdata.2023.1296508. eCollection 2023.

CTAB-GAN+: enhancing tabular data synthesis

Affiliations

CTAB-GAN+: enhancing tabular data synthesis

Zilong Zhao et al. Front Big Data. .

Abstract

The usage of synthetic data is gaining momentum in part due to the unavailability of original data due to privacy and legal considerations and in part due to its utility as an augmentation to the authentic data. Generative adversarial networks (GANs), a paragon of generative models, initially for images and subsequently for tabular data, has contributed many of the state-of-the-art synthesizers. As GANs improve, the synthesized data increasingly resemble the real data risking to leak privacy. Differential privacy (DP) provides theoretical guarantees on privacy loss but degrades data utility. Striking the best trade-off remains yet a challenging research question. In this study, we propose CTAB-GAN+ a novel conditional tabular GAN. CTAB-GAN+ improves upon state-of-the-art by (i) adding downstream losses to conditional GAN for higher utility synthetic data in both classification and regression domains; (ii) using Wasserstein loss with gradient penalty for better training convergence; (iii) introducing novel encoders _targeting mixed continuous-categorical variables and variables with unbalanced or skewed data; and (iv) training with DP stochastic gradient descent to impose strict privacy guarantees. We extensively evaluate CTAB-GAN+ on statistical similarity and machine learning utility against state-of-the-art tabular GANs. The results show that CTAB-GAN+ synthesizes privacy-preserving data with at least 21.9% higher machine learning utility (i.e., F1-Score) across multiple datasets and learning tasks under given privacy budget.

Keywords: GAN; data synthesis; differential privacy; imbalanced distribution; tabular data.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
Challenges of modeling industrial dataset using existing GAN-based table generator: (A) single Gaussian, (B) mixed type, (C) long tail distribution, and (D) skewed data.
Figure 2
Figure 2
Synthetic tabular data generation via CTAB-GAN+.
Figure 3
Figure 3
Encoding for mix data-type variable. (A) Mixed type variable distribution with VGM. (B) Mode selection of single value in continuous variable.
Figure 4
Figure 4
Conditional vector: example selects class 2 from third variable.
Figure 5
Figure 5
Evaluation flows for ML utility of classification.
Figure 6
Figure 6
Modeling industrial dataset using CTAB-GAN+: (A) simple Gaussian, (B) mixed type, (C) long tail distribution, and (D) skewed data.

Similar articles

Cited by

References

    1. Abadi M., Chu A., Goodfellow I., McMahan H. B., Mironov I., Talwar K., et al. . (2016). “Deep learning with differential privacy,” in ACM SIGSAC Conference on Computer and Communications Security (CCS). 10.1145/2976749.2978318 - DOI
    1. Arjovsky M., Chintala S., Bottou L. (2017). “Wasserstein generative adversarial networks,” in Proceedings of the 34th ICML, 214–223.
    1. Bellemare M. G., Danihelka I., Dabney W., Mohamed S., Lakshminarayanan B., Hoyer S., et al. . (2017). The cramer distance as a solution to biased wasserstein gradients. arXiv preprint arXiv:1705.10743.
    1. Bishop C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics). Berlin; Heidelberg: Springer-Verlag.
    1. Chen D., Orekondy T., Fritz M. (2020a). GS-WGAN: a gradient-sanitized approach for learning differentially private generators. arXiv preprint arXiv:2006.08265.

Grants and funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.
  NODES
twitter 2