Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 7;14(1):23312.
doi: 10.1038/s41598-024-73608-0.

A novel and fully automated platform for synthetic tabular data generation and validation

Affiliations

A novel and fully automated platform for synthetic tabular data generation and validation

Hooman H Rashidi et al. Sci Rep. .

Abstract

Healthcare data accessibility for machine learning (ML) is encumbered by a range of stringent regulations and limitations. Using synthetic data that mirrors the underlying properties in the real data is emerging as a promising solution to overcome these barriers. We propose a fully automated synthetic tabular neural generator (STNG), which comprises multiple synthetic data generators and integrates an Auto-ML module to validate and comprehensively compare the synthetic datasets generated from different approaches. An empirical study was conducted to demonstrate the performance of STNG using twelve different datasets. The results highlight STNG's robustness and its pivotal role in enhancing the accessibility of validated synthetic healthcare data, thereby offering a promising solution to a critical barrier in ML applications in healthcare.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
STNG synthetic data generators and Auto-ML infrastructure.
Fig. 2
Fig. 2
STNG ML Scores of the synthetic datasets for the datasets with binary outputs.
Fig. 3
Fig. 3
Areas under the curve (AUCrr, AUCss, and AUCsr) for evaluating synthetic heart disease datasets.
Fig. 4
Fig. 4
Univariate and bivariate comparison of the real and STNG Gaussian copula synthetic datasets: A) comparison of means and standard deviations from the real and synthetic heart disease datasets; (B) pairwise correlations of the real and synthetic data, and their difference.
Fig. 5
Fig. 5
Areas under the curve (AUCrr, AUCss, and AUCsr) for evaluating synthetic stroke datasets.
Fig. 6
Fig. 6
Areas under the curve (AUCrr, AUCss, and AUCsr) for evaluating synthetic NHANES diabetes datasets.

Similar articles

References

    1. Office, U. S. G. A. Artificial Intelligence in Health Care, Benefits and Challenges of Machine Learning Technologies for Medical Diagnostics (2022).
    1. Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K. & Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomedical Eng.5(6), 493–497 (2021). - PMC - PubMed
    1. Bhanot, K., Qi, M., Erickson, J. S., Guyon, I. & Bennett, K. P. The problem of fairness in synthetic healthcare data. Entropy (Basel)23(9) (2021). - PMC - PubMed
    1. Reiner Benaim, A. et al. Analyzing medical research results based on synthetic data and their relation to real data results: Systematic comparison from five observational studies. JMIR Med. Inf.8(2), e16492 (2020). - PMC - PubMed
    1. Goncalves, A. et al. Generation and evaluation of synthetic patient data. BMC Med. Res. Methodol.20(1), 108 (2020). - PMC - PubMed

LinkOut - more resources

  NODES
twitter 2