Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy
- PMID: 37268168
- DOI: 10.1016/j.jbi.2023.104404
Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy
Abstract
A large amount of personal health data that is highly valuable to the scientific community is still not accessible or requires a lengthy request process due to privacy concerns and legal restrictions. As a solution, synthetic data has been studied and proposed to be a promising alternative to this issue. However, generating realistic and privacy-preserving synthetic personal health data retains challenges such as simulating the characteristics of the patients' data that are in the minority classes, capturing the relations among variables in imbalanced data and transferring them to the synthetic data, and preserving individual patients' privacy. In this paper, we propose a differentially private conditional Generative Adversarial Network model (DP-CGANS) consisting of data transformation, sampling, conditioning, and network training to generate realistic and privacy-preserving personal data. Our model distinguishes categorical and continuous variables and transforms them into latent space separately for better training performance. We tackle the unique challenges of generating synthetic patient data due to the special data characteristics of personal health data. For example, patients with a certain disease are typically the minority in the dataset and the relations among variables are crucial to be observed. Our model is structured with a conditional vector as an additional input to present the minority class in the imbalanced data and maximally capture the dependency between variables. Moreover, we inject statistical noise into the gradients in the networking training process of DP-CGANS to provide a differential privacy guarantee. We extensively evaluate our model with state-of-the-art generative models on personal socio-economic datasets and real-world personal health datasets in terms of statistical similarity, machine learning performance, and privacy measurement. We demonstrate that our model outperforms other comparable models, especially in capturing the dependence between variables. Finally, we present the balance between data utility and privacy in synthetic data generation considering the different data structures and characteristics of real-world personal health data such as imbalanced classes, abnormal distributions, and data sparsity.
Keywords: Data privacy; Generative adversarial network; Health data sharing; Synthetic data; Synthetic health data.
Copyright © 2023 The Authors. Published by Elsevier Inc. All rights reserved.
Conflict of interest statement
Declaration of Competing Interest The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Chang Sun reports financial support was provided by Dutch Open Data Infrastructure for Social Science and Economic Innovations.
Similar articles
-
Privacy preserving Generative Adversarial Networks to model Electronic Health Records.Neural Netw. 2022 Sep;153:339-348. doi: 10.1016/j.neunet.2022.06.022. Epub 2022 Jun 25. Neural Netw. 2022. PMID: 35779443
-
CTAB-GAN+: enhancing tabular data synthesis.Front Big Data. 2024 Jan 8;6:1296508. doi: 10.3389/fdata.2023.1296508. eCollection 2023. Front Big Data. 2024. PMID: 38260053 Free PMC article.
-
Generating synthetic clinical data that capture class imbalanced distributions with generative adversarial networks: Example using antiretroviral therapy for HIV.J Biomed Inform. 2023 Aug;144:104436. doi: 10.1016/j.jbi.2023.104436. Epub 2023 Jul 13. J Biomed Inform. 2023. PMID: 37451495
-
Federated Machine Learning, Privacy-Enhancing Technologies, and Data Protection Laws in Medical Research: Scoping Review.J Med Internet Res. 2023 Mar 30;25:e41588. doi: 10.2196/41588. J Med Internet Res. 2023. PMID: 36995759 Free PMC article. Review.
-
Federated transfer learning for auxiliary classifier generative adversarial networks: framework and industrial application.J Intell Manuf. 2023 May 5:1-16. doi: 10.1007/s10845-023-02126-z. Online ahead of print. J Intell Manuf. 2023. PMID: 37361337 Free PMC article. Review.
Cited by
-
Source-free unsupervised domain adaptation: A survey.Neural Netw. 2024 Jun;174:106230. doi: 10.1016/j.neunet.2024.106230. Epub 2024 Mar 11. Neural Netw. 2024. PMID: 38490115 Review.
-
Synthetic data: how could it be used in infectious disease research?Future Microbiol. 2024;19(17):1439-1444. doi: 10.1080/17460913.2024.2400853. Epub 2024 Sep 30. Future Microbiol. 2024. PMID: 39345126 No abstract available.
-
Private pathological assessment via machine learning and homomorphic encryption.BioData Min. 2024 Sep 10;17(1):33. doi: 10.1186/s13040-024-00379-9. BioData Min. 2024. PMID: 39252108 Free PMC article.
-
Getting real about synthetic data ethics : Are AI ethics principles a good starting point for synthetic data ethics?EMBO Rep. 2024 May;25(5):2152-2155. doi: 10.1038/s44319-024-00101-0. Epub 2024 Feb 22. EMBO Rep. 2024. PMID: 38388694 Free PMC article.
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Research Materials