BayesBoost: Identifying and Handling Bias Using Synthetic Data Generators

Barbara Draghi, Zhenchen Wang, Puja Myles, Allan Tucker
Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications, PMLR 154:49-62, 2021.

Abstract

Advanced synthetic data generators can model sensitive personal datasets by creating simulated samples of data with realistic correlation structures and distributions, but with a greatly reduced risk of identifying individuals. This has huge potential in medicine where sensitive patient data can be simulated and shared, enabling the development and robust validation of new AI technologies for diagnosis and disease management. However, even when massive ground truth datasets are available (such as UK-NHS databases which contain patient records in the order of millions) there is a high risk that biases still exist which are carried over to the data generators. For example, certain cohorts of patients may be under-represented due to cultural sensitivities amongst some communities, or due to institutionalised procedures in data collection. The under-representation of groups is one of the forms in which bias can manifest itself in machine learning, and it is the one we investigate in this work.These factors may also lead to structurally missing data or incorrect correlations and distributions which will be mirrored in the synthetic data generated from biased ground truth datasets. In this paper, we explore methods to improve synthetic data generators by using probabilistic methods to firstly identify the difficult to predict data samples in ground truth data, and then to boost these types of data when generating synthetic samples. The paper explores attempts to create synthetic data that contain more realistic distributions and that lead to predictive models with better performance.

Cite this Paper


BibTeX
@InProceedings{pmlr-v154-draghi21a, title = {BayesBoost: Identifying and Handling Bias Using Synthetic Data Generators}, author = {Draghi, Barbara and Wang, Zhenchen and Myles, Puja and Tucker, Allan}, booktitle = {Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications}, pages = {49--62}, year = {2021}, editor = {Moniz, Nuno and Branco, Paula and Torgo, Luis and Japkowicz, Nathalie and Woźniak, Michał and Wang, Shuo}, volume = {154}, series = {Proceedings of Machine Learning Research}, month = {17 Sep}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v154/draghi21a/draghi21a.pdf}, url = {https://proceedings.mlr.press/v154/draghi21a.html}, abstract = {Advanced synthetic data generators can model sensitive personal datasets by creating simulated samples of data with realistic correlation structures and distributions, but with a greatly reduced risk of identifying individuals. This has huge potential in medicine where sensitive patient data can be simulated and shared, enabling the development and robust validation of new AI technologies for diagnosis and disease management. However, even when massive ground truth datasets are available (such as UK-NHS databases which contain patient records in the order of millions) there is a high risk that biases still exist which are carried over to the data generators. For example, certain cohorts of patients may be under-represented due to cultural sensitivities amongst some communities, or due to institutionalised procedures in data collection. The under-representation of groups is one of the forms in which bias can manifest itself in machine learning, and it is the one we investigate in this work.These factors may also lead to structurally missing data or incorrect correlations and distributions which will be mirrored in the synthetic data generated from biased ground truth datasets. In this paper, we explore methods to improve synthetic data generators by using probabilistic methods to firstly identify the difficult to predict data samples in ground truth data, and then to boost these types of data when generating synthetic samples. The paper explores attempts to create synthetic data that contain more realistic distributions and that lead to predictive models with better performance.} }
Endnote
%0 Conference Paper %T BayesBoost: Identifying and Handling Bias Using Synthetic Data Generators %A Barbara Draghi %A Zhenchen Wang %A Puja Myles %A Allan Tucker %B Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications %C Proceedings of Machine Learning Research %D 2021 %E Nuno Moniz %E Paula Branco %E Luis Torgo %E Nathalie Japkowicz %E Michał Woźniak %E Shuo Wang %F pmlr-v154-draghi21a %I PMLR %P 49--62 %U https://proceedings.mlr.press/v154/draghi21a.html %V 154 %X Advanced synthetic data generators can model sensitive personal datasets by creating simulated samples of data with realistic correlation structures and distributions, but with a greatly reduced risk of identifying individuals. This has huge potential in medicine where sensitive patient data can be simulated and shared, enabling the development and robust validation of new AI technologies for diagnosis and disease management. However, even when massive ground truth datasets are available (such as UK-NHS databases which contain patient records in the order of millions) there is a high risk that biases still exist which are carried over to the data generators. For example, certain cohorts of patients may be under-represented due to cultural sensitivities amongst some communities, or due to institutionalised procedures in data collection. The under-representation of groups is one of the forms in which bias can manifest itself in machine learning, and it is the one we investigate in this work.These factors may also lead to structurally missing data or incorrect correlations and distributions which will be mirrored in the synthetic data generated from biased ground truth datasets. In this paper, we explore methods to improve synthetic data generators by using probabilistic methods to firstly identify the difficult to predict data samples in ground truth data, and then to boost these types of data when generating synthetic samples. The paper explores attempts to create synthetic data that contain more realistic distributions and that lead to predictive models with better performance.
APA
Draghi, B., Wang, Z., Myles, P. & Tucker, A.. (2021). BayesBoost: Identifying and Handling Bias Using Synthetic Data Generators. Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications, in Proceedings of Machine Learning Research 154:49-62 Available from https://proceedings.mlr.press/v154/draghi21a.html.

Related Material