Synthetic data generation – What is it?

Garbage in, garbage out: When auditing semi-automated decision-making processes, one of the most immediate questions is the representativeness of the source data. However, privacy poses a hurdle to sharing data with external parties to assess the representativeness of the data. Absent access to source data means that stakeholders – such as people whose data is stored and independent experts – cannot scrutinize it for potential biases. Consequently, the evaluation of data used for semi-automated decision-making processes, and training of AI-systems, relies on a small group. If this evaluation process is not performed carefully, this can have severe downstream consequences for the decision-making processes that are using these data. This harms public trust in technology and in the organisations that deploy these digital methods.

Synthetic data generation (SDG) – the creation of artificial datasets mimicking the original dataset’s statistical characteristics – emerges as a potential solution. SDG has the potential to extend the audience involved in assessing the representativeness of data. It is considered a safe approach for the wider release of privately held data, as it contains no identifiable trace of the personal data it was generated from.

How can SDG be used for AI bias testing?

SDG holds potential for third parties to audit datasets in a privacy-preserving way. There is currently not yet sufficient knowledge how and when SDG serves as a suitable method for external bias testing. First, the complex process of SDG may not always be necessary for bias testing when simple approaches such as univariate or bivariate aggregate statistics of the source data suffice. Second, SDG can be performed using a plethora of methods, e.g., parametric, non-parametric and copula-based estimation and inference methods. The best SDG method for a given use case depends on the underlying structure of the data and is therefore context-specific. At Algorithm Audit, we are investigating these open-ended questions, and build public knowledge on what form of data-sharing practice (SDG or alternatives) is best suited for privacy-preserving AI bias testing in specific use cases. Through our technical and qualitative work in this project, we contribute to this collective learning process.

Learn more about our quantitative and qualitative Joint Fairness Assessment Method.

Has SDG been used in the past?

Although numerous commercial APIs for generating synthetic data exist, widespread adoption has historically been limited due to the risk data-sharing poses for privacy. Particularly in the public sector, where stringent privacy preservation is imperative, adoption has faced SDG hurdles. Yet, the couple of last years has seen landmark use cases of data sharing enabled through SDG.

Use cases

Notably, Lighthouse Reports shared inadvertently acquired data to the public through SDG, shedding light on biases in a massive data set that the Municipality of Rotterdam used for ML-driven risk profiling in the context of social welfare re-examination.

AI Act

Furthermore, the AI Act (Article 10) contains a specific provision regarding the utilization of synthetic data for bias detection and mitigation, mandating AI system providers to rectify biases using synthetic or anonymized data rather than relying solely on “appropriate safeguards.”

Synthetic data generation cohort

Ellen Bogaards

MSc Artifical Intelligence, Utrecht University

Emmanuel Menvouta PhD

Machine Learning Engineer, Dataroots

Godwin Acheampong

Data Scientist, Budget Thuis

Joel Persson PhD

Research Scientist, Spotify

Sonja Babac

PhD-candidate, Technical University Eindhoven – Philips MedTech

Newsletter

Stay up to date about our work by signing up for our newsletter

Newsletter

Stay up to date about our work by signing up for our newsletter