Introduction – Synthetic data generation tool
What is synthetic data?
Synthetic data is artificial data mimicking the original dataset’s statistical characteristics without sharing personal data.
What data can be processed?
The tool processes all data in table format. The type of data (numerical, categorical, time, etc.) and missing values are automatically detected. The user has several option how missing values can be processed. More info how missing values can be treated is provided in the tool.
What synthetic data generation methods are supported?
Users can currently choose two methods for generating synthetic data:
- Classification And Regression Trees (CART); and
- Gaussian Copula (GC).
By default, CART is used. CART generally produces higher quality synthetic data, but might not work well on datasets with categorical variables with 20+ categories. GC is recommended in those cases. The tool contains a demo data for which output is generated. Use the ‘Try it out’ button.
What does the tool return?
The tool generates synthetic data. An evaluation report of the generated data, including various evaluation metrics, is automatically created and can be downloaded as a pdf. The synthetic data can be downloaded in .csv and .json format.
How is my data processed?
The tool is privacy-friendly because the data are processed entirely within the browser. The data does not leave your computer or the environment of your organization. The tool utilizes the computing power of your own computer to analyze the data. This type of browser-based software is referred to as local-first. The tool does not upload data to third parties, such as cloud providers. Instructions on how the tool and the local-first architecture can be hosted locally within your own organization can be found on Github.
Try the tool below ⬇️
Synthetic data generation tool
Source code
The source code of the synthetic data generation methods are available on Github and as a pip package:
pip install python-synthpop
.The architecture to run web apps local-first is also available on Github.
How can SDG be used for AI auditing?
When auditing algorithmic-driven decision-making processes, one of the most immediate questions is the representativeness of the source data. However, privacy poses a hurdle to sharing data with external parties to assess the representativeness of the data. Absent access to source data means that stakeholders – such as people whose data is stored and independent experts – cannot scrutinize it for potential biases. Consequently, the evaluation of data used for decision-making processes, and training of AI-systems, relies on a small group of experts. If evaluation processes by this small group are not performed carefully, this can have severe downstream consequences, such as bias and skewed. This harms public trust in technology and in the organisations that deploy these digital methods.
Synthetic data generation (SDG) offers a promising solution. By creating data that mimics the properties of the original dataset without containing any identifiable personal information, SDG allows for broader participation in assessing data representativeness while preserving privacy. It is considered a safe approach for the wider release of privately held data, as it contains no identifiable trace of the personal data it was generated from.
Synthetic data generation (SDG) offers a solution. By creating artificial data that mimics the properties of the original dataset without sharing personal information, SDG enables broader data sharing. It is considered a safe approach for wider data dissemination, as it contains no personally identifiable information.
Has SDG been used in the past?
For two reasons, the use of synthetic data has long been hindered:
- Privacy risks – Concerns, particularly among legal professionals, existed about the risks of personal data being exposed when sharing synthetic data. Research and practical examples have demonstrated that these risks can be mitigated. See the attached memo below for more background information on the legal aspects of synthetic data generation.
- Cloud dependencies risks – Many existing (commercial) APIs rely on cloud-based software, making them unsuitable for public organizations, as citizen data cannot simply be uploaded to cloud platforms. Local-first data processing offers a solution to this problem. With this tool, synthetic data can be generated directly in the browser. The data does not leave the user’s computer or the organization’s environment.
In sum, recent use cases have shown that synthetic data can be safely shared and generated without the involvement of a cloud provider. It is time to scale up so that stakeholders can gain more and better insights into the data managed by government organizations.
Applications
Lighthouse Reports was able to publicly share unintentionally obtained data using synthetic data, revealing bias in a dataset from the Municipality of Rotterdam. This dataset was used for machine learning-driven risk profiling in the context of social welfare re-examination.
AI Act
Additionally, Article 10(5) of the AI Act includes a specific provision regarding the use of synthetic data for bias detection and mitigation. It requires AI system providers to first investigate bias using synthetic or anonymized data, rather than directly processing “special categories of personal data.”
Local-first architecture
What is local-first computing?
Local-first computing is the opposite of cloud computing: the data are not uploaded to third-parties, such as a cloud providers, and are processed by your own computer. The data attached to the tool therefore don’t leave your computer or the environment of your organization. The tool is privacy-friendly because the data can be processed within the mandate of your organisation and don’t need to be shared with new parties. This synthetic data generation tool can also be hosted locally within your organization. Instructions for local hosting, including the source code or the web app, can be found on Github.
Overview of local-first architecture

Supported by
This local-first synthetic data generation tool is developed with support of public and philanthropic organisations.

Innovation grant Dutch Ministry of the Interior
Description
In partnership with the Dutch Executive Agency for Education and the Dutch Ministry of the Interior, Algorithm Audit has been developing and testing this tool from July 2024 to July 2025, supported by an Innovation grant from the annual competition hosted by the Dutch Ministry of the Interior. Project progress was shared at a community gathering on 13-02-2025. A first version of the tools are launched during a webinar on 10-06-2025.

SIDN Fund
Description
In 2024, the SIDN Fund supported Algorithm Audit to develop a first demo of the synthetic data generation tool.