Inspire PEACH

Synthetic Data

The research team at the Malawi University of Business and Applied Sciences has been leading the work on the synthetic data for the project. We are working with the research community to understand how current trends in synthetic health data can be used in the current Malawi and Kenya Health Data landscape.

Why is synthetic data potentially important?

Synthetic data has received considerable attention as a method of creating annonymized data sets and thus of protecting patient privacy and augmenting clinical research. Synthetic data can be numerical, e.g. fake patient records, or fake medical images.

Synthetic datasets offer the potential to speed up access to healthcare datasets – a key aspect towards harmonising health datasets to enable collaborative research that improves people’s lives.

Examples of synthetic patient datasets are:

the Synthetic Patient Data in OMOP Dataset is a synthetic database released by the Centers for Medicare and Medicaid Services (CMS) Medicare Claims Synthetic Public Use Files (SynPUF).
the Synthea Generated Synthetic Data in FHIR hosts over 1 million synthetic patient records generated using Synthea in FHIR format.

Typically, synthetic datasets are created from real datasets, by extracting the characteristics of the real datasets, e.g. statistical measures of distribution and correlation, and maintaining these in the synthetic data.

AI techniques , such as the use of generative adversarial networks (GANs) have also been used to create high-fidelity fake data. The artificial intelligence system is provided with a dataset of real data and learns to produce new data that retains the overall properties of the original dataset but is artificial.

As we have seen in our project, accessing health data from Ministries of Health and other stakeholders can be long process. Sometimes, data may not be in an electronic form, sometimes data is in electronic form but a lot of investment needs to be made in preparing that data.

We generated a synthetic dataset for the IDSR Case Base reporting form for Covid-19

See a description of this form here:and a description of the dataset we created here.

We have used this dataset to test the ETL pipeline to OMOP as you see in DATA MESH > ETL .

We are working with the OHDSI network and other stakeholders to drive the research around the use of synthetic datasets and how they can support a variety of uses: academic, teaching and training, benchmarking of methods and commercial use.