Is it possible to increase participant privacy while retaining statistical control with highly synthetic data injection?

TL;DR

Is it possible to increase participant privacy while retaining statistical control with highly synthetic data injection?

Yep.

Synthetic Data

Synthetic data is cool, and I like it [1]. Synthetic data provides an interesting, new way to increase participant privacy while also having an open dataset.

As always, what is synthetic data? Synthetic data is data which is generated from an original data source which, if valid, would have the same distributional and internal statistical properties/functions as the original dataset. It is incredibly useful for further data exploration while retaining participant privacy.

Given this, we can take the original data, create a synthetic counterpart, and then produce a partially synthetic dataset. This isn’t all that new. However, the scenario below gives an area to explore: Is it possible to significantly increase the sample size to increase privacy/noise?

First and foremost, it has to be stated that this approach is not turnkey ready for analysis to publication. That is, someone cannot take the dataset, analyze it, and write up the results for any form of publication/communication. It will have, at the minimum, a synthetically inflated sample size. There is also the other obvious “these data are not real” aspect too.

A Scenario Necessitating High Privacy

You’re a researcher in sexual health in an extremely socially conservative area. Recently, an undergraduate student came to you and proposed a study which would focus on teen-aged, closeted transgender students and their sexual behaviors. You find the idea interesting and approve the study.

There are some immediate things to note with this scenario:

You are dealing with minors
You are dealing with a population subject to societal prejudice
You are dealing with a subset of that population who are not ‘out’
You are dealing with a sensitive topic
You are dealing with the aforementioned in a socially conservative ecosystem

These five points are significant barriers to even wanting to do open data. I support open data and all open science tenets, but I’d even be hesitant to share this data by itself. If someone in that socially conservative area were to find the data and I didn’t do a perfect job at anonymizing the data, it could result in unthinkable harms being done to the participants. For the purposes of the post here, however, let us say there is a reason open data is needed.

Generating the Data

Let’s generate some data in R to act as data from the aforementioned scenario:

#required libraries
suppressMessages(library(tidyverse)) #data wrangling
suppressMessages(library(synthpop)) #synthetic data generation
suppressMessages(library(faux)) #generating correlated data
suppressMessages(library(knitr)) #tables
suppressMessages(library(rmarkdown)) #rmarkdown

#dataset generation
set.seed(0001) #allows for reproducible values
datagen <- rnorm_multi(n = 20, #number of participants
                       mu = c(50, 25, 40), #averages
                       sd = c(10, 3, 5), #sds
                       r = c(.5, .7, .2), #coefficients
                       varnames = c("Var1", "Var2", "Var3"), #var names
                       empirical = FALSE)

kable(datagen, align = "c") #reports the data

Var1	Var2	Var3
42.78856	23.02010	40.51284
50.90882	24.63125	43.30836
42.08004	22.15918	36.40365
68.11645	29.22146	39.64735
52.23135	26.01547	43.74534
41.44597	25.34748	37.24683
55.22734	25.19673	41.10612
59.24389	26.70389	37.73019
56.10539	26.82382	40.86755
47.13634	22.04843	39.34886
63.96932	24.37960	49.66113
53.57427	27.03423	41.71214
43.67821	22.76090	38.48796
27.32191	24.14119	32.20889
63.42457	25.73888	38.95358
51.23558	21.37306	36.89340
49.97984	26.34671	39.04488
58.75513	28.71410	44.35432
57.46980	23.36980	46.05356
55.07865	25.02611	44.74000

Here are twenty hypothetical participants across three variables. For whatever reason, you choose to look at the relationship between the three variables.

r2_og <- lm(Var1 ~ Var2 + Var3, data = datagen) #model
r2_og_summary <- summary(r2_og) #summarizes the model

And the results are…

R^2 = 0.56, F(2,17) = 10.65, p < 0.05

Injecting Data - Partially Synthetic

So, we have our outcome. Now, how do we increase the sample size? The synthpop package makes generating synthetic data incredible easy. For this example, we will increase the sample size by 10 participants (total N = 30).

#getting a subset of the data
set.seed(0002) #allows for reproducible values
data_30n <- sample_n(datagen, 10) #grabs ten rows
data_30n_df <- data.frame(data_30n) #puts ten rows into a dataframe

#creating the synthetic data
syn_data_30n <- (syn(data_30n_df, seed = 0003))$syn #creating the synthetic values
syn_data_30n <- data.frame(syn_data_30n) #putting them into a dataframe
syn_data_30n_df <- rbind(datagen, syn_data_30n) #combining both dataframes

Var1	Var2	Var3
43	23	41
51	25	43
42	22	36
68	29	40
52	26	44
41	25	37
55	25	41
59	27	38
56	27	41
47	22	39
64	24	50
54	27	42
44	23	38
27	24	32
63	26	39
51	21	37
50	26	39
59	29	44
57	23	46
55	25	45
43	23	41
51	27	37
51	27	38
59	27	42
51	23	43
54	25	38
54	27	39
59	27	46
51	23	38
51	23	38

As you can see, two things have happened. First, all values have been rounded to the appropriate integer. Second, starting at row 21, those values were not present in the first dataframe seen above. Let’s see if the interpretations would be the same with this partially synthetic dataset.

r2_30n <- lm(Var1 ~ Var2 + Var3, data = syn_data_30n_df) #model
r2_30n_summary <- summary(r2_30n) #summarizes the model

R^2 = 0.54, F(2,27) = 15.58, p < 0.05

So, yes. The interpretations would be nearly identical. Of course, thats only increasing the sample size by 10 participants (albeit, that is 50% of the original sample). Now, let’s increase the sample size to 60 participants - an increase of 40 synthetic participants.

Injecting Data - Highly Synthetic

Here is some bodged [2] code that works:

#getting a subset of the data
set.seed(0004) #allows for reproducible values
data_60n1 <- sample_n(datagen, 10) #grabs ten rows
data_60n1_df <- data.frame(data_60n1) #puts ten rows into a dataframe

set.seed(0005) #allows for reproducible values
data_60n2 <- sample_n(datagen, 10) #grabs ten rows
data_60n2_df <- data.frame(data_60n2) #puts ten rows into a dataframe

set.seed(0006) #allows for reproducible values
data_60n3 <- sample_n(datagen, 10) #grabs ten rows
data_60n3_df <- data.frame(data_60n3) #puts ten rows into a dataframe

set.seed(0007) #allows for reproducible values
data_60n4 <- sample_n(datagen, 10) #grabs ten rows
data_60n4_df <- data.frame(data_60n4) #puts ten rows into a dataframe

#creating the synthetic data
syn_data_60n1 <- (syn(data_60n1_df, seed = 0008))$syn #creating the synthetic values
syn_data_60n2 <- (syn(data_60n2_df, seed = 0009))$syn
syn_data_60n3 <- (syn(data_60n3_df, seed = 0010))$syn
syn_data_60n4 <- (syn(data_60n4_df, seed = 0011))$syn

#putting them into a dataframe
syn_data_60n1 <- data.frame(syn_data_60n1) #putting them into a dataframe
syn_data_60n2 <- data.frame(syn_data_60n2)
syn_data_60n3 <- data.frame(syn_data_60n3)
syn_data_60n4 <- data.frame(syn_data_60n4)

#binding the synthetic datasets to the original dataset
syn_data_60n1_df <- rbind(datagen, syn_data_60n1) #combining both dataframes
syn_data_60n2_df <- rbind(syn_data_60n1_df, syn_data_60n2)
syn_data_60n3_df <- rbind(syn_data_60n2_df, syn_data_60n3)
syn_data_60n4_df <- rbind(syn_data_60n3_df, syn_data_60n4)

Again, time for model… I will spare you the code.

R^2 = 0.02, F(2,57) = 0.67, p > 0.05

With giving a quick recap of the original results, R^2 = 0.56, F(2,17) = 10.65, p < 0.05, it is clear the interpretations one would make between the original model (i.e., original dataset, N = 20 ) and the highly (N = 60) synthetic model are completely different. Therefore, the highly synthetic dataset is not valid. However, the dataset which was partially synthetic (N = 30), had results which were similar to the original and did produce a valid dataset: R^2 = 0.54, F(2,27) = 15.58, p < 0.05.

“Simulation” Time

One of the weird things which synthetic data is that, depending on the statistical properties within a dataset, one seed can produce a valid dataset and another not. Therefore, I am curious if it is computationally feasible to produce a valid, highly synthetic dataset.

So, let’s do some simulating [3].

#creating frames to place values
r2val_60n_frame <- data.frame(r2val=c) #creates a frame for the loops to place the r-squared values into
pval_60n_frame <- data.frame(pval=c) #creates a frame for the loops to place the r-squared values into

#setting seeds
loop_60n <- 1
sampling_seed <- 2000

while(loop_60n < 2501){
  
    #first ten
    set.seed(sampling_seed) #seed for replicable sequences
    data_60n1 <- sample_n(datagen, 10) #grabs ten from the original generated dataset
      sampling_seed <- sampling_seed + 1 #adds 1 to the sampling seed so 
    #second ten
    set.seed(sampling_seed) 
    data_60n2 <- sample_n(datagen, 10)
      sampling_seed <- sampling_seed + 1
    #third ten
    set.seed(sampling_seed)
    data_60n3 <- sample_n(datagen, 10)
      sampling_seed <- sampling_seed + 1
    #fourth ten
    set.seed(sampling_seed)
    data_60n4 <- sample_n(datagen, 10)
      sampling_seed <- sampling_seed + 1
      
    #binding the data together
    data_60n <- rbind(data_60n1, data_60n2) #binding the sets of sampled data together
    data_60n <- rbind(data_60n, data_60n3)
    data_60n <- rbind(data_60n, data_60n4)
    data_60n_df <- data.frame(data_60n) #putting the data into a final dataframe
    
    #setting the synthetic data seed
    seed_60n <- 10000
    
    #synthesizing the data
    syn_data_60n <- (syn(data_60n_df, seed = seed_60n))$syn #synthesizing
    syn_data_60n <- data.frame(syn_data_60n) #dataframe
    syn_data_60n_df <- rbind(datagen, syn_data_60n) #binding the original generated data with the synthetic data

    #model
    r2_60n <- lm(Var1 ~ Var2 + Var3, data = syn_data_60n_df) #model
    r2_60n_summary <- summary(r2_60n)
    r2val_60n_frame <- rbind(r2val_60n_frame, r2_60n_summary$r.squared) #binding the R^2 value to the dataframe
    
    #getting the p-value for the overall model
    p_60n <- pf(r2_60n_summary$fstatistic, 
                     r2_60n_summary$fstatistic[2], 
                     r2_60n_summary$fstatistic[3], 
                     lower.tail = FALSE)[1]
    
    pval_60n_frame <- rbind(pval_60n_frame, p_60n) #binding the p-values to the dataframe
    
    #adds a value
    seed_60n <- seed_60n + 1 #adds one to the seed where the data is synthesized
    loop_60n <- loop_60n + 1
}

Now to filter out invalid datasets. To do this, we need to make some arguments for valid and invalid. For here, I think the two most obvious ones will be the p-value as it is a dichotomous decision and the R^2 value. Given the original R^2 value was 0.56, I will filter out any values which are not between 0.400 and 0.599.

#binding the two frames holding R^2 and p-values together
all_60n <- cbind(r2val_60n_frame, pval_60n_frame)

colnames(all_60n)[1] <- c("r2_val_60n") #renames the column
colnames(all_60n)[2] <- c("p_val_60n") #renames the column

n60_filtered <- all_60n %>% filter(p_val_60n < 0.050 & r2_val_60n >= 0.400 & r2_val_60n <= 0.599) #filters OUT r-squared values which would change the inference (i.e., effect size) one would make

From a total of 2,500 ‘simulations’, the final total count of valid datasets is 1992.

So… it is possible to produce a valid dataset with such a significantly high set of synthetic data within [4]. It is just a computational cost, and one that isn’t that severe. It is also possible to know which seed produced the valid dataset [5], so doing something like this while() wouldn’t be a bad idea if you really needed a synthetic dataset for your project.

Notes

[1] An enormous thank you to Dan Quintana for introducing me to this topic and for an excellent primer on the topic.

[2] The Art of the Bodge. TL;DW: There is a less sloppy way to make this work, but it works.

[3] I’ve never quite felt comfortable calling this process ‘simulation’ but I do not know a better term for it. Don’t get upset please.

[4] I did not vizualize the distribution of values for this project. Therefore, it is assumed that the distributions would look the same. To have a definitive claim that a synthetic dataset is valid, the distribution of values must look the same compared to the original dataset.

[5] Maybe for another post. Not today.