Is it possible to increase participant privacy while retaining statistical control with highly synthetic data injection?
Yep.
Synthetic data is cool, and I like it [1]. Synthetic data provides an interesting, new way to increase participant privacy while also having an open dataset.
As always, what is synthetic data? Synthetic data is data which is generated from an original data source which, if valid, would have the same distributional and internal statistical properties/functions as the original dataset. It is incredibly useful for further data exploration while retaining participant privacy.
Given this, we can take the original data, create a synthetic counterpart, and then produce a partially synthetic dataset. This isn’t all that new. However, the scenario below gives an area to explore: Is it possible to significantly increase the sample size to increase privacy/noise?
First and foremost, it has to be stated that this approach is not turnkey ready for analysis to publication. That is, someone cannot take the dataset, analyze it, and write up the results for any form of publication/communication. It will have, at the minimum, a synthetically inflated sample size. There is also the other obvious “these data are not real” aspect too.
You’re a researcher in sexual health in an extremely socially conservative area. Recently, an undergraduate student came to you and proposed a study which would focus on teen-aged, closeted transgender students and their sexual behaviors. You find the idea interesting and approve the study.
There are some immediate things to note with this scenario:
These five points are significant barriers to even wanting to do open data. I support open data and all open science tenets, but I’d even be hesitant to share this data by itself. If someone in that socially conservative area were to find the data and I didn’t do a perfect job at anonymizing the data, it could result in unthinkable harms being done to the participants. For the purposes of the post here, however, let us say there is a reason open data is needed.
Let’s generate some data in R
to act as data from the aforementioned scenario:
#required libraries
suppressMessages(library(tidyverse)) #data wrangling
suppressMessages(library(synthpop)) #synthetic data generation
suppressMessages(library(faux)) #generating correlated data
suppressMessages(library(knitr)) #tables
suppressMessages(library(rmarkdown)) #rmarkdown
#dataset generation
set.seed(0001) #allows for reproducible values
datagen <- rnorm_multi(n = 20, #number of participants
mu = c(50, 25, 40), #averages
sd = c(10, 3, 5), #sds
r = c(.5, .7, .2), #coefficients
varnames = c("Var1", "Var2", "Var3"), #var names
empirical = FALSE)
kable(datagen, align = "c") #reports the data
Var1 | Var2 | Var3 |
---|---|---|
42.78856 | 23.02010 | 40.51284 |
50.90882 | 24.63125 | 43.30836 |
42.08004 | 22.15918 | 36.40365 |
68.11645 | 29.22146 | 39.64735 |
52.23135 | 26.01547 | 43.74534 |
41.44597 | 25.34748 | 37.24683 |
55.22734 | 25.19673 | 41.10612 |
59.24389 | 26.70389 | 37.73019 |
56.10539 | 26.82382 | 40.86755 |
47.13634 | 22.04843 | 39.34886 |
63.96932 | 24.37960 | 49.66113 |
53.57427 | 27.03423 | 41.71214 |
43.67821 | 22.76090 | 38.48796 |
27.32191 | 24.14119 | 32.20889 |
63.42457 | 25.73888 | 38.95358 |
51.23558 | 21.37306 | 36.89340 |
49.97984 | 26.34671 | 39.04488 |
58.75513 | 28.71410 | 44.35432 |
57.46980 | 23.36980 | 46.05356 |
55.07865 | 25.02611 | 44.74000 |
Here are twenty hypothetical participants across three variables. For whatever reason, you choose to look at the relationship between the three variables.
r2_og <- lm(Var1 ~ Var2 + Var3, data = datagen) #model
r2_og_summary <- summary(r2_og) #summarizes the model
And the results are…
R^2 = 0.56, F(2,17) = 10.65, p < 0.05
So, we have our outcome. Now, how do we increase the sample size? The synthpop
package makes generating synthetic data incredible easy. For this example, we will increase the sample size by 10 participants (total N = 30).
#getting a subset of the data
set.seed(0002) #allows for reproducible values
data_30n <- sample_n(datagen, 10) #grabs ten rows
data_30n_df <- data.frame(data_30n) #puts ten rows into a dataframe
#creating the synthetic data
syn_data_30n <- (syn(data_30n_df, seed = 0003))$syn #creating the synthetic values
syn_data_30n <- data.frame(syn_data_30n) #putting them into a dataframe
syn_data_30n_df <- rbind(datagen, syn_data_30n) #combining both dataframes
Var1 | Var2 | Var3 |
---|---|---|
43 | 23 | 41 |
51 | 25 | 43 |
42 | 22 | 36 |
68 | 29 | 40 |
52 | 26 | 44 |
41 | 25 | 37 |
55 | 25 | 41 |
59 | 27 | 38 |
56 | 27 | 41 |
47 | 22 | 39 |
64 | 24 | 50 |
54 | 27 | 42 |
44 | 23 | 38 |
27 | 24 | 32 |
63 | 26 | 39 |
51 | 21 | 37 |
50 | 26 | 39 |
59 | 29 | 44 |
57 | 23 | 46 |
55 | 25 | 45 |
43 | 23 | 41 |
51 | 27 | 37 |
51 | 27 | 38 |
59 | 27 | 42 |
51 | 23 | 43 |
54 | 25 | 38 |
54 | 27 | 39 |
59 | 27 | 46 |
51 | 23 | 38 |
51 | 23 | 38 |
As you can see, two things have happened. First, all values have been rounded to the appropriate integer. Second, starting at row 21, those values were not present in the first dataframe seen above. Let’s see if the interpretations would be the same with this partially synthetic dataset.
r2_30n <- lm(Var1 ~ Var2 + Var3, data = syn_data_30n_df) #model
r2_30n_summary <- summary(r2_30n) #summarizes the model
R^2 = 0.54, F(2,27) = 15.58, p < 0.05
So, yes. The interpretations would be nearly identical. Of course, thats only increasing the sample size by 10 participants (albeit, that is 50% of the original sample). Now, let’s increase the sample size to 60 participants - an increase of 40 synthetic participants.
Here is some bodged [2] code that works:
#getting a subset of the data
set.seed(0004) #allows for reproducible values
data_60n1 <- sample_n(datagen, 10) #grabs ten rows
data_60n1_df <- data.frame(data_60n1) #puts ten rows into a dataframe
set.seed(0005) #allows for reproducible values
data_60n2 <- sample_n(datagen, 10) #grabs ten rows
data_60n2_df <- data.frame(data_60n2) #puts ten rows into a dataframe
set.seed(0006) #allows for reproducible values
data_60n3 <- sample_n(datagen, 10) #grabs ten rows
data_60n3_df <- data.frame(data_60n3) #puts ten rows into a dataframe
set.seed(0007) #allows for reproducible values
data_60n4 <- sample_n(datagen, 10) #grabs ten rows
data_60n4_df <- data.frame(data_60n4) #puts ten rows into a dataframe
#creating the synthetic data
syn_data_60n1 <- (syn(data_60n1_df, seed = 0008))$syn #creating the synthetic values
syn_data_60n2 <- (syn(data_60n2_df, seed = 0009))$syn
syn_data_60n3 <- (syn(data_60n3_df, seed = 0010))$syn
syn_data_60n4 <- (syn(data_60n4_df, seed = 0011))$syn
#putting them into a dataframe
syn_data_60n1 <- data.frame(syn_data_60n1) #putting them into a dataframe
syn_data_60n2 <- data.frame(syn_data_60n2)
syn_data_60n3 <- data.frame(syn_data_60n3)
syn_data_60n4 <- data.frame(syn_data_60n4)
#binding the synthetic datasets to the original dataset
syn_data_60n1_df <- rbind(datagen, syn_data_60n1) #combining both dataframes
syn_data_60n2_df <- rbind(syn_data_60n1_df, syn_data_60n2)
syn_data_60n3_df <- rbind(syn_data_60n2_df, syn_data_60n3)
syn_data_60n4_df <- rbind(syn_data_60n3_df, syn_data_60n4)
Again, time for model… I will spare you the code.
R^2 = 0.02, F(2,57) = 0.67, p > 0.05
With giving a quick recap of the original results, R^2 = 0.56, F(2,17) = 10.65, p < 0.05, it is clear the interpretations one would make between the original model (i.e., original dataset, N = 20 ) and the highly (N = 60) synthetic model are completely different. Therefore, the highly synthetic dataset is not valid. However, the dataset which was partially synthetic (N = 30), had results which were similar to the original and did produce a valid dataset: R^2 = 0.54, F(2,27) = 15.58, p < 0.05.
One of the weird things which synthetic data is that, depending on the statistical properties within a dataset, one seed can produce a valid dataset and another not. Therefore, I am curious if it is computationally feasible to produce a valid, highly synthetic dataset.
So, let’s do some simulating [3].
#creating frames to place values
r2val_60n_frame <- data.frame(r2val=c) #creates a frame for the loops to place the r-squared values into
pval_60n_frame <- data.frame(pval=c) #creates a frame for the loops to place the r-squared values into
#setting seeds
loop_60n <- 1
sampling_seed <- 2000
while(loop_60n < 2501){
#first ten
set.seed(sampling_seed) #seed for replicable sequences
data_60n1 <- sample_n(datagen, 10) #grabs ten from the original generated dataset
sampling_seed <- sampling_seed + 1 #adds 1 to the sampling seed so
#second ten
set.seed(sampling_seed)
data_60n2 <- sample_n(datagen, 10)
sampling_seed <- sampling_seed + 1
#third ten
set.seed(sampling_seed)
data_60n3 <- sample_n(datagen, 10)
sampling_seed <- sampling_seed + 1
#fourth ten
set.seed(sampling_seed)
data_60n4 <- sample_n(datagen, 10)
sampling_seed <- sampling_seed + 1
#binding the data together
data_60n <- rbind(data_60n1, data_60n2) #binding the sets of sampled data together
data_60n <- rbind(data_60n, data_60n3)
data_60n <- rbind(data_60n, data_60n4)
data_60n_df <- data.frame(data_60n) #putting the data into a final dataframe
#setting the synthetic data seed
seed_60n <- 10000
#synthesizing the data
syn_data_60n <- (syn(data_60n_df, seed = seed_60n))$syn #synthesizing
syn_data_60n <- data.frame(syn_data_60n) #dataframe
syn_data_60n_df <- rbind(datagen, syn_data_60n) #binding the original generated data with the synthetic data
#model
r2_60n <- lm(Var1 ~ Var2 + Var3, data = syn_data_60n_df) #model
r2_60n_summary <- summary(r2_60n)
r2val_60n_frame <- rbind(r2val_60n_frame, r2_60n_summary$r.squared) #binding the R^2 value to the dataframe
#getting the p-value for the overall model
p_60n <- pf(r2_60n_summary$fstatistic,
r2_60n_summary$fstatistic[2],
r2_60n_summary$fstatistic[3],
lower.tail = FALSE)[1]
pval_60n_frame <- rbind(pval_60n_frame, p_60n) #binding the p-values to the dataframe
#adds a value
seed_60n <- seed_60n + 1 #adds one to the seed where the data is synthesized
loop_60n <- loop_60n + 1
}
Now to filter out invalid datasets. To do this, we need to make some arguments for valid and invalid. For here, I think the two most obvious ones will be the p-value as it is a dichotomous decision and the R^2 value. Given the original R^2 value was 0.56, I will filter out any values which are not between 0.400 and 0.599.
#binding the two frames holding R^2 and p-values together
all_60n <- cbind(r2val_60n_frame, pval_60n_frame)
colnames(all_60n)[1] <- c("r2_val_60n") #renames the column
colnames(all_60n)[2] <- c("p_val_60n") #renames the column
n60_filtered <- all_60n %>% filter(p_val_60n < 0.050 & r2_val_60n >= 0.400 & r2_val_60n <= 0.599) #filters OUT r-squared values which would change the inference (i.e., effect size) one would make
From a total of 2,500 ‘simulations’, the final total count of valid datasets is 1992.
So… it is possible to produce a valid dataset with such a significantly high set of synthetic data within [4]. It is just a computational cost, and one that isn’t that severe. It is also possible to know which seed produced the valid dataset [5], so doing something like this while()
wouldn’t be a bad idea if you really needed a synthetic dataset for your project.
[1] An enormous thank you to Dan Quintana for introducing me to this topic and for an excellent primer on the topic.
[2] The Art of the Bodge. TL;DW: There is a less sloppy way to make this work, but it works.
[3] I’ve never quite felt comfortable calling this process ‘simulation’ but I do not know a better term for it. Don’t get upset please.
[4] I did not vizualize the distribution of values for this project. Therefore, it is assumed that the distributions would look the same. To have a definitive claim that a synthetic dataset is valid, the distribution of values must look the same compared to the original dataset.
[5] Maybe for another post. Not today.