TL;DR

Is it possible to increase participant privacy while retaining statistical control with highly synthetic data injection?

Yep.

Synthetic Data

Synthetic data is cool, and I like it [1]. Synthetic data provides an interesting, new way to increase participant privacy while also having an open dataset.

As always, what is synthetic data? Synthetic data is data which is generated from an original data source which, if valid, would have the same distributional and internal statistical properties/functions as the original dataset. It is incredibly useful for further data exploration while retaining participant privacy.

Given this, we can take the original data, create a synthetic counterpart, and then produce a partially synthetic dataset. This isn’t all that new. However, the scenario below gives an area to explore: Is it possible to significantly increase the sample size to increase privacy/noise?

First and foremost, it has to be stated that this approach is not turnkey ready for analysis to publication. That is, someone cannot take the dataset, analyze it, and write up the results for any form of publication/communication. It will have, at the minimum, a synthetically inflated sample size. There is also the other obvious “these data are not real” aspect too.

A Scenario Necessitating High Privacy

You’re a researcher in sexual health in an extremely socially conservative area. Recently, an undergraduate student came to you and proposed a study which would focus on teen-aged, closeted transgender students and their sexual behaviors. You find the idea interesting and approve the study.

There are some immediate things to note with this scenario:

These five points are significant barriers to even wanting to do open data. I support open data and all open science tenets, but I’d even be hesitant to share this data by itself. If someone in that socially conservative area were to find the data and I didn’t do a perfect job at anonymizing the data, it could result in unthinkable harms being done to the participants. For the purposes of the post here, however, let us say there is a reason open data is needed.

Generating the Data

Let’s generate some data in R to act as data from the aforementioned scenario:

#required libraries
suppressMessages(library(tidyverse)) #data wrangling
suppressMessages(library(synthpop)) #synthetic data generation
suppressMessages(library(faux)) #generating correlated data
suppressMessages(library(knitr)) #tables
suppressMessages(library(rmarkdown)) #rmarkdown

#dataset generation
set.seed(0001) #allows for reproducible values
datagen <- rnorm_multi(n = 20, #number of participants
                       mu = c(50, 25, 40), #averages
                       sd = c(10, 3, 5), #sds
                       r = c(.5, .7, .2), #coefficients
                       varnames = c("Var1", "Var2", "Var3"), #var names
                       empirical = FALSE)

kable(datagen, align = "c") #reports the data
Var1 Var2 Var3
42.78856 23.02010 40.51284
50.90882 24.63125 43.30836
42.08004 22.15918 36.40365
68.11645 29.22146 39.64735
52.23135 26.01547 43.74534
41.44597 25.34748 37.24683
55.22734 25.19673 41.10612
59.24389 26.70389 37.73019
56.10539 26.82382 40.86755
47.13634 22.04843 39.34886
63.96932 24.37960 49.66113
53.57427 27.03423 41.71214
43.67821 22.76090 38.48796
27.32191 24.14119 32.20889
63.42457 25.73888 38.95358
51.23558 21.37306 36.89340
49.97984 26.34671 39.04488
58.75513 28.71410 44.35432
57.46980 23.36980 46.05356
55.07865 25.02611 44.74000

Here are twenty hypothetical participants across three variables. For whatever reason, you choose to look at the relationship between the three variables.

r2_og <- lm(Var1 ~ Var2 + Var3, data = datagen) #model
r2_og_summary <- summary(r2_og) #summarizes the model

And the results are…

R^2 = 0.56, F(2,17) = 10.65, p < 0.05

Injecting Data - Partially Synthetic

So, we have our outcome. Now, how do we increase the sample size? The synthpop package makes generating synthetic data incredible easy. For this example, we will increase the sample size by 10 participants (total N = 30).

#getting a subset of the data
set.seed(0002) #allows for reproducible values
data_30n <- sample_n(datagen, 10) #grabs ten rows
data_30n_df <- data.frame(data_30n) #puts ten rows into a dataframe

#creating the synthetic data
syn_data_30n <- (syn(data_30n_df, seed = 0003))$syn #creating the synthetic values
syn_data_30n <- data.frame(syn_data_30n) #putting them into a dataframe
syn_data_30n_df <- rbind(datagen, syn_data_30n) #combining both dataframes
Var1 Var2 Var3
43 23 41
51 25 43
42 22 36
68 29 40
52 26 44
41 25 37
55 25 41
59 27 38
56 27 41
47 22 39
64 24 50
54 27 42
44 23 38
27 24 32
63 26 39
51 21 37
50 26 39
59 29 44
57 23 46
55 25 45
43 23 41
51 27 37
51 27 38
59 27 42
51 23 43
54 25 38
54 27 39
59 27 46
51 23 38
51 23 38

As you can see, two things have happened. First, all values have been rounded to the appropriate integer. Second, starting at row 21, those values were not present in the first dataframe seen above. Let’s see if the interpretations would be the same with this partially synthetic dataset.

r2_30n <- lm(Var1 ~ Var2 + Var3, data = syn_data_30n_df) #model
r2_30n_summary <- summary(r2_30n) #summarizes the model

R^2 = 0.54, F(2,27) = 15.58, p < 0.05

So, yes. The interpretations would be nearly identical. Of course, thats only increasing the sample size by 10 participants (albeit, that is 50% of the original sample). Now, let’s increase the sample size to 60 participants - an increase of 40 synthetic participants.

Injecting Data - Highly Synthetic

Here is some bodged [2] code that works:

#getting a subset of the data
set.seed(0004) #allows for reproducible values
data_60n1 <- sample_n(datagen, 10) #grabs ten rows
data_60n1_df <- data.frame(data_60n1) #puts ten rows into a dataframe

set.seed(0005) #allows for reproducible values
data_60n2 <- sample_n(datagen, 10) #grabs ten rows
data_60n2_df <- data.frame(data_60n2) #puts ten rows into a dataframe

set.seed(0006) #allows for reproducible values
data_60n3 <- sample_n(datagen, 10) #grabs ten rows
data_60n3_df <- data.frame(data_60n3) #puts ten rows into a dataframe

set.seed(0007) #allows for reproducible values
data_60n4 <- sample_n(datagen, 10) #grabs ten rows
data_60n4_df <- data.frame(data_60n4) #puts ten rows into a dataframe

#creating the synthetic data
syn_data_60n1 <- (syn(data_60n1_df, seed = 0008))$syn #creating the synthetic values
syn_data_60n2 <- (syn(data_60n2_df, seed = 0009))$syn
syn_data_60n3 <- (syn(data_60n3_df, seed = 0010))$syn
syn_data_60n4 <- (syn(data_60n4_df, seed = 0011))$syn

#putting them into a dataframe
syn_data_60n1 <- data.frame(syn_data_60n1) #putting them into a dataframe
syn_data_60n2 <- data.frame(syn_data_60n2)
syn_data_60n3 <- data.frame(syn_data_60n3)
syn_data_60n4 <- data.frame(syn_data_60n4)

#binding the synthetic datasets to the original dataset
syn_data_60n1_df <- rbind(datagen, syn_data_60n1) #combining both dataframes
syn_data_60n2_df <- rbind(syn_data_60n1_df, syn_data_60n2)
syn_data_60n3_df <- rbind(syn_data_60n2_df, syn_data_60n3)
syn_data_60n4_df <- rbind(syn_data_60n3_df, syn_data_60n4)

Again, time for model… I will spare you the code.

R^2 = 0.02, F(2,57) = 0.67, p > 0.05

With giving a quick recap of the original results, R^2 = 0.56, F(2,17) = 10.65, p < 0.05, it is clear the interpretations one would make between the original model (i.e., original dataset, N = 20 ) and the highly (N = 60) synthetic model are completely different. Therefore, the highly synthetic dataset is not valid. However, the dataset which was partially synthetic (N = 30), had results which were similar to the original and did produce a valid dataset: R^2 = 0.54, F(2,27) = 15.58, p < 0.05.

“Simulation” Time

One of the weird things which synthetic data is that, depending on the statistical properties within a dataset, one seed can produce a valid dataset and another not. Therefore, I am curious if it is computationally feasible to produce a valid, highly synthetic dataset.

So, let’s do some simulating [3].

#creating frames to place values
r2val_60n_frame <- data.frame(r2val=c) #creates a frame for the loops to place the r-squared values into
pval_60n_frame <- data.frame(pval=c) #creates a frame for the loops to place the r-squared values into

#setting seeds
loop_60n <- 1
sampling_seed <- 2000

while(loop_60n < 2501){
  
    #first ten
    set.seed(sampling_seed) #seed for replicable sequences
    data_60n1 <- sample_n(datagen, 10) #grabs ten from the original generated dataset
      sampling_seed <- sampling_seed + 1 #adds 1 to the sampling seed so 
    #second ten
    set.seed(sampling_seed) 
    data_60n2 <- sample_n(datagen, 10)
      sampling_seed <- sampling_seed + 1
    #third ten
    set.seed(sampling_seed)
    data_60n3 <- sample_n(datagen, 10)
      sampling_seed <- sampling_seed + 1
    #fourth ten
    set.seed(sampling_seed)
    data_60n4 <- sample_n(datagen, 10)
      sampling_seed <- sampling_seed + 1
      
    #binding the data together
    data_60n <- rbind(data_60n1, data_60n2) #binding the sets of sampled data together
    data_60n <- rbind(data_60n, data_60n3)
    data_60n <- rbind(data_60n, data_60n4)
    data_60n_df <- data.frame(data_60n) #putting the data into a final dataframe
    
    #setting the synthetic data seed
    seed_60n <- 10000
    
    #synthesizing the data
    syn_data_60n <- (syn(data_60n_df, seed = seed_60n))$syn #synthesizing
    syn_data_60n <- data.frame(syn_data_60n) #dataframe
    syn_data_60n_df <- rbind(datagen, syn_data_60n) #binding the original generated data with the synthetic data

    #model
    r2_60n <- lm(Var1 ~ Var2 + Var3, data = syn_data_60n_df) #model
    r2_60n_summary <- summary(r2_60n)
    r2val_60n_frame <- rbind(r2val_60n_frame, r2_60n_summary$r.squared) #binding the R^2 value to the dataframe
    
    #getting the p-value for the overall model
    p_60n <- pf(r2_60n_summary$fstatistic, 
                     r2_60n_summary$fstatistic[2], 
                     r2_60n_summary$fstatistic[3], 
                     lower.tail = FALSE)[1]
    
    pval_60n_frame <- rbind(pval_60n_frame, p_60n) #binding the p-values to the dataframe
    
    #adds a value
    seed_60n <- seed_60n + 1 #adds one to the seed where the data is synthesized
    loop_60n <- loop_60n + 1
}

Now to filter out invalid datasets. To do this, we need to make some arguments for valid and invalid. For here, I think the two most obvious ones will be the p-value as it is a dichotomous decision and the R^2 value. Given the original R^2 value was 0.56, I will filter out any values which are not between 0.400 and 0.599.

#binding the two frames holding R^2 and p-values together
all_60n <- cbind(r2val_60n_frame, pval_60n_frame)

colnames(all_60n)[1] <- c("r2_val_60n") #renames the column
colnames(all_60n)[2] <- c("p_val_60n") #renames the column

n60_filtered <- all_60n %>% filter(p_val_60n < 0.050 & r2_val_60n >= 0.400 & r2_val_60n <= 0.599) #filters OUT r-squared values which would change the inference (i.e., effect size) one would make

From a total of 2,500 ‘simulations’, the final total count of valid datasets is 1992.

So… it is possible to produce a valid dataset with such a significantly high set of synthetic data within [4]. It is just a computational cost, and one that isn’t that severe. It is also possible to know which seed produced the valid dataset [5], so doing something like this while() wouldn’t be a bad idea if you really needed a synthetic dataset for your project.

Notes

[1] An enormous thank you to Dan Quintana for introducing me to this topic and for an excellent primer on the topic.

[2] The Art of the Bodge. TL;DW: There is a less sloppy way to make this work, but it works.

[3] I’ve never quite felt comfortable calling this process ‘simulation’ but I do not know a better term for it. Don’t get upset please.

[4] I did not vizualize the distribution of values for this project. Therefore, it is assumed that the distributions would look the same. To have a definitive claim that a synthetic dataset is valid, the distribution of values must look the same compared to the original dataset.

[5] Maybe for another post. Not today.