Release of aggregated data is always a challenge, especially when multiple cross tabulations are provided. Analysis has shown that in some instances de-identified data can still be matched back to the original data owner. This is where Differential PRivacy comes in. It provides a methodology to add noise into aggregate figures in order to better protect privacy. It provides a mathematical approach to securing aggregated data.
Epsilon is the privacy budget. If something is more sensitive the value of epsilon will be smaller. If the item is not senstive then the value is larger.
Global sensitivity is how the L1 distance changes based on removing one person from the data.
gs
= 1gs
= 1/ngs
= \(\frac{b-a}{n}\)gs
= \(\frac{(b-a)^2}{n}\)Where n is the total number of observations, a is the lower bound, and b is the lower bound of the data.
If global sensitivity is larger or budget is lower, then more noise is added. If global sensitivity is smaller or the budget is lower, then less noise is added.
Supposed that we work for the Imperial Census Bureau. While we may not agree with the empire it is important for what remains of the Galactic Senate to have proper representation. Thus we have our data:
starwars
Let’s also say that we have some sensitive data (aka allegiance to the Rebel cause)
set.seed(5251977)
(starwars <- starwars %>%
mutate(allegiance = ifelse(rbinom(nrow(.), 1, prob = .3)>0, 1, 0)))
Now let’s say we want to look at the height of the different species by allegance. Perhaps the Empire is setting up a height screening tool to try to detect and re-identify beings. We certainly don’t want to give away too much data so let’s test the aggregate.
So the true value would be
(starwars %>%
group_by(allegiance) %>%
summarise(mu_height = mean(height, na.rm=TRUE))->true_height)
n <- nrow(starwars)
a <- max(starwars$height, na.rm = T)
b <- min(starwars$height, na.rm = T)
gs.mean <- (a-b)/n
eps <- 1
Thus the simulated value would be:
(reported <-rdoublex(1, true_height$mu_height, gs.mean / eps))
## [1] 172.2229 163.8062
So we can look at the results side by side
data_frame(truth = true_height$mu_height, reported = reported)
## Warning: `data_frame()` is deprecated, use `tibble()`.
## This warning is displayed once per session.
Now say we wanted to look at the proportion of people who supported the Empire. We could calculate the true value by looking at the following:
(starwars %>%
count(species, allegiance) %>%
mutate(perc = n/sum(n)) %>%
filter(allegiance==1, species=="Human") %>%
pull(perc) -> perc_support)
## [1] 0.1609195
Ok, so only 16.1%. This might be identifable…
# Number of Observations
n <- nrow(filter(starwars, species == "Human"))
Now we set our privacy budget:
# Set Value of Epsilon
eps <- 0.1
Now we can set the global sensitivity
# GS of Proportion
gs.prop <- 1 / n
Now the differential privacy number that we would report would be:
# Noisey Value
(reported <- rdoublex(1,perc_support, gs.prop / eps) %>% max(0))
## [1] 0.3131407
And looking at the two values side by side:
data_frame(truth = perc_support, reported = reported)
Now we will look at those people will blue eyes:
starwars %>%
count(eye_color) %>%
filter(eye_color == "blue") %>%
pull(n)
## [1] 19
So the true answer is 19. Let’s add our differential privacy.
data_frame(true = 19, eps = c(0.01, .1, 1, 10, 100), gs.count = 1) %>%
rowwise() %>%
mutate(noisey_value = round(rdoublex(1, 19, gs.count/eps)))
Thus you can see the effect that DP has on the reported noisey value.
Research and Methods Resources
me.dewitt.jr@gmail.com
Winston- Salem, NC
Copyright © 2018 Michael DeWitt. All rights reserved.