Using functional programming to efficiently compute and structure summary statistics from survey data
One of the things I often do in my work is compute the weighted mean and confidence intervals of indicators constructed from probability sampled survey data for the overall population and for sub-groups of the population defined by certain characteristics such as the gender, age or socio-economic status of the respondent. In this post I’ll show one of the workflows I use to generate sample statistics from survey data and have those results structured in a way that makes them easy to visualize later (ie. with ggplot).
The general approach is as follows:
- Create descriptive labels of the indicators (numeric variables) that you want to estimate the means for and store these in a named vector.
- Create descriptive labels of the population groups (categorical variables) for which you want estimates and store these in a named vector.
- Create a custom function that computes the weighted mean (and potentially any other summary statistic) of a single variable given a primary sampling unit, probability weight and stratum.
- Apply the custom function from (3) to each of the unique combination of indicators and grouping variables in your dataset defined in (1) and (2).
This post assumes you have some handle on the data manipulation and transformation techniques facilitated by the family of functions in the dplyr
package. If not, the R for Data Science book/ online resource is an excellent place to start.
I’ll begin using a simple example using the mtcars dataset and then show another example in cases where you have data from a probability sampling procedure (as is often the case with household surveys.)
1. Simple example
##-------- STEP 0: Data preparation
## Ensuring categorical grouping variables are factors with descriptive labels
data <- mtcars %>%
mutate(
fullsample = factor("All cars"),
cyl = factor(cyl, levels = c(4, 6, 8), labels = c("4 cylinders", "6 cylinders", "8 cylinders")),
am = factor(am, levels = c(0, 1), labels = c("Automatic transmission", "Manual transmission"))
)
##-------- STEP 1-2
## Named vector of all the numeric variables
## we want the sample means for:
indicators <- c("mpg" = "Miles per gallon",
"hp" = "Gross horsepower",
"disp" = "Displacement (cu. in.)",
"wt" = "Weight (1,000 lbs)")
# Named vector of all the categorical variables
# containing population sub-groups we want sample means for:
groups <- c("fullsample" = "All cars",
"cyl" = "Number of cylinders",
"am" = "Type of transmission")
The code section below creates a function called svy_summary
which requires 5 inputs: data
(your dataset), i
(a string - the column name of an indicator in data), g
(a string - the column name of a grouping variable in data), iref
(a lookup vector with the descriptive names of the indicator), gref
(a lookup vector with the descriptive names of the grouping variable). In the code for this function you will see some functions and operators that you may not have seen before, specifically sym
, as_string
and !!
. These are all part of the rlang
library which provides a set of tools for tidy evaluation - to help you write functions using tidyverse pipelines and grammars. The svy_summary
function returns a data frame with the estimates for the mean and the 95% confidence interval for the indicators and groups specified, as well as descriptive information. The function assumes that there is a variable called “fullsample” to accomodate the case where you want to estimate the mean and confidence interval of an indicator for the population as a whole.
#----------STEP 3:
# Creating function to compute mean & standard deviation (no sampling weights or complex survey design)
#given an indicator and grouping variable
# INPUTS:
# data --> dnenotes data frame you are using
# i --> indicator(s)
# g --> population sub-group(s)
# iref --> named vector which provides labels for all of your indicators (variables)
# gref --> named vector which provides labels for all of the variables that contain your sub-groups
svy_summary <- function(data, i, g, iref, gref) {
i <- sym(i)
g <- sym(g)
# Computing mean and 95% confidence interval,
# adding a name for the indicator,
# and group names for group variable
if (as_string(g) %in% c("fullsample")) {
results <- data %>%
summarize(
mean = mean(!!i, na.rm=TRUE),
sd = sd(!!i, na.rm = TRUE),
se = sd/sqrt(length(!!i))
) %>%
mutate(
group_cat_val = gref[as_string(g)]
)
}
else {
results <- data %>%
group_by(!!g) %>%
summarize(
mean = mean(!!i, na.rm=TRUE),
sd = sd(!!i, na.rm = TRUE),
se = sd/sqrt(length(!!i))
) %>%
mutate(
group_cat_val = !!g
)
}
# Computing lower & upper 95%, adding descriptive info
results <- results %>% mutate(
lci = mean - qnorm(0.975)*se,
uci = mean + qnorm(0.975)*se,
indicator = as_string(i),
indicator_name = iref[as_string(i)],
group = as_string(g),
group_name = gref[as_string(g)]
) %>%
dplyr::select(
indicator, indicator_name, group, group_name, group_cat_val, mean, lci, uci
)
return(results)
}
#---------- STEP 4:
# Apply the function from step 3 to each combination of indicator and grouping variable
combinations <- expand.grid(names(indicators), names(groups), stringsAsFactors = FALSE)
# Indicators
is <- combinations[[1]]
# Groups
gs <- combinations[[2]]
results <- dplyr::bind_rows(
map2(is, gs, svy_summary, data = data, iref = indicators, gref = groups)
)
results %>%
kable() %>%
kable_styling(bootstrap_options = "responsive", font_size = 10) %>%
scroll_box(height = "250px")
indicator | indicator_name | group | group_name | group_cat_val | mean | lci | uci |
---|---|---|---|---|---|---|---|
mpg | Miles per gallon | fullsample | All cars | All cars | 20.090625 | 18.002432 | 22.178818 |
hp | Gross horsepower | fullsample | All cars | All cars | 146.687500 | 122.932115 | 170.442885 |
disp | Displacement (cu. in.) | fullsample | All cars | All cars | 230.721875 | 187.780098 | 273.663652 |
wt | Weight (1,000 lbs) | fullsample | All cars | All cars | 3.217250 | 2.878238 | 3.556262 |
mpg | Miles per gallon | cyl | Number of cylinders | 4 cylinders | 26.663636 | 23.998548 | 29.328725 |
mpg | Miles per gallon | cyl | Number of cylinders | 6 cylinders | 19.742857 | 18.666059 | 20.819655 |
mpg | Miles per gallon | cyl | Number of cylinders | 8 cylinders | 15.100000 | 13.758990 | 16.441011 |
hp | Gross horsepower | cyl | Number of cylinders | 4 cylinders | 82.636364 | 70.265074 | 95.007653 |
hp | Gross horsepower | cyl | Number of cylinders | 6 cylinders | 122.285714 | 104.313621 | 140.257807 |
hp | Gross horsepower | cyl | Number of cylinders | 8 cylinders | 209.214286 | 182.511451 | 235.917121 |
disp | Displacement (cu. in.) | cyl | Number of cylinders | 4 cylinders | 105.136364 | 89.256558 | 121.016169 |
disp | Displacement (cu. in.) | cyl | Number of cylinders | 6 cylinders | 183.314286 | 152.524950 | 214.103621 |
disp | Displacement (cu. in.) | cyl | Number of cylinders | 8 cylinders | 353.100000 | 317.599862 | 388.600138 |
wt | Weight (1,000 lbs) | cyl | Number of cylinders | 4 cylinders | 2.285727 | 1.949143 | 2.622312 |
wt | Weight (1,000 lbs) | cyl | Number of cylinders | 6 cylinders | 3.117143 | 2.853163 | 3.381123 |
wt | Weight (1,000 lbs) | cyl | Number of cylinders | 8 cylinders | 3.999214 | 3.601421 | 4.397007 |
mpg | Miles per gallon | am | Type of transmission | Automatic transmission | 17.147368 | 15.423439 | 18.871298 |
mpg | Miles per gallon | am | Type of transmission | Manual transmission | 24.392308 | 21.040220 | 27.744396 |
hp | Gross horsepower | am | Type of transmission | Automatic transmission | 160.263158 | 136.023522 | 184.502793 |
hp | Gross horsepower | am | Type of transmission | Manual transmission | 126.846154 | 81.150194 | 172.542114 |
disp | Displacement (cu. in.) | am | Type of transmission | Automatic transmission | 290.378947 | 240.840644 | 339.917251 |
disp | Displacement (cu. in.) | am | Type of transmission | Manual transmission | 143.530769 | 96.127012 | 190.934526 |
wt | Weight (1,000 lbs) | am | Type of transmission | Automatic transmission | 3.768895 | 3.419339 | 4.118450 |
wt | Weight (1,000 lbs) | am | Type of transmission | Manual transmission | 2.411000 | 2.075611 | 2.746389 |
2. Example with sampling weights
If you have data from a more complex sampling procedure, you can use the set of functions in the srvyr package to first define the survey design using as_survey_design
.
#-------- STEP 0: Data preparation
# Adding two columns, one for survey weights, another for the primary sampling unit
data <- data %>%
mutate(
probweights = runif(n(), min = 0, max = 1),
psu = rep(1:3, length.out = n()),
strata = 1
)
#---------- STEP 3:
# Creating function to compute mean & standard deviation
# (for a data set with sampling weights or complex survey design)
# given an indicator and grouping variable
# INPUTS:
# same as before, except now you need:
# w = survey weights
# psu = primary sampling units
# strata = survey strata (if applicable)
svy_summary_weights <- function(data, i, g, psu, strata, w, iref, gref) {
i <- sym(i)
g <- sym(g)
w <- sym(w)
psu <- sym(psu)
strata <- sym(strata)
# Defining survey design
svydesign <- data %>%
as_survey_design(ids = !!psu, weight = !!w, strata = !!strata, variables = c(!!i, !!g))
# Computing mean and 95% confidence interval, adding a name fot the indicator, and group namees for group variable
if (as_string(g) %in% c("fullsample")) {
results <- svydesign %>%
summarise(
mean = survey_mean(!!i, vartype = "ci", na.rm=TRUE)
) %>%
mutate(
group_cat_val = gref[as_string(g)]
)
}
else {
results <- svydesign %>%
group_by(!!g) %>%
summarise(
mean = survey_mean(!!i, vartype = "ci", na.rm=TRUE)
) %>%
mutate(
group_cat_val = !!g
)
}
results <- results %>%
rename(lci = mean_low, uci = mean_upp) %>%
mutate(
indicator = as_string(i),
indicator_name = iref[as_string(i)],
group = as_string(g),
group_name = gref[as_string(g)]
) %>%
dplyr::select(
indicator, indicator_name, group, group_name, group_cat_val, mean, lci, uci
)
}
#---------- STEP 4:
# Apply the function from step 3 to each combination of indicator and grouping variable
combinations <- expand.grid(names(indicators), names(groups), stringsAsFactors = FALSE)
# Indicators
is <- combinations[[1]]
# Groups
gs <- combinations[[2]]
results <- dplyr::bind_rows(
map2(is,
gs,
svy_summary_weights,
data = data,
iref = indicators,
gref = groups,
psu = "psu",
strata = "strata",
w = "probweights")
)
results %>%
kable() %>%
kable_styling(bootstrap_options = "responsive", font_size = 10) %>%
scroll_box(height = "250px")
indicator | indicator_name | group | group_name | group_cat_val | mean | lci | uci |
---|---|---|---|---|---|---|---|
mpg | Miles per gallon | fullsample | All cars | All cars | 20.519877 | 19.194936 | 21.844819 |
hp | Gross horsepower | fullsample | All cars | All cars | 144.172653 | 106.588444 | 181.756863 |
disp | Displacement (cu. in.) | fullsample | All cars | All cars | 232.651629 | 161.369019 | 303.934239 |
wt | Weight (1,000 lbs) | fullsample | All cars | All cars | 3.171457 | 2.900692 | 3.442222 |
mpg | Miles per gallon | cyl | Number of cylinders | 4 cylinders | 27.407213 | 19.887306 | 34.927120 |
mpg | Miles per gallon | cyl | Number of cylinders | 6 cylinders | 19.893404 | 16.789258 | 22.997549 |
mpg | Miles per gallon | cyl | Number of cylinders | 8 cylinders | 15.707861 | 12.496165 | 18.919557 |
hp | Gross horsepower | cyl | Number of cylinders | 4 cylinders | 79.475175 | 52.443331 | 106.507018 |
hp | Gross horsepower | cyl | Number of cylinders | 6 cylinders | 115.139199 | 105.551340 | 124.727058 |
hp | Gross horsepower | cyl | Number of cylinders | 8 cylinders | 206.617943 | 197.385097 | 215.850789 |
disp | Displacement (cu. in.) | cyl | Number of cylinders | 4 cylinders | 102.081026 | 64.282096 | 139.879955 |
disp | Displacement (cu. in.) | cyl | Number of cylinders | 6 cylinders | 192.974918 | 107.161288 | 278.788549 |
disp | Displacement (cu. in.) | cyl | Number of cylinders | 8 cylinders | 349.335315 | 299.363656 | 399.306975 |
wt | Weight (1,000 lbs) | cyl | Number of cylinders | 4 cylinders | 2.256888 | 1.127330 | 3.386446 |
wt | Weight (1,000 lbs) | cyl | Number of cylinders | 6 cylinders | 3.152953 | 2.916478 | 3.389429 |
wt | Weight (1,000 lbs) | cyl | Number of cylinders | 8 cylinders | 3.860660 | 3.145283 | 4.576038 |
mpg | Miles per gallon | am | Type of transmission | Automatic transmission | 17.667195 | 17.534491 | 17.799899 |
mpg | Miles per gallon | am | Type of transmission | Manual transmission | 25.567401 | 21.238455 | 29.896346 |
hp | Gross horsepower | am | Type of transmission | Automatic transmission | 160.298744 | 127.109391 | 193.488097 |
hp | Gross horsepower | am | Type of transmission | Manual transmission | 115.639223 | 53.735393 | 177.543054 |
disp | Displacement (cu. in.) | am | Type of transmission | Automatic transmission | 285.307443 | 211.600040 | 359.014845 |
disp | Displacement (cu. in.) | am | Type of transmission | Manual transmission | 139.482676 | 38.591627 | 240.373726 |
wt | Weight (1,000 lbs) | am | Type of transmission | Automatic transmission | 3.647227 | 3.421634 | 3.872821 |
wt | Weight (1,000 lbs) | am | Type of transmission | Manual transmission | 2.329631 | 1.680587 | 2.978674 |
You can easily create a more publication friendly of this table (with a subset of the results):
makeci <- function(m, l, u) {
paste0(round(m, 1), " (", round(l, 1), " - ", round(u, 1), ")")
}
results %>%
mutate(meanci = makeci(mean, lci, uci)) %>%
filter(indicator_name %in% c("Miles per gallon", "Gross horsepower")) %>%
arrange(indicator_name, group_name, group_cat_val) %>%
select(indicator_name, group_name, group_cat_val, meanci) %>%
rename(`Performance Indicator` = indicator_name, Feature = group_name, Type = group_cat_val,`Mean (95% Confidence Interval)` = meanci) %>%
kable() %>%
kable_styling(bootstrap_options = "responsive", font_size = 10) %>%
collapse_rows(columns = 1:2, valign = "top")
Performance Indicator | Feature | Type | Mean (95% Confidence Interval) |
---|---|---|---|
Gross horsepower | All cars | All cars | 144.2 (106.6 - 181.8) |
Number of cylinders | 4 cylinders | 79.5 (52.4 - 106.5) | |
6 cylinders | 115.1 (105.6 - 124.7) | ||
8 cylinders | 206.6 (197.4 - 215.9) | ||
Type of transmission | Automatic transmission | 160.3 (127.1 - 193.5) | |
Manual transmission | 115.6 (53.7 - 177.5) | ||
Miles per gallon | All cars | All cars | 20.5 (19.2 - 21.8) |
Number of cylinders | 4 cylinders | 27.4 (19.9 - 34.9) | |
6 cylinders | 19.9 (16.8 - 23) | ||
8 cylinders | 15.7 (12.5 - 18.9) | ||
Type of transmission | Automatic transmission | 17.7 (17.5 - 17.8) | |
Manual transmission | 25.6 (21.2 - 29.9) |