Streamlining the analysis of survey data using functional programming in R

Using functional programming to efficiently compute and structure summary statistics from survey data

One of the things I often do in my work is compute the weighted mean and confidence intervals of indicators constructed from probability sampled survey data for the overall population and for sub-groups of the population defined by certain characteristics such as the gender, age or socio-economic status of the respondent. In this post I’ll show one of the workflows I use to generate sample statistics from survey data and have those results structured in a way that makes them easy to visualize later (ie. with ggplot).

The general approach is as follows:

  1. Create descriptive labels of the indicators (numeric variables) that you want to estimate the means for and store these in a named vector.
  2. Create descriptive labels of the population groups (categorical variables) for which you want estimates and store these in a named vector.
  3. Create a custom function that computes the weighted mean (and potentially any other summary statistic) of a single variable given a primary sampling unit, probability weight and stratum.
  4. Apply the custom function from (3) to each of the unique combination of indicators and grouping variables in your dataset defined in (1) and (2).

This post assumes you have some handle on the data manipulation and transformation techniques facilitated by the family of functions in the dplyr package. If not, the R for Data Science book/ online resource is an excellent place to start.

I’ll begin using a simple example using the mtcars dataset and then show another example in cases where you have data from a probability sampling procedure (as is often the case with household surveys.)

1. Simple example


##-------- STEP 0: Data preparation

## Ensuring categorical grouping variables are factors with descriptive labels

data <- mtcars %>% 
          mutate(
            fullsample = factor("All cars"), 
            cyl = factor(cyl, levels = c(4, 6, 8), labels = c("4 cylinders", "6 cylinders", "8 cylinders")), 
            am = factor(am, levels = c(0, 1), labels = c("Automatic transmission", "Manual transmission"))
          )

##-------- STEP 1-2

## Named vector of all the numeric variables
## we want the sample means for:
indicators <- c("mpg" = "Miles per gallon", 
               "hp" = "Gross horsepower", 
               "disp" = "Displacement (cu. in.)", 
               "wt" = "Weight (1,000 lbs)")

# Named vector of all the categorical variables
# containing population sub-groups we want sample means for:
groups <- c("fullsample" = "All cars", 
            "cyl" = "Number of cylinders", 
            "am" = "Type of transmission")

The code section below creates a function called svy_summary which requires 5 inputs: data (your dataset), i (a string - the column name of an indicator in data), g (a string - the column name of a grouping variable in data), iref (a lookup vector with the descriptive names of the indicator), gref (a lookup vector with the descriptive names of the grouping variable). In the code for this function you will see some functions and operators that you may not have seen before, specifically sym, as_string and !!. These are all part of the rlang library which provides a set of tools for tidy evaluation - to help you write functions using tidyverse pipelines and grammars. The svy_summary function returns a data frame with the estimates for the mean and the 95% confidence interval for the indicators and groups specified, as well as descriptive information. The function assumes that there is a variable called “fullsample” to accomodate the case where you want to estimate the mean and confidence interval of an indicator for the population as a whole.


#----------STEP 3:
# Creating function to compute mean & standard deviation (no sampling weights or complex survey design) 
#given an indicator and grouping variable

# INPUTS:
# data --> dnenotes data frame you are using
# i --> indicator(s)
# g --> population sub-group(s)
# iref --> named vector which provides labels for all of your indicators (variables)
# gref --> named vector which provides labels for all of the variables that contain your sub-groups

svy_summary <- function(data, i, g, iref, gref) { 
  
  i <- sym(i)
  g <- sym(g)

  # Computing mean and 95% confidence interval, 
  # adding a name for the indicator, 
  # and group names for group variable
  
  if (as_string(g) %in% c("fullsample")) { 
    results <- data %>%
      summarize(
        mean = mean(!!i, na.rm=TRUE), 
        sd = sd(!!i, na.rm = TRUE), 
        se = sd/sqrt(length(!!i))
      ) %>% 
      mutate(
        group_cat_val = gref[as_string(g)]
      )
  }
  else {
    results <- data %>%
      group_by(!!g) %>%
      summarize(
        mean = mean(!!i, na.rm=TRUE), 
        sd = sd(!!i, na.rm = TRUE), 
        se = sd/sqrt(length(!!i))
      ) %>% 
      mutate(
        group_cat_val = !!g
      )
  }
  
  # Computing lower & upper 95%, adding descriptive info
  results <- results %>%  mutate(
        lci = mean - qnorm(0.975)*se, 
        uci = mean + qnorm(0.975)*se, 
        indicator = as_string(i), 
        indicator_name = iref[as_string(i)], 
        group = as_string(g), 
        group_name = gref[as_string(g)]
      ) %>%
      dplyr::select(
        indicator, indicator_name, group, group_name, group_cat_val, mean, lci, uci
      )
  
  return(results)
  
}

#---------- STEP 4:
# Apply the function from step 3 to each combination of indicator and grouping variable

combinations <- expand.grid(names(indicators), names(groups), stringsAsFactors = FALSE)
# Indicators
is <- combinations[[1]]
# Groups
gs <- combinations[[2]]

results <- dplyr::bind_rows(
              map2(is, gs, svy_summary, data = data, iref = indicators, gref = groups)
            )
 

results %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "responsive", font_size = 10) %>% 
  scroll_box(height = "250px")
indicator indicator_name group group_name group_cat_val mean lci uci
mpg Miles per gallon fullsample All cars All cars 20.090625 18.002432 22.178818
hp Gross horsepower fullsample All cars All cars 146.687500 122.932115 170.442885
disp Displacement (cu. in.) fullsample All cars All cars 230.721875 187.780098 273.663652
wt Weight (1,000 lbs) fullsample All cars All cars 3.217250 2.878238 3.556262
mpg Miles per gallon cyl Number of cylinders 4 cylinders 26.663636 23.998548 29.328725
mpg Miles per gallon cyl Number of cylinders 6 cylinders 19.742857 18.666059 20.819655
mpg Miles per gallon cyl Number of cylinders 8 cylinders 15.100000 13.758990 16.441011
hp Gross horsepower cyl Number of cylinders 4 cylinders 82.636364 70.265074 95.007653
hp Gross horsepower cyl Number of cylinders 6 cylinders 122.285714 104.313621 140.257807
hp Gross horsepower cyl Number of cylinders 8 cylinders 209.214286 182.511451 235.917121
disp Displacement (cu. in.) cyl Number of cylinders 4 cylinders 105.136364 89.256558 121.016169
disp Displacement (cu. in.) cyl Number of cylinders 6 cylinders 183.314286 152.524950 214.103621
disp Displacement (cu. in.) cyl Number of cylinders 8 cylinders 353.100000 317.599862 388.600138
wt Weight (1,000 lbs) cyl Number of cylinders 4 cylinders 2.285727 1.949143 2.622312
wt Weight (1,000 lbs) cyl Number of cylinders 6 cylinders 3.117143 2.853163 3.381123
wt Weight (1,000 lbs) cyl Number of cylinders 8 cylinders 3.999214 3.601421 4.397007
mpg Miles per gallon am Type of transmission Automatic transmission 17.147368 15.423439 18.871298
mpg Miles per gallon am Type of transmission Manual transmission 24.392308 21.040220 27.744396
hp Gross horsepower am Type of transmission Automatic transmission 160.263158 136.023522 184.502793
hp Gross horsepower am Type of transmission Manual transmission 126.846154 81.150194 172.542114
disp Displacement (cu. in.) am Type of transmission Automatic transmission 290.378947 240.840644 339.917251
disp Displacement (cu. in.) am Type of transmission Manual transmission 143.530769 96.127012 190.934526
wt Weight (1,000 lbs) am Type of transmission Automatic transmission 3.768895 3.419339 4.118450
wt Weight (1,000 lbs) am Type of transmission Manual transmission 2.411000 2.075611 2.746389

2. Example with sampling weights

If you have data from a more complex sampling procedure, you can use the set of functions in the srvyr package to first define the survey design using as_survey_design.


#-------- STEP 0: Data preparation

# Adding two columns, one for survey weights, another for the primary sampling unit

data <- data %>% 
          mutate(
            probweights = runif(n(), min = 0, max = 1), 
            psu = rep(1:3, length.out = n()), 
            strata = 1
          )

#---------- STEP 3:
# Creating function to compute mean & standard deviation 
# (for a data set with sampling weights or complex survey design) 
# given an indicator and grouping variable

# INPUTS:
# same as before, except now you need: 
# w = survey weights
# psu = primary sampling units
# strata = survey strata (if applicable)

svy_summary_weights <- function(data, i, g, psu, strata, w, iref, gref) { 
  
  i <- sym(i)
  g <- sym(g)
  w <- sym(w)
  psu <- sym(psu)
  strata <- sym(strata)
  
  # Defining survey design
  svydesign <- data %>%
    as_survey_design(ids = !!psu, weight = !!w, strata = !!strata, variables = c(!!i, !!g))
  
  # Computing mean and 95% confidence interval, adding a name fot the indicator, and group namees for group variable
  
  if (as_string(g) %in% c("fullsample")) { 
    results <- svydesign %>%
      summarise(
        mean = survey_mean(!!i, vartype = "ci", na.rm=TRUE)
      ) %>% 
      mutate(
        group_cat_val = gref[as_string(g)]
      )
  }
  
  else {
    results <- svydesign %>%
      group_by(!!g) %>%
      summarise(
        mean = survey_mean(!!i, vartype = "ci", na.rm=TRUE)
      ) %>% 
      mutate(
        group_cat_val = !!g
      ) 
  }
  
  results <- results %>% 
    rename(lci = mean_low, uci = mean_upp) %>% 
      mutate(
        indicator = as_string(i), 
        indicator_name = iref[as_string(i)], 
        group = as_string(g), 
        group_name = gref[as_string(g)]
      ) %>%
      dplyr::select(
        indicator, indicator_name, group, group_name, group_cat_val, mean, lci, uci
      )
  
}

#---------- STEP 4:
# Apply the function from step 3 to each combination of indicator and grouping variable

combinations <- expand.grid(names(indicators), names(groups), stringsAsFactors = FALSE)
# Indicators
is <- combinations[[1]]
# Groups
gs <- combinations[[2]]

results <- dplyr::bind_rows(
              map2(is, 
                   gs, 
                   svy_summary_weights, 
                   data = data, 
                   iref = indicators, 
                   gref = groups,
                   psu = "psu", 
                   strata = "strata", 
                   w = "probweights")
            )
 

results %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "responsive", font_size = 10) %>% 
  scroll_box(height = "250px")
indicator indicator_name group group_name group_cat_val mean lci uci
mpg Miles per gallon fullsample All cars All cars 20.519877 19.194936 21.844819
hp Gross horsepower fullsample All cars All cars 144.172653 106.588444 181.756863
disp Displacement (cu. in.) fullsample All cars All cars 232.651629 161.369019 303.934239
wt Weight (1,000 lbs) fullsample All cars All cars 3.171457 2.900692 3.442222
mpg Miles per gallon cyl Number of cylinders 4 cylinders 27.407213 19.887306 34.927120
mpg Miles per gallon cyl Number of cylinders 6 cylinders 19.893404 16.789258 22.997549
mpg Miles per gallon cyl Number of cylinders 8 cylinders 15.707861 12.496165 18.919557
hp Gross horsepower cyl Number of cylinders 4 cylinders 79.475175 52.443331 106.507018
hp Gross horsepower cyl Number of cylinders 6 cylinders 115.139199 105.551340 124.727058
hp Gross horsepower cyl Number of cylinders 8 cylinders 206.617943 197.385097 215.850789
disp Displacement (cu. in.) cyl Number of cylinders 4 cylinders 102.081026 64.282096 139.879955
disp Displacement (cu. in.) cyl Number of cylinders 6 cylinders 192.974918 107.161288 278.788549
disp Displacement (cu. in.) cyl Number of cylinders 8 cylinders 349.335315 299.363656 399.306975
wt Weight (1,000 lbs) cyl Number of cylinders 4 cylinders 2.256888 1.127330 3.386446
wt Weight (1,000 lbs) cyl Number of cylinders 6 cylinders 3.152953 2.916478 3.389429
wt Weight (1,000 lbs) cyl Number of cylinders 8 cylinders 3.860660 3.145283 4.576038
mpg Miles per gallon am Type of transmission Automatic transmission 17.667195 17.534491 17.799899
mpg Miles per gallon am Type of transmission Manual transmission 25.567401 21.238455 29.896346
hp Gross horsepower am Type of transmission Automatic transmission 160.298744 127.109391 193.488097
hp Gross horsepower am Type of transmission Manual transmission 115.639223 53.735393 177.543054
disp Displacement (cu. in.) am Type of transmission Automatic transmission 285.307443 211.600040 359.014845
disp Displacement (cu. in.) am Type of transmission Manual transmission 139.482676 38.591627 240.373726
wt Weight (1,000 lbs) am Type of transmission Automatic transmission 3.647227 3.421634 3.872821
wt Weight (1,000 lbs) am Type of transmission Manual transmission 2.329631 1.680587 2.978674

You can easily create a more publication friendly of this table (with a subset of the results):


makeci <- function(m, l, u) { 
  paste0(round(m, 1), " (", round(l, 1), " - ", round(u, 1), ")")
}

results %>% 
  mutate(meanci = makeci(mean, lci, uci)) %>% 
  filter(indicator_name %in% c("Miles per gallon", "Gross horsepower")) %>% 
  arrange(indicator_name, group_name, group_cat_val) %>%
  select(indicator_name, group_name, group_cat_val, meanci) %>% 
  rename(`Performance Indicator` = indicator_name, Feature = group_name, Type = group_cat_val,`Mean (95% Confidence Interval)` = meanci) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "responsive", font_size = 10) %>% 
  collapse_rows(columns = 1:2, valign = "top")
Performance Indicator Feature Type Mean (95% Confidence Interval)
Gross horsepower All cars All cars 144.2 (106.6 - 181.8)
Number of cylinders 4 cylinders 79.5 (52.4 - 106.5)
6 cylinders 115.1 (105.6 - 124.7)
8 cylinders 206.6 (197.4 - 215.9)
Type of transmission Automatic transmission 160.3 (127.1 - 193.5)
Manual transmission 115.6 (53.7 - 177.5)
Miles per gallon All cars All cars 20.5 (19.2 - 21.8)
Number of cylinders 4 cylinders 27.4 (19.9 - 34.9)
6 cylinders 19.9 (16.8 - 23)
8 cylinders 15.7 (12.5 - 18.9)
Type of transmission Automatic transmission 17.7 (17.5 - 17.8)
Manual transmission 25.6 (21.2 - 29.9)

Paul Gubbins

I help governments, non-profits and companies get the most out of their data in order to advance public policies, programs and innovations that support people's well-being.