Streamlining the analysis of survey data using functional programming in R

Jan 15, 2020 10 min read R

Using functional programming to efficiently compute and structure summary statistics from survey data

One of the things I often do in my work is compute the weighted mean and confidence intervals of indicators constructed from probability sampled survey data for the overall population and for sub-groups of the population defined by certain characteristics such as the gender, age or socio-economic status of the respondent. In this post I’ll show one of the workflows I use to generate sample statistics from survey data and have those results structured in a way that makes them easy to visualize later (ie. with ggplot).

The general approach is as follows:

Create descriptive labels of the indicators (numeric variables) that you want to estimate the means for and store these in a named vector.
Create descriptive labels of the population groups (categorical variables) for which you want estimates and store these in a named vector.
Create a custom function that computes the weighted mean (and potentially any other summary statistic) of a single variable given a primary sampling unit, probability weight and stratum.
Apply the custom function from (3) to each of the unique combination of indicators and grouping variables in your dataset defined in (1) and (2).

This post assumes you have some handle on the data manipulation and transformation techniques facilitated by the family of functions in the dplyr package. If not, the R for Data Science book/ online resource is an excellent place to start.

I’ll begin using a simple example using the mtcars dataset and then show another example in cases where you have data from a probability sampling procedure (as is often the case with household surveys.)

1. Simple example


##-------- STEP 0: Data preparation

## Ensuring categorical grouping variables are factors with descriptive labels

data <- mtcars %>% 
          mutate(
            fullsample = factor("All cars"), 
            cyl = factor(cyl, levels = c(4, 6, 8), labels = c("4 cylinders", "6 cylinders", "8 cylinders")), 
            am = factor(am, levels = c(0, 1), labels = c("Automatic transmission", "Manual transmission"))
          )


##-------- STEP 1-2

## Named vector of all the numeric variables
## we want the sample means for:
indicators <- c("mpg" = "Miles per gallon", 
               "hp" = "Gross horsepower", 
               "disp" = "Displacement (cu. in.)", 
               "wt" = "Weight (1,000 lbs)")

# Named vector of all the categorical variables
# containing population sub-groups we want sample means for:
groups <- c("fullsample" = "All cars", 
            "cyl" = "Number of cylinders", 
            "am" = "Type of transmission")

The code section below creates a function called svy_summary which requires 5 inputs: data (your dataset), i (a string - the column name of an indicator in data), g (a string - the column name of a grouping variable in data), iref (a lookup vector with the descriptive names of the indicator), gref (a lookup vector with the descriptive names of the grouping variable). In the code for this function you will see some functions and operators that you may not have seen before, specifically sym, as_string and !!. These are all part of the rlang library which provides a set of tools for tidy evaluation - to help you write functions using tidyverse pipelines and grammars. The svy_summary function returns a data frame with the estimates for the mean and the 95% confidence interval for the indicators and groups specified, as well as descriptive information. The function assumes that there is a variable called “fullsample” to accomodate the case where you want to estimate the mean and confidence interval of an indicator for the population as a whole.


#----------STEP 3:
# Creating function to compute mean & standard deviation (no sampling weights or complex survey design) 
#given an indicator and grouping variable

# INPUTS:
# data --> dnenotes data frame you are using
# i --> indicator(s)
# g --> population sub-group(s)
# iref --> named vector which provides labels for all of your indicators (variables)
# gref --> named vector which provides labels for all of the variables that contain your sub-groups

svy_summary <- function(data, i, g, iref, gref) { 
  
  i <- sym(i)
  g <- sym(g)

  # Computing mean and 95% confidence interval, 
  # adding a name for the indicator, 
  # and group names for group variable
  
  if (as_string(g) %in% c("fullsample")) { 
    results <- data %>%
      summarize(
        mean = mean(!!i, na.rm=TRUE), 
        sd = sd(!!i, na.rm = TRUE), 
        se = sd/sqrt(length(!!i))
      ) %>% 
      mutate(
        group_cat_val = gref[as_string(g)]
      )
  }
  else {
    results <- data %>%
      group_by(!!g) %>%
      summarize(
        mean = mean(!!i, na.rm=TRUE), 
        sd = sd(!!i, na.rm = TRUE), 
        se = sd/sqrt(length(!!i))
      ) %>% 
      mutate(
        group_cat_val = !!g
      )
  }
  
  # Computing lower & upper 95%, adding descriptive info
  results <- results %>%  mutate(
        lci = mean - qnorm(0.975)*se, 
        uci = mean + qnorm(0.975)*se, 
        indicator = as_string(i), 
        indicator_name = iref[as_string(i)], 
        group = as_string(g), 
        group_name = gref[as_string(g)]
      ) %>%
      dplyr::select(
        indicator, indicator_name, group, group_name, group_cat_val, mean, lci, uci
      )
  
  return(results)
  
}


#---------- STEP 4:
# Apply the function from step 3 to each combination of indicator and grouping variable

combinations <- expand.grid(names(indicators), names(groups), stringsAsFactors = FALSE)
# Indicators
is <- combinations[[1]]
# Groups
gs <- combinations[[2]]

results <- dplyr::bind_rows(
              map2(is, gs, svy_summary, data = data, iref = indicators, gref = groups)
            )


results %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "responsive", font_size = 10) %>% 
  scroll_box(height = "250px")

indicator	indicator_name	group	group_name	group_cat_val	mean	lci	uci
mpg	Miles per gallon	fullsample	All cars	All cars	20.090625	18.002432	22.178818
hp	Gross horsepower	fullsample	All cars	All cars	146.687500	122.932115	170.442885
disp	Displacement (cu. in.)	fullsample	All cars	All cars	230.721875	187.780098	273.663652
wt	Weight (1,000 lbs)	fullsample	All cars	All cars	3.217250	2.878238	3.556262
mpg	Miles per gallon	cyl	Number of cylinders	4 cylinders	26.663636	23.998548	29.328725
mpg	Miles per gallon	cyl	Number of cylinders	6 cylinders	19.742857	18.666059	20.819655
mpg	Miles per gallon	cyl	Number of cylinders	8 cylinders	15.100000	13.758990	16.441011
hp	Gross horsepower	cyl	Number of cylinders	4 cylinders	82.636364	70.265074	95.007653
hp	Gross horsepower	cyl	Number of cylinders	6 cylinders	122.285714	104.313621	140.257807
hp	Gross horsepower	cyl	Number of cylinders	8 cylinders	209.214286	182.511451	235.917121
disp	Displacement (cu. in.)	cyl	Number of cylinders	4 cylinders	105.136364	89.256558	121.016169
disp	Displacement (cu. in.)	cyl	Number of cylinders	6 cylinders	183.314286	152.524950	214.103621
disp	Displacement (cu. in.)	cyl	Number of cylinders	8 cylinders	353.100000	317.599862	388.600138
wt	Weight (1,000 lbs)	cyl	Number of cylinders	4 cylinders	2.285727	1.949143	2.622312
wt	Weight (1,000 lbs)	cyl	Number of cylinders	6 cylinders	3.117143	2.853163	3.381123
wt	Weight (1,000 lbs)	cyl	Number of cylinders	8 cylinders	3.999214	3.601421	4.397007
mpg	Miles per gallon	am	Type of transmission	Automatic transmission	17.147368	15.423439	18.871298
mpg	Miles per gallon	am	Type of transmission	Manual transmission	24.392308	21.040220	27.744396
hp	Gross horsepower	am	Type of transmission	Automatic transmission	160.263158	136.023522	184.502793
hp	Gross horsepower	am	Type of transmission	Manual transmission	126.846154	81.150194	172.542114
disp	Displacement (cu. in.)	am	Type of transmission	Automatic transmission	290.378947	240.840644	339.917251
disp	Displacement (cu. in.)	am	Type of transmission	Manual transmission	143.530769	96.127012	190.934526
wt	Weight (1,000 lbs)	am	Type of transmission	Automatic transmission	3.768895	3.419339	4.118450
wt	Weight (1,000 lbs)	am	Type of transmission	Manual transmission	2.411000	2.075611	2.746389

2. Example with sampling weights

If you have data from a more complex sampling procedure, you can use the set of functions in the srvyr package to first define the survey design using as_survey_design.


#-------- STEP 0: Data preparation

# Adding two columns, one for survey weights, another for the primary sampling unit

data <- data %>% 
          mutate(
            probweights = runif(n(), min = 0, max = 1), 
            psu = rep(1:3, length.out = n()), 
            strata = 1
          )


#---------- STEP 3:
# Creating function to compute mean & standard deviation 
# (for a data set with sampling weights or complex survey design) 
# given an indicator and grouping variable

# INPUTS:
# same as before, except now you need: 
# w = survey weights
# psu = primary sampling units
# strata = survey strata (if applicable)

svy_summary_weights <- function(data, i, g, psu, strata, w, iref, gref) { 
  
  i <- sym(i)
  g <- sym(g)
  w <- sym(w)
  psu <- sym(psu)
  strata <- sym(strata)
  
  # Defining survey design
  svydesign <- data %>%
    as_survey_design(ids = !!psu, weight = !!w, strata = !!strata, variables = c(!!i, !!g))
  
  # Computing mean and 95% confidence interval, adding a name fot the indicator, and group namees for group variable
  
  if (as_string(g) %in% c("fullsample")) { 
    results <- svydesign %>%
      summarise(
        mean = survey_mean(!!i, vartype = "ci", na.rm=TRUE)
      ) %>% 
      mutate(
        group_cat_val = gref[as_string(g)]
      )
  }
  
  else {
    results <- svydesign %>%
      group_by(!!g) %>%
      summarise(
        mean = survey_mean(!!i, vartype = "ci", na.rm=TRUE)
      ) %>% 
      mutate(
        group_cat_val = !!g
      ) 
  }
  
  results <- results %>% 
    rename(lci = mean_low, uci = mean_upp) %>% 
      mutate(
        indicator = as_string(i), 
        indicator_name = iref[as_string(i)], 
        group = as_string(g), 
        group_name = gref[as_string(g)]
      ) %>%
      dplyr::select(
        indicator, indicator_name, group, group_name, group_cat_val, mean, lci, uci
      )
  
}


#---------- STEP 4:
# Apply the function from step 3 to each combination of indicator and grouping variable

combinations <- expand.grid(names(indicators), names(groups), stringsAsFactors = FALSE)
# Indicators
is <- combinations[[1]]
# Groups
gs <- combinations[[2]]

results <- dplyr::bind_rows(
              map2(is, 
                   gs, 
                   svy_summary_weights, 
                   data = data, 
                   iref = indicators, 
                   gref = groups,
                   psu = "psu", 
                   strata = "strata", 
                   w = "probweights")
            )


results %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "responsive", font_size = 10) %>% 
  scroll_box(height = "250px")

indicator	indicator_name	group	group_name	group_cat_val	mean	lci	uci
mpg	Miles per gallon	fullsample	All cars	All cars	20.519877	19.194936	21.844819
hp	Gross horsepower	fullsample	All cars	All cars	144.172653	106.588444	181.756863
disp	Displacement (cu. in.)	fullsample	All cars	All cars	232.651629	161.369019	303.934239
wt	Weight (1,000 lbs)	fullsample	All cars	All cars	3.171457	2.900692	3.442222
mpg	Miles per gallon	cyl	Number of cylinders	4 cylinders	27.407213	19.887306	34.927120
mpg	Miles per gallon	cyl	Number of cylinders	6 cylinders	19.893404	16.789258	22.997549
mpg	Miles per gallon	cyl	Number of cylinders	8 cylinders	15.707861	12.496165	18.919557
hp	Gross horsepower	cyl	Number of cylinders	4 cylinders	79.475175	52.443331	106.507018
hp	Gross horsepower	cyl	Number of cylinders	6 cylinders	115.139199	105.551340	124.727058
hp	Gross horsepower	cyl	Number of cylinders	8 cylinders	206.617943	197.385097	215.850789
disp	Displacement (cu. in.)	cyl	Number of cylinders	4 cylinders	102.081026	64.282096	139.879955
disp	Displacement (cu. in.)	cyl	Number of cylinders	6 cylinders	192.974918	107.161288	278.788549
disp	Displacement (cu. in.)	cyl	Number of cylinders	8 cylinders	349.335315	299.363656	399.306975
wt	Weight (1,000 lbs)	cyl	Number of cylinders	4 cylinders	2.256888	1.127330	3.386446
wt	Weight (1,000 lbs)	cyl	Number of cylinders	6 cylinders	3.152953	2.916478	3.389429
wt	Weight (1,000 lbs)	cyl	Number of cylinders	8 cylinders	3.860660	3.145283	4.576038
mpg	Miles per gallon	am	Type of transmission	Automatic transmission	17.667195	17.534491	17.799899
mpg	Miles per gallon	am	Type of transmission	Manual transmission	25.567401	21.238455	29.896346
hp	Gross horsepower	am	Type of transmission	Automatic transmission	160.298744	127.109391	193.488097
hp	Gross horsepower	am	Type of transmission	Manual transmission	115.639223	53.735393	177.543054
disp	Displacement (cu. in.)	am	Type of transmission	Automatic transmission	285.307443	211.600040	359.014845
disp	Displacement (cu. in.)	am	Type of transmission	Manual transmission	139.482676	38.591627	240.373726
wt	Weight (1,000 lbs)	am	Type of transmission	Automatic transmission	3.647227	3.421634	3.872821
wt	Weight (1,000 lbs)	am	Type of transmission	Manual transmission	2.329631	1.680587	2.978674

You can easily create a more publication friendly of this table (with a subset of the results):


makeci <- function(m, l, u) { 
  paste0(round(m, 1), " (", round(l, 1), " - ", round(u, 1), ")")
}

results %>% 
  mutate(meanci = makeci(mean, lci, uci)) %>% 
  filter(indicator_name %in% c("Miles per gallon", "Gross horsepower")) %>% 
  arrange(indicator_name, group_name, group_cat_val) %>%
  select(indicator_name, group_name, group_cat_val, meanci) %>% 
  rename(`Performance Indicator` = indicator_name, Feature = group_name, Type = group_cat_val,`Mean (95% Confidence Interval)` = meanci) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "responsive", font_size = 10) %>% 
  collapse_rows(columns = 1:2, valign = "top")

Performance Indicator	Feature	Type	Mean (95% Confidence Interval)
Gross horsepower	All cars	All cars	144.2 (106.6 - 181.8)
	Number of cylinders	4 cylinders	79.5 (52.4 - 106.5)
		6 cylinders	115.1 (105.6 - 124.7)
		8 cylinders	206.6 (197.4 - 215.9)
	Type of transmission	Automatic transmission	160.3 (127.1 - 193.5)
	Type of transmission	Manual transmission	115.6 (53.7 - 177.5)
Miles per gallon	All cars	All cars	20.5 (19.2 - 21.8)
	Number of cylinders	4 cylinders	27.4 (19.9 - 34.9)
		6 cylinders	19.9 (16.8 - 23)
		8 cylinders	15.7 (12.5 - 18.9)
	Type of transmission	Automatic transmission	17.7 (17.5 - 17.8)
	Type of transmission	Manual transmission	25.6 (21.2 - 29.9)

RStats purrr Household survey Summary statistics

Paul Gubbins

I help governments, non-profits and companies get the most out of their data in order to advance public policies, programs and innovations that support people's well-being.