library(rstanarm)
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default())
# options(mc.cores = 4)
Inference about the population is one the main aims of statistical methodology. Multilevel regression and post-stratification (MRP) (Little 1993; Lax and Phillips 2009; Park, Gelman, and Bafumi 2004) has been shown to be an effective method of adjusting the sample to be more representative of the population for a set of key variables. Recent work has demonstrated the effectiveness of MRP when there are a number of suspected interactions between these variables (Ghitza and Gelman 2013), replicated by Lei, Gelman, and Ghitza (2017). While Ghitza and Gelman (2013) use approximate marginal maximum likelihood estimates; Lei, Gelman, and Ghitza (2017) implement a fully Bayesian approach through Stan.
The rstanarm package allows the user to conduct complicated regression analyses in Stan with the simplicity of standard formula notation in R. The purpose of this vignette is to demonstrate the utility of rstanarm when conducting MRP analyses. We will not delve into the details of conducting logistic regression with rstanarm as this is already covered in other vignettes.
Most of the code for data manipulation and plotting is not shown in the text but is available in the R markdown source code on GitHub.
Three data sets are simulated by the function
simulate_mrp_data()
, which is defined in the source
code for this R markdown document (and printed in the appendix). The
first, sample
, contains n observations from the individuals
that form our sample (i.e., n
rows). For each individual we have their age (recorded as membership
within a specific age bracket), ethnicity, income level (recorded as
membership within a specific bracket), and gender. Participants were
randomly sampled from a state.
MRP is often used for dichotomous fixed choice questions (e.g., McCain’s share of two party vote (Ghitza and Gelman 2013); support for George W Bush, (Park, Gelman, and Bafumi 2004); or support for the death penalty (Shirley and Gelman 2015)), so we will use a binary variable as the outcome in this vignette. However, MRP can also be used if there are more than two categories or if the outcome is continuous.
As this is a simple toy example, we will describe the proportion of
the population who would choose to adopt a cat over a dog, given the
opportunity. We will simulate data using a function that is included in
the appendix of this document. The simulate_mrp_data()
function simulates a sample from a much larger population. It returns a
list including the sample, population poststratification matrix and the
true population preference for cats.
The variables describing the individual (age, ethnicity, income level and gender) will be used to match the sample to the population of interest. To do this we will need to form a post-stratification table, which contains the number of people in each possible combination of the post-stratification variables. We have 4 variables with 2 (male), 7 (age), 3 (ethnicity) and 3 (income) levels, so there are 2x7x3x3 different levels. Participants are also selected from a state (50), increasing the number of possible levels to 6300.
To make inference about the population, we will also need the proportion of individuals in each post stratification cell at the population level. We will use this information to update the estimate of our outcome variable from the sample so that is more representative of the population. This is particularly helpful if there is a belief that the sample has some bias (e.g., a greater proportion of females responded than males), and that the bias impacts the outcome variable (e.g., maybe women are more likely to pick a cat than men). For each possible combination of factors, the post-stratification table shows the proportion/number of the population in that cell (rather than the proportion/number in the sample in the cell).
Below we read in the poststrat data our simulated data list.
One of the benefits of using a simulated data set for this example is that the actual population level probability of cat preference is known for each post-stratification cell. In real world data analysis, we don’t have this luxury, but we will use it later in this case study to check the predictions of the model. Details regarding the simulation of this data are available in the appendix.
Before we begin with the MRP analysis, we first explore the data set with some basic visualizations.
The aim of this analysis is to obtain a population estimation of cat preference given our sample of 4626. We can see in the following plot the difference in proportions between the sample and the population. Horizontal panels represent each variable. Bars represent the proportion of the sample (solid) and population (dashed) in each category (represented by colour and the x-axis). For ease of viewing, we ordered the states in terms of the proportion of the sample in that state that was observed. We will continue this formatting choice thoughout this vignette.
Secondly; we consider the evidence of different proportions across different levels of a post-stratification variable; which we should consider for each of the post-stratification variables. Here we break down the proportion of individuals who would prefer a cat (y-axis) by different levels (x-axis) of the post-stratification variable (horizontal panels). We can see from this figure that there appears to be differences in cat preference for the different levels of post-stratification variables. Given the previous figure, which suggested that the sample was different to the population in the share of different levels of theses variables, this should suggest that using the sample to estimate cat preference may not give accurate estimates of cat preference in the population.
Thirdly, we demonstrate visually that there is an interaction between age and gender and compare to a case where there is no interaction. Here a simulated interaction effect between age (x-axis) and gender (color), right panel, is contrasted with no interaction effect (left panel). While both panels demonstrate a difference between the genders on the outcome variable (y-axis), only the second panel shows this difference changing with the variable on the x-axis.
Lastly we look at the difference in cat preference between states, which will form the basis for the multi-level component of our analysis. Participants were randomly selected from particular states. Plotting the state (x-axis) against the overall proportion of participants who prefer cats (y-axis) demonstrates state differences. The downward slope is because we ordered the x-axis by the proportion of cat preference for ease of viewing. We also include second plot with a horizontal line to represent the overall preference for cats in the total population, according to the sample.
From visual inspection, it appears that different levels of post-stratification variable have different preferences for cats. Our survey also appears to have sampling bias; indicating that some groups were over/under sampled relative to the population. The net effect of this is that we could not make good population level estimates of cat preference straight from our sample. Our aim is to infer the preference for cats in the population using the post-stratification variables to account for systematic differences between the sample and population. Using rstanarm, this becomes a simple procedure.
The first step is to use a multi-level logistic regression model to predict preference for cats in the sample given the variables that we will use to post-stratify. Note that we actually have more rows in the post-stratification matrix than the we have observed units, so there are some cells in the poststrat matrix that we don’t observe. We can use a multi-level model to partially pool information across the different levels within each variable to assist with this. In the model described below, we use a fixed intercept for gender, and hierarchically modeled varying intercepts for each of the other factors.
Let θj denote the preference for cats in the jth poststratification cell. The non-hierarchical part of the model can be written as
θj = logit−1(Xjβ),
where here X only contains an indicator for male or female and an interaction term with age.
Adding the varying intercepts for the other variables the model becomes
$$ \theta_j = logit^{-1}( X_{j}\beta + \alpha_{\rm state[j]}^{\rm state} + \alpha_{\rm age[j]}^{\rm age} + \alpha_{\rm eth[j]}^{\rm eth} + \alpha_{\rm inc[j]}^{\rm inc} ) $$ with
$$
$$
Each of $\sigma^{\rm state}$, $\sigma^{\rm age}$, $\sigma^{\rm eth}$, and $\sigma^{\rm inc}$ are estimated from the data (in this case using rstanarm’s default priors), which is beneficial as it means we share information between the levels of each variable and we can prevent levels with with less data from being too sensitive to the few observed values. This also helps with the levels we don’t observe at all it will use information from the levels that we do observe. For more on the benefits of this type of model, see Gelman et al. (2005), and see Ghitza and Gelman (2013) and Si et al. (2017) for more complicated extensions that involve deep interactions and structured prior distributions.
Here is the model specified using the stan_glmer()
function in rstanarm, which uses the same formula syntax as the
glmer()
function from the lme4 package:
fit <- stan_glmer(
cat_pref ~ factor(male) + factor(male) * factor(age) +
(1 | state) + (1 | age) + (1 | eth) + (1 | income),
family = binomial(link = "logit"),
data = sample
)
As a first pass to check whether the model is performing well, note that there are no warnings about divergences, failure to converge or tree depth. If these errors do occur, more information on how to alleviate them is provided here.
From this we get a summary of the baseline log odds of cat preference
at the first element of each factor (i.e., male = 0, age = 1) for each
state, plus estimates on variability of the intercept for state,
ethnicity, age and income. While this is interesting, currently all we
have achieved is a model that predicts cat preference given a number of
factor-type predictors in a sample. What we would like to do is estimate
cat preference in the population by accounting for differences between
our sample and the population. We use the
posterior_linpred()
function to obtain posterior estimates
for cat preference given the proportion of people in the
population in each level of the factors included in the
model.
posterior_prob <- posterior_linpred(fit, transform = TRUE, newdata = poststrat)
poststrat_prob <- posterior_prob %*% poststrat$N / sum(poststrat$N)
model_popn_pref <- c(mean = mean(poststrat_prob), sd = sd(poststrat_prob))
round(model_popn_pref, 3)
We can compare this to the estimate we would have made if we had just used the sample:
We can also add it to the last figure to graphically represent the difference between the sample and population estimate.
compare2 <- compare2 +
geom_hline(yintercept = model_popn_pref[1], colour = '#2ca25f', size = 1) +
geom_text(aes(x = 5.2, y = model_popn_pref[1] + .025), label = "MRP", colour = '#2ca25f')
bayesplot_grid(compare, compare2,
grid_args = list(nrow = 1, widths = c(8, 1)))
As this is simulated data, we can look directly at the preference for cats that we simulated from to consider how good our estimate is.
Which we will also add to the figure.
Our MRP estimate is barely off, while our sample estimate is off by more than 10 percentage points. This indicates that using MRP helps to make estimates for the population from our sample that are more accurate.
One of the nice benefits of using MRP to make inference about the population is that we can change the population of interest. In the previous paragraph we inferred the preference for cats in the whole population. We can also infer the preference for cats in a single state. In the following code we post-stratify for each state in turn. Note that we can reuse the predictive model from the previous step and update for different population demographics. This is particularly useful for complicated cases or large data sets where the model takes some time to fit.
As before, first we use the proportion of the population in each combination of the post-stratification groups to estimate the proportion of people who preferred cats in the population, only in this case the population of interest is the state.
state_df <- data.frame(
State = 1:50,
model_state_sd = rep(-1, 50),
model_state_pref = rep(-1, 50),
sample_state_pref = rep(-1, 50),
true_state_pref = rep(-1, 50),
N = rep(-1, 50)
)
for(i in 1:length(levels(as.factor(poststrat$state)))) {
poststrat_state <- poststrat[poststrat$state == i, ]
posterior_prob_state <- posterior_linpred(
fit,
transform = TRUE,
draws = 1000,
newdata = as.data.frame(poststrat_state)
)
poststrat_prob_state <- (posterior_prob_state %*% poststrat_state$N) / sum(poststrat_state$N)
#This is the estimate for popn in state:
state_df$model_state_pref[i] <- round(mean(poststrat_prob_state), 4)
state_df$model_state_sd[i] <- round(sd(poststrat_prob_state), 4)
#This is the estimate for sample
state_df$sample_state_pref[i] <- round(mean(sample$cat_pref[sample$state == i]), 4)
#And what is the actual popn?
state_df$true_state_pref[i] <-
round(sum(true_popn$cat_pref[true_popn$state == i] * poststrat_state$N) /
sum(poststrat_state$N), digits = 4)
state_df$N[i] <- length(sample$cat_pref[sample$state == i])
}
state_df[c(1,3:6)]
state_df$State <- factor(state_df$State, levels = levels(sample$state))
Here we similar findings to when we considered the population as whole. While estimates for cat preference (in percent) using the sample are off by
round(100 * c(
mean = mean(abs(state_df$sample_state_pref-state_df$true_state_pref), na.rm = TRUE),
max = max(abs(state_df$sample_state_pref-state_df$true_state_pref), na.rm = TRUE)
))
the MRP based estimates are much closer to the actual percentage,
round(100 * c(
mean = mean(abs(state_df$model_state_pref-state_df$true_state_pref)),
max = max(abs(state_df$model_state_pref-state_df$true_state_pref))
))
and especially when the sample size for that population is relatively small. This is easier to see graphically, so we will continue to add additional layers to the previous figure. Here we add model estimates,represented by triangles, and the true population cat preference, represented as transparent circles.
Previously we used a binary outcome variable. An alternative form of
this model is to aggregate the data to the poststrat cell level and
model the number of successes (or endorsement of cat preference in this
case) out of the total number of people in that cell. To do this we need
to create two n x 1 outcome variables, N_cat_pref
(number
in cell who prefer cats) and N
(number in the poststrat
cell).
# not evaluated to avoid dependency on tidyverse
sample_alt <- sample %>%
group_by(male, age, income, state, eth) %>%
summarise(N_cat_pref = sum(cat_pref), N = n()) %>%
ungroup()
We then can use these two outcome variables to model the data using the binomial distribution.
fit2 <- stan_glmer(
cbind(N_cat_pref, N - N_cat_pref) ~ factor(male) + factor(male) * factor(age) +
(1 | state) + (1 | age) + (1 | eth) + (1 | income),
family = binomial("logit"),
data = sample_alt,
refresh = 0
)
Like before, we can use the posterior_linpred()
function
to obtain an estimate of the preference for cats in the population.
posterior_prob_alt <- posterior_linpred(fit2, transform = TRUE, newdata = poststrat)
poststrat_prob_alt <- posterior_prob_alt %*% poststrat$N / sum(poststrat$N)
model_popn_pref_alt <- c(mean = mean(poststrat_prob_alt), sd = sd(poststrat_prob_alt))
round(model_popn_pref_alt, 3)
As we should, we get the same answer as when we fit the model using the binary outcome. The two ways are equivalent, so we can use whichever form is most convenient for the data at hand. More details on these two forms of binomial models are available here.
The formulas for fitting so-called “mixed-effects” models in rstanarm are the same as those in the lme4 package. A table of examples can be found in Table 2 of the vignette for the lme4 package, available here.