NOTE: We recommend viewing the fully rendered version of this vignette online at https://mc-stan.org/loo/articles/
This vignette shows how to perform Bayesian leave-one-out cross-validation (LOO-CV) using the mixture estimators proposed in the paper Silva and Zanella (2022). These estimators have shown to be useful in presence of outliers but also, and especially, in high-dimensional settings where the model features many parameters. In these contexts it can happen that a large portion of observations lead to high values of Pareto-k diagnostics and potential instability of PSIS-LOO estimators.
For this illustration we consider a high-dimensional Bayesian Logistic regression model applied to the Voice dataset.
This is the Stan code for a logistic regression model with regularized horseshoe prior. The code includes an if statement to include a code line needed later for the MixIS approach.
# Note: some syntax used in this program requires RStan >= 2.26 (or CmdStanR)
# To use an older version of RStan change the line declaring `y` to:
# int<lower=0,upper=1> y[N];
stancode_horseshoe <- "
data {
int <lower=0> N;
int <lower=0> P;
array[N] int <lower=0, upper=1> y;
matrix [N,P] X;
real <lower=0> scale_global;
int <lower=0,upper=1> mixis;
}
transformed data {
real<lower=1> nu_global=1; // degrees of freedom for the half-t priors for tau
real<lower=1> nu_local=1; // degrees of freedom for the half-t priors for lambdas
// (nu_local = 1 corresponds to the horseshoe)
real<lower=0> slab_scale=2;// for the regularized horseshoe
real<lower=0> slab_df=100; // for the regularized horseshoe
}
parameters {
vector[P] z; // for non-centered parameterization
real <lower=0> tau; // global shrinkage parameter
vector <lower=0>[P] lambda; // local shrinkage parameter
real<lower=0> caux;
}
transformed parameters {
vector[P] beta;
{
vector[P] lambda_tilde; // 'truncated' local shrinkage parameter
real c = slab_scale * sqrt(caux); // slab scale
lambda_tilde = sqrt( c^2 * square(lambda) ./ (c^2 + tau^2*square(lambda)));
beta = z .* lambda_tilde*tau;
}
}
model {
vector[N] means=X*beta;
vector[N] log_lik;
target += std_normal_lpdf(z);
target += student_t_lpdf(lambda | nu_local, 0, 1);
target += student_t_lpdf(tau | nu_global, 0, scale_global);
target += inv_gamma_lpdf(caux | 0.5*slab_df, 0.5*slab_df);
for (n in 1:N) {
log_lik[n]= bernoulli_logit_lpmf(y[n] | means[n]);
}
target += sum(log_lik);
if (mixis) {
target += log_sum_exp(-log_lik);
}
}
generated quantities {
vector[N] means=X*beta;
vector[N] log_lik;
for (n in 1:N) {
log_lik[n] = bernoulli_logit_lpmf(y[n] | means[n]);
}
}
"
The LSVT Voice Rehabilitation Data Set (see link for details) has p = 312 covariates and n = 126 observations with binary response. We construct data list for Stan.
data(voice)
y <- voice$y
X <- voice[2:length(voice)]
n <- dim(X)[1]
p <- dim(X)[2]
p0 <- 10
scale_global <- 2*p0/(p-p0)/sqrt(n-1)
standata <- list(N = n, P = p, X = as.matrix(X), y = c(y), scale_global = scale_global, mixis = 0)
Note that in our prior specification we divide the prior variance by the number of covariates p. This is often done in high-dimensional contexts to have a prior variance for the linear predictors Xβ that remains bounded as p increases.
LOO-CV computations are challenging in this context due to high-dimensionality of the parameter space. To show that, we compute PSIS-LOO estimators, which require sampling from the posterior distribution, and inspect the associated Pareto-k diagnostics.
chains <- 4
n_iter <- 2000
warm_iter <- 1000
stanmodel <- stan_model(model_code = stancode_horseshoe)
fit_post <- sampling(stanmodel, data = standata, chains = chains, iter = n_iter, warmup = warm_iter, refresh = 0)
loo_post <-loo(fit_post)
As we can see the diagnostics signal either “bad” or “very bad” Pareto-k values for roughly 15 − 30% of the observations which is a significant portion of the dataset.
We now compute the mixture estimators proposed in Silva and Zanella (2022). These require to sample from the following mixture of leave-one-out posteriors The code to generate a Stan model for the above mixture distribution is the same to the one for the posterior, just enabling one line of code with a LogSumExp contribution to account for the last term in the equation above.
if (mixis) {
target += log_sum_exp(-log_lik);
}
We sample from the mixture and collect the log-likelihoods term.
standata$mixis <- 1
fit_mix <- sampling(stanmodel, data = standata, chains = chains, iter = n_iter, warmup = warm_iter, refresh = 0, pars = "log_lik")
log_lik_mix <- extract(fit_mix)$log_lik
We now compute the mixture estimators, following the numerically stable implementation in Appendix A.2 of Silva and Zanella (2022). The code below makes use of the package “matrixStats”.
To evaluate the performance of mixture estimators (MixIS) we also generate benchmark values, i.e. accurate approximations of the LOO predictives {p(yi|y−i)}i = 1, …, n, obtained by brute-force sampling from the leave-one-out posteriors directly, getting 90k samples from each and discarding the first 10k as warmup. This is computationally heavy, hence we have saved the results and we just load them in the current vignette.
We can then compute the root mean squared error (RMSE) of the PSIS and mixture estimators relative to such benchmark values.
elpd_psis <- loo_post$pointwise[,1]
print(paste("RMSE(PSIS) =",round( sqrt(mean((elpd_loo-elpd_psis)^2)) ,2)))
print(paste("RMSE(MixIS) =",round( sqrt(mean((elpd_loo-elpd_mixis)^2)) ,2)))
Here mixture estimator provides a reduction in RMSE. Note that this value would increase with the number of samples drawn from the posterior and mixture, since in this example the RMSE of MixIS will exhibit a CLT-type decay while the one of PSIS will converge at a slower rate (this can be verified by running the above code with a larger sample size; see also Figure 3 of Silva and Zanella (2022) for analogous results).
We then compare the overall ELPD estimates with the brute force one.
elpd_psis <- loo_post$pointwise[,1]
print(paste("ELPD (PSIS)=",round(sum(elpd_psis),2)))
print(paste("ELPD (MixIS)=",round(sum(elpd_mixis),2)))
print(paste("ELPD (brute force)=",round(sum(elpd_loo),2)))
In this example, MixIS provides a more accurate ELPD estimate closer to the brute force estimate, while PSIS severely overestimates the ELPD. Note that low accuracy of the PSIS ELPD estimate is expected in this example given the large number of large Pareto-k values. In this example, the accuracy of MixIS estimate will also improve with bigger MCMC sample size.
More generally, mixture estimators can be useful in situations where standard PSIS estimators struggle and return many large Pareto-k values. In these contexts MixIS often provides more accurate LOO-CV and ELPD estimates with a single sampling routine (i.e. with a cost comparable to sampling from the original posterior).
Silva L. and Zanella G. (2022). Robust leave-one-out cross-validation for high-dimensional Bayesian models. Preprint at arXiv:2209.09190
Vehtari A., Gelman A., and Gabry J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432. Preprint at arXiv:1507.04544
Vehtari, A., Simpson, D., Gelman, A., Yao, Y., and Gabry, J. (2024). Pareto smoothed importance sampling. Journal of Machine Learning Research, 25(72):1-58. PDF