Title:  Projection Predictive Feature Selection 

Description:  Performs projection predictive feature selection for generalized linear models (Piironen, Paasiniemi, and Vehtari, 2020, <doi:10.1214/20EJS1711>) with or without multilevel or additive terms (Catalina, Bürkner, and Vehtari, 2022, <https://proceedings.mlr.press/v151/catalina22a.html>), for some ordinal and nominal regression models (Weber, Glass, and Vehtari, 2023, <arXiv:2301.01660>), and for many other regression models (using the latent projection by Catalina, Bürkner, and Vehtari, 2021, <arXiv:2109.04702>, which can also be applied to most of the former models). The package is compatible with the 'rstanarm' and 'brms' packages, but other reference models can also be used. See the vignettes and the documentation for more information and examples. 
Authors:  Juho Piironen [aut], Markus Paasiniemi [aut], Alejandro Catalina [aut], Frank Weber [cre, aut], Aki Vehtari [aut], Jonah Gabry [ctb], Marco Colombo [ctb], PaulChristian Bürkner [ctb], Hamada S. Badr [ctb], Brian Sullivan [ctb], Sölvi Rögnvaldsson [ctb], The LME4 Authors [cph] (see file 'LICENSE' for details), Yann McLatchie [ctb], Juho Timonen [ctb] 
Maintainer:  Frank Weber <[email protected]> 
License:  GPL3  file LICENSE 
Version:  2.8.0.9000 
Built:  20240829 22:57:11 UTC 
Source:  https://github.com/standev/projpred 
The R package projpred performs the projection predictive variable (or
"feature") selection for various regression models. We recommend to read the
README
file (available with enhanced formatting
online) and the main vignette (topic = "projpred"
, but also available
online) before
continuing here.
Throughout the whole package documentation, we use the term "submodel" for
all kinds of candidate models onto which the reference model is projected.
For custom reference models, the candidate models don't need to be actual
submodels of the reference model, but in any case (even for custom
reference models), the candidate models are always actual submodels of the
full formula
used by the search procedure. In this regard, it is correct
to speak of submodels, even in case of a custom reference model.
The following model type abbreviations will be used at multiple places throughout the documentation: GLM (generalized linear model), GLMM (generalized linear multilevel—or "mixed"—model), GAM (generalized additive model), and GAMM (generalized additive multilevel—or "mixed"—model). Note that the term "generalized" includes the Gaussian family as well.
For the projection of the reference model onto a submodel, projpred
currently relies on the following functions as drawwise divergence
minimizers (in other words, these are the workhorse functions employed by
projpred's internal default div_minimizer
functions, see
init_refmodel()
):
Submodel without multilevel or additive terms:
For the traditional (or latent) projection (or the augmenteddata
projection in case of the binomial()
or brms::bernoulli()
family): An
internal C++ function which basically serves the same purpose as lm()
for the gaussian()
family and glm()
for all other families. The
returned object inherits from class subfit
. Possible tuning parameters
for this internal C++ function are: regul
(amount of ridge
regularization; default: 1e4
), thresh_conv
(convergence threshold;
default: 1e7
), qa_updates_max
(maximum number of quadratic
approximation updates; default: 100
, but fixed to 1
in case of the
Gaussian family with identity link), ls_iter_max
(maximum number of
line search iterations; default: 30
, but fixed to 1
in case of the
Gaussian family with identity link), normalize
(single logical value
indicating whether to scale the predictors internally with the returned
regression coefficient estimates being backadjusted appropriately;
default: TRUE
), beta0_init
(single numeric value giving the starting
value for the intercept at centered predictors; default: 0
), and
beta_init
(numeric vector giving the starting values for the regression
coefficients; default: vector of 0
s).
For the augmenteddata projection: MASS::polr()
(the returned object
inherits from class polr
) for the brms::cumulative()
family or
rstanarm::stan_polr()
fits, nnet::multinom()
(the returned object
inherits from class multinom
) for the brms::categorical()
family.
Submodel with multilevel but no additive terms:
For the traditional (or latent) projection (or the augmenteddata
projection in case of the binomial()
or brms::bernoulli()
family):
lme4::lmer()
(the returned object inherits from class lmerMod
) for
the gaussian()
family, lme4::glmer()
(the returned object inherits
from class glmerMod
) for all other families.
For the augmenteddata projection: ordinal::clmm()
(the returned
object inherits from class clmm
) for the brms::cumulative()
family,
mclogit::mblogit()
(the returned object inherits from class mmblogit
)
for the brms::categorical()
family.
Submodel without multilevel but additive terms: mgcv::gam()
(the returned
object inherits from class gam
).
Submodel with multilevel and additive terms: gamm4::gamm4()
(within
projpred, the returned object inherits from class gamm4
).
Setting global option projpred.extra_verbose
to TRUE
will print out which
submodel projpred is currently projecting onto as well as (if method = "forward"
and verbose = TRUE
in varsel()
or cv_varsel()
) which
submodel has been selected at those steps of the forward search for which a
percentage (of the maximum submodel size that the search is run up to) is
printed. In general, however, we cannot recommend setting this global option
to TRUE
for cv_varsel()
with validate_search = TRUE
(simply due to the
amount of information that will be printed, but also due to the progress bar
which will not work as intended anymore).
By default, projpred catches messages and warnings from the drawwise
divergence minimizers and throws their unique collection after performing all
drawwise divergence minimizations (i.e., drawwise projections). This can be
deactivated by setting global option projpred.warn_prj_drawwise
to FALSE
.
Furthermore, by default, projpred checks the convergence of the
drawwise divergence minimizers and throws a warning if any seem to have not
converged. This warning is thrown after the warning message from global
option projpred.warn_prj_drawwise
(see above) and can be deactivated by
setting global option projpred.check_conv
to FALSE
.
The projection of the reference model onto a submodel can be run in parallel
(across the projected draws). This is powered by the foreach package.
Thus, any parallel (or sequential) backend compatible with foreach can
be used, e.g., the backends from packages doParallel, doMPI, or
doFuture. Using the global option projpred.prll_prj_trigger
, the
number of projected draws below which no parallelization is applied (even if
a parallel backend is registered) can be modified. Such a "trigger" threshold
exists because of the computational overhead of a parallelization which makes
the projection parallelization only useful for a sufficiently large number of
projected draws. By default, the projection parallelization is turned off,
which can also be achieved by supplying Inf
(or NULL
) to option
projpred.prll_prj_trigger
. Note that we cannot recommend the projection
parallelization on Windows because in our experience, the parallelization
overhead is larger there, causing a parallel run to take longer than a
sequential run. Also note that the projection parallelization works well for
submodels which are GLMs (and hence also for the latent projection if the
submodel has no multilevel or additive predictor terms), but for all other
types of submodels, the fitted submodel objects are quite big, which—when
running in parallel—may lead to excessive memory usage which in turn may
crash the R session (on Unix systems, setting an appropriate memory limit via
unix::rlimit_as()
may avoid crashing the whole machine). Thus, we currently
cannot recommend parallelizing projections onto submodels which are GLMs (in
this context, the latent projection onto a submodel without multilevel and
without additive terms may be regarded as a projection onto a submodel which
is a GLM). However, for cv_varsel()
, there is also a CV parallelization
(i.e., a parallelization of projpred's crossvalidation) which can be
activated via argument parallel
.
In case of multilevel models, projpred offers two global options for
"integrating out" grouplevel effects: projpred.mlvl_pred_new
and
projpred.mlvl_proj_ref_new
. When setting projpred.mlvl_pred_new
to TRUE
(default is FALSE
), then at
prediction time, projpred will treat group levels existing in the
training data as new group levels, implying that their grouplevel effects
are drawn randomly from a (multivariate) Gaussian distribution. This concerns
both, the reference model and the (i.e., any) submodel. Furthermore, setting
projpred.mlvl_pred_new
to TRUE
causes as.matrix.projection()
and
as_draws_matrix.projection()
to omit the projected grouplevel effects (for
the group levels from the original dataset). When setting
projpred.mlvl_proj_ref_new
to TRUE
(default is FALSE
), then at
projection time, the reference model's fitted values (that the submodels
fit to) will be computed by treating the group levels from the original
dataset as new group levels, implying that their grouplevel effects will
be drawn randomly from a (multivariate) Gaussian distribution (as long as the
reference model is a multilevel model, which—for custom reference
models—does not need to be the case). This also affects the latent response
values for a latent projection correspondingly. Setting
projpred.mlvl_pred_new
to TRUE
makes sense, e.g., when the prediction
task is such that any group level will be treated as a new one. Typically,
setting projpred.mlvl_proj_ref_new
to TRUE
only makes sense when
projpred.mlvl_pred_new
is already set to TRUE
. In that case, the default
of FALSE
for projpred.mlvl_proj_ref_new
ensures that at projection time,
the submodels fit to the best possible fitted values from the reference
model, and setting projpred.mlvl_proj_ref_new
to TRUE
would make sense if
the grouplevel effects should be integrated out completely.
By setting the global option projpred.run_gc
to TRUE
, projpred will
call gc()
at some places (e.g., after each size that the forward search
passes through) to free up some memory. These gc()
calls are not always
necessary to reduce the peak memory usage, but they add runtime (hence the
default of FALSE
for that global option).
Most examples are not executed when called via example()
. To execute them,
their code has to be copied and pasted manually to the console.
init_refmodel()
, get_refmodel()
For setting up an object
containing information about the reference model, the submodels, and how
the projection should be carried out. Explicit calls to init_refmodel()
and get_refmodel()
are only rarely needed.
varsel()
, cv_varsel()
For running the search part and the evaluation part for a projection predictive variable selection, possibly with crossvalidation (CV).
summary.vsel()
, print.vsel()
, plot.vsel()
,
suggest_size.vsel()
, ranking()
, cv_proportions()
,
plot.cv_proportions()
, performances()
For postprocessing the results
from varsel()
and cv_varsel()
.
project()
For projecting the reference model onto submodel(s). Typically, this follows the variable selection, but it can also be applied directly (without a variable selection).
as.matrix.projection()
and as_draws_matrix.projection()
For extracting projected parameter draws.
proj_linpred()
, proj_predict()
For making predictions from a submodel (after projecting the reference model onto it).
Maintainer: Frank Weber [email protected]
Authors:
Juho Piironen [email protected]
Markus Paasiniemi
Alejandro Catalina [email protected]
Aki Vehtari
Other contributors:
Jonah Gabry [contributor]
Marco Colombo [contributor]
PaulChristian Bürkner [contributor]
Hamada S. Badr [contributor]
Brian Sullivan [contributor]
Sölvi Rögnvaldsson [contributor]
The LME4 Authors (see file 'LICENSE' for details) [copyright holder]
Yann McLatchie [contributor]
Juho Timonen [contributor]
Useful links:
Report bugs at https://github.com/standev/projpred/issues/
draws_matrix
(see package
posterior)These are the posterior::as_draws()
and posterior::as_draws_matrix()
methods for projection
objects (returned by project()
, possibly as
elements of a list
). They extract the projected parameter draws and return
them as a draws_matrix
. In case of different (i.e., nonconstant) weights
for the projected draws, a draws_matrix
allows for a safer handling of
these weights (safer in contrast to the matrix returned by
as.matrix.projection()
), in particular by providing the natural input for
posterior::resample_draws()
(see section "Examples" below).
## S3 method for class 'projection' as_draws_matrix(x, ...) ## S3 method for class 'projection' as_draws(x, ...)
## S3 method for class 'projection' as_draws_matrix(x, ...) ## S3 method for class 'projection' as_draws(x, ...)
x 
An object of class 
... 
Arguments passed to 
In case of the augmenteddata projection for a multilevel submodel
of a brms::categorical()
reference model, the multilevel parameters (and
therefore also their names) slightly differ from those in the brms
reference model fit (see section "Augmenteddata projection" in
extend_family()
's documentation).
An $S_{\mathrm{prj}} \times Q$
draws_matrix
(see
posterior::draws_matrix()
) of projected draws, with
$S_{\mathrm{prj}}$
denoting the number of projected draws and
$Q$
the number of parameters. If the projected draws have nonconstant
weights, posterior::weight_draws()
is applied internally.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `nclusters`, but only for illustrative purposes; this is not # recommended in general): prj < project(fit, predictor_terms = c("X1", "X3", "X5"), nclusters = 5, seed = 9182) # Applying the posterior::as_draws_matrix() generic to the output of # project() dispatches to the projpred::as_draws_matrix.projection() # method: prj_draws < posterior::as_draws_matrix(prj) # Resample the projected draws according to their weights: set.seed(3456) prj_draws_resampled < posterior::resample_draws(prj_draws, ndraws = 1000) # The values from the following two objects should be the same (in general, # this only holds approximately): print(proportions(table(rownames(prj_draws_resampled)))) print(weights(prj_draws)) # Treat the resampled draws like ordinary draws, e.g., summarize them: print(posterior::summarize_draws( prj_draws_resampled, "median", "mad", function(x) quantile(x, probs = c(0.025, 0.975)) )) # Or visualize them using the `bayesplot` package: if (requireNamespace("bayesplot", quietly = TRUE)) { print(bayesplot::mcmc_intervals(prj_draws_resampled)) }
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `nclusters`, but only for illustrative purposes; this is not # recommended in general): prj < project(fit, predictor_terms = c("X1", "X3", "X5"), nclusters = 5, seed = 9182) # Applying the posterior::as_draws_matrix() generic to the output of # project() dispatches to the projpred::as_draws_matrix.projection() # method: prj_draws < posterior::as_draws_matrix(prj) # Resample the projected draws according to their weights: set.seed(3456) prj_draws_resampled < posterior::resample_draws(prj_draws, ndraws = 1000) # The values from the following two objects should be the same (in general, # this only holds approximately): print(proportions(table(rownames(prj_draws_resampled)))) print(weights(prj_draws)) # Treat the resampled draws like ordinary draws, e.g., summarize them: print(posterior::summarize_draws( prj_draws_resampled, "median", "mad", function(x) quantile(x, probs = c(0.025, 0.975)) )) # Or visualize them using the `bayesplot` package: if (requireNamespace("bayesplot", quietly = TRUE)) { print(bayesplot::mcmc_intervals(prj_draws_resampled)) }
This is the as.matrix()
method for projection
objects (returned by
project()
, possibly as elements of a list
). It extracts the projected
parameter draws and returns them as a matrix. In case of different (i.e.,
nonconstant) weights for the projected draws, see
as_draws_matrix.projection()
for a better solution.
## S3 method for class 'projection' as.matrix(x, nm_scheme = NULL, allow_nonconst_wdraws_prj = FALSE, ...)
## S3 method for class 'projection' as.matrix(x, nm_scheme = NULL, allow_nonconst_wdraws_prj = FALSE, ...)
x 
An object of class 
nm_scheme 
The naming scheme for the columns of the output matrix.
Either 
allow_nonconst_wdraws_prj 
A single logical value indicating whether to
allow projected draws with different (i.e., nonconstant) weights ( 
... 
Currently ignored. 
In case of the augmenteddata projection for a multilevel submodel
of a brms::categorical()
reference model, the multilevel parameters (and
therefore also their names) slightly differ from those in the brms
reference model fit (see section "Augmenteddata projection" in
extend_family()
's documentation).
An $S_{\mathrm{prj}} \times Q$
matrix of projected
draws, with $S_{\mathrm{prj}}$
denoting the number of projected
draws and $Q$
the number of parameters. If allow_nonconst_wdraws_prj
is set to TRUE
, the weights of the projected draws are stored in an
attribute wdraws_prj
. (If allow_nonconst_wdraws_prj
is FALSE
,
projected draws with nonconstant weights cause an error.)
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `ndraws`, but only for the sake of speed in this example; this # is not recommended in general): prj < project(fit, predictor_terms = c("X1", "X3", "X5"), ndraws = 21, seed = 9182) # Applying the as.matrix() generic to the output of project() dispatches to # the projpred::as.matrix.projection() method: prj_mat < as.matrix(prj) # Since the draws have all the same weight here, we can treat them like # ordinary MCMC draws, e.g., we can summarize them using the `posterior` # package: if (requireNamespace("posterior", quietly = TRUE)) { print(posterior::summarize_draws( posterior::as_draws_matrix(prj_mat), "median", "mad", function(x) quantile(x, probs = c(0.025, 0.975)) )) } # Or visualize them using the `bayesplot` package: if (requireNamespace("bayesplot", quietly = TRUE)) { print(bayesplot::mcmc_intervals(prj_mat)) }
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `ndraws`, but only for the sake of speed in this example; this # is not recommended in general): prj < project(fit, predictor_terms = c("X1", "X3", "X5"), ndraws = 21, seed = 9182) # Applying the as.matrix() generic to the output of project() dispatches to # the projpred::as.matrix.projection() method: prj_mat < as.matrix(prj) # Since the draws have all the same weight here, we can treat them like # ordinary MCMC draws, e.g., we can summarize them using the `posterior` # package: if (requireNamespace("posterior", quietly = TRUE)) { print(posterior::summarize_draws( posterior::as_draws_matrix(prj_mat), "median", "mad", function(x) quantile(x, probs = c(0.025, 0.975)) )) } # Or visualize them using the `bayesplot` package: if (requireNamespace("bayesplot", quietly = TRUE)) { print(bayesplot::mcmc_intervals(prj_mat)) }
This is the function which has to be supplied to extend_family()
's argument
augdat_ilink
in case of the augmenteddata projection for the binomial()
family.
augdat_ilink_binom(eta_arr, link = "logit")
augdat_ilink_binom(eta_arr, link = "logit")
eta_arr 
An array as described in section "Augmenteddata projection"
of 
link 
The same as argument 
An array as described in section "Augmenteddata projection" of
extend_family()
's documentation.
This is the function which has to be supplied to extend_family()
's argument
augdat_link
in case of the augmenteddata projection for the binomial()
family.
augdat_link_binom(prb_arr, link = "logit")
augdat_link_binom(prb_arr, link = "logit")
prb_arr 
An array as described in section "Augmenteddata projection"
of 
link 
The same as argument 
An array as described in section "Augmenteddata projection" of
extend_family()
's documentation.
Sometimes there can be terms in a formula that refer to a matrix instead of a single predictor. This function breaks up the matrix term into individual predictors to handle separately, as that is probably the intention of the user.
break_up_matrix_term(formula, data)
break_up_matrix_term(formula, data)
formula 
A 
data 
The original 
A list
containing the expanded formula
and the expanded
data.frame
.
This function aggregates $S$
parameter draws that have been clustered
into $S_{\mathrm{cl}}$
clusters by averaging across the draws that
belong to the same cluster. This averaging can be done in a weighted fashion.
cl_agg( draws, cl = seq_len(nrow(draws)), wdraws = rep(1, nrow(draws)), eps_wdraws = 0 )
cl_agg( draws, cl = seq_len(nrow(draws)), wdraws = rep(1, nrow(draws)), eps_wdraws = 0 )
draws 
An 
cl 
A numeric vector of length 
wdraws 
A numeric vector of length 
eps_wdraws 
A positive numeric value (typically small) which will be
used to improve numerical stability: The weights of the draws within each
cluster are multiplied by 
An $S_{\mathrm{cl}} \times P$
matrix of aggregated
parameter draws.
set.seed(323) S < 100L P < 3L draws < matrix(rnorm(S * P), nrow = S, ncol = P) # Clustering example: S_cl < 10L cl_draws < sample.int(S_cl, size = S, replace = TRUE) draws_cl < cl_agg(draws, cl = cl_draws) # Clustering example with nonconstant `wdraws`: w_draws < rgamma(S, shape = 4) draws_cl < cl_agg(draws, cl = cl_draws, wdraws = w_draws) # Thinning example (implying constant `wdraws`): S_th < 50L idxs_thin < round(seq(1, S, length.out = S_th)) th_draws < rep(NA, S) th_draws[idxs_thin] < seq_len(S_th) draws_th < cl_agg(draws, cl = th_draws)
set.seed(323) S < 100L P < 3L draws < matrix(rnorm(S * P), nrow = S, ncol = P) # Clustering example: S_cl < 10L cl_draws < sample.int(S_cl, size = S, replace = TRUE) draws_cl < cl_agg(draws, cl = cl_draws) # Clustering example with nonconstant `wdraws`: w_draws < rgamma(S, shape = 4) draws_cl < cl_agg(draws, cl = cl_draws, wdraws = w_draws) # Thinning example (implying constant `wdraws`): S_th < 50L idxs_thin < round(seq(1, S, length.out = S_th)) th_draws < rep(NA, S) th_draws[idxs_thin] < seq_len(S_th) draws_th < cl_agg(draws, cl = th_draws)
Calculates the ranking proportions from the foldwise predictor rankings in
a crossvalidation (CV) with foldwise searches. For a given predictor
$x$
and a given submodel size $j$
, the ranking proportion is the
proportion of CV folds which have predictor $x$
at position $j$
of
their predictor ranking. While these ranking proportions are helpful for
investigating variability in the predictor ranking, they can also be
cumulated across submodel sizes. The cumulated ranking proportions are more
helpful when it comes to model selection.
cv_proportions(object, ...) ## S3 method for class 'ranking' cv_proportions(object, cumulate = FALSE, ...) ## S3 method for class 'vsel' cv_proportions(object, ...)
cv_proportions(object, ...) ## S3 method for class 'ranking' cv_proportions(object, cumulate = FALSE, ...) ## S3 method for class 'vsel' cv_proportions(object, ...)
object 
For 
... 
For 
cumulate 
A single logical value indicating whether the ranking
proportions should be cumulated across increasing submodel sizes ( 
A numeric matrix containing the ranking proportions. This matrix has
nterms_max
rows and nterms_max
columns, with nterms_max
as specified
in the (possibly implicit) ranking()
call. The rows correspond to the
submodel sizes and the columns to the predictor terms (sorted according to
the fulldata predictor ranking). If cumulate
is FALSE
, then the
returned matrix is of class cv_proportions
. If cumulate
is TRUE
, then
the returned matrix is of classes cv_proportions_cumul
and
cv_proportions
(in this order).
Note that if cumulate
is FALSE
, then the values in the returned matrix
only need to sum to 1 (columnwise and rowwise) if nterms_max
(see
above) is equal to the full model size. Likewise, if cumulate
is TRUE
,
then the value 1
only needs to occur in each column of the returned
matrix if nterms_max
is equal to the full model size.
The cv_proportions()
function is only applicable if the ranking
object
includes foldwise predictor rankings (i.e., if it is based on a vsel
object created by cv_varsel()
with validate_search = TRUE
). If the
ranking
object contains only a fulldata predictor ranking (i.e., if it
is based on a vsel
object created by varsel()
or by cv_varsel()
, but
the latter with validate_search = FALSE
), then an error is thrown because
in that case, there are no foldwise predictor rankings from which to
calculate ranking proportions.
# For an example, see `?plot.cv_proportions`.
# For an example, see `?plot.cv_proportions`.
Run the search part and the evaluation part for a projection predictive
variable selection. The search part determines the predictor ranking (also
known as solution path), i.e., the best submodel for each submodel size
(number of predictor terms). The evaluation part determines the predictive
performance of the submodels along the predictor ranking. In contrast to
varsel()
, cv_varsel()
performs a crossvalidation (CV) by running the
search part with the training data of each CV fold separately (an exception
is explained in section "Note" below) and by running the evaluation part on
the corresponding test set of each CV fold. A special method is
cv_varsel.vsel()
because it reuses the search results from an earlier
cv_varsel()
(or varsel()
) run, as illustrated in the main vignette.
cv_varsel(object, ...) ## Default S3 method: cv_varsel(object, ...) ## S3 method for class 'vsel' cv_varsel( object, cv_method = object$cv_method %% "LOO", nloo = object$nloo, K = object$K %% if (!inherits(object, "datafit")) 5 else 10, cvfits = object$cvfits, validate_search = object$validate_search %% TRUE, ... ) ## S3 method for class 'refmodel' cv_varsel( object, method = "forward", cv_method = if (!inherits(object, "datafit")) "LOO" else "kfold", ndraws = NULL, nclusters = 20, ndraws_pred = 400, nclusters_pred = NULL, refit_prj = !inherits(object, "datafit"), nterms_max = NULL, penalty = NULL, verbose = TRUE, nloo = object$nobs, K = if (!inherits(object, "datafit")) 5 else 10, cvfits = object$cvfits, search_control = NULL, lambda_min_ratio = 1e05, nlambda = 150, thresh = 1e06, validate_search = TRUE, seed = NA, search_terms = NULL, search_out = NULL, parallel = getOption("projpred.prll_cv", FALSE), ... )
cv_varsel(object, ...) ## Default S3 method: cv_varsel(object, ...) ## S3 method for class 'vsel' cv_varsel( object, cv_method = object$cv_method %% "LOO", nloo = object$nloo, K = object$K %% if (!inherits(object, "datafit")) 5 else 10, cvfits = object$cvfits, validate_search = object$validate_search %% TRUE, ... ) ## S3 method for class 'refmodel' cv_varsel( object, method = "forward", cv_method = if (!inherits(object, "datafit")) "LOO" else "kfold", ndraws = NULL, nclusters = 20, ndraws_pred = 400, nclusters_pred = NULL, refit_prj = !inherits(object, "datafit"), nterms_max = NULL, penalty = NULL, verbose = TRUE, nloo = object$nobs, K = if (!inherits(object, "datafit")) 5 else 10, cvfits = object$cvfits, search_control = NULL, lambda_min_ratio = 1e05, nlambda = 150, thresh = 1e06, validate_search = TRUE, seed = NA, search_terms = NULL, search_out = NULL, parallel = getOption("projpred.prll_cv", FALSE), ... )
object 
An object of class 
... 
For 
cv_method 
The CV method, either 
nloo 
Caution: Still experimental. Only relevant if 
K 
Only relevant if 
cvfits 
Only relevant if 
validate_search 
A single logical value indicating whether to
crossvalidate also the search part, i.e., whether to run the search
separately for each CVfold ( 
method 
The method for the search part. Possible options are

ndraws 
Number of posterior draws used in the search part. Ignored if

nclusters 
Number of clusters of posterior draws used in the search
part. Ignored in case of L1 search (because L1 search always uses a single
cluster). For the meaning of 
ndraws_pred 
Only relevant if 
nclusters_pred 
Only relevant if 
refit_prj 
For the evaluation part, should the projections onto the
submodels along the predictor ranking be performed again using 
nterms_max 
Maximum submodel size (number of predictor terms) up to
which the search is continued. If 
penalty 
Only relevant for L1 search. A numeric vector determining the
relative penalties or costs for the predictors. A value of 
verbose 
A single logical value indicating whether to print out additional information during the computations. 
search_control 
A

lambda_min_ratio 
Deprecated (please use 
nlambda 
Deprecated (please use 
thresh 
Deprecated (please use 
seed 
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument 
search_terms 
Only relevant for forward search. A custom character
vector of predictor term blocks to consider for the search. Section
"Details" below describes more precisely what "predictor term block" means.
The intercept ( 
search_out 
Intended for internal use. 
parallel 
A single logical value indicating whether to run costly parts
of the CV in parallel ( 
Arguments ndraws
, nclusters
, nclusters_pred
, and ndraws_pred
are automatically truncated at the number of posterior draws in the
reference model (which is 1
for datafit
s). Using less draws or clusters
in ndraws
, nclusters
, nclusters_pred
, or ndraws_pred
than posterior
draws in the reference model may result in slightly inaccurate projection
performance. Increasing these arguments affects the computation time
linearly.
For argument method
, there are some restrictions: For a reference model
with multilevel or additive formula terms or a reference model set up for
the augmenteddata projection, only the forward search is available.
Furthermore, argument search_terms
requires a forward search to take
effect.
L1 search is faster than forward search, but forward search may be more
accurate. Furthermore, forward search may find a sparser model with
comparable performance to that found by L1 search, but it may also
overfit when more predictors are added. This overfit can be detected
by running search validation (see cv_varsel()
).
An L1 search may select an interaction term before all involved lowerorder interaction terms (including maineffect terms) have been selected. In projpred versions > 2.6.0, the resulting predictor ranking is automatically modified so that the lowerorder interaction terms come before this interaction term, but if this is conceptually undesired, choose the forward search instead.
The elements of the search_terms
character vector don't need to be
individual predictor terms. Instead, they can be building blocks consisting
of several predictor terms connected by the +
symbol. To understand how
these building blocks work, it is important to know how projpred's
forward search works: It starts with an empty vector chosen
which will
later contain already selected predictor terms. Then, the search iterates
over model sizes $j \in \{0, ..., J\}$
(with $J$
denoting the maximum submodel size, not counting the intercept). The
candidate models at model size $j$
are constructed from those elements
from search_terms
which yield model size $j$
when combined with the
chosen
predictor terms. Note that sometimes, there may be no candidate
models for model size $j$
. Also note that internally, search_terms
is
expanded to include the intercept ("1"
), so the first step of the search
(model size 0) always consists of the interceptonly model as the only
candidate.
As a search_terms
example, consider a reference model with formula y ~ x1 + x2 + x3
. Then, to ensure that x1
is always included in the
candidate models, specify search_terms = c("x1", "x1 + x2", "x1 + x3", "x1 + x2 + x3")
(or, in a simpler way that leads to the same results,
search_terms = c("x1", "x1 + x2", "x1 + x3")
, for which helper function
force_search_terms()
exists). This search would start with y ~ 1
as the
only candidate at model size 0. At model size 1, y ~ x1
would be the only
candidate. At model size 2, y ~ x1 + x2
and y ~ x1 + x3
would be the
two candidates. At the last model size of 3, y ~ x1 + x2 + x3
would be
the only candidate. As another example, to exclude x1
from the search,
specify search_terms = c("x2", "x3", "x2 + x3")
(or, in a simpler way
that leads to the same results, search_terms = c("x2", "x3")
).
An object of class vsel
. The elements of this object are not meant
to be accessed directly but instead via helper functions (see the main
vignette and projpredpackage).
If validate_search
is FALSE
, the search is not included in the CV
so that only a single fulldata search is run. If the number of
observations is big, the fast PSISLOOCV along the fulldata search path
is likely to be accurate. If the number of observations is small or
moderate, the fast PSISLOOCV along the fulldata search path is likely to
have optimistic bias in the middle of the search path. This result can be
used to guide further actions and the optimistic bias can be greatly
reduced by using validate_search = TRUE
.
PSIS uses Pareto$\hat{k}$
diagnostic to assess the reliability of
PSISLOOCV. Whether the Pareto$\hat{k}$
diagnostics are shown as
warnings, is controlled with a global option projpred.warn_psis
(default
is TRUE
). See loo::looglossary for how to interpret the
Pareto$\hat{k}$
values and the warning thresholds. projpred does
not support the usually recommended momentmatching (see
loo::loo_moment_match()
and brms::loo_moment_match()
), mixture
importance sampling (vignette("loo2mixis", package="loo")
), or
reloo
ing (brms::reloo()
). If the reference model PSISLOOCV
Pareto$\hat{k}$
values are good, but there are high
Pareto$\hat{k}$
values for the projected models, you can try
increasing the number of draws used for the PSISLOOCV (ndraws
in case
of refit_prj = FALSE
; ndraws_pred
in case of refit_prj = TRUE
). If
increasing the number of draws does not help and if the reference model
PSISLOOCV Pareto$\hat{k}$
values are high, and the reference model
PSISLOOCV results change substantially when using momentmatching,
mixture importance sampling, or reloo
ing, we recommend to use
$K$
foldCV within projpred
.
For PSISLOOCV, projpred calls loo::psis()
(or, exceptionally,
loo::sis()
, see below) with r_eff = NA
. This is only a problem if there
was extreme autocorrelation between the MCMC iterations when the reference
model was built. In those cases however, the reference model should not
have been used anyway, so we don't expect projpred's r_eff = NA
to
be a problem.
PSIS cannot be used if the number of draws or clusters is too small. In
such cases, projpred resorts to standard importance sampling (SIS)
and shows a message about this. Throughout the documentation, the term
"PSIS" is used even though in fact, projpred resorts to SIS in these
special cases. If SIS is used, check that the reference model PSISLOOCV
Pareto$\hat{k}$
values are good.
With parallel = TRUE
, costly parts of projpred's CV can be run in
parallel. Costly parts are the foldwise searches and performance
evaluations in case of validate_search = TRUE
. (Note that in case of
$K$
fold CV, the $K$
reference model refits are not affected by
argument parallel
; only projpred's CV is affected.) The
parallelization is powered by the foreach package. Thus, any parallel
(or sequential) backend compatible with foreach can be used, e.g.,
the backends from packages doParallel, doMPI, or
doFuture. For GLMs, this CV parallelization should work reliably, but
for other models (such as GLMMs), it may lead to excessive memory usage
which in turn may crash the R session (on Unix systems, setting an
appropriate memory limit via unix::rlimit_as()
may avoid crashing the
whole machine). However, the problem of excessive memory usage is less
pronounced for the CV parallelization than for the projection
parallelization described in projpredpackage. In that regard, the CV
parallelization is recommended over the projection parallelization.
Måns Magnusson, Michael Riis Andersen, Johan Jonasson, Aki Vehtari (2020). Leaveoneout crossvalidation for Bayesian model comparison in large data. Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR 108:341351.
Aki Vehtari, Andrew Gelman, and Jonah Gabry (2017). Practical Bayesian Model Evaluation Using LeaveOneOut CrossValidation and WAIC. Statistics and Computing, 27(5):1413–32. doi:10.1007/s1122201696964.
Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry (2024). Pareto smoothed importance sampling. Journal of Machine Learning Research, 25(72):158.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 1000, refresh = 0, seed = 9876 ) # Run cv_varsel() (with L1 search and small values for `K`, `nterms_max`, and # `nclusters_pred`, but only for the sake of speed in this example; this is # not recommended in general): cvvs < cv_varsel(fit, method = "L1", cv_method = "kfold", K = 2, nterms_max = 3, nclusters_pred = 10, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible postprocessing functions.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 1000, refresh = 0, seed = 9876 ) # Run cv_varsel() (with L1 search and small values for `K`, `nterms_max`, and # `nclusters_pred`, but only for the sake of speed in this example; this is # not recommended in general): cvvs < cv_varsel(fit, method = "L1", cv_method = "kfold", K = 2, nterms_max = 3, nclusters_pred = 10, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible postprocessing functions.
These are helper functions to create crossvalidation (CV) folds, i.e., to
split up the indices from 1 to n
into K
subsets ("folds") for
$K$
fold CV. These functions are potentially useful when creating the
input for arguments cvfits
and cvfun
of init_refmodel()
(or argument
cvfits
of cv_varsel.refmodel()
). Function cvfolds()
is deprecated;
please use cv_folds()
instead (apart from the name, they are the same). The
return value of cv_folds()
and cv_ids()
is different, see below for
details.
cv_folds(n, K, seed = NA) cvfolds(n, K, seed = NA) cv_ids(n, K, out = c("foldwise", "indices"), seed = NA)
cv_folds(n, K, seed = NA) cvfolds(n, K, seed = NA) cv_ids(n, K, out = c("foldwise", "indices"), seed = NA)
n 
Number of observations. 
K 
Number of folds. Must be at least 2 and not exceed 
seed 
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument 
out 
Format of the output, either 
cv_folds()
returns a vector of length n
such that each element is
an integer between 1 and K
denoting which fold the corresponding data
point belongs to. The return value of cv_ids()
depends on the out
argument. If out = "foldwise"
, the return value is a list
with K
elements, each being a list
with elements tr
and ts
giving the
training and test indices, respectively, for the corresponding fold. If
out = "indices"
, the return value is a list
with elements tr
and ts
each being a list
with K
elements giving the training and test indices,
respectively, for each fold.
n < 100 set.seed(1234) y < rnorm(n) cv < cv_ids(n, K = 5) # Mean within the test set of each fold: cvmeans < sapply(cv, function(fold) mean(y[fold$ts]))
n < 100 set.seed(1234) y < rnorm(n) cv < cv_ids(n, K = 5) # Mean within the test set of each fold: cvmeans < sapply(cv, function(fold) mean(y[fold$ts]))
Binomial toy example
df_binom
df_binom
A simulated classification dataset containing 100 observations.
response, 0 or 1.
predictors, 30 in total.
https://web.stanford.edu/~hastie/glmnet/glmnetData/BNExample.RData
Gaussian toy example
df_gaussian
df_gaussian
A simulated regression dataset containing 100 observations.
response, realvalued.
predictors, 20 in total. Mean and SD are approximately 0 and 1, respectively.
https://web.stanford.edu/~hastie/glmnet/glmnetData/QSExample.RData
This function adds some internally required elements to an object of class
family
(see, e.g., family()
). It is called internally by
init_refmodel()
, so you will rarely need to call it yourself.
extend_family( family, latent = FALSE, latent_y_unqs = NULL, latent_ilink = NULL, latent_ll_oscale = NULL, latent_ppd_oscale = NULL, augdat_y_unqs = NULL, augdat_link = NULL, augdat_ilink = NULL, augdat_args_link = list(), augdat_args_ilink = list(), ... )
extend_family( family, latent = FALSE, latent_y_unqs = NULL, latent_ilink = NULL, latent_ll_oscale = NULL, latent_ppd_oscale = NULL, augdat_y_unqs = NULL, augdat_link = NULL, augdat_ilink = NULL, augdat_args_link = list(), augdat_args_ilink = list(), ... )
family 
An object of class 
latent 
A single logical value indicating whether to use the latent
projection ( 
latent_y_unqs 
Only relevant for a latent projection where the original
response space has finite support (i.e., the original response values may
be regarded as categories), in which case this needs to be the character
vector of unique response values (which will be assigned to 
latent_ilink 
Only relevant for the latent projection, in which case
this needs to be the inverselink function. If the original response family
was the 
latent_ll_oscale 
Only relevant for the latent projection, in which
case this needs to be the function computing responsescale (not
latentscale) loglikelihood values. If 
latent_ppd_oscale 
Only relevant for the latent projection, in which
case this needs to be the function sampling response values given latent
predictors that have been transformed to response scale using

augdat_y_unqs 
Only relevant for augmenteddata projection, in which
case this needs to be the character vector of unique response values (which
will be assigned to 
augdat_link 
Only relevant for augmenteddata projection, in which case
this needs to be the link function. Use 
augdat_ilink 
Only relevant for augmenteddata projection, in which
case this needs to be the inverselink function. Use 
augdat_args_link 
Only relevant for augmenteddata projection, in which
case this may be a named 
augdat_args_ilink 
Only relevant for augmenteddata projection, in
which case this may be a named 
... 
Ignored (exists only to swallow up further arguments which might be passed to this function). 
In the following, $N$
, $C_{\mathrm{cat}}$
,
$C_{\mathrm{lat}}$
, $S_{\mathrm{ref}}$
, and
$S_{\mathrm{prj}}$
from help topic refmodelinitget are used.
Note that $N$
does not necessarily denote the number of original
observations; it can also refer to new observations. Furthermore, let $S$
denote either $S_{\mathrm{ref}}$
or $S_{\mathrm{prj}}$
,
whichever is appropriate in the context where it is used.
The family
object extended in the way needed by projpred.
As their first input, the functions supplied to arguments augdat_link
and
augdat_ilink
have to accept:
For augdat_link
: an $S \times N \times C_{\mathrm{cat}}$
array containing the probabilities for the response categories. The
order of the response categories is the same as in family$cats
(see
argument augdat_y_unqs
).
For augdat_ilink
: an $S \times N \times C_{\mathrm{lat}}$
array containing the linear predictors.
The return value of these functions needs to be:
For augdat_link
: an $S \times N \times C_{\mathrm{lat}}$
array containing the linear predictors.
For augdat_ilink
: an $S \times N \times C_{\mathrm{cat}}$
array containing the probabilities for the response categories. The
order of the response categories has to be the same as in family$cats
(see
argument augdat_y_unqs
).
For the augmenteddata projection, the response vector resulting from
extract_model_data
(see init_refmodel()
) is coerced to a factor
(using
as.factor()
) at multiple places throughout this package. Inside of
init_refmodel()
, the levels of this factor
have to be identical to
family$cats
(after applying extend_family()
inside of
init_refmodel()
). Everywhere else, these levels have to be a subset of
<refmodel>$family$cats
(where <refmodel>
is an object resulting from
init_refmodel()
). See argument augdat_y_unqs
for how to control
family$cats
.
For ordinal brms families, be aware that the submodels (onto which the reference model is projected) currently have the following restrictions:
The discrimination parameter disc
is not supported (i.e., it is a
constant with value 1).
The thresholds are "flexible"
(see brms::brmsfamily()
).
The thresholds do not vary across the levels of a factor
like variable
(see argument gr
of brms::resp_thres()
).
The "probit_approx"
link is replaced by "probit"
.
For the brms::categorical()
family, be aware that:
For multilevel submodels, the grouplevel effects are allowed to be correlated between different response categories.
For multilevel submodels, mclogit versions < 0.9.4 may throw the
error 'a' (<number> x 1) must be square
. Updating mclogit to a
version >= 0.9.4 should fix this.
The function supplied to argument latent_ilink
needs to have the prototype
latent_ilink(lpreds, cl_ref, wdraws_ref = rep(1, length(cl_ref)))
where:
lpreds
accepts an $S \times N$
matrix containing the linear
predictors.
cl_ref
accepts a numeric vector of length $S_{\mathrm{ref}}$
,
containing projpred's internal cluster indices for these draws.
wdraws_ref
accepts a numeric vector of length
$S_{\mathrm{ref}}$
, containing weights for these draws. These
weights should be treated as not being normalized (i.e., they don't
necessarily sum to 1
).
The return value of latent_ilink
needs to contain the linear predictors
transformed to the original response space, with the following structure:
If is.null(family$cats)
(after taking latent_y_unqs
into account): an
$S \times N$
matrix.
If !is.null(family$cats)
(after taking latent_y_unqs
into account): an
$S \times N \times C_{\mathrm{cat}}$
array. In that case,
latent_ilink
needs to return probabilities (for the response categories
given in family$cats
, after taking latent_y_unqs
into account).
The function supplied to argument latent_ll_oscale
needs to have the
prototype
latent_ll_oscale(ilpreds, y_oscale, wobs = rep(1, length(y_oscale)), cl_ref, wdraws_ref = rep(1, length(cl_ref)))
where:
ilpreds
accepts the return value from latent_ilink
.
y_oscale
accepts a vector of length $N$
containing response values on
the original response scale.
wobs
accepts a numeric vector of length $N$
containing observation
weights.
cl_ref
accepts the same input as argument cl_ref
of latent_ilink
.
wdraws_ref
accepts the same input as argument wdraws_ref
of
latent_ilink
.
The return value of latent_ll_oscale
needs to be an $S \times N$
matrix containing the responsescale (not latentscale) loglikelihood values
for the $N$
observations from its inputs.
The function supplied to argument latent_ppd_oscale
needs to have the
prototype
latent_ppd_oscale(ilpreds_resamp, wobs, cl_ref, wdraws_ref = rep(1, length(cl_ref)), idxs_prjdraws)
where:
ilpreds_resamp
accepts the return value from latent_ilink
, but possibly
with resampled (clustered) draws (see argument nresample_clusters
of
proj_predict()
).
wobs
accepts a numeric vector of length $N$
containing observation
weights.
cl_ref
accepts the same input as argument cl_ref
of latent_ilink
.
wdraws_ref
accepts the same input as argument wdraws_ref
of
latent_ilink
.
idxs_prjdraws
accepts a numeric vector of length dim(ilpreds_resamp)[1]
containing the resampled indices of the projected draws (i.e., these indices
are values from the set $\{1, ..., \texttt{dim(ilpreds)[1]}\}$
where ilpreds
denotes the return value of
latent_ilink
).
The return value of latent_ppd_oscale
needs to be a
$\texttt{dim(ilpreds\_resamp)[1]} \times N$
matrix containing the responsescale (not latentscale) draws from the
posterior(projection) predictive distributions for the $N$
observations
from its inputs.
If the bodies of these three functions involve parameter draws from the
reference model which have not been projected (e.g., for latent_ilink
, the
thresholds in an ordinal model), cl_agg()
is provided as a helper function
for aggregating these reference model draws in the same way as the draws have
been aggregated for the first argument of these functions (e.g., lpreds
in
case of latent_ilink
).
In fact, the weights passed to argument wdraws_ref
are nonconstant only in
case of cv_varsel()
with cv_method = "LOO"
and validate_search = TRUE
.
In that case, the weights passed to this argument are the PSISLOO CV weights
for one observation. Note that although argument wdraws_ref
has the suffix
_ref
, wdraws_ref
does not necessarily obtain weights for the initial
reference model's posterior draws: In case of cv_varsel()
with cv_method = "kfold"
, these weights may refer to one of the $K$
reference model
refits (but in that case, they are constant anyway).
If family$cats
is not NULL
(after taking latent_y_unqs
into account),
then the response vector resulting from extract_model_data
(see
init_refmodel()
) is coerced to a factor
(using as.factor()
) at multiple
places throughout this package. Inside of init_refmodel()
, the levels of
this factor
have to be identical to family$cats
(after applying
extend_family()
inside of init_refmodel()
). Everywhere else, these levels
have to be a subset of <refmodel>$family$cats
(where <refmodel>
is an
object resulting from init_refmodel()
).
Family objects not in the set of default family
objects.
Student_t(link = "identity", nu = 3)
Student_t(link = "identity", nu = 3)
link 
Name of the link function. In contrast to the default 
nu 
Degrees of freedom for the Student 
A family object analogous to those described in family
.
Support for the Student_t()
family is still experimental.
A helper function to construct the input for argument search_terms
of
varsel()
or cv_varsel()
if certain predictor terms should be forced to be
selected first whereas other predictor terms are optional (i.e., they are
subject to the variable selection, but only after the inclusion of the
"forced" terms).
force_search_terms(forced_terms, optional_terms)
force_search_terms(forced_terms, optional_terms)
forced_terms 
A character vector of predictor terms that should be selected first. 
optional_terms 
A character vector of predictor terms that should be subject to the variable selection after the inclusion of the "forced" terms. 
A character vector that may be used as input for argument
search_terms
of varsel()
or cv_varsel()
.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # We will force X1 and X2 to be selected first: search_terms_forced < force_search_terms( forced_terms = paste0("X", 1:2), optional_terms = paste0("X", 3:5) ) # Run varsel() (here without crossvalidation and with small values for # `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, nclusters = 5, nclusters_pred = 10, search_terms = search_terms_forced, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible postprocessing functions.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # We will force X1 and X2 to be selected first: search_terms_forced < force_search_terms( forced_terms = paste0("X", 1:2), optional_terms = paste0("X", 3:5) ) # Run varsel() (here without crossvalidation and with small values for # `nterms_max`, `nclusters`, and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, nclusters = 5, nclusters_pred = 10, search_terms = search_terms_forced, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible postprocessing functions.
The mesquite bushes yields dataset from Gelman and Hill (2006) (http://www.stat.columbia.edu/~gelman/arm/).
mesquite
mesquite
The response variable is the total weight (in grams) of photosynthetic material as derived from actual harvesting of the bush. The predictor variables are:
diameter of the canopy (the leafy area of the bush) in meters, measured along the longer axis of the bush.
canopy diameter measured along the shorter axis.
height of the canopy.
total height of the bush.
plant unit density (# of primary stems per plant unit).
group of measurements (0 for the first group, 1 for the second group).
http://www.stat.columbia.edu/~gelman/arm/examples/mesquite/mesquite.dat
Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge, UK: Cambridge University Press. doi:10.1017/CBO9780511790942.
Retrieves the predictive performance summaries after running varsel()
or
cv_varsel()
. These summaries are computed by summary.vsel()
, so the main
method of performances()
is performances.vselsummary()
(objects of class
vselsummary
are returned by summary.vsel()
). As a shortcut method,
performances.vsel()
is provided as well (objects of class vsel
are
returned by varsel()
and cv_varsel()
). For a graphical representation,
see plot.vsel()
.
performances(object, ...) ## S3 method for class 'vselsummary' performances(object, ...) ## S3 method for class 'vsel' performances(object, ...)
performances(object, ...) ## S3 method for class 'vselsummary' performances(object, ...) ## S3 method for class 'vsel' performances(object, ...)
object 
The object from which to retrieve the predictive performance results. Possible classes may be inferred from the names of the corresponding methods (see also the description). 
... 
For 
An object of class performances
which is a list
with the
following elements:
submodels
: The predictive performance results for the submodels, as a
data.frame
.
reference_model
: The predictive performance results for the reference
model, as a named vector.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without crossvalidation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(performances(vs))
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without crossvalidation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(performances(vs))
Plots the ranking proportions (see cv_proportions()
) from the foldwise
predictor rankings in a crossvalidation with foldwise searches. This is a
visualization of the transposed matrix returned by cv_proportions()
. The
proportions printed as text inside of the colored tiles are rounded to whole
percentage points (the plotted proportions themselves are not rounded).
## S3 method for class 'cv_proportions' plot(x, text_angle = NULL, ...) ## S3 method for class 'ranking' plot(x, ...)
## S3 method for class 'cv_proportions' plot(x, text_angle = NULL, ...) ## S3 method for class 'ranking' plot(x, ...)
x 
For 
text_angle 
Passed to argument 
... 
For 
A ggplot2 plotting object (of class gg
and ggplot
).
Idea and original code by Aki Vehtari. Slight modifications of the original code by Frank Weber, Yann McLatchie, and Sölvi Rögnvaldsson. Final implementation in projpred by Frank Weber.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 1000, refresh = 0, seed = 9876 ) # Run cv_varsel() (with L1 search and small values for `K`, `nterms_max`, and # `nclusters_pred`, but only for the sake of speed in this example; this is # not recommended in general): cvvs < cv_varsel(fit, method = "L1", cv_method = "kfold", K = 2, nterms_max = 3, nclusters_pred = 10, seed = 5555) # Extract predictor rankings: rk < ranking(cvvs) # Compute ranking proportions: pr_rk < cv_proportions(rk) # Visualize the ranking proportions: gg_pr_rk < plot(pr_rk) print(gg_pr_rk) # Since the object returned by plot.cv_proportions() is a standard ggplot2 # plotting object, you can modify the plot easily, e.g., to remove the # legend: print(gg_pr_rk + ggplot2::theme(legend.position = "none"))
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 1000, refresh = 0, seed = 9876 ) # Run cv_varsel() (with L1 search and small values for `K`, `nterms_max`, and # `nclusters_pred`, but only for the sake of speed in this example; this is # not recommended in general): cvvs < cv_varsel(fit, method = "L1", cv_method = "kfold", K = 2, nterms_max = 3, nclusters_pred = 10, seed = 5555) # Extract predictor rankings: rk < ranking(cvvs) # Compute ranking proportions: pr_rk < cv_proportions(rk) # Visualize the ranking proportions: gg_pr_rk < plot(pr_rk) print(gg_pr_rk) # Since the object returned by plot.cv_proportions() is a standard ggplot2 # plotting object, you can modify the plot easily, e.g., to remove the # legend: print(gg_pr_rk + ggplot2::theme(legend.position = "none"))
This is the plot()
method for vsel
objects (returned by varsel()
or
cv_varsel()
). It visualizes the predictive performance of the reference
model (possibly also that of some other "baseline" model) and that of the
submodels along the fulldata predictor ranking. Basic information about the
(CV) variability in the ranking of the predictors is included as well (if
available; inferred from cv_proportions()
). For a tabular representation,
see summary.vsel()
and performances()
.
## S3 method for class 'vsel' plot( x, nterms_max = NULL, stats = "elpd", deltas = FALSE, alpha = 2 * pnorm(1), baseline = if (!inherits(x$refmodel, "datafit")) "ref" else "best", thres_elpd = NA, resp_oscale = TRUE, point_size = 3, bar_thickness = 1, ranking_nterms_max = NULL, ranking_abbreviate = FALSE, ranking_abbreviate_args = list(), ranking_repel = NULL, ranking_repel_args = list(), ranking_colored = FALSE, show_cv_proportions = TRUE, cumulate = FALSE, text_angle = NULL, size_position = "primary_x_bottom", ... )
## S3 method for class 'vsel' plot( x, nterms_max = NULL, stats = "elpd", deltas = FALSE, alpha = 2 * pnorm(1), baseline = if (!inherits(x$refmodel, "datafit")) "ref" else "best", thres_elpd = NA, resp_oscale = TRUE, point_size = 3, bar_thickness = 1, ranking_nterms_max = NULL, ranking_abbreviate = FALSE, ranking_abbreviate_args = list(), ranking_repel = NULL, ranking_repel_args = list(), ranking_colored = FALSE, show_cv_proportions = TRUE, cumulate = FALSE, text_angle = NULL, size_position = "primary_x_bottom", ... )
x 
An object of class 
nterms_max 
Maximum submodel size (number of predictor terms) for which
the performance statistics are calculated. Using 
stats 
One or more character strings determining which performance
statistics (i.e., utilities or losses) to estimate based on the
observations in the evaluation (or "test") set (in case of
crossvalidation, these are all observations because they are partitioned
into multiple test sets; in case of

deltas 
If 
alpha 
A number determining the (nominal) coverage 
baseline 
For 
thres_elpd 
Only relevant if 
resp_oscale 
Only relevant for the latent projection. A single logical
value indicating whether to calculate the performance statistics on the
original response scale ( 
point_size 
Passed to argument 
bar_thickness 
Passed to argument 
ranking_nterms_max 
Maximum submodel size (number of predictor terms)
for which the predictor names and the corresponding ranking proportions are
added on the xaxis. Using 
ranking_abbreviate 
A single logical value indicating whether the
predictor names in the fulldata predictor ranking should be abbreviated by

ranking_abbreviate_args 
A 
ranking_repel 
Either 
ranking_repel_args 
A 
ranking_colored 
A single logical value indicating whether the points
and the uncertainty bars should be gradientcolored according to the CV
ranking proportions ( 
show_cv_proportions 
A single logical value indicating whether the CV
ranking proportions (see 
cumulate 
Passed to argument 
text_angle 
Passed to argument 
size_position 
A single character string specifying the position of the
submodel sizes. Either 
... 
Arguments passed to the internal function which is used for
bootstrapping (if applicable; see argument 
The stats
options "mse"
and "rmse"
are only available for:
the traditional projection,
the latent projection with resp_oscale = FALSE
,
the latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
.
The stats
option "acc"
(= "pctcorr"
) is only available for:
the binomial()
family in case of the traditional projection,
all families in case of the augmenteddata projection,
the binomial()
family (on the original response scale) in case of the
latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
,
all families (on the original response scale) in case of the latent
projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being not NULL
.
The stats
option "auc"
is only available for:
the binomial()
family in case of the traditional projection,
the binomial()
family (on the original response scale) in case of the
latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
.
A ggplot2 plotting object (of class gg
and ggplot
). If
ranking_abbreviate
is TRUE
, the output of abbreviate()
is stored in
an attribute called projpred_ranking_abbreviated
(to allow the
abbreviations to be easily mapped back to the original predictor names).
As long as the reference model's performance is computable, it is always
shown in the plot as a dashed red horizontal line. If baseline = "best"
,
the baseline model's performance is shown as a dotted black horizontal line.
If !is.na(thres_elpd)
and any(stats %in% c("elpd", "mlpd", "gmpd"))
, the
value supplied to thres_elpd
(which is automatically adapted internally in
case of the MLPD or the GMPD or deltas = FALSE
) is shown as a dotdashed
gray horizontal line for the reference model and, if baseline = "best"
, as
a longdashed green horizontal line for the baseline model.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without crossvalidation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(plot(vs))
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without crossvalidation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(plot(vs))
After the projection of the reference model onto a submodel, the linear
predictors (for the original or a new dataset) based on that submodel can be
calculated by proj_linpred()
. These linear predictors can also be
transformed to response scale and averaged across the projected parameter
draws. Furthermore, proj_linpred()
returns the corresponding log predictive
density values if the (original or new) dataset contains response values. The
proj_predict()
function draws from the predictive distributions (there is
one such distribution for each observation from the original or new dataset)
of the submodel that the reference model has been projected onto. If the
projection has not been performed yet, both functions call project()
internally to perform the projection. Both functions can also handle multiple
submodels at once (for object
s of class vsel
or object
s returned by a
project()
call to an object of class vsel
; see project()
).
proj_linpred( object, newdata = NULL, offsetnew = NULL, weightsnew = NULL, filter_nterms = NULL, transform = FALSE, integrated = FALSE, allow_nonconst_wdraws_prj = return_draws_matrix, return_draws_matrix = FALSE, .seed = NA, ... ) proj_predict( object, newdata = NULL, offsetnew = NULL, weightsnew = NULL, filter_nterms = NULL, nresample_clusters = 1000, return_draws_matrix = FALSE, .seed = NA, resp_oscale = TRUE, ... )
proj_linpred( object, newdata = NULL, offsetnew = NULL, weightsnew = NULL, filter_nterms = NULL, transform = FALSE, integrated = FALSE, allow_nonconst_wdraws_prj = return_draws_matrix, return_draws_matrix = FALSE, .seed = NA, ... ) proj_predict( object, newdata = NULL, offsetnew = NULL, weightsnew = NULL, filter_nterms = NULL, nresample_clusters = 1000, return_draws_matrix = FALSE, .seed = NA, resp_oscale = TRUE, ... )
object 
An object returned by 
newdata 
Passed to argument 
offsetnew 
Passed to argument 
weightsnew 
Passed to argument 
filter_nterms 
Only applies if 
transform 
For 
integrated 
For 
allow_nonconst_wdraws_prj 
Only relevant for 
return_draws_matrix 
A single logical value indicating whether to
return an object (in case of 
.seed 
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument 
... 
Arguments passed to 
nresample_clusters 
For 
resp_oscale 
Only relevant for the latent projection. A single logical
value indicating whether to draw from the posteriorprojection predictive
distributions on the original response scale ( 
Currently, proj_predict()
ignores observation weights that are not
equal to 1
. A corresponding warning is thrown if this is the case.
In case of the latent projection and transform = FALSE
:
Output element pred
contains the linear predictors without any
modifications that may be due to the original response distribution (e.g.,
for a brms::cumulative()
model, the ordered thresholds are not taken into
account).
Output element lpd
contains the latent log predictive density values,
i.e., those corresponding to the latent Gaussian distribution. If newdata
is not NULL
, this requires the latent response values to be supplied in a
column called .<response_name>
of newdata
where <response_name>
needs
to be replaced by the name of the original response variable (if
<response_name>
contained parentheses, these have been stripped off by
init_refmodel()
; see the lefthand side of formula(<refmodel>)
). For
technical reasons, the existence of column <response_name>
in newdata
is another requirement (even though .<response_name>
is actually used).
In the following, $S_{\mathrm{prj}}$
, $N$
,
$C_{\mathrm{cat}}$
, and $C_{\mathrm{lat}}$
from help
topic refmodelinitget are used. (For proj_linpred()
with integrated = TRUE
, we have $S_{\mathrm{prj}} = 1$
.) Furthermore, let
$C$
denote either $C_{\mathrm{cat}}$
(if transform = TRUE
)
or $C_{\mathrm{lat}}$
(if transform = FALSE
). Then, if the
prediction is done for one submodel only (i.e., length(nterms) == 1  !is.null(predictor_terms)
in the explicit or implicit call to project()
,
see argument object
):
proj_linpred()
returns a list
with the following elements:
Element pred
contains the actual predictions, i.e., the linear
predictors, possibly transformed to response scale (depending on
argument transform
).
Element lpd
is nonNULL
only if newdata
is NULL
or if
newdata
contains response values in the corresponding column. In that
case, it contains the log predictive density values (conditional on
each of the projected parameter draws if integrated = FALSE
and
averaged across the projected parameter draws if integrated = TRUE
).
In case of (i) the traditional projection, (ii) the latent projection
with transform = FALSE
, or (iii) the latent projection with
transform = TRUE
and <refmodel>$family$cats
(where <refmodel>
is
an object resulting from init_refmodel()
; see also
extend_family()
's argument latent_y_unqs
) being NULL
, both
elements are $S_{\mathrm{prj}} \times N$
matrices
(converted to a—possibly weighted—draws_matrix
if argument
return_draws_matrix
is TRUE
, see the description of this argument).
In case of (i) the augmenteddata projection or (ii) the latent
projection with transform = TRUE
and <refmodel>$family$cats
being
not NULL
, pred
is an $S_{\mathrm{prj}} \times N \times C$
array (if argument return_draws_matrix
is TRUE
, this array
is "compressed" to an $S_{\mathrm{prj}} \times (N \cdot C)$
matrix—with the columns consisting of $C$
blocks of
$N$
rows—and then converted to a—possibly
weighted—draws_matrix
) and lpd
is an $S_{\mathrm{prj}} \times
N$
matrix (converted to a—possibly
weighted—draws_matrix
if argument return_draws_matrix
is TRUE
).
If return_draws_matrix
is FALSE
and allow_nonconst_wdraws_prj
is
TRUE
and integrated
is FALSE
and the projected draws have
nonconstant weights, then both list
elements have the weights of
these draws stored in an attribute wdraws_prj
. (If
return_draws_matrix
, allow_nonconst_wdraws_prj
, and integrated
are all FALSE
, then projected draws with nonconstant weights cause an
error.)
proj_predict()
returns an $S_{\mathrm{prj}} \times N$
matrix of predictions where $S_{\mathrm{prj}}$
denotes
nresample_clusters
in case of clustered projection (or, more generally,
in case of projected draws with nonconstant weights). If argument
return_draws_matrix
is TRUE
, the returned matrix is converted to a
draws_matrix
(see posterior::draws_matrix()
). In case of (i) the
augmenteddata projection or (ii) the latent projection with resp_oscale = TRUE
and <refmodel>$family$cats
being not NULL
, the returned matrix
(or draws_matrix
) has an attribute called cats
(the character vector of
response categories) and the values of the matrix (or draws_matrix
) are
the predicted indices of the response categories (these indices refer to
the order of the response categories from attribute cats
).
If the prediction is done for more than one submodel, the output from above
is returned for each submodel, giving a named list
with one element for
each submodel (the names of this list
being the numbers of predictor
terms of the submodels when counting the intercept, too).
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `ndraws`, but only for the sake of speed in this example; this # is not recommended in general): prj < project(fit, predictor_terms = c("X1", "X3", "X5"), ndraws = 21, seed = 9182) # Predictions (at the training points) from the submodel onto which the # reference model was projected: prjl < proj_linpred(prj) prjp < proj_predict(prj, .seed = 7364)
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `ndraws`, but only for the sake of speed in this example; this # is not recommended in general): prj < project(fit, predictor_terms = c("X1", "X3", "X5"), ndraws = 21, seed = 9182) # Predictions (at the training points) from the submodel onto which the # reference model was projected: prjl < proj_linpred(prj) prjp < proj_predict(prj, .seed = 7364)
This is the predict()
method for refmodel
objects (returned by
get_refmodel()
or init_refmodel()
). It offers three types of output which
are all based on the reference model and new (or old) observations: Either
the linear predictor on link scale, the linear predictor transformed to
response scale, or the log posterior predictive density.
## S3 method for class 'refmodel' predict( object, newdata = NULL, ynew = NULL, offsetnew = NULL, weightsnew = NULL, type = "response", ... )
## S3 method for class 'refmodel' predict( object, newdata = NULL, ynew = NULL, offsetnew = NULL, weightsnew = NULL, type = "response", ... )
object 
An object of class 
newdata 
Passed to argument 
ynew 
If not 
offsetnew 
Passed to argument 
weightsnew 
Passed to argument 
type 
Usually only relevant if 
... 
Currently ignored. 
Argument weightsnew
is only relevant if !is.null(ynew)
.
In case of a multilevel reference model, grouplevel effects for new group
levels are drawn randomly from a (multivariate) Gaussian distribution. When
setting projpred.mlvl_pred_new
to TRUE
, all group levels from newdata
(even those that already exist in the original dataset) are treated as new
group levels (if is.null(newdata)
, all group levels from the original
dataset are considered as new group levels in that case).
In the following, $N$
, $C_{\mathrm{cat}}$
, and
$C_{\mathrm{lat}}$
from help topic refmodelinitget are used.
Furthermore, let $C$
denote either $C_{\mathrm{cat}}$
(if
type = "response"
) or $C_{\mathrm{lat}}$
(if type = "link"
).
Then, if is.null(ynew)
, the returned object contains the reference
model's predictions (with the scale depending on argument type
) as:
a length$N$
vector in case of (i) the traditional projection, (ii)
the latent projection with type = "link"
, or (iii) the latent projection
with type = "response"
and object$family$cats
being NULL
;
an $N \times C$
matrix in case of (i) the augmenteddata
projection or (ii) the latent projection with type = "response"
and
object$family$cats
being not NULL
.
If !is.null(ynew)
, the returned object is a length$N$
vector of log
posterior predictive densities evaluated at ynew
.
project()
runFor a projection
object (returned by project()
, possibly as elements of a
list
), this function extracts the combination of predictor terms onto which
the projection was performed.
predictor_terms(object, ...) ## S3 method for class 'projection' predictor_terms(object, ...)
predictor_terms(object, ...) ## S3 method for class 'projection' predictor_terms(object, ...)
object 
An object of class 
... 
Currently ignored. 
A character vector of predictor terms.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `nclusters`, but only for the sake of speed in this example; # this is not recommended in general): prj < project(fit, predictor_terms = c("X1", "X3", "X5"), nclusters = 10, seed = 9182) print(predictor_terms(prj)) # gives `c("X1", "X3", "X5")`
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Projection onto an arbitrary combination of predictor terms (with a small # value for `nclusters`, but only for the sake of speed in this example; # this is not recommended in general): prj < project(fit, predictor_terms = c("X1", "X3", "X5"), nclusters = 10, seed = 9182) print(predictor_terms(prj)) # gives `c("X1", "X3", "X5")`
project()
outputThis is the print()
method for objects of class projection
. This method
mainly exists to avoid cluttering the console when printing such objects
accidentally.
## S3 method for class 'projection' print(x, ...)
## S3 method for class 'projection' print(x, ...)
x 
An object of class 
... 
Currently ignored. 
The input object x
(invisible).
This is the print()
method for reference model objects (objects of class
refmodel
). This method mainly exists to avoid cluttering the console when
printing such objects accidentally.
## S3 method for class 'refmodel' print(x, ...)
## S3 method for class 'refmodel' print(x, ...)
x 
An object of class 
... 
Currently ignored. 
The input object x
(invisible).
varsel()
or cv_varsel()
runThis is the print()
method for vsel
objects (returned by varsel()
or
cv_varsel()
). It displays a summary of a varsel()
or cv_varsel()
run by
first calling summary.vsel()
and then print.vselsummary()
.
## S3 method for class 'vsel' print(x, digits = getOption("projpred.digits", 2), ...)
## S3 method for class 'vsel' print(x, digits = getOption("projpred.digits", 2), ...)
x 
An object of class 
digits 
Passed to argument 
... 
Arguments passed to 
The output of summary.vsel()
(invisible).
varsel()
or cv_varsel()
runThis is the print()
method for summary objects created by summary.vsel()
.
It displays a summary of the results from a varsel()
or cv_varsel()
run.
## S3 method for class 'vselsummary' print(x, digits = getOption("projpred.digits", 2), ...)
## S3 method for class 'vselsummary' print(x, digits = getOption("projpred.digits", 2), ...)
x 
An object of class 
digits 
Passed to 
... 
Arguments passed to 
In the submodel predictive performance table printed at (or towards)
the bottom, column ranking_fulldata
contains the fulldata predictor
ranking and column cv_proportions_diag
contains the main diagonal of the
matrix returned by cv_proportions()
(with cumulate
as set in the
summary.vsel()
call that created x
). To retrieve the foldwise
predictor rankings, use the ranking()
function, possibly followed by
cv_proportions()
for computing the ranking proportions (which can be
visualized by plot.cv_proportions()
).
The output of summary.vsel()
(invisible).
Project the posterior of the reference model onto the parameter space of a single submodel consisting of a specific combination of predictor terms or (after variable selection) onto the parameter space of a single or multiple submodels of specific sizes.
project( object, nterms = NULL, solution_terms = predictor_terms, predictor_terms = NULL, refit_prj = TRUE, ndraws = 400, nclusters = NULL, seed = NA, verbose = getOption("projpred.verbose_project", TRUE), ... )
project( object, nterms = NULL, solution_terms = predictor_terms, predictor_terms = NULL, refit_prj = TRUE, ndraws = 400, nclusters = NULL, seed = NA, verbose = getOption("projpred.verbose_project", TRUE), ... )
object 
An object which can be used as input to 
nterms 
Only relevant if 
solution_terms 
Deprecated. Please use argument 
predictor_terms 
If not 
refit_prj 
A single logical value indicating whether to fit the
submodels (again) ( 
ndraws 
Only relevant if 
nclusters 
Only relevant if 
seed 
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument 
verbose 
A single logical value indicating whether to print out
additional information during the computations. More precisely, this gets
passed as 
... 
Arguments passed to 
Arguments ndraws
and nclusters
are automatically truncated at
the number of posterior draws in the reference model (which is 1
for
datafit
s). Using less draws or clusters in ndraws
or nclusters
than
posterior draws in the reference model may result in slightly inaccurate
projection performance. Increasing these arguments affects the computation
time linearly.
If refit_prj = FALSE
(which is only possible if object
is of class
vsel
), project()
retrieves the submodel fits from the fulldata search
that was run when creating object
. Usually, the search relies on a rather
coarse clustering or thinning of the reference model's posterior draws (by
default, varsel()
and cv_varsel()
use nclusters = 20
). Consequently,
project()
with refit_prj = FALSE
then inherits this coarse clustering
or thinning.
If the projection is performed onto a single submodel (i.e.,
length(nterms) == 1  !is.null(predictor_terms)
), an object of class
projection
which is a list
containing the following elements:
dis
Projected draws for the dispersion parameter.
ce
The crossentropy part of the KullbackLeibler (KL) divergence from the reference model to the submodel. For some families, this is not the actual crossentropy, but a reduced one where terms which would cancel out when calculating the KL divergence have been dropped. In case of the Gaussian family, that reduced crossentropy is further modified, yielding merely a proxy.
wdraws_prj
Weights for the projected draws.
predictor_terms
A character vector of the submodel's predictor terms.
outdmin
A list
containing the submodel fits (one fit per
projected draw). This is the same as the return value of the
div_minimizer
function (see init_refmodel()
), except if project()
was used with an object
of class vsel
based on an L1 search as well
as with refit_prj = FALSE
, in which case this is the output from an
internal L1penalized divergence minimizer.
cl_ref
A numeric vector of length equal to the number of posterior draws in the reference model, containing the cluster indices of these draws.
wdraws_ref
A numeric vector of length equal to the number of
posterior draws in the reference model, giving the weights of these
draws. These weights should be treated as not being normalized (i.e.,
they don't necessarily sum to 1
).
const_wdraws_prj
A single logical value indicating whether the
projected draws have constant weights (TRUE
) or not (FALSE
).
refmodel
The reference model object.
If the projection is performed onto more than one submodel, the output from
above is returned for each submodel, giving a list
with one element for
each submodel.
The elements of an object of class projection
are not meant to be
accessed directly but instead via helper functions (see the main vignette
and projpredpackage; see also as_draws_matrix.projection()
, argument
return_draws_matrix
of proj_linpred()
, and argument
nresample_clusters
of proj_predict()
for the intended use of the
weights stored in element wdraws_prj
).
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without crossvalidation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) # Projection onto the best submodel with 2 predictor terms (with a small # value for `nclusters`, but only for the sake of speed in this example; # this is not recommended in general): prj_from_vs < project(vs, nterms = 2, nclusters = 10, seed = 9182) # Projection onto an arbitrary combination of predictor terms (with a small # value for `nclusters`, but only for the sake of speed in this example; # this is not recommended in general): prj < project(fit, predictor_terms = c("X1", "X3", "X5"), nclusters = 10, seed = 9182)
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without crossvalidation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) # Projection onto the best submodel with 2 predictor terms (with a small # value for `nclusters`, but only for the sake of speed in this example; # this is not recommended in general): prj_from_vs < project(vs, nterms = 2, nclusters = 10, seed = 9182) # Projection onto an arbitrary combination of predictor terms (with a small # value for `nclusters`, but only for the sake of speed in this example; # this is not recommended in general): prj < project(fit, predictor_terms = c("X1", "X3", "X5"), nclusters = 10, seed = 9182)
Extracts the predictor ranking(s) from an object of class vsel
(returned
by varsel()
or cv_varsel()
). A predictor ranking is simply a character
vector of predictor terms ranked by predictive relevance (with the most
relevant term first). In any case, objects of class vsel
contain the
predictor ranking based on the fulldata search. If an object of class
vsel
is based on a crossvalidation (CV) with foldwise searches (i.e., if
it was created by cv_varsel()
with validate_search = TRUE
), then it also
contains foldwise predictor rankings.
ranking(object, ...) ## S3 method for class 'vsel' ranking(object, nterms_max = NULL, ...)
ranking(object, ...) ## S3 method for class 'vsel' ranking(object, nterms_max = NULL, ...)
object 
The object from which to retrieve the predictor ranking(s). Possible classes may be inferred from the names of the corresponding methods (see also the description). 
... 
Currently ignored. 
nterms_max 
Maximum submodel size (number of predictor terms) for the
predictor ranking(s), i.e., the submodel size at which to cut off the
predictor ranking(s). Using 
An object of class ranking
which is a list
with the following
elements:
fulldata
: The predictor ranking from the fulldata search.
foldwise
: The predictor rankings from the foldwise
searches in the form of a character matrix (only available if object
is
based on a CV with foldwise searches, otherwise element foldwise
is
NULL
). The rows of this matrix correspond to the CV folds and the columns
to the submodel sizes. Each row contains the predictor ranking from the
search of that CV fold.
# For an example, see `?plot.cv_proportions`.
# For an example, see `?plot.cv_proportions`.
Function get_refmodel()
is a generic function whose methods usually call
init_refmodel()
which is the underlying workhorse (and may also be used
directly without a call to get_refmodel()
).
Both, get_refmodel()
and init_refmodel()
, create an object containing
information needed for the projection predictive variable selection, namely
about the reference model, the submodels, and how the projection should be
carried out. For the sake of simplicity, the documentation may refer to the
resulting object also as "reference model" or "reference model object", even
though it also contains information about the submodels and the projection.
A "typical" reference model object is created by get_refmodel.stanreg()
and
brms::get_refmodel.brmsfit()
, either implicitly by a call to a toplevel
function such as project()
, varsel()
, and cv_varsel()
or explicitly by
a call to get_refmodel()
. All non"typical" reference model objects will be
called "custom" reference model objects.
Some arguments are for $K$
fold crossvalidation ($K$
fold CV) only;
see cv_varsel()
for the use of $K$
fold CV in projpred.
get_refmodel(object, ...) ## S3 method for class 'refmodel' get_refmodel(object, ...) ## S3 method for class 'vsel' get_refmodel(object, ...) ## S3 method for class 'projection' get_refmodel(object, ...) ## Default S3 method: get_refmodel(object, family = NULL, ...) ## S3 method for class 'stanreg' get_refmodel(object, latent = FALSE, dis = NULL, ...) init_refmodel( object, data, formula, family, ref_predfun = NULL, div_minimizer = NULL, proj_predfun = NULL, extract_model_data = NULL, cvfun = NULL, cvfits = NULL, dis = NULL, cvrefbuilder = NULL, called_from_cvrefbuilder = FALSE, ... )
get_refmodel(object, ...) ## S3 method for class 'refmodel' get_refmodel(object, ...) ## S3 method for class 'vsel' get_refmodel(object, ...) ## S3 method for class 'projection' get_refmodel(object, ...) ## Default S3 method: get_refmodel(object, family = NULL, ...) ## S3 method for class 'stanreg' get_refmodel(object, latent = FALSE, dis = NULL, ...) init_refmodel( object, data, formula, family, ref_predfun = NULL, div_minimizer = NULL, proj_predfun = NULL, extract_model_data = NULL, cvfun = NULL, cvfits = NULL, dis = NULL, cvrefbuilder = NULL, called_from_cvrefbuilder = FALSE, ... )
object 
For 
... 
For 
family 
An object of class 
latent 
A single logical value indicating whether to use the latent
projection ( 
dis 
A vector of posterior draws for the reference model's dispersion
parameter or—more precisely—the posterior values for the reference
model's parameterconditional predictive variance (assuming that this
variance is the same for all observations). May be 
data 
A 
formula 
The full formula to use for the search procedure. For custom
reference models, this does not necessarily coincide with the reference
model's formula. For general information about formulas in R, see

ref_predfun 
Prediction function for the linear predictor of the
reference model, including offsets (if existing). See also section
"Arguments 
div_minimizer 
A function for minimizing the KullbackLeibler (KL)
divergence from the reference model to a submodel (i.e., for performing the
projection of the reference model onto a submodel). The output of

proj_predfun 
Prediction function for the linear predictor of a
submodel onto which the reference model is projected. See also section
"Arguments 
extract_model_data 
A function for fetching some variables (response,
observation weights, offsets) from the original dataset (supplied to
argument 
cvfun 
For 
cvfits 
For 
cvrefbuilder 
For 
called_from_cvrefbuilder 
A single logical value indicating whether

An object that can be passed to all the functions that take the
reference model fit as the first argument, such as varsel()
,
cv_varsel()
, project()
, proj_linpred()
, and proj_predict()
.
Usually, the returned object is of class refmodel
. However, if object
is NULL
, the returned object is of class datafit
as well as of class
refmodel
(with datafit
being first). Objects of class datafit
are
handled differently at several places throughout this package.
The elements of the returned object are not meant to be accessed directly
but instead via downstream functions (see the functions mentioned above as
well as predict.refmodel()
).
Although bad practice (in general), a reference model lacking an intercept can be used within projpred. However, it will always be projected onto submodels which include an intercept. The reason is that even if the true intercept in the reference model is zero, this does not need to hold for the submodels.
In multilevel (grouplevel) terms, function calls on the righthand side of
the 
character (e.g., (1  gr(group_variable))
, which is possible in
brms) are currently not allowed in projpred.
For additive models (still an experimental feature), only mgcv::s()
and
mgcv::t2()
are currently supported as smooth terms. Furthermore, these need
to be called without any arguments apart from the predictor names (symbols).
For example, for smoothing the effect of a predictor x
, only s(x)
or
t2(x)
are allowed. As another example, for smoothing the joint effect of
two predictors x
and z
, only s(x, z)
or t2(x, z)
are allowed (and
analogously for higherorder joint effects, e.g., of three predictors). Note
that all smooth terms need to be included in formula
(there is no random
argument as in rstanarm::stan_gamm4()
, for example).
ref_predfun
, proj_predfun
, and div_minimizer
Arguments ref_predfun
, proj_predfun
, and div_minimizer
may be NULL
for using an internal default (see projpredpackage for the functions used
by the default divergence minimizers). Otherwise, let $N$
denote the
number of observations (in case of CV, these may be reduced to each fold),
$S_{\mathrm{ref}}$
the number of posterior draws for the reference
model's parameters, and $S_{\mathrm{prj}}$
the number of draws for
the parameters of a submodel that the reference model has been projected onto
(short: the number of projected draws). For the augmenteddata projection,
let $C_{\mathrm{cat}}$
denote the number of response categories,
$C_{\mathrm{lat}}$
the number of latent response categories (which
typically equals $C_{\mathrm{cat}}  1$
), and define
$N_{\mathrm{augcat}} := N \cdot C_{\mathrm{cat}}$
as well as $N_{\mathrm{auglat}} := N \cdot C_{\mathrm{lat}}$
. Then the functions supplied to these arguments need to have the
following prototypes:
ref_predfun
: ref_predfun(fit, newdata = NULL)
where:
fit
accepts the reference model fit as given in argument object
(but possibly refitted to a subset of the observations, as done in
$K$
fold CV).
newdata
accepts either NULL
(for using the original dataset,
typically stored in fit
) or data for new observations (at least in the
form of a data.frame
).
proj_predfun
: proj_predfun(fits, newdata)
where:
fits
accepts a list
of length $S_{\mathrm{prj}}$
containing this number of submodel fits. This list
is the same as that
returned by project()
in its output element outdmin
(which in turn is
the same as the return value of div_minimizer
, except if project()
was used with an object
of class vsel
based on an L1 search as well
as with refit_prj = FALSE
).
newdata
accepts data for new observations (at least in the form of a
data.frame
).
div_minimizer
does not need to have a specific prototype, but it needs to
be able to be called with the following arguments:
formula
accepts either a standard formula
with a single response
(if $S_{\mathrm{prj}} = 1$
or in case of the
augmenteddata projection) or a formula
with $S_{\mathrm{prj}} >
1$
response variables cbind()
ed on the lefthand side in
which case the projection has to be performed for each of the response
variables separately.
data
accepts a data.frame
to be used for the projection. In case of
the traditional or the latent projection, this dataset has $N$
rows.
In case of the augmenteddata projection, this dataset has
$N_{\mathrm{augcat}}$
rows.
family
accepts an object of class family
.
weights
accepts either observation weights (at least in the form of a
numeric vector) or NULL
(for using a vector of ones as weights).
projpred_var
accepts an $N \times S_{\mathrm{prj}}$
matrix of predictive variances (necessary for projpred's internal
GLM fitter) in case of the traditional or the latent projection and an
$N_{\mathrm{augcat}} \times S_{\mathrm{prj}}$
matrix (containing only NA
s) in case of the augmenteddata projection.
projpred_ws_aug
accepts an $N \times S_{\mathrm{prj}}$
matrix of expected values for the response in case of the traditional or
the latent projection and an $N_{\mathrm{augcat}} \times
S_{\mathrm{prj}}$
matrix of probabilities for the
response categories in case of the augmenteddata projection.
...
accepts further arguments specified by the user (or by
projpred).
The return value of these functions needs to be:
ref_predfun
: for the traditional or the latent projection, an $N
\times S_{\mathrm{ref}}$
matrix; for the augmenteddata
projection, an $S_{\mathrm{ref}} \times N \times C_{\mathrm{lat}}$
array (the only exception is the augmenteddata projection for
the binomial()
family in which case ref_predfun
needs to return an $N
\times S_{\mathrm{ref}}$
matrix just like for the traditional
projection because the array is constructed by an internal wrapper function).
proj_predfun
: for the traditional or the latent projection, an $N
\times S_{\mathrm{prj}}$
matrix; for the augmenteddata
projection, an $N \times C_{\mathrm{lat}} \times S_{\mathrm{prj}}$
array.
div_minimizer
: a list
of length $S_{\mathrm{prj}}$
containing this number of submodel fits.
extract_model_data
The function supplied to argument extract_model_data
needs to have the
prototype
extract_model_data(object, newdata, wrhs = NULL, orhs = NULL, extract_y = TRUE)
where:
object
accepts the reference model fit as given in argument object
(but
possibly refitted to a subset of the observations, as done in $K$
fold
CV).
newdata
accepts data for new observations (at least in the form of a
data.frame
).
wrhs
accepts at least (i) a righthand side formula consisting only of
the variable in newdata
containing the observation weights or (ii) NULL
for using the observation weights corresponding to newdata
(typically, the
observation weights are stored in a column of newdata
; if the model was
fitted without observation weights, a vector of ones should be used).
orhs
accepts at least (i) a righthand side formula consisting only of
the variable in newdata
containing the offsets or (ii) NULL
for using the
offsets corresponding to newdata
(typically, the offsets are stored in a
column of newdata
; if the model was fitted without offsets, a vector of
zeros should be used).
extract_y
accepts a single logical value indicating whether output
element y
(see below) shall be NULL
(TRUE
) or not (FALSE
).
The return value of extract_model_data
needs to be a list
with elements
y
, weights
, and offset
, each being a numeric vector containing the data
for the response, the observation weights, and the offsets, respectively. An
exception is that y
may also be NULL
(depending on argument extract_y
),
a nonnumeric vector, or a factor
.
The weights and offsets returned by extract_model_data
will be assumed to
hold for the reference model as well as for the submodels.
Above, arguments wrhs
and orhs
were assumed to have defaults of NULL
.
It should be possible to use defaults other than NULL
, but we strongly
recommend to use NULL
. If defaults other than NULL
are used, they need to
imply the behaviors described at items "(ii)" (see the descriptions of wrhs
and orhs
).
If a custom reference model for an augmenteddata projection is needed, see
also extend_family()
.
For the augmenteddata projection, the response vector resulting from
extract_model_data
is internally coerced to a factor
(using
as.factor()
). The levels of this factor
have to be identical to
family$cats
(after applying extend_family()
internally; see
extend_family()
's argument augdat_y_unqs
).
Note that responsespecific offsets (i.e., one length$N$
offset vector
per response category) are not supported by projpred yet. So far, only
offsets which are the same across all response categories are supported. This
is why in case of the brms::categorical()
family, offsets are currently not
supported at all.
Currently, object = NULL
(i.e., a datafit
; see section "Value") is not
supported in case of the augmenteddata projection.
If a custom reference model for a latent projection is needed, see also
extend_family()
.
For the latent projection, family$cats
(after applying extend_family()
internally; see extend_family()
's argument latent_y_unqs
) currently must
not be NULL
if the original (i.e., nonlatent) response is a factor
.
Conversely, if family$cats
(after applying extend_family()
) is
nonNULL
, the response vector resulting from extract_model_data
is
internally coerced to a factor
(using as.factor()
). The levels of this
factor
have to be identical to that nonNULL
element family$cats
.
Currently, object = NULL
(i.e., a datafit
; see section "Value") is not
supported in case of the latent projection.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Define the reference model object explicitly: ref < get_refmodel(fit) print(class(ref)) # gives `"refmodel"` # Now see, for example, `?varsel`, `?cv_varsel`, and `?project` for # possible postprocessing functions. Most of the postprocessing functions # call get_refmodel() internally at the beginning, so you will rarely need # to call get_refmodel() yourself. # A custom reference model object which may be used in a variable selection # where the candidate predictors are not a subset of those used for the # reference model's predictions: ref_cust < init_refmodel( fit, data = dat_gauss, formula = y ~ X6 + X7, family = gaussian(), cvfun = function(folds) { kfold( fit, K = max(folds), save_fits = TRUE, folds = folds, cores = 1 )$fits[, "fit"] }, dis = as.matrix(fit)[, "sigma"], cvrefbuilder = function(cvfit) { init_refmodel(cvfit, data = dat_gauss[cvfit$omitted, , drop = FALSE], formula = y ~ X6 + X7, family = gaussian(), dis = as.matrix(cvfit)[, "sigma"], called_from_cvrefbuilder = TRUE) } ) # Now, the postprocessing functions mentioned above (for example, # varsel(), cv_varsel(), and project()) may be applied to `ref_cust`.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Define the reference model object explicitly: ref < get_refmodel(fit) print(class(ref)) # gives `"refmodel"` # Now see, for example, `?varsel`, `?cv_varsel`, and `?project` for # possible postprocessing functions. Most of the postprocessing functions # call get_refmodel() internally at the beginning, so you will rarely need # to call get_refmodel() yourself. # A custom reference model object which may be used in a variable selection # where the candidate predictors are not a subset of those used for the # reference model's predictions: ref_cust < init_refmodel( fit, data = dat_gauss, formula = y ~ X6 + X7, family = gaussian(), cvfun = function(folds) { kfold( fit, K = max(folds), save_fits = TRUE, folds = folds, cores = 1 )$fits[, "fit"] }, dis = as.matrix(fit)[, "sigma"], cvrefbuilder = function(cvfit) { init_refmodel(cvfit, data = dat_gauss[cvfit$omitted, , drop = FALSE], formula = y ~ X6 + X7, family = gaussian(), dis = as.matrix(cvfit)[, "sigma"], called_from_cvrefbuilder = TRUE) } ) # Now, the postprocessing functions mentioned above (for example, # varsel(), cv_varsel(), and project()) may be applied to `ref_cust`.
cvfits
from cvfun
A helper function that can be used to create input for
cv_varsel.refmodel()
's argument cvfits
by running first cv_folds()
and
then the reference model object's cvfun
(see init_refmodel()
). This is
helpful if $K$
fold CV is run multiple times based on the same $K$
reference model refits.
run_cvfun(object, ...) ## Default S3 method: run_cvfun(object, ...) ## S3 method for class 'refmodel' run_cvfun( object, K = if (!inherits(object, "datafit")) 5 else 10, folds = NULL, seed = NA, ... )
run_cvfun(object, ...) ## Default S3 method: run_cvfun(object, ...) ## S3 method for class 'refmodel' run_cvfun( object, K = if (!inherits(object, "datafit")) 5 else 10, folds = NULL, seed = NA, ... )
object 
An object of class 
... 
For 
K 
Number of folds. Must be at least 2 and not exceed the number of
observations. Ignored if 
folds 
Either 
seed 
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument 
An object that can be used as input for cv_varsel.refmodel()
's
argument cvfits
.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Define the reference model object explicitly (not really necessary here # because the get_refmodel() call is quite fast in this example, but in # general, this approach is faster than defining the reference model object # multiple times implicitly): ref < get_refmodel(fit) # Run the reference model object's `cvfun` (with a small value for `K`, but # only for the sake of speed in this example; this is not recommended in # general): cv_fits < run_cvfun(ref, K = 2, seed = 184) # Run cv_varsel() (with L1 search and small values for `nterms_max` and # `nclusters_pred`, but only for the sake of speed in this example; this is # not recommended in general) and use `cv_fits` there: cvvs_L1 < cv_varsel(ref, method = "L1", cv_method = "kfold", cvfits = cv_fits, nterms_max = 3, nclusters_pred = 10, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible postprocessing functions. # The purpose of run_cvfun() is to create an object that can be used in # multiple cv_varsel() calls, e.g., to check the sensitivity to the search # method (L1 or forward): cvvs_fw < cv_varsel(ref, method = "forward", cv_method = "kfold", cvfits = cv_fits, nterms_max = 3, nclusters = 5, nclusters_pred = 10, seed = 5555) # Stratified KfoldCV is straightforward: n_strat < 3L set.seed(692) # Some example strata: strat_fac < sample(paste0("lvl", seq_len(n_strat)), size = nrow(dat_gauss), replace = TRUE, prob = diff(c(0, pnorm(seq_len(n_strat  1L)  0.5), 1))) table(strat_fac) # Use loo::kfold_split_stratified() to create the folds vector: folds_strat < loo::kfold_split_stratified(K = 2, x = strat_fac) table(folds_strat, strat_fac) # Call run_cvfun(), but this time with argument `folds` instead of `K` (here, # specifying argument `seed` would not be necessary because of the set.seed() # call above, but we specify it nonetheless for the sake of generality): cv_fits_strat < run_cvfun(ref, folds = folds_strat, seed = 391) # Now use `cv_fits_strat` analogously to `cv_fits` from above.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Define the reference model object explicitly (not really necessary here # because the get_refmodel() call is quite fast in this example, but in # general, this approach is faster than defining the reference model object # multiple times implicitly): ref < get_refmodel(fit) # Run the reference model object's `cvfun` (with a small value for `K`, but # only for the sake of speed in this example; this is not recommended in # general): cv_fits < run_cvfun(ref, K = 2, seed = 184) # Run cv_varsel() (with L1 search and small values for `nterms_max` and # `nclusters_pred`, but only for the sake of speed in this example; this is # not recommended in general) and use `cv_fits` there: cvvs_L1 < cv_varsel(ref, method = "L1", cv_method = "kfold", cvfits = cv_fits, nterms_max = 3, nclusters_pred = 10, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible postprocessing functions. # The purpose of run_cvfun() is to create an object that can be used in # multiple cv_varsel() calls, e.g., to check the sensitivity to the search # method (L1 or forward): cvvs_fw < cv_varsel(ref, method = "forward", cv_method = "kfold", cvfits = cv_fits, nterms_max = 3, nclusters = 5, nclusters_pred = 10, seed = 5555) # Stratified KfoldCV is straightforward: n_strat < 3L set.seed(692) # Some example strata: strat_fac < sample(paste0("lvl", seq_len(n_strat)), size = nrow(dat_gauss), replace = TRUE, prob = diff(c(0, pnorm(seq_len(n_strat  1L)  0.5), 1))) table(strat_fac) # Use loo::kfold_split_stratified() to create the folds vector: folds_strat < loo::kfold_split_stratified(K = 2, x = strat_fac) table(folds_strat, strat_fac) # Call run_cvfun(), but this time with argument `folds` instead of `K` (here, # specifying argument `seed` would not be necessary because of the set.seed() # call above, but we specify it nonetheless for the sake of generality): cv_fits_strat < run_cvfun(ref, folds = folds_strat, seed = 391) # Now use `cv_fits_strat` analogously to `cv_fits` from above.
varsel()
or cv_varsel()
run
or the predictor combination from a project()
runThe solution_terms.vsel()
method retrieves the solution path from a
fulldata search (vsel
objects are returned by varsel()
or
cv_varsel()
). The solution_terms.projection()
method retrieves the
predictor combination onto which a projection was performed (projection
objects are returned by project()
, possibly as elements of a list
). Both
methods (and hence also the solution_terms()
generic) are deprecated and
will be removed in a future release. Please use ranking()
instead of
solution_terms.vsel()
(ranking()
's output element fulldata
contains the
fulldata predictor ranking that is extracted by solution_terms.vsel()
;
ranking()
's output element foldwise
contains the foldwise predictor
rankings—if available—which were previously not accessible via a builtin
function) and predictor_terms()
instead of solution_terms.projection()
.
solution_terms(object, ...) ## S3 method for class 'vsel' solution_terms(object, ...) ## S3 method for class 'projection' solution_terms(object, ...)
solution_terms(object, ...) ## S3 method for class 'vsel' solution_terms(object, ...) ## S3 method for class 'projection' solution_terms(object, ...)
object 
The object from which to retrieve the predictor terms. Possible classes may be inferred from the names of the corresponding methods (see also the description). 
... 
Currently ignored. 
A character vector of predictor terms.
This function can suggest an appropriate submodel size based on a decision
rule described in section "Details" below. Note that this decision is quite
heuristic and should be interpreted with caution. It is recommended to
examine the results via plot.vsel()
, cv_proportions()
,
plot.cv_proportions()
, and/or summary.vsel()
and to make the final
decision based on what is most appropriate for the problem at hand.
suggest_size(object, ...) ## S3 method for class 'vsel' suggest_size( object, stat = "elpd", pct = 0, type = "upper", thres_elpd = NA, warnings = TRUE, ... )
suggest_size(object, ...) ## S3 method for class 'vsel' suggest_size( object, stat = "elpd", pct = 0, type = "upper", thres_elpd = NA, warnings = TRUE, ... )
object 
An object of class 
... 
Arguments passed to 
stat 
Performance statistic (i.e., utility or loss) used for the
decision. See argument 
pct 
A number giving the proportion (not percents) of the relative null model utility one is willing to sacrifice. See section "Details" below for more information. 
type 
Either 
thres_elpd 
Only relevant if 
warnings 
Mainly for internal use. A single logical value indicating
whether to throw warnings if automatic suggestion fails. Usually there is
no reason to set this to 
In general (beware of special cases below), the suggested model
size is the smallest model size $j \in \{0, 1, ...,
\texttt{nterms\_max}\}$
for which either the
lower or upper bound (depending on argument type
) of the
normalapproximation (or bootstrap or exponentiated normalapproximation;
see argument stat
) confidence interval (with nominal coverage 1  alpha
; see argument alpha
of summary.vsel()
) for $U_j 
U_{\mathrm{base}}$
(with $U_j$
denoting the $j$
th
submodel's true utility and $U_{\mathrm{base}}$
denoting the
baseline model's true utility)
falls above (or is equal to)
$\texttt{pct} \cdot (u_0 
u_{\mathrm{base}})$
where $u_0$
denotes the null
model's estimated utility and $u_{\mathrm{base}}$
the baseline
model's estimated utility. The baseline model is either the reference model
or the best submodel found (see argument baseline
of summary.vsel()
).
In doing so, loss statistics like the root mean squared error (RMSE) and
the mean squared error (MSE) are converted to utilities by multiplying them
by 1
, so a call such as suggest_size(object, stat = "rmse", type = "upper")
finds the smallest model size whose upper confidence interval
bound for the negative RMSE or MSE exceeds (or is equal to) the cutoff
(or, equivalently, has the lower confidence interval bound for the RMSE or
MSE below—or equal to—the cutoff). This is done to make the
interpretation of argument type
the same regardless of argument stat
.
For the geometric mean predictive density (GMPD), the decision rule above
is applied on log()
scale. In other words, if the true GMPD is denoted by
$U^\ast_j$
for the $j$
th submodel and
$U^\ast_{\mathrm{base}}$
for the baseline model (so that
$U_j$
and $U_{\mathrm{base}}$
from above are given by
$U_j = \log(U^\ast_j)$
and
$U_{\mathrm{base}} = \log(U^\ast_{\mathrm{base}})$
), then suggest_size()
yields the smallest model size whose
lower or upper (depending on argument type
) confidence interval bound for
$\frac{U^\ast_j}{U^\ast_{\mathrm{base}}}$
exceeds (or
is equal to)
$(\frac{u^\ast_0}{u^\ast_{\mathrm{base}}})^{\texttt{pct}}$
where $u^\ast_0$
denotes the null
model's estimated GMPD and $u^\ast_{\mathrm{base}}$
the
baseline model's estimated GMPD.
If !is.na(thres_elpd)
and stat = "elpd"
, the decision rule above is
extended: The suggested model size is then the smallest model size $j$
fulfilling the rule above or $u_j  u_{\mathrm{base}} >
\texttt{thres\_elpd}$
. Correspondingly, in case
of stat = "mlpd"
(and !is.na(thres_elpd)
), the suggested model size is
the smallest model size $j$
fulfilling the rule above or $u_j 
u_{\mathrm{base}} > \frac{\texttt{thres\_elpd}}{N}$
with $N$
denoting the number of observations.
Correspondingly, in case of stat = "gmpd"
(and !is.na(thres_elpd)
), the
suggested model size is the smallest model size $j$
fulfilling the rule
above or $\frac{u^\ast_j}{u^\ast_{\mathrm{base}}} >
\exp(\frac{\texttt{thres\_elpd}}{N})$
.
For example (disregarding the special extensions in case of
!is.na(thres_elpd)
with stat %in% c("elpd", "mlpd", "gmpd")
), alpha = 2 * pnorm(1)
, pct = 0
, and type = "upper"
means that we select the
smallest model size for which the upper bound of the 1  2 * pnorm(1)
(approximately 68.3%) confidence interval for $U_j 
U_{\mathrm{base}}$
($\frac{U^\ast_j}{U^\ast_{\mathrm{base}}}$
in case of
the GMPD) exceeds (or is equal to) zero (one in case of the GMPD), that is
(if stat
is a performance statistic for which the normal approximation is
used, not the bootstrap and not the exponentiated normal approximation),
for which the submodel's utility estimate is at most one standard error
smaller than the baseline model's utility estimate (with that standard
error referring to the utility difference).
Apart from the two summary.vsel()
arguments mentioned above (alpha
and
baseline
), resp_oscale
is another important summary.vsel()
argument
that may be passed via ...
.
A single numeric value, giving the suggested submodel size (or NA
if the suggestion failed).
The intercept is not counted by suggest_size()
, so a suggested size of
zero stands for the interceptonly model.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without crossvalidation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(suggest_size(vs))
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without crossvalidation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(suggest_size(vs))
varsel()
or cv_varsel()
runThis is the summary()
method for vsel
objects (returned by varsel()
or
cv_varsel()
). Apart from some general information about the varsel()
or
cv_varsel()
run, it shows the fulldata predictor ranking, basic
information about the (CV) variability in the ranking of the predictors (if
available; inferred from cv_proportions()
), and estimates for
userspecified predictive performance statistics. For a graphical
representation, see plot.vsel()
. For extracting the predictive performance
results printed at the bottom of the output created by this summary()
method, see performances()
.
## S3 method for class 'vsel' summary( object, nterms_max = NULL, stats = "elpd", type = c("mean", "se", "diff", "diff.se"), deltas = FALSE, alpha = 2 * pnorm(1), baseline = if (!inherits(object$refmodel, "datafit")) "ref" else "best", resp_oscale = TRUE, cumulate = FALSE, ... )
## S3 method for class 'vsel' summary( object, nterms_max = NULL, stats = "elpd", type = c("mean", "se", "diff", "diff.se"), deltas = FALSE, alpha = 2 * pnorm(1), baseline = if (!inherits(object$refmodel, "datafit")) "ref" else "best", resp_oscale = TRUE, cumulate = FALSE, ... )
object 
An object of class 
nterms_max 
Maximum submodel size (number of predictor terms) for which
the performance statistics are calculated. Using 
stats 
One or more character strings determining which performance
statistics (i.e., utilities or losses) to estimate based on the
observations in the evaluation (or "test") set (in case of
crossvalidation, these are all observations because they are partitioned
into multiple test sets; in case of

type 
One or more items from 
deltas 
If 
alpha 
A number determining the (nominal) coverage 
baseline 
For 
resp_oscale 
Only relevant for the latent projection. A single logical
value indicating whether to calculate the performance statistics on the
original response scale ( 
cumulate 
Passed to argument 
... 
Arguments passed to the internal function which is used for
bootstrapping (if applicable; see argument 
The stats
options "mse"
and "rmse"
are only available for:
the traditional projection,
the latent projection with resp_oscale = FALSE
,
the latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
.
The stats
option "acc"
(= "pctcorr"
) is only available for:
the binomial()
family in case of the traditional projection,
all families in case of the augmenteddata projection,
the binomial()
family (on the original response scale) in case of the
latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
,
all families (on the original response scale) in case of the latent
projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being not NULL
.
The stats
option "auc"
is only available for:
the binomial()
family in case of the traditional projection,
the binomial()
family (on the original response scale) in case of the
latent projection with resp_oscale = TRUE
in combination with
<refmodel>$family$cats
being NULL
.
An object of class vselsummary
. The elements of this object are not
meant to be accessed directly but instead via helper functions
(print.vselsummary()
and performances.vselsummary()
).
print.vselsummary()
, performances.vselsummary()
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without crossvalidation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(summary(vs), digits = 1)
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without crossvalidation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) print(summary(vs), digits = 1)
Run the search part and the evaluation part for a projection predictive
variable selection. The search part determines the predictor ranking (also
known as solution path), i.e., the best submodel for each submodel size
(number of predictor terms). The evaluation part determines the predictive
performance of the submodels along the predictor ranking. A special method is
varsel.vsel()
which reuses the search results from an earlier
varsel()
(or cv_varsel()
) run, as illustrated in the main vignette.
varsel(object, ...) ## Default S3 method: varsel(object, ...) ## S3 method for class 'vsel' varsel(object, ...) ## S3 method for class 'refmodel' varsel( object, d_test = NULL, method = "forward", ndraws = NULL, nclusters = 20, ndraws_pred = 400, nclusters_pred = NULL, refit_prj = !inherits(object, "datafit"), nterms_max = NULL, verbose = TRUE, search_control = NULL, lambda_min_ratio = 1e05, nlambda = 150, thresh = 1e06, penalty = NULL, search_terms = NULL, search_out = NULL, seed = NA, ... )
varsel(object, ...) ## Default S3 method: varsel(object, ...) ## S3 method for class 'vsel' varsel(object, ...) ## S3 method for class 'refmodel' varsel( object, d_test = NULL, method = "forward", ndraws = NULL, nclusters = 20, ndraws_pred = 400, nclusters_pred = NULL, refit_prj = !inherits(object, "datafit"), nterms_max = NULL, verbose = TRUE, search_control = NULL, lambda_min_ratio = 1e05, nlambda = 150, thresh = 1e06, penalty = NULL, search_terms = NULL, search_out = NULL, seed = NA, ... )
object 
An object of class 
... 
For 
d_test 
A 
method 
The method for the search part. Possible options are

ndraws 
Number of posterior draws used in the search part. Ignored if

nclusters 
Number of clusters of posterior draws used in the search
part. Ignored in case of L1 search (because L1 search always uses a single
cluster). For the meaning of 
ndraws_pred 
Only relevant if 
nclusters_pred 
Only relevant if 
refit_prj 
For the evaluation part, should the projections onto the
submodels along the predictor ranking be performed again using 
nterms_max 
Maximum submodel size (number of predictor terms) up to
which the search is continued. If 
verbose 
A single logical value indicating whether to print out additional information during the computations. 
search_control 
A

lambda_min_ratio 
Deprecated (please use 
nlambda 
Deprecated (please use 
thresh 
Deprecated (please use 
penalty 
Only relevant for L1 search. A numeric vector determining the
relative penalties or costs for the predictors. A value of 
search_terms 
Only relevant for forward search. A custom character
vector of predictor term blocks to consider for the search. Section
"Details" below describes more precisely what "predictor term block" means.
The intercept ( 
search_out 
Intended for internal use. 
seed 
Pseudorandom number generation (PRNG) seed by which the same
results can be obtained again if needed. Passed to argument 
Arguments ndraws
, nclusters
, nclusters_pred
, and ndraws_pred
are automatically truncated at the number of posterior draws in the
reference model (which is 1
for datafit
s). Using less draws or clusters
in ndraws
, nclusters
, nclusters_pred
, or ndraws_pred
than posterior
draws in the reference model may result in slightly inaccurate projection
performance. Increasing these arguments affects the computation time
linearly.
For argument method
, there are some restrictions: For a reference model
with multilevel or additive formula terms or a reference model set up for
the augmenteddata projection, only the forward search is available.
Furthermore, argument search_terms
requires a forward search to take
effect.
L1 search is faster than forward search, but forward search may be more
accurate. Furthermore, forward search may find a sparser model with
comparable performance to that found by L1 search, but it may also
overfit when more predictors are added. This overfit can be detected
by running search validation (see cv_varsel()
).
An L1 search may select an interaction term before all involved lowerorder interaction terms (including maineffect terms) have been selected. In projpred versions > 2.6.0, the resulting predictor ranking is automatically modified so that the lowerorder interaction terms come before this interaction term, but if this is conceptually undesired, choose the forward search instead.
The elements of the search_terms
character vector don't need to be
individual predictor terms. Instead, they can be building blocks consisting
of several predictor terms connected by the +
symbol. To understand how
these building blocks work, it is important to know how projpred's
forward search works: It starts with an empty vector chosen
which will
later contain already selected predictor terms. Then, the search iterates
over model sizes $j \in \{0, ..., J\}$
(with $J$
denoting the maximum submodel size, not counting the intercept). The
candidate models at model size $j$
are constructed from those elements
from search_terms
which yield model size $j$
when combined with the
chosen
predictor terms. Note that sometimes, there may be no candidate
models for model size $j$
. Also note that internally, search_terms
is
expanded to include the intercept ("1"
), so the first step of the search
(model size 0) always consists of the interceptonly model as the only
candidate.
As a search_terms
example, consider a reference model with formula y ~ x1 + x2 + x3
. Then, to ensure that x1
is always included in the
candidate models, specify search_terms = c("x1", "x1 + x2", "x1 + x3", "x1 + x2 + x3")
(or, in a simpler way that leads to the same results,
search_terms = c("x1", "x1 + x2", "x1 + x3")
, for which helper function
force_search_terms()
exists). This search would start with y ~ 1
as the
only candidate at model size 0. At model size 1, y ~ x1
would be the only
candidate. At model size 2, y ~ x1 + x2
and y ~ x1 + x3
would be the
two candidates. At the last model size of 3, y ~ x1 + x2 + x3
would be
the only candidate. As another example, to exclude x1
from the search,
specify search_terms = c("x2", "x3", "x2 + x3")
(or, in a simpler way
that leads to the same results, search_terms = c("x2", "x3")
).
An object of class vsel
. The elements of this object are not meant
to be accessed directly but instead via helper functions (see the main
vignette and projpredpackage).
d_test
If not NULL
, then d_test
needs to be a list
with the following
elements:
data
: a data.frame
containing the predictor variables for the test set.
offset
: a numeric vector containing the offset values for the test set
(if there is no offset, use a vector of zeros).
weights
: a numeric vector containing the observation weights for the test
set (if there are no observation weights, use a vector of ones).
y
: a vector or a factor
containing the response values for the test
set. In case of the latent projection, this has to be a vector containing the
latent response values, but it can also be a vector full of NA
s if
latentscale postprocessing is not needed.
y_oscale
: Only needs to be provided in case of the latent projection
where this needs to be a vector or a factor
containing the original
(i.e., nonlatent) response values for the test set.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without crossvalidation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible postprocessing functions.
# Data: dat_gauss < data.frame(y = df_gaussian$y, df_gaussian$x) # The `stanreg` fit which will be used as the reference model (with small # values for `chains` and `iter`, but only for technical reasons in this # example; this is not recommended in general): fit < rstanarm::stan_glm( y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss, QR = TRUE, chains = 2, iter = 500, refresh = 0, seed = 9876 ) # Run varsel() (here without crossvalidation, with L1 search, and with small # values for `nterms_max` and `nclusters_pred`, but only for the sake of # speed in this example; this is not recommended in general): vs < varsel(fit, method = "L1", nterms_max = 3, nclusters_pred = 10, seed = 5555) # Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`, # and `?ranking` for possible postprocessing functions.
A helper function for extracting response values, observation weights, and
offsets from a dataset. It is designed for use in the extract_model_data
function of custom reference model objects (see init_refmodel()
).
y_wobs_offs(newdata, wrhs = NULL, orhs = NULL, resp_form)
y_wobs_offs(newdata, wrhs = NULL, orhs = NULL, resp_form)
newdata 
The 
wrhs 
Either a righthand side formula consisting only of the variable
in 
orhs 
Either a righthand side formula consisting only of the variable
in 
resp_form 
If this is a formula, then the second element of this
formula (if the formula is a standard formula with both lefthand and
righthand side, then its second element is the lefthand side; if the
formula is a righthand side formula, then its second element is the
righthand side) will be extracted from 
A list
with elements y
, weights
, and offset
, each being a
numeric vector containing the data for the response, the observation
weights, and the offsets, respectively. An exception is that y
may also
be NULL
(depending on argument resp_form
), a nonnumeric vector, or a
factor
.
# For an example, see `?init_refmodel`.
# For an example, see `?init_refmodel`.