--- title: "rvar: The Random Variable Datatype" author: "Matthew Kay" date: "`r Sys.Date()`" output: html_vignette: toc: yes vignette: > %\VignetteIndexEntry{rvar: The Random Variable Datatype} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ## Introduction This vignette describes the `rvar()` datatype, a multidimensional, sample-based representation of random variables designed to act as much like base R arrays as possible (e.g., by supporting many math operators and functions). This format is also the basis of the `draws_rvars()` format. The `rvar()` datatype is inspired by the [rv](https://cran.r-project.org/package=rv) package and [Kerman and Gelman (2007)](https://doi.org/10.1007%2Fs11222-007-9020-4), though with a slightly different backing format (multidimensional arrays). It is also designed to interoperate with vectorized distributions in the [distributional](https://pkg.mitchelloharawild.com/distributional/) package, to be able to be used inside `data.frame()`s and `tibble()`s, and to be used with distribution visualizations in the [ggdist](https://mjskay.github.io/ggdist/) package. ```{r setup, include = FALSE} library(posterior) set.seed(1234) ``` ## The `rvars` datatype The `rvar()` datatype is a wrapper around a multidimensional array where the first dimension is the number of draws in the random variable. The most direct way to create a random variable is to pass such an array to the `rvar()` function. For example, to create a "scalar" `rvar`, one would pass a one-dimensional array or a vector whose length (here `4000`) is the desired number of draws: ```{r x_rvar_rnorm} x <- rvar(rnorm(4000, mean = 1, sd = 1)) x ``` The default display of an `rvar` shows the mean and standard deviation of each element of the array. We can create random vectors by adding an additional dimension beyond just the draws dimension to the input array: ```{r x_rvar_array} n <- 4 # length of output vector x <- rvar(array(rnorm(4000*n, mean = 1, sd = 1), dim = c(4000, n))) x ``` Or we can create a random matrix: ```{r x_matrix} rows <- 4 cols <- 3 x <- rvar(array(rnorm(4000 * rows * cols, mean = 1, sd = 1), dim = c(4000, rows, cols))) x ``` Or any array up to an arbitrary number of dimensions. The array backing an `rvar` can be accessed (and modified, with caution) via `draws_of()`: ```{r str_draws} str(draws_of(x)) ``` While the above examples assume all draws come from a single chain, `rvar`s can also contain samples from multiple chains. For example, if your array of draws has iterations as the first dimension and chains as the second dimension, you can use `with_chains = TRUE` to create an `rvar` that includes chain information: ```{r x_matrix_with_chains} iterations <- 1000 chains <- 4 rows <- 4 cols <- 3 x_array <- array( rnorm(iterations * chains * rows * cols, mean = 1, sd = 1), dim = c(iterations, chains, rows, cols) ) x <- rvar(x_array, with_chains = TRUE) x ``` Manual construction and modification of `rvar`s in this way is not always recommended unless you need it for performance reasons: several other higher-level interfaces to constructing and manipulating `rvar`s are described below. ### `rvar_factor` and `rvar_ordered` subtypes {#rvar_factor} You can also use `rvar`s to represent discrete distributions, using the `rvar_factor()` and `rvar_ordered()` subtypes. If you attempt to create an `rvar` using character values or a `factor`, it will automatically be treated as an `rvar_factor`: ```{r rvar_factor} x <- rvar(sample(c("a","b","c"), 4000, prob = c(0.7, 0.2, 0.1), replace = TRUE)) x ``` Numeric arrays with a `"levels"` attribute can also be passed to `rvar_factor()`. This (along with conversion of character values) means output from `rstanarm::posterior_predict()` and `brms::posterior_predict()` on categorical models can be passed directly to `rvar_factor()`. The default display shows the mode (as returned by `modal_category()`) and normalized entropy (`entropy()`), which is Shannon entropy scaled by the maximum possible entropy for a distribution with the same number of levels: thus 0 means all probability is concentrated in one category, and 1 means the distribution is uniform. You can construct an ordered factor using `rvar_ordered()` (or by passing an `ordered()` vector to `rvar()`): ```{r rvar_ordered} x <- rvar_ordered(sample(c("a","b","c"), 4000, prob = c(0.7, 0.2, 0.1), replace = TRUE)) x ``` For `rvar_ordered()`, the default display is mode and dissention (`dissent()`), which is 0 when all probability is concentrated in one category, and 1 when the distribution is bimodal at opposite ends of the scale. `rvar_factor`s attempt to mimic `factor()` and `rvar_ordered`s attempt to mimic `ordered()`s, by implementing factor-specific functions like `levels()`. Comparison operations are also implemented where valid. For example, in `x` as defined above, approximately 90% of draws should be less than `"b"` (which means the `"a"` and `"b"` levels): ```{r x_lte_b} x <= "b" ``` `rvar`s also supply an implementation of `match()` and `%in%`, which can be especially useful with `rvar_factor`s: ```{r x_in_ac} x %in% c("a", "c") ``` ## The `draws_rvars` datatype The `draws_rvars()` datatype, like all `draws` datatypes in posterior, contains multiple variables in a joint sample from some distribution (e.g. a posterior or prior distribution). You can construct `draws_rvars()` objects directly using the `draws_rvars()` function. The input `rvar`s must have the same number of chains and iterations, but can otherwise have different shapes: ```{r draws_rvars} d <- draws_rvars(x = x, y = rvar(rnorm(iterations * chains), nchains = 4)) d ``` Existing objects can also be converted to the `draws_rvars()` format using `as_draws_rvars()`. Below is the `example_draws("multi_normal")` dataset converted into the `draws_rvars()` format. This dataset has 100 iterations from 4 chains from the posterior of a a 3-dimensional multivariate normal model. The `mu` variable is a mean vector of length 3 and the `Sigma` variable is a $3 \times 3$ covariance matrix: ```{r post} post <- as_draws_rvars(example_draws("multi_normal")) post ``` The `draws_rvars()` datatype works much the same way that other `draws` formats do; see the main package vignette at `vignette("posterior")` for an introduction to `draws` objects. One difference is that `draws_rvars` counts variables differently, because it allows variables to be multidimensional. For example, the `post` object above contains two variables, `mu` and `Sigma`: ```{r variables_draws_rvars} variables(post) ``` But converted to a `draws_list()`, it contains one variable for each combination of the dimensions of its variables: ```{r variables_draws_list} variables(as_draws_list(post)) ``` ## Math with `rvar`s The `rvar()` datatype implements most math operations, including basic arithmetic, functions in the *Math* and *Summary* groups, like `log()` and `exp()` (see `help("groupGeneric")` for a list), and more. Binary operators can be performed between multiple `rvar`s or between `rvar`s and `numeric`s. A simple example: ```{r mu_plus_1} mu <- post$mu Sigma <- post$Sigma mu + 1 ``` Matrix multiplication is also implemented (using a tensor product under the hood). In R < 4.3, the normal matrix multiplication operator (`%*%`) cannot be properly implemented for S3 datatypes, so `rvar` uses `%**%` instead. In R ≥ 4.3, which does support matrix multiplication for S3 datatypes, you can use `%*%` to matrix-multiply `rvar`s. A trivial example: ```{r matrix_mult} Sigma %**% diag(1:3) ``` The set of mathematical functions and operators supported by `rvar`s includes: | Group | Functions and operators | |:----------------------------|:------------------------| | Arithmetic operators | `+`, `-`, `*`, `/`, `^`, `%%`, `%/%` | | Logical operators | `&`, `|`, `!` | | Comparison operators | `==`, `!=`, `<`, `<=`, `>=`, `>` | | Value matching | `match()`, `%in%` | | Matrix multiplication | `%**%`, `%*%` (R ≥ 4.3 only) | | Basic functions | `abs()`, `sign()`
`sqrt()`
`floor()`, `ceiling()`, `trunc()`, `round()`, `signif()` | | Logarithms and exponentials | `exp()`, `expm1()`
`log()`, `log10()`, `log2()`, `log1p()` | | Trigonometric functions | `cos()`, `sin()`, `tan()`
`cospi()`, `sinpi()`, `tanpi()`
`acos()`, `asin()`, `atan()`| | Hyperbolic functions | `cosh()`, `sinh()`, `tanh()`
`acosh()`, `asinh()`, `atanh()` | | Special functions | `lgamma()`, `gamma()`, `digamma()`, `trigamma()` | | Cumulative functions | `cumsum()`, `cumprod()`, `cummax()`, `cummin()` | | Array transposition | `t()`, `aperm()` | | Matrix decomposition | `chol()` | | Matrix diagonals | `diag()` | ## Expectations and summary functions The `E()` function is an alias of `mean()`, producing means within each cell of an `rvar`. For example, given `mu`: ```{r mu} mu ``` We can get the expectation of each cell of `mu`: ```{r E_mu} E(mu) ``` Expectations of logical expressions are probabilities, and can be computed either with `E()` / `mean()` or with `Pr()`. `Pr()` is provided as notational sugar, but also checks that the input is a logical variable before taking the mean: ```{r Pr} Pr(mu > 0) ``` More generally, the `rvar` data type provides two types of summary functions: 1. Summary functions that mimic base-R vector summary functions, except applied to `rvar` vectors. These apply their summaries **over** elements of the input vectors **within** each draw, generally returning an `rvar` of length 1. These functions are prefixed with `rvar_` as a reminder that they return `rvar`s. Here is an example of `rvar_mean()`: ```{r rvar_mean_mu} rvar_mean(mu) ``` 2. Summary functions that summarise **within** elements of input vectors and **over** draws. These summary functions generally return base arrays (`numeric` or `logical`) of the same shape as the input `rvar`, and are especially useful for diagnostic summaries. These summary functions are not prefixed with `rvar_` as they do not return `rvar`s. Here is an example of `mean()`: ```{r mean_mu} mean(mu) ``` You should expect the same values from these functions (though in a different shape) when you use them with `summarise_draws()`, for example: ```{r summarise_draws_mu_mean} summarise_draws(mu, mean) ``` Here is a table of both types of summary functions: | | 1. Summarise *within* draws,
*over* elements | 2. Summarise *over* draws,
*within* elements
| |--------------------------|:------------------------------------------------|:----------------------------------------------------| | **Output format**
of `res = f(x)` | `rvar` of length 1 | `array` of same shape as input `rvar` | | **Help page** | `help("rvar-summaries-within-draws")` | `help("rvar-summaries-over-draws")` | | Numeric summaries | `rvar_median()`
`rvar_sum()`, `rvar_prod()`
`rvar_min()`, `rvar_max()`| `median()`
`sum()`, `prod()`
`min()`, `max()`| | Mean | `rvar_mean()`
*N/A* | `mean()`, `E()`
`Pr()`: enforces that input is `logical` | | Spread | `rvar_sd()`
`rvar_var()`
`rvar_mad()` | `sd()`
`var()`, `variance()`
`mad()`| | Range | `rvar_range()`
**Note:** `length(res) == 2` | `range()`
**Note:** `dim(res) == c(2, dim(x))` | | Quantiles | `rvar_quantile()`
**Note:** `length(res) == length(probs)` | `quantile()`
**Note:** `dim(res) == c(length(probs), dim(x))` | | Logical summaries | `rvar_all()`, `rvar_any()` | `all()`, `any()` | | Special value predicates | `rvar_is_finite()`
`rvar_is_infinite()`
`rvar_is_nan()`
`rvar_is_na()`
**Note:** `dim(res) == dim(x)`. These functions act within draws but do not summarise over elements. | `is.finite()`
`is.infinite()`
`is.nan()`
`is.na()`
**Note:** `res[i] == TRUE` if `x[i]` has any draws matching predicate (except for `is.finite()`, where all draws in `x[i]` must match) | | Diagnostics | *N/A* | `ess_basic()`, `ess_bulk()`, `ess_quantile()`, `ess_sd()`, `ess_tail()`,
`mcse_mean()`, `mcse_quantile()`, `mcse_sd()`
`rhat()`, `rhat_basic()`| ## Constants Constant `rvar`s can be constructed by converting numeric vectors or arrays into `rvar`s using `as_rvar()`, which will return an `rvar` with one draw and the same dimensions as its input: ```{r const} const <- as_rvar(1:3) const ``` While normally `rvar`s must have the same number of draws to be used in the same expression, `rvar`s with one draw are treated like constants, and can be combined with other `rvar`s: ```{r mu_plus_const} mu + const ``` ## Using existing R functions and expressions with `rvar`s While `rvar`s attempt to emulate as much of the functionality of base R arrays as possible, there are situations in which an existing R function may not work directly with an `rvar`. There are several approaches to solving this problem. For example, say you wish to generate samples from the following expression for $\mu$, $\sigma$, and $x$: $$ \begin{align} \left[\begin{matrix}\mu_1 \\ \vdots \\ \mu_4 \end{matrix}\right] &\sim \textrm{Normal}\left(\left[\begin{matrix}1 \\ \vdots \\ 4 \end{matrix}\right],1\right)\\ \sigma &\sim \textrm{Gamma}(1,1)\\ \left[\begin{matrix}x_1 \\ \vdots \\ x_4 \end{matrix}\right] &\sim \textrm{Normal}\left(\left[\begin{matrix}\mu_1 \\ \vdots \\ \mu_4 \end{matrix}\right], \sigma\right) \end{align} $$ There are three different approaches you might take to doing this: converting existing R functions with `rfun()`, executing expressions of random variables with `rdo()`, or evaluating random number generator functions using `rvar_rng()`. ### Converting functions with `rfun()` The `rfun()` wrapper converts an existing R function into a new function that `rvar`s can be passed to it as arguments, and which will return `rvar`s. We can use `rfun()` to convert the base `rnorm()` and `rgamma()` random number generating functions into functions that accept and return `rvar`s: ```{r rfun_defs} rvar_norm <- rfun(rnorm) rvar_gamma <- rfun(rgamma) ``` Then we can translate the above example into code using those functions: ```{r rfun_ex} mu <- rvar_norm(4, mean = 1:4, sd = 1) sigma <- rvar_gamma(1, shape = 1, rate = 1) x <- rvar_norm(4, mu, sigma) x ``` While `rfun()`-converted functions work well for prototyping, they will generally speaking be slower than functions designed specifically for `rvar`s. Thus, you may find you need to adopt other strategies (like `rvar_rng()`, described below; or re-writing functions to support `rvar` directly using math operators and/or the `draws_of()` function). ### Evaluating expressions with `rdo()` An alternative to `rfun()` is to use `rdo()`, which can be passed nearly-arbitrary R expressions. The expression will be executed multiple times to construct an `rvar`. E.g., we can write an expression for `mu` like in the above example: ```{r mu_rdo} mu <- rdo(rnorm(4, mean = 1:4, sd = 1)) mu ``` We can also control the number of draws using the `ndraws` argument: ```{r mu_rdo_ndraws} mu <- rdo(rnorm(4, mean = 1:4, sd = 1), ndraws = 1000) mu ``` `rdo()` expressions can also contain other `rvar`s, so long as all `rvar`s in the expression have the same number of draws. Thus, we can re-write the example above that used `rfun()` as follows: ```{r rdo_ex} mu <- rdo(rnorm(4, mean = 1:4, sd = 1)) sigma <- rdo(rgamma(1, shape = 1, rate = 1)) x <- rdo(rnorm(4, mu, sigma)) x ``` Like `rfun()`, `rdo()` is not necessarily fast, so you may find it more useful for prototyping than production code. ### Evaluating random number generators with `rvar_rng()` `rvar_rng()` is an alternative to `rfun()`/`rdo()` designed specifically to work with random number generating functions that follow the typical API of such functions in base R. Such functions, like `rnorm()`, `rgamma()`, `rbinom()`, etc all following this interface: - They have a first argument, `n`, giving the number of draws to take from the distribution. - Their arguments for distribution parameters (`mean`, `sd`, `shape`, `rate`, etc.) are vectorized. - They return a single vector of length `n`, representing `n` draws from the distribution. You can use any function with this interface with `rvar_rng()`, and it will adapt it to be able to take `rvar` arguments and return an `rvar`, as follows: ```{r rvar_r_ex} mu <- rvar_rng(rnorm, 4, mean = 1:4, sd = 1) sigma <- rvar_rng(rgamma, 1, shape = 1, rate = 1) x <- rvar_rng(rnorm, 4, mu, sigma) x ``` In contrast to the `rfun()` and `rdo()` examples above, `rvar_rng()` takes advantage of the existing vectorization of the underlying random number generating function to execute quickly. ## Broadcasting Broadcasting for `rvar`s does not follow R's vector recycling rules. Instead, when two variables with different dimensions are being used with basic arithmetic functions, dimensions are added until both variables have the same number of dimensions. If two variables $x$ and $y$ differ on the length of dimension $d$, they can be broadcast to the same size so long as one of the variables has dimension $d$ of size 1. Then that variable will be broadcast up to the same size as the other variable along that dimension. If two variables disagree on the size of a dimension and neither has size 1, it is an error. For example, consider this random matrix: ```{r X_matrix} X <- rdo(rnorm(12, 1:12), dim = c(4,3)) X ``` And this vector of length 3: ```{r y_vector} y <- rdo(rnorm(3, 3:1)) y ``` If we attempt to add `X` and `y`, it will produce an error as vectors are by default treated as column vectors, and `y` has length 3 while columns of `X` have length 4: ```{r X_plus_y, error = TRUE} X + y ``` By contrast, R arrays of the same shape will simply recycle `y` until it is the same length as `X` (regardless of the dimensions). Thus will produce a result, though likely not the intended result: ```{r mean_X_plus_y} mean(X) + mean(y) ``` On the other hand, if y were a row vector... ```{r row_y} row_y = t(y) row_y ``` ...it would have the same number of columns as `X` and contain only one row, so it can be broadcast along rows of `X`: ```{r X_plus_row_y} X + row_y ``` ## Slicing and conditionals The `[[` and `[` operators implement all of the base array slicing operations, including numeric, character, and logical indices, as well as slicing arrays using a matrix of indices with `[`. The main difference between `rvar` slicing and base array slicing is that `rvar`s default to `drop = FALSE`; i.e. they retain all dimensions of the original `rvar` array. For a complete list of `rvar` slicing types, see `help("rvar-slice")`. In addition to the base slicing operations, `rvar` also implements three slicing/conditioning methods that allow you to use other `rvar`s to define a slice. To demonstrate these operations, consider an `rvar` vector of two components: ```{r} component = rvar_rng(rnorm, 2, mean = c(1, 5)) component ``` Perhaps we want to create a mixture of these two components, `mixture`, with a mixing proportion of 0.75. We could create an index, `i`, that is a random variable indicating which component (1 or 2) determines the value of `mixture` on each draw: ```{r} i = rvar_rng(rbinom, 1, size = 1, p = 0.75) + 1L i ``` We can use several different approaches to create the mixture distribution ### Subsetting `rvar`s by draw: `x[]` A slice `x[i]` where `i` is a scalar logical `rvar` returns a new rvar with the same shape as `x`, but containing only those draws where `i` is `TRUE`. Thus, we can use `i == 2` to select draws from the second component and overwrite them in the first component, creating the mixture distribution: ```{r} mixture = component[[1]] mixture[i == 2] = component[[2]][i == 2] mixture ``` The resulting mixture looks like this: ```{r mixture, eval = requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("ggdist", quietly = TRUE)} library(ggplot2) ggplot() + ggdist::stat_slab(aes(xdist = mixture)) ``` See `vignette("slabinterval", package = "ggdist")` for more examples of visualizing distribution-type objects, including `rvar`s. ### Conditionals using `rvar_ifelse()` You could create the same mixture using `rvar_ifelse(test, yes, no)`, which broadcasts `test`, `yes`, and `no` to the same shape, then returns a new `rvar` containing draws from `yes` where `test == TRUE` and draws from `no` where `test == FALSE`. Thus, we can create the mixture as follows: ```{r} x = rvar_ifelse(i == 1, component[[1]], component[[2]]) x ``` ### Selecting different elements in each draw: `x[[]]` The slice `x[[i]]`, where `i` is a scalar numeric rvar, generalizes indexing when `i` is a scalar numeric. Within each draw of `x`, it selects the element of `x` corresponding to the value of `i` within that same draw. Thus, since `i` in our example is a scalar numeric `rvar` whose values are either `1` or `2` within each draw, you can use it as an index directly on `component` to create the mixture: ```{r} x = component[[i]] x ``` This approach is also nice because it generalizes easily to more than two components. ## Applying functions over `rvar`s The `rvar` data type supplies an implementation of `as.list()`, which should give compatibility with the base R family of functions for applying functions over arrays: `apply()`, `lapply()`, `vapply()`, `sapply()`, etc. You can also manually use `as.list()` to convert an `rvar` into a list along its first dimension, which may be necessary for compatibility with some functions (like `purrr:map()`). For example, given this multidimensional `rvar`... ```{r multidim_array} set.seed(3456) x <- rvar_rng(rnorm, 24, mean = 1:24) dim(x) <- c(2,3,4) x ``` ... you can apply functions along the margins using `apply()` (here, a silly example): ```{r apply} apply(x, c(1,2), length) ``` One exception is that while `apply()` will work with an `rvar` input if your function returns base data types (like numerics), it will not give you simplified `rvar` arrays if your function returns an `rvar`. Thus, we supply the `rvar_apply()` function, which takes in either base arrays or `rvar` arrays and returns `rvar` arrays, and which also uses the `rvar` broadcasting rules to combine the results of the applied function. For example, you can use `rvar_apply()` with `rvar_mean()` to compute the distributions of means along one margin of an array: ```{r rvar_apply_one_dim} rvar_apply(x, 1, rvar_mean) ``` Or along multiple dimensions: ```{r rvar_apply_multi_dim} rvar_apply(x, c(2,3), rvar_mean) ``` ## Looping over `draws` and `rvar`s The `rvar` datatype is also used in `for_each_draw()`, which allows you to loop over draws in a `draws` object or an `rvar`. `for_each_draw(x, expr)` converts `x` into a `draws_rvars()` object, then loops over each draw of `x`, executing the provided expression, `expr`. The expression can use the variables in `x` as if they were regular R arrays. One application of `for_each_draw()` is in constructing base-R plots of individual draws (for `ggplot2`-based plotting of `rvar`s, see the next section and the [ggdist](https://mjskay.github.io/ggdist/) package). For example, it can be used to construct a parallel coordinates plot: ```{r eight_schools_parcoord, fig.width = 6, fig.height = 4} eight_schools <- as_draws_rvars(example_draws()) plot(1, type = "n", xlim = c(1, length(eight_schools$theta)), ylim = range(range(eight_schools$theta)), xlab = "school", ylab = "theta" ) # use for_each_draw() to make a parallel coordinates plot of all draws # of eight_schools$theta for_each_draw(eight_schools, { lines(seq_along(theta), theta, col = rgb(1, 0, 0, 0.05)) }) # add means and 90% intervals lines(seq_along(eight_schools$theta), mean(eight_schools$theta)) with(summarise_draws(eight_schools$theta), segments(seq_along(eight_schools$theta), y0 = q5, y1 = q95) ) ``` As `for_each_draw()` will be slower than most other ways of manipulating draws, this function should generally not be used unless needed. ## Using `rvar`s in data frames and in ggplot2 `rvar`s can be used as columns in `data.frame()` or `tibble()` objects: ```{r data_frame_with_y} df <- data.frame(group = c("a","b","c","d"), mu) df ``` This makes them convenient for adding predictions to a data frame alongside the data used to generate the predictions. `rvar`s can then be visualized with ggplot2 by passing them to the `xdist` and `ydist` aesthetics of the `stat_...` family of geometries in the [ggdist](https://mjskay.github.io/ggdist/) package, such as `stat_halfeye()`, `stat_lineribbon()`, and `stat_dotsinterval()`. For example: ```{r data_frame_plot, eval = requireNamespace("ggplot2", quietly = TRUE) && requireNamespace("ggdist", quietly = TRUE)} library(ggplot2) library(ggdist) ggplot(df) + stat_halfeye(aes(y = group, xdist = mu)) ``` See `vignette("slabinterval", package = "ggdist")` or `vignette("tidy-posterior", package = "tidybayes")` for more examples.