--- title: "stan_surv: Estimating Survival (Time-to-Event) Models with rstanarm" author: "Sam Brilleman" date: "`r Sys.Date()`" output: html_vignette: toc: true number_sections: false params: EVAL: !r identical(Sys.getenv("NOT_CRAN"), "true") --- ```{r, child="children/SETTINGS-knitr.txt"} ``` ```{r, child="children/SETTINGS-gg.txt"} ``` ```{r setup_jm, include=FALSE, message=FALSE} knitr::opts_chunk$set(fig.width=10, fig.height=4) library(rstanarm) set.seed(989898) CHAINS <- 1 CORES <- 1 SEED <- 12345 ITER <- 1000 ``` # Preamble This vignette provides an introduction to the `stan_surv` modelling function in the __rstanarm__ package. The `stan_surv` function allows the user to fit survival models (sometimes known as time-to-event models) under a Bayesian framework. # Introduction Survival (or time-to-event) analysis is concerned with the analysis of an outcome variable that corresponds to the time from some defined baseline until an event of interest occurs. The methodology is used in a range of disciplines where it is known by a variety of different names. These include survival analysis (medicine), duration analysis (economics), reliability analysis (engineering), and event history analysis (sociology). Survival analyses are particularly common in health and medical research, where a classic example of survival outcome data is the time from diagnosis of a disease until the occurrence of death. In standard survival analysis, one event time is measured for each observational unit. In practice however that event time may be unobserved due to left, right, or interval censoring, in which case the event time is only known to have occurred within the relevant censoring interval. The combined aspects of time and censoring make survival analysis methodology distinct from many other regression modelling approaches. There are two common approaches to modelling survival data. The first is to model the instantaneous rate of the event (known as the hazard) as a function of time. This includes the class of models known as proportional and non-proportional hazards regression models. The second is to model the event time itself. This includes the class of models known as accelerated failure time (AFT) models. Under both of these modelling frameworks a number of extensions have been proposed. For instance the handling of recurrent events, competing events, clustered survival data, cure models, and more. More recently, methods for modelling both longitudinal (e.g. a repeatedly measured biomarker) and survival data have become increasingly popular (as described in the `stan_jm` [vignette](priors.html)). This vignette is structured as follows. In the next sections we describe the modelling, estimation, and prediction frameworks underpinning survival models in __rstanarm__. Following that we describe the implementation and arguments for the `stan_surv` modelling function. Following that we demonstrate usage of the package through a series of examples. # Modelling framework ## Data and notation We assume that a true event time for individual $i$ ($i = 1,...,N$) exists and can be denoted $T_i^*$. However, in practice $T_i^*$ may not be observed due to left, right, or interval censoring. We therefore observe outcome data $\mathcal{D}_i = \{T_i, T_i^U, T_i^E, d_i\}$ for individual $i$ where: - $T_i$ denotes the observed event or censoring time - $T_i^U$ denotes the observed upper limit for interval censored individuals - $T_i^E$ denotes the observed entry time (i.e. the time at which an individual became at risk for the event); and - $d_i \in \{0,1,2,3\}$ denotes an event indicator taking value 0 if individual $i$ was right censored (i.e. $T_i^* > T_i$), value 1 if individual $i$ was uncensored (i.e. $T_i^* = T_i$), value 2 if individual $i$ was left censored (i.e. $T_i^* < T_i$), or value 3 if individual $i$ was interval censored (i.e. $T_i < T_i^* < T_i^U$). ### Hazard, cumulative hazard, and survival There are three key quantities of interest in standard survival analysis: the hazard rate, the cumulative hazard, and the survival probability. It is these quantities that are used to form the likelihood function for the survival models described in later sections. The hazard is the instantaneous rate of occurrence for the event at time $t$. Mathematically, it is defined as: \ \begin{equation} \begin{split} h_i(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T_i^* < t + \Delta t | T_i^* > t)}{\Delta t} \end{split} \end{equation} \ where $\Delta t$ is the width of some small time interval. The numerator is the conditional probability of the individual experiencing the event during the time interval $[t, t + \Delta t)$, given that they were still at risk of the event at time $t$. The denominator converts the conditional probability to a rate per unit of time. As $\Delta t$ approaches the limit, the width of the interval approaches zero and the instantaneous event rate is obtained. The cumulative hazard is defined as: \ \begin{equation} \begin{split} H_i(t) = \int_{s=0}^t h_i(s) ds \end{split} \end{equation} \ and the survival probability is defined as: \ \begin{equation} \begin{split} S_i(t) = \exp \left[ -H_i(t) \right] = \exp \left[ -\int_{s=0}^t h_i(s) ds \right] \end{split} \end{equation} It can be seen here that in the standard survival analysis setting -- where there is one event type of interest (i.e. no competing events) -- there is a one-to-one relationship between each of the hazard, the cumulative hazard, and the survival probability. ### Delayed entry Delayed entry, also known as left truncation, occurs when an individual is not at risk of the event until some time $t > 0$. As previously described we use $T_i^E$ to denote the entry time at which the individual becomes at risk. A common situation where delayed entry occurs is when age is used as the time scale. With age as the time scale it is likely that our study will only be concerned with the observation of individuals starting from some time (i.e. age) $t > 0$. To allow for delayed entry we essentially want to work with a conditional survival probability: \ \begin{equation} \begin{split} S_i \left(t \mid T_i^E > 0 \right) = \frac{S_i(t)}{S_i \left( T_i^E \right)} \end{split} \end{equation} Here the survival probability is evaluated conditional on the individual having survived up to the entry time. This conditional survival probability is used to allow for delayed entry in the log likelihood of our survival model. ## Model formulations Our modelling approaches are twofold. First, we define a class of models on the hazard scale. This includes both proportional and non-proportional hazard regression models. Second, we define a class of models on the scale of the survival time. These are often known as accelerated failure time (AFT) models and can include both time-fixed and time-varying acceleration factors. These two classes of models and their respective features are described in the following sections. ### Hazard scale models Under a hazard scale formulation, we model the hazard of the event for individual $i$ at time $t$ using the regression model: / \begin{equation} \begin{split} h_i(t) = h_0(t) \exp \left( \eta_i(t) \right) \end{split} \end{equation} / where $h_0(t)$ is the baseline hazard (i.e. the hazard for an individual with all covariates set equal to zero) at time $t$, and $\eta_i(t)$ denotes the linear predictor evaluated for individual $i$ at time $t$. For full generality we allow the linear predictor to be time-varying. That is, it may be a function of time-varying covariates and/or time-varying coefficients (e.g. a time-varying hazard ratio). However, if there are no time-varying covariates or time-varying coefficients in the model, then the linear predictor reduces to a time-fixed quantity and the definition of the hazard function reduces to: / \begin{equation} \begin{split} h_i(t) = h_0(t) \exp \left( \eta_i \right) \end{split} \end{equation} / where the linear predictor $\eta_i$ is no longer a function of time. We describe the linear predictor in detail in later sections. Different distributional assumptions can be made for the baseline hazard $h_0(t)$ and affect how the baseline hazard changes as a function of time. The __rstanarm__ package currently accommodates several standard parametric distributions for the baseline hazard (exponential, Weibull, Gompertz) as well as more flexible approaches that directly model the baseline hazard as a piecewise or smooth function of time using splines. The following describes the baseline hazards that are currently implemented in the __rstanarm__ package. #### M-splines model (the default): Let $M_{l}(t; \boldsymbol{k}, \delta)$ denote the $l^{\text{th}}$ $(l = 1,...,L)$ basis term for a degree $\delta$ M-spline function evaluated at a vector of knot locations $\boldsymbol{k} = \{k_{1},...,k_{J}\}$, and $\gamma_{l}$ denote the $l^{\text{th}}$ M-spline coefficient. We then have: / \begin{equation} h_i(t) = \sum_{l=1}^{L} \gamma_{l} M_{l}(t; \boldsymbol{k}, \delta) \exp ( \eta_i(t) ) \end{equation} The M-spline basis is evaluated using the method described in \cite{Ramsay:1988} and implemented in the __splines2__ package \citep{Wang:2018}. To ensure that the hazard function $h_i(t)$ is not constrained to zero at the origin (i.e. when $t$ approaches 0) the M-spline basis incorporates an intercept. To ensure identifiability of both the M-spline coefficients and the intercept in the linear predictor we constrain the M-spline coefficients to a simplex, that is, $\sum_{l=1}^L{\gamma_l} = 1$. The default degree in __rstanarm__ is $\delta = 3$ (i.e. cubic M-splines) such that the baseline hazard can be modelled as a flexible and smooth function of time, however this can be changed by the user. It is worthwhile noting that setting $\delta = 0$ is treated as a special case that corresponds to a piecewise constant baseline hazard. #### Exponential model: For scale parameter $\lambda_i(t) = \exp ( \eta_i(t) )$ we have: / \begin{equation} h_i(t) = \lambda_i(t) \end{equation} In the case where the linear predictor is not time-varying, the exponential model leads to a hazard rate that is constant over time. #### Weibull model: For scale parameter $\lambda_i(t) = \exp ( \eta_i(t) )$ and shape parameter $\gamma > 0$ we have: / \begin{equation} h_i(t) = \gamma t^{\gamma-1} \lambda_i(t) \end{equation} In the case where the linear predictor is not time-varying, the Weibull model leads to a hazard rate that is monotonically increasing or monotonically decreasing over time. In the special case where $\gamma = 1$ it reduces to the exponential model. #### Gompertz model: For shape parameter $\lambda_i(t) = \exp ( \eta_i(t) )$ and scale parameter $\gamma > 0$ we have: / \begin{equation} h_i(t) = \exp(\gamma t) \lambda_i(t) \end{equation} #### B-splines model (for the log baseline hazard): Let $B_{l}(t; \boldsymbol{k}, \delta)$ denote the $l^{\text{th}}$ $(l = 1,...,L)$ basis term for a degree $\delta$ B-spline function evaluated at a vector of knot locations $\boldsymbol{k} = \{k_{1},...,k_{J}\}$, and $\gamma_{l}$ denote the $l^{\text{th}}$ B-spline coefficient. We then have: / \begin{equation} h_i(t) = \exp \left( \sum_{l=1}^{L} \gamma_{l} B_{l}(t; \boldsymbol{k}, \delta) + \eta_i(t) \right) \end{equation} / The B-spline basis is calculated using the method implemented in the __splines2__ package \citep{Wang:2018}. The B-spline basis does not require an intercept and therefore does not include one; any constant shift in the log hazard is fully captured via an intercept in the linear predictor. By default cubic B-splines are used (i.e. $\delta = 3$) and these allow the log baseline hazard to be modelled as a smooth function of time. ### Accelerated failure time (AFT) models Under an AFT formulation we model the survival probability for individual $i$ at time $t$ using the regression model \citep{Hougaard:1999}: / \begin{equation} \label{eq:aftform-surv} \begin{split} S_i(t) = S_0 \left( \int_{u=0}^t \exp \left( - \eta_i(u) \right) du \right) \end{split} \end{equation} / where $S_0(t)$ is the baseline survival probability at time $t$, and $\eta_i(t)$ denotes the linear predictor evaluated for individual $i$ at time $t$. For full generality we again allow the linear predictor to be time-varying. This also leads to a corresponding general expression for the hazard function \citep{Hougaard:1999} as follows: / \begin{align} \label{eq:aftform-haz} \begin{split} h_i(t) = \exp \left(-\eta_i(t) \right) h_0 \left( \int_{u=0}^t \exp \left( - \eta_i(u) \right) du \right) \end{split} \end{align} If there are no time-varying covariates or time-varying coefficients in the model, then the definition of the survival probability reduces to: / \begin{equation} \begin{split} S_i(t) = S_0 \left( t \exp \left( - \eta_i \right) \right) \end{split} \end{equation} / and for the hazard: / \begin{align} \begin{split} h_i(t) = \exp \left( -\eta_i \right) h_0 \left( t \exp \left( - \eta_i \right) \right) \end{split} \end{align} Different distributional assumptions can be made for how the baseline survival probability $S_0(t)$ changes as a function of time. The __rstanarm__ package currently accommodates two standard parametric distributions (exponential, Weibull) although others may be added in the future. The current distributions are implemented as follows. #### Exponential model: When the linear predictor is time-varying we have: / \begin{equation} S_i(t) = \exp \left( - \int_{u=0}^t \exp ( -\eta_i(u) ) du \right) \end{equation} / and when the linear predictor is time-fixed we have: / \begin{equation} S_i(t) = \exp \left( - t \lambda_i \right) \end{equation} / for scale parameter $\lambda_i = \exp ( -\eta_i )$. #### Weibull model: When the linear predictor is time-varying we have: / \begin{equation} S_i(t) = \exp \left( - \left( \int_{u=0}^t \exp ( -\eta_i(u) ) du \right)^{\gamma} \right) \end{equation} / for shape parameter $\gamma > 0$ and when the linear predictor is time-fixed we have: / \begin{equation} S_i(t) = \exp \left( - t^{\gamma} \lambda_i \right) \end{equation} / for scale parameter $\lambda_i = \exp ( -\gamma \eta_i )$ and shape parameter $\gamma > 0$. ## Linear predictor Under all of the previous model formulations our linear predictor can be defined as: / \begin{equation} \label{eq:eta} \begin{split} \eta_i(t) = \boldsymbol{\beta}^T(t) \boldsymbol{X}_i(t) \end{split} \end{equation} / where $\boldsymbol{X}_i(t) = [1, x_{i1}(t), ..., x_{iP}(t) ]$ denotes a vector of covariates with $x_{ip}(t)$ denoting the observed value of $p^{th}$ $(p = 1,...,P)$ covariate for the $i^{th}$ $(i=1,...,N)$ individual at time $t$, and $\boldsymbol{\beta}(t) = [ \beta_0, \beta_1(t), ... , \beta_P(t) ]$ denotes a vector of parameters with $\beta_0$ denoting an intercept parameter and $\beta_p(t)$ denoting the possibly time-varying coefficient for the $p^{th}$ covariate. ### Hazard ratios Under a hazard scale formulation the quantity $\exp \left( \beta_p(t) \right)$ is referred to as a \textif{hazard ratio}. The hazard ratio quantifies the relative increase in the hazard that is associated with a unit-increase in the relevant covariate, $x_{ip}$, assuming that all other covariates in the model are held constant. For instance, a hazard ratio of 2 means that a unit-increase in the covariate leads to a doubling in the hazard (i.e. the instantaneous rate) of the event. ### Acceleration factors and survival time ratios Under an AFT formulation the quantity $\exp \left( - \beta_p(t) \right)$ is referred to as an \textif{acceleration factor} and the quantity $\exp \left( \beta_p(t) \right)$ is referred to as a \textif{survival time ratio}. The acceleration factor quantifies the acceleration (or deceleration) of the event process that is associated with a unit-increase in the relevant covariate, $x_{ip}$. For instance, an acceleration factor of 0.5 means that a unit-increase in the covariate corresponds to approaching the event at half the speed. The survival time ratio is interpreted as the increase (or decrease) in the expected survival time that is associated with a unit-increase in the relevant covariate, $x_{ip}$. For instance, a survival time ratio of 2 (which is equivalent to an acceleration factor of 0.5) means that a unit-increase in the covariate leads to an doubling in the expected survival time. Note that the survival time ratio is a simple reparameterisation of the acceleration factor. Specifically, the survival time ratio is equal to the reciprocal of the acceleration factor. The survival time ratio and the acceleration factor therefore provide alternative interpretations for the same effect of the same covariate. ### Time-fixed vs time-varying effects Under either a hazard scale or AFT formulation the coefficient $\beta_p(t)$ can be treated as a time-fixed or time-varying quantity. When $\beta_p(t)$ is treated as a time-fixed quantity we have: / \begin{equation} \begin{split} \beta_p(t) = \theta_{p0} \end{split} \end{equation} / such that $\theta_{p0}$ is a time-fixed log hazard ratio (or log survival time ratio). On the hazard scale this is equivalent to assuming proportional hazards, whilst on the AFT scale it is equivalent to assuming a time-fixed acceleration factor. When $\beta_p(t)$ is treated as a time-varying quantity we refer to it as a time-varying effect because the effect of the covariate is allowed to change as a function of time. On the hazard scale this leads to non-proportional hazards, whilst on the AFT scale it leads to time-varying acceleration factors. When $\beta_p(t)$ is time-varying we must determine how we wish to model it. In __rstanarm__ the default is to use B-splines such that: / \begin{equation} \begin{split} \beta_p(t) = \theta_{p0} + \sum_{l=1}^{L} \theta_{pl} B_{l}(t; \boldsymbol{k}, \delta) \end{split} \end{equation} / where $\theta_{p0}$ is a constant, $B_{l}(t; \boldsymbol{k}, \delta)$ is the $l^{\text{th}}$ $(l = 1,...,L)$ basis term for a degree $\delta$ B-spline function evaluated at a vector of knot locations $\boldsymbol{k} = \{k_{1},...,k_{J}\}$, and $\theta_{pl}$ is the $l^{\text{th}}$ B-spline coefficient. By default cubic B-splines are used (i.e. $\delta = 3$). These allow the log hazard ratio (or log survival time ratio) to be modelled as a smooth function of time. However an alternative is to model $\beta_p(t)$ using a piecewise constant function: / \begin{equation} \begin{split} \beta_p(t) = \theta_{p0} + \sum_{l=1}^{L} \theta_{pl} I(k_{l+1} < t \leq k_{l+2}) \end{split} \end{equation} / where $I(x)$ is an indicator function taking value 1 if $x$ is true and 0 otherwise, $\theta_{p0}$ is a constant corresponding to the log hazard ratio (or log survival time ratio for AFT models) in the first time interval, $\theta_{pl}$ is the deviation in the log hazard ratio (or log survival time ratio) between the first and $(l+1)^\text{th}$ $(l = 1,...,L)$ time interval, and $\boldsymbol{k} = \{k_{1},...,k_{J}\}$ is a sequence of knot locations (i.e. break points) that includes the lower and upper boundary knots. This allows the log hazard ratio (or log survival time ratio) to be modelled as a piecewise constant function of time. Note that we have dropped the subscript $p$ from the knot locations $\boldsymbol{k}$ and degree $\delta$ discussed above. This is just for simplicity of the notation. In fact, if a model has a time-varying effect estimated for more than one covariate, then each of these can be modelled using different knot locations and/or degree if the user desires. These knot locations and/or degree can also differ from those used for modelling the baseline or log baseline hazard described previously in Section \ref{sec:modelformulations}. ### Relationship between proportional hazards and AFT models As shown in Section \ref{sec:modelformulations} some baseline distributions can be parameterised as either a proportional hazards or an AFT model. In __rstanarm__ this currently includes the exponential and Weibull models. One can therefore transform the estimates from an exponential or Weibull proportional hazards model to get the estimates that would be obtained under an exponential or Weibull AFT parameterisation. Specifically, the following relationship applies for the exponential model: / \begin{equation} \begin{split} \beta_0 & = - \beta_0^* \\ \beta_p & = - \beta_p^* \end{split} \end{equation} / and for the Weibull model: / \begin{equation} \begin{split} \beta_0 & = -\gamma \beta_0^* \\ \beta_p & = -\gamma \beta_p^* \end{split} \end{equation} / where the unstarred parameters are from the proportional hazards model and the starred ($*$) parameters are from the AFT model. Note however that these relationships only hold in the absence of time-varying effects. This is demonstrated using a real dataset in the example in Section \ref{sec:aftmodel}. ## Multilevel survival models The definition of the linear predictor in Equation \ref{eq:eta} can be extended to allow for shared frailty or other clustering effects. Suppose that the individuals in our sample belong to a series of clusters. The clusters may represent for instance hospitals, families, or GP clinics. We denote the $i^{th}$ individual ($i = 1,...,N_j$) as a member of the $j^{th}$ cluster ($j = 1,...,J$). Moreover, to indicate the fact that individual $i$ is now a member of cluster $j$ we index the observed data (i.e. event times, event indicator, and covariates) with a subscript $j$, that is $T_{ij}^*$, $\mathcal{D}_{ij} = \{T_{ij}, T_{ij}^U, T_{ij}^E, d_{ij}\}$ and $X_{ij}(t)$, as well as estimated quantities such as the hazard rate, cumulative hazard, survival probability, and linear predictor, that is $h_{ij}(t)$, $H_{ij}(t)$, $S_{ij}(t)$, and $\eta_{ij}(t)$. To allow for intra-cluster correlation in the event times we include cluster-specific random effects in the linear predictor as follows: / \begin{equation} \label{eq:multileveleta} \begin{split} \eta_{ij}(t) = \boldsymbol{\beta}^T \boldsymbol{X}_{ij}(t) + \boldsymbol{b}_{j}^T \boldsymbol{Z}_{ij} \end{split} \end{equation} / where $\boldsymbol{Z}_{ij}$ denotes a vector of covariates for the $i^{th}$ individual in the $j^{th}$ cluster, with an associated vector of cluster-specific parameters $\boldsymbol{b}_{j}$. We assume that the cluster-specific parameters are normally distributed such that $\boldsymbol{b}_{j} \sim N(0, \boldsymbol{\Sigma}}_{b})$ for some variance-covariance matrix $\boldsymbol{\Sigma}}_{b}$. We assume that $\boldsymbol{\Sigma}}_{b}$ is unstructured, that is each variance and covariance term is allowed to be different. In most cases $\boldsymbol{b}_{j}$ will correspond to just a cluster-specific random intercept (often known as a "shared frailty" term) but more complex random effects structures are possible. For simplicitly of notation Equation \ref{eq:multileveleta} also assumes just one clustering factor in the model (indexed by $j = 1,...,J$). However, it is possible to extend the model to multiple clustering factors (in __rstanarm__ there is no limit to the number of clustering factors that can be included). For example, suppose that the $i^{th}$ individual was clustered within the $j^{th}$ hospital that was clustered within the $k^{th}$ geographical region. Then we would have hospital-specific random effects $\boldsymbol{b}_j \sim N(0, \boldsymbol{\Sigma}}_{b})$ and region-specific random effects $\boldsymbol{u}_k \sim N(0, \boldsymbol{\Sigma}}_{u})$ and assume $\boldsymbol{b}_j$ and $\boldsymbol{u}_k$ are independent for all $(j,k)$. # Estimation framework ## Log posterior The log posterior for the $i^{th}$ individual in the $j^{th}$ cluster can be specified as: / \begin{equation} \begin{split} \log p(\boldsymbol{\theta}, \boldsymbol{b}_{j} \mid \mathcal{D}_{ij}) \propto \log p(\mathcal{D}_{ij} \mid \boldsymbol{\theta}, \boldsymbol{b}_{j}) + \log p(\boldsymbol{b}_{j} \mid \boldsymbol{\theta}) + \log p(\boldsymbol{\theta}) \end{split} \end{equation} / where $\log p(\mathcal{D}_{ij} \mid \boldsymbol{\theta}, \boldsymbol{b}_{j})$ is the log likelihood for the outcome data, $\log p(\boldsymbol{b}_{j} \mid \boldsymbol{\theta})$ is the log likelihood for the distribution of any cluster-specific parameters (i.e. random effects) when relevant, and $\log p(\boldsymbol{\theta})$ represents the log likelihood for the joint prior distribution across all remaining unknown parameters. ## Log likelihood Allowing for the three forms of censoring (left, right, and interval censoring) and potential delayed entry (i.e. left truncation) the log likelihood for the survival model takes the form: / \begin{equation} \label{eq:loglik} \begin{split} \log p(\mathcal{D}_{ij} \mid \boldsymbol{\theta}, \boldsymbol{b}_{j}) & = {I(d_{ij} = 0)} \times \log \left[ S_{ij}(T_{ij}) \right] \\ & \quad + {I(d_{ij} = 1)} \times \log \left[ h_{ij}(T_{ij}) \right] \\ & \quad + {I(d_{ij} = 1)} \times \log \left[ S_{ij}(T_{ij}) \right] \\ & \quad + {I(d_{ij} = 2)} \times \log \left[ 1 - S_{ij}(T_{ij}) \right] \\ & \quad + {I(d_{ij} = 3)} \times \log \left[ S_{ij}(T_{ij}) - S_{ij}(T_{ij}^U) \right] \\ & \quad - \log \left[ S_{ij} ( T_{ij}^E ) \right] \end{split} \end{equation} / where $I(x)$ is an indicator function taking value 1 if $x$ is true and 0 otherwise. That is, each individual's contribution to the likelihood depends on the type of censoring for their event time. The last term on the right hand side of Equation \ref{eq:loglik} accounts for delayed entry. When an individual is at risk from time zero (i.e. no delayed entry) then $T_{ij}^E = 0$ and $S_{ij}(0) = 1$ meaning that the last term disappears from the likelihood. ### Evaluating integrals in the log likelihood When the linear predictor is time-fixed there is a closed form expression for both the hazard rate and survival probability in almost all cases (the single exception is when B-splines are used to model the log baseline hazard). When there is a closed form expression for both the hazard rate and survival probability then there is also a closed form expression for the (log) likelihood function. The details of these expressions are given in Appendix \ref{app:haz-parameterisations} (for hazard models) and Appendix \ref{app:aft-parameterisations} (for AFT models). However, when the linear predictor is time-varying there isn't a closed form expression for the survival probability. Instead, Gauss-Kronrod quadrature with $Q$ nodes is used to approximate the necessary integrals. For hazard scale models Gauss-Kronrod quadrature is used to evaluate the cumulative hazard, which in turn is used to evaluate the survival probability. Expanding on Equation \ref{eq:survdef} we have: / \begin{equation} \begin{split} \int_{u=0}^{T_{ij}} h_{ij}(u) du \approx \frac{T_{ij}}{2} \sum_{q=1}^{Q} w_q h_{ij} \left( \frac{T_{ij}(1 + v_q)}{2} \right) \end{split} \end{equation} / where $w_q$ and $v_q$, respectively, are the standardised weights and locations ("abscissa") for quadrature node $q$ $(q = 1,...,Q)$ \citep{Laurie:1997}. For AFT models Gauss-Kronrod quadrature is used to evaluate the cumulative acceleration factor, which in turn is used to evaluate both the survival probability and the hazard rate. Expanding on Equations \ref{eq:aftform-surv} and \ref{eq:aftform-haz} we have: / \begin{equation} \begin{split} \int_{u=0}^{T_{ij}} \exp \left( - \eta_{ij}(u) \right) du \approx \frac{T_{ij}}{2} \sum_{q=1}^{Q} w_q \exp \left( - \eta_{ij} \left( \frac{T_{ij}(1 + v_q)}{2} \right) \right) \end{split} \end{equation} When quadrature is necessary, the default in __rstanarm__ is to use $Q = 15$ nodes. But the number of nodes can be changed by the user. ## Prior distributions For each of the parameters a number of prior distributions are available. Default choices exist, but the user can explicitly specify the priors if they wish. ### Intercept All models include an intercept parameter in the linear predictor ($\beta_0$) which effectively forms part of the baseline hazard. Choices of prior distribution for $\beta_0$ include the normal, t, or Cauchy distributions. The default is a normal distribution with mean 0 and standard deviation of 20. However it is worth noting that -- internally (but not in the reported parameter estimates) -- the prior is placed on the intercept after centering the predictors at their sample means and after applying a constant shift of $\log \left( \frac{E}{T} \right)$ where $E$ is the total number of events and $T$ is the total follow up time. For instance, the default prior is not centered on an intercept of zero when all predictors are at their sample means, but rather, it is centered on the log crude event rate when all predictors are at their sample means. This is intended to help with numerical stability and sampling, but does not impact on the reported estimates (i.e. the intercept is back-transformed before being returned to the user). \subsubsection{Regression coefficients} Choices of prior distribution for the time-fixed regression coefficients $\theta_{p0}$ ($p = 1,...,P$) include normal, t, and Cauchy distributions as well as several shrinkage prior distributions. Where relevant, the additional coefficients required for estimating a time-varying effect (i.e. the B-spline coefficients or the interval-specific deviations in the piecewise constant function) are given a random walk prior of the form $\theta_{p,1} \sim N(0,1)$ and $\theta_{p,m} \sim N(\theta_{p,m-1},\tau_p)$ for $m = 2,...,M$, where $M$ is the total number of cubic B-spline basis terms. The prior distribution for the hyperparameter $\tau_p$ can be specified by the user and choices include an exponential, half-normal, half-t, or half-Cauchy distribution. Note that lower values of $\tau_p$ lead to a less flexible (i.e. smoother) function for modelling the time-varying effect. \subsubsection{Auxiliary parameters} There are several choices of prior distribution for the so-called "auxiliary" parameters related to the baseline hazard (i.e. scalar $\gamma$ for the Weibull and Gompertz models or vector $\boldsymbol{\gamma}$ for the M-spline and B-spline models). These include: \begin{itemize} \item a Dirichlet prior distribution for the baseline hazard M-spline coefficients $\boldsymbol{\gamma}$; \item a half-normal, half-t, half-Cauchy or exponential prior distribution for the Weibull shape parameter $\gamma$; \item a half-normal, half-t, half-Cauchy or exponential prior distribution for the Gompertz scale parameter $\gamma$; and \item a normal, t, or Cauchy prior distribution for the log baseline hazard B-spline coefficients $\boldsymbol{\gamma}$. \end{itemize} \subsubsection{Covariance matrices} When a multilevel survival model is estimated there is an unstructured covariance matrix estimated for the random effects. Of course, in the situation where there is just one random effect in the model `formula` (e.g. a random intercept or "shared frailty" term) the covariance matrix will reduce to just a single element; i.e. it will be a scalar equal to the variance of the single random effect in the model. The prior distribution is based on a decomposition of the covariance matrix. The decomposition takes place as follows. The covariance matrix $\boldsymbol{\Sigma}_b$ is decomposed into a correlation matrix $\boldsymbol{\Omega}$ and vector of variances. The vector of variances is then further decomposed into a simplex $\pi$ (i.e. a probability vector summing to 1) and a scalar equal to the sum of the variances. Lastly, the sum of the variances is set equal to the order of the covariance matrix multiplied by the square of a scale parameter (here we denote that scale parameter $\tau$). The prior distribution for the correlation matrix $\boldsymbol{\Omega}$ is the LKJ distribution \citep{Lewandowski:2009}. It is parameterised through a regularisation parameter $\zeta > 0$. The default is $\zeta = 1$ such that the LKJ prior distribution is jointly uniform over all possible correlation matrices. When $\zeta > 1$ the mode of the LKJ distribution is the identity matrix and as $\zeta$ increases the distribution becomes more sharply peaked at the mode. When $0 < \zeta < 1$ the prior has a trough at the identity matrix. The prior distribution for the simplex $\boldsymbol{\pi}$ is a symmetric Dirichlet distribution with a single concentration parameter $\phi > 0$. The default is $\phi = 1$ such that the prior is jointly uniform over all possible simplexes. If $\phi > 1$ then the prior mode corresponds to all entries of the simplex being equal (i.e. equal variances for the random effects) and the larger the value of $\phi$ then the more pronounced the mode of the prior. If $0 < \phi < 1$ then the variances are polarised. The prior distribution for the scale parameter $\tau$ is a Gamma distribution. The shape and scale parameter for the Gamma distribution are both set equal to 1 by default, however the user can change the value of the shape parameter. The behaviour is such that increasing the shape parameter will help enforce that the trace of $\boldsymbol{\Sigma}_b$ (i.e. sum of the variances of the random effects) be non-zero. Further details on this implied prior for covariance matrices can be found in the __rstanarm__ documentation and vignettes. ## Estimation Estimation in __rstanarm__ is based on either full Bayesian inference (Hamiltonian Monte Carlo) or approximate Bayesian inference (either mean-field or full-rank variational inference). The default is full Bayesian inference, but the user can change this if they wish. The approximate Bayesian inference algorithms are much faster, but they only provide approximations for the joint posterior distribution and are therefore not recommended for final inference. Hamiltonian Monte Carlo is a form of Markov chain Monte Carlo (MCMC) in which information about the gradient of the log posterior is used to more efficiently sample from the posterior space. Stan uses a specific implementation of Hamiltonian Monte Carlo known as the No-U-Turn Sampler (NUTS) \citep{Hoffman:2014}. A benefit of NUTS is that the tuning parameters are handled automatically during a "warm-up" phase of the estimation. However the __rstanarm__ modelling functions provide arguments that allow the user to retain control over aspects such as the number of MCMC chains, number of warm-up and sampling iterations, and number of computing cores used. # Prediction framework ## Survival predictions without clustering If our survival model does not contain any clustering effects (i.e. it is not a multilevel survival model) then our prediction framework is more straightforward. Let $\mathcal{D} = \{ \mathcal{D}_{i}; i = 1,...,N \}$ denote the entire collection of outcome data in our sample and let $T_{\max} = \max \{ T_{i}, T_{i}^U, T_{i}^E; i = 1,...,N \}$ denote the maximum event or censoring time across all individuals in our sample. Suppose that for some individual $i^*$ (who may or may not have been in our sample) we have covariate vector $\boldsymbol{x}_{i^*}$. Note that the covariate data must be time-fixed. The predicted probability of being event-free at time $0 < t \leq T_{\max}$, denoted $\hat{S}_{i^*}(t)$, can be generated from the posterior predictive distribution: / \begin{equation} \begin{split} p \Big( \hat{S}_{i^*}(t) \mid \boldsymbol{x}_{i^*}, \mathcal{D} \Big) = \int p \Big( \hat{S}_{i^*}(t) \mid \boldsymbol{x}_{i^*}, \boldsymbol{\theta} \Big) p \Big( \boldsymbol{\theta} \mid \mathcal{D} \Big) d \boldsymbol{\theta} \end{split} \end{equation} We approximate this posterior predictive distribution by drawing from $p(\hat{S}_{i^*}(t) \mid \boldsymbol{x}_{i^*}, \boldsymbol{\theta}^{(l)})$ where $\boldsymbol{\theta}^{(l)}$ is the $l^{th}$ $(l = 1,...,L)$ MCMC draw from the posterior distribution $p(\boldsymbol{\theta} \mid \mathcal{D})$. ## Survival predictions with clustering When there are clustering effects in the model (i.e. multilevel survival models) then our prediction framework requires conditioning on the cluster-specific parameters. Let $\mathcal{D} = \{ \mathcal{D}_{ij}; i = 1,...,N_j, j = 1,...,J \}$ denote the entire collection of outcome data in our sample and let $T_{\max} = \max \{ T_{ij}, T_{ij}^U, T_{ij}^E; i = 1,...,N_j, j = 1,...,J \}$ denote the maximum event or censoring time across all individuals in our sample. Suppose that for some individual $i^*$ (who may or may not have been in our sample) and who is known to come from cluster $j^*$ (which may or may not have been in our sample) we have covariate vectors $\boldsymbol{x}_{i^*j^*}$ and $\boldsymbol{z}_{i^*j^*}$. Note again that the covariate data is assumed to be time-fixed. If individual $i^*$ does in fact come from a cluster $j^* = j$ (for some $j \in \{1,...,J\}$) in our sample then the predicted probability of being event-free at time $0 < t \leq T_{\max}$, denoted $S_{i^*j}(t)$, can be generated from the posterior predictive distribution: / \begin{equation} \begin{split} p \Big( \hat{S}_{i^*j}(t) \mid \boldsymbol{x}_{i^*j}, \boldsymbol{z}_{i^*j}, \mathcal{D} \Big) = \int \int p \Big( \hat{S}_{i^*j}(t) \mid \boldsymbol{x}_{i^*j}, \boldsymbol{z}_{i^*j}, \boldsymbol{\theta}, \boldsymbol{b}_j \Big) p \Big( \boldsymbol{\theta}, \boldsymbol{b}_j \mid \mathcal{D} \Big) d \boldsymbol{b}_j \space d \boldsymbol{\theta} \end{split} \end{equation} Since cluster $j$ was included in our sample data it is easy for us to approximate this posterior predictive distribution by drawing from $p(\hat{S}_{i^*j}(t) \mid \boldsymbol{x}_{i^*j}, \boldsymbol{z}_{i^*j}, \boldsymbol{\theta}^{(l)}, \boldsymbol{b}_j^{(l)})$ where $\boldsymbol{\theta}^{(l)}$ and $\boldsymbol{b}_j^{(l)}$ are the $l^{th}$ $(l = 1,...,L)$ MCMC draws from the joint posterior distribution $p(\boldsymbol{\theta}, \boldsymbol{b}_j \mid \mathcal{D})$. Alternatively, individual $i^*$ may come from a new cluster $j^* \neq j$ (for all $j \in \{1,...,J\}$) that was not in our sample. The predicted probability of being event-free at time $0 < t \leq T_{\max}$ is therefore denoted $\hat{S}_{i^*j^*}(t)$ and can be generated from the posterior predictive distribution: / \begin{equation} \begin{aligned} p \Big( \hat{S}_{i^*j^*}(t) \mid \boldsymbol{x}_{i^*j^*}, \boldsymbol{z}_{i^*j^*}, \mathcal{D} \Big) & = \int \int p \Big( \hat{S}_{i^*j^*}(t) \mid \boldsymbol{x}_{i^*j^*}, \boldsymbol{z}_{i^*j^*}, \boldsymbol{\theta}, \boldsymbol{\tilde{b}}_{j^*} \Big) p \Big( \boldsymbol{\theta}, \boldsymbol{\tilde{b}}_{j^*} \mid \mathcal{D} \Big) d \boldsymbol{\tilde{b}}_{j^*} \space d \boldsymbol{\theta} \\ & = \int \int p \Big( \hat{S}_{i^*j^*}(t) \mid \boldsymbol{x}_{i^*j^*}, \boldsymbol{z}_{i^*j^*}, \boldsymbol{\theta}, \boldsymbol{\tilde{b}}_{j^*} \Big) p \Big( \boldsymbol{\tilde{b}}_{j^*} \mid \boldsymbol{\theta} \Big) p \Big( \boldsymbol{\theta} \mid \mathcal{D} \Big) d \boldsymbol{\tilde{b}}_{j^*} \space d \boldsymbol{\theta} \\ \end{aligned} \end{equation} / where $\boldsymbol{\tilde{b}}_{j^*}$ denotes the cluster-specific parameters for the new cluster. We can obtain draws for $\boldsymbol{\tilde{b}}_{j^*}$ during estimation of the model (in a similar manner as for $\boldsymbol{b}_j$). At the $l^{th}$ iteration of the MCMC sampler we obtain $\boldsymbol{\tilde{b}}_{j^*}^{(l)}$ as a random draw from the posterior distribution of the cluster-specific parameters and store it for later use in predictions. The set of random draws $\boldsymbol{\tilde{b}}_{j^*}^{(l)}$ for $l = 1,...,L$ then allow us to essentially marginalise over the distribution of the cluster-specific parameters. This is the method used in __rstanarm__ when generating survival predictions for individuals in new clusters that were not part of the original sample. ## Conditional survival probabilities In some instances we want to evaluate the predicted survival probability conditional on a last known survival time. This is known as a conditional survival probability. Suppose that individual $i^*$ is known to be event-free up until $C_{i^*}$ and we wish to predict the survival probability at some time $t > C_{i^*}$. To do this we draw from the conditional posterior predictive distribution: / \begin{equation} \begin{split} p \Big( \hat{S}_{i^*}(t) \mid \boldsymbol{x}_{i^*}, \mathcal{D}, t > C_{i^*} \Big) = \frac {p \Big( \hat{S}_{i^*}(t) \mid \boldsymbol{x}_{i^*}, \mathcal{D} \Big)} {p \Big( \hat{S}_{i^*}(C_{i^*}) \mid \boldsymbol{x}_{i^*}, \mathcal{D} \Big)} \end{split} \end{equation} / or -- equivalently -- for multilevel survival models we have individual $i^*$ in cluster $j^*$ who is known to be event-free up until $C_{i^*j^*}$: / \begin{equation} \begin{split} p \Big( \hat{S}_{i^*j^*}(t) \mid \boldsymbol{x}_{i^*j^*}, \boldsymbol{z}_{i^*j^*}, \mathcal{D}, t > C_{i^*j^*} \Big) = \frac {p \Big( \hat{S}_{i^*j^*}(t) \mid \boldsymbol{x}_{i^*j^*}, \boldsymbol{z}_{i^*j^*}, \mathcal{D} \Big)} {p \Big( \hat{S}_{i^*j^*}(C_{i^*j^*}) \mid \boldsymbol{x}_{i^*j^*}, \boldsymbol{z}_{i^*j^*}, \mathcal{D} \Big)} \end{split} \end{equation} ## Standardised survival probabilities All of the previously discussed predictions require conditioning on some covariate values $\boldsymbol{x}_{ij}$ and $\boldsymbol{z}_{ij}$. Even if we have a multilevel survival model and choose to marginalise over the distribution of the cluster-specific parameters, we are still obtaining predictions at some known unique values of the covariates. However sometimes we wish to generate an "average" survival probability. One possible approach is to predict at the mean value of all covariates \citep{Cupples:1995}. However this doesn't always make sense, especially not in the presence of categorical covariates. For instance, suppose our covariates are gender and a treatment indicator. Then predicting for an individual at the mean of all covariates might correspond to a 50% male who was 50% treated. That does not make sense and is not what we wish to do. A better alternative is to average over the individual survival probabilties. This essentially provides an approximation to marginalising over the joint distribution of the covariates. At any time $t$ it is possible to obtain a so-called standardised survival probability, denoted $\hat{S}^{*}(t)$, by averaging the individual-specific survival probabilities: / \begin{equation} \begin{split} p ( \hat{S}^{*}(t) \mid \mathcal{D} ) = \frac{1}{N^{P}} \sum_{i=1}^{N^{P}} p ( \hat{S}_i(t) \mid \boldsymbol{x}_{i^*}, \mathcal{D} ) \end{split} \end{equation} / where $\hat{S}_i(t)$ is the predicted survival probability for individual $i$ ($i = 1,...,N^{P}$) at time $t$, and $N^{P}$ is the number of individuals included in the predictions. For multilevel survival models the calculation is similar and follows quite naturally (details not shown). Note however that if $N^{P}$ is not sufficiently large (for example we predict individual survival probabilities using covariate data for just $N^{P} = 2$ individuals) then averaging over their covariate distribution may not be meaningful. Similarly, if we estimated a multilevel survival model and then predicted standardised survival probabilities based on just $N^{P} = 2$ individuals from our sample, the joint distribution of their cluster-specific parameters would likely be a poor representation of the distribution of cluster-specific parameters for the entire sample and population. It is therefore better to calculate standardised survival probabilities by setting $N^{P}$ equal to the total number of individuals in the original sample (i.e. $N^{P} = N$. This approach can then also be used for assessing the fit of the survival model in __rstanarm__ (see the \fct{ps_check} function described in Section \ref{sec:implementation}). Posterior predictive draws of the standardised survival probability are evaluated at a series of time points between 0 and $T_{\max}$ using all individuals in the estimation sample and the predicted standardised survival curve is overlaid with the observed Kaplan-Meier survival curve. # Implementation ## Overview The __rstanarm__ package is built on top of the __rstan__ R package \citep{Stan:2019}, which is the R interface for Stan. Models in __rstanarm__ are written in the Stan programming language, translated into C++ code, and then compiled at the time the package is built. This means that for most users -- who install a binary version of __rstanarm__ from the Comprehensive R Archive Network (CRAN) -- the models in __rstanarm__ will be pre-compiled. This is beneficial for users because there is no compilation time either during installation or when they estimate a model. ## Main modelling function Survival models in __rstanarm__ are implemented around the \fct{stan_surv} modelling function. The function signature for \fct{stan_surv} is: / <>= stan_surv(formula, data, basehaz = "ms", basehaz_ops, qnodes = 15, prior = normal(), prior_intercept = normal(), prior_aux, prior_smooth = exponential(autoscale = FALSE), prior_covariance = decov(), prior_PD = FALSE, algorithm = c("sampling", "meanfield", "fullrank"), adapt_delta = 0.95, ...) @ The following provides a brief description of the main features of each of these arguments: \begin{itemize} \item The `formula` argument accepts objects built around the standard R formula syntax (see \fct{stats::formula}). The left hand side of the formula should be an object returned by the \fct{Surv} function in the __survival__ package \citep{Therneau:2019}. Any random effects structure (for multilevel survival models) can be specified on the right hand side of the formula using the same syntax as the __lme4__ package \citep{Bates:2015} as shown in the example in Section \ref{sec:multilevelmodel}. By default, any covariate effects specified in the fixed-effect part of the model `formula` are included under a proportional hazards assumption (for models estimated using a hazard scale formulation) or under the assumption of time-fixed acceleration factors (for models estimated using an AFT formulation). Time-varying effects are specified in the model `formula` by wrapping the covariate name in the \fct{tve} function. For example, if we wanted to estimate a time-varying effect for the covariate \code{sex} then we could specify \code{tve(sex)} in the model formula, e.g. \code{formula = Surv(time, status) ~ tve(sex) + age}. The \fct{tve} function is a special function that only has meaning when used in the `formula` of a model estimated using \fct{stan_surv}. Its functionality is demonstrated in the worked examples in Sections \ref{sec:tvebs} and \ref{sec:tvepw}. \item The \code{data} argument accepts an object inheriting the class \class{data.frame}, in other words the usual R data frame. \item The choice of parametric baseline hazard (or baseline survival distribution for AFT models) is specified via the \code{basehaz} argument. For the M-spline (\code{"ms"}) and B-spline (\code{"bs"}) models additional options related to the spline degree $\delta$, knot locations $\boldsymbol{k}$, or degrees of freedom $L$ can be specified as a list and passed to the \code{basehaz_ops} argument. For example, specifying \code{basehaz = "ms"} and \code{basehaz_ops = list(degree = 2, knots = c(10,20))} would request a baseline hazard modelled using quadratic M-splines with two internal knots located at $t = 10$ and $t = 20$. \item The argument \code{qnodes} is a control argument that allows the user to specify the number of quadrature nodes when quadrature is required (as described in Section \ref{sec:loglikelihood}). \item The \code{prior} family of arguments allow the user to specify the prior distributions for each of the parameters, as follows: \begin{itemize} \item \code{prior} relates to the time-fixed regression coefficients; \item \code{prior_intercept} relates to the intercept in the linear predictor; \item \code{prior_aux} relates to the so-called "auxiliary" parameters in the baseline hazard ($\gamma$ for the Weibull and Gompertz models or $\boldsymbol{\gamma}$ for the M-spline and B-spline models); \item \code{prior_smooth} relates to the hyperparameter $\tau_p$ when the $p^{th}$ covariate has a time-varying effect; and \item \code{prior_covariance} relates to the covariance matrix for the random effects when a multilevel survival model is being estimated. \end{itemize} \item The remaining arguments (\code{prior_PD}, \code{algorithm}, and \code{adapt_delta}) are optional control arguments related to estimation in Stan: \begin{itemize} \item Setting \code{prior_PD = TRUE} states that the user only wants to draw from the prior predictive distribution and not condition on the data. \item The \code{algorithm} argument specifies the estimation routine to use. This includes either Hamiltonian Monte Carlo (\code{"sampling"}) or one of the variational Bayes algorithms (\code{"meanfield"} or \code{"fullrank"}). The model specification is agnostic to the chosen \code{algorithm}. That is, the user can choose from any of the available algorithms regardless of the specified model. \item The \code{adapt_delta} argument controls the target average acceptance probability. It is only relevant when \code{algorithm = "sampling"} in which case \code{adapt_delta} should be between 0 and 1, with higher values leading to smaller step sizes and therefore a more robust sampler but longer estimation times. \end{itemize} \end{itemize} The model returned by \fct{stan_surv} is an object of class \class{stansurv} and inheriting the \class{stanreg} class. It is effectively a list with a number of important attributes. There are a range of post-estimation functions that can be called on \class{stansurv} (and \class{stanreg}) objects -- some of the most important ones are described in Section \ref{sec:postest-functions}. ### Default knot locations Default knot locations for the M-spline, B-spline, or piecewise constant functions are the same regardless of whether they are used for modelling the baseline hazard or time-varying effects. By default the vector of knot locations $\boldsymbol{k} = \{k_{1},...,k_{J}\}$ includes a lower boundary knot $k_{1}$ at the earliest entry time (equal to zero if there isn't delayed entry) and an upper boundary knot $k_{J}$ at the latest event or censoring time. The location of the boundary knots cannot be changed by the user. Internal knot locations -- that is $k_{2},...,k_{(J-1)}$ when $J \geq 3$ -- can be explicitly specified by the user or are determined by default. The number of internal knots and/or their locations can be controlled via the \code{basehaz_ops} argument to \fct{stan_surv} (for modelling the baseline hazard) or via the arguments to the \fct{tve} function (for modelling a time-varying effect). If knot locations are not explicitly specified by the user, then the default is to place the internal knots at equally spaced percentiles of the distribution of uncensored event times. For instance, if there are three internal knots they would be placed at the $25^{\text{th}}$, $50^{\text{th}}$, and $75^{\text{th}}$ percentiles of the distribution of the uncensored event times. ## Post-estimation functions The __rstanarm__ package provides a range of post-estimation functions that can be used after fitting the survival model. This includes functions for inference (e.g. reporting parameter estimates), diagnostics (e.g. assessing model fit), and generating predictions. We highlight the most important ones here: \begin{itemize} \item The \fct{print} and \fct{summary} functions provide reports of parameter estimates and some summary information on the data (e.g. number of observations, number of events, etc). They each provide varying levels of detail. For example, the \fct{summary} method provides diagnostic measures such as Gelman and Rubin's Rhat statistic \citep{Gelman:1992} for assessing convergence of the MCMC chains and the number of effective MCMC samples. On the other hand, the \fct{print} method is more concise and does not provide this level of additional detail. \item The \fct{fixef} and \fct{ranef} functions report the fixed effect and random effect parameter estimates, respectively. \item The \fct{posterior_survfit} function is the primary function for generating survival predictions. The type of prediction is specified via the \code{type} arguments and can currently be any of the following: \begin{itemize} \item \code{"surv"}: the estimated survival probability; \item \code{"cumhaz"}: the estimated cumulative hazard; \item \code{"haz"}: the estimated hazard rate; \item \code{"cdf"}: the estimated failure probability; \item \code{"logsurv"}: the estimated log survival probability; \item \code{"logcumhaz"}: the estimated log cumulative hazard; \item \code{"loghaz"}: the estimated log hazard rate; or \item \code{"logcdf"}: the estimated log failure probability. \end{itemize} There are additional arguments to \fct{posterior_survfit} that control the time at which the predictions are generated (\code{times}), whether they are generated across a time range (referred to as extrapolation, see \code{extrapolate}), whether they are conditional on a last known survival time (\code{condition}), and whether they are averaged across individuals (referred to as standardised predictions, see \code{standardise}). The returned predictions are a data frame with a special class called \class{survfit.stansurv}. The \class{survfit.stansurv} class has both \fct{print} and \fct{plot} methods that can be called on it. These will be demonstrated as part of the examples in Section \ref{sec:usage}. \item The \fct{loo} and \fct{waic} functions report model fit statistics. The former is based on approximate leave-one-out cross validation \citep{Vehtari:2017} and is recommended. The latter is a less preferable alternative that reports the Widely Applicable Information Criterion (WAIC) criterion \citep{Watanabe:2010}. Both of these functions are built on top of the __loo__ R package \citep{Vehtari:2019}. The values (objects) returned by either \fct{loo} or \fct{waic} can also be passed to the \fct{loo_compare} function to compare different models estimated on the same dataset. This will be demonstrated as part of the examples in Section \ref{sec:usage}. \item The \fct{log_lik} function generates a pointwise log likelihood matrix. That is, it calculates the log likelihood for each observation (either in the original dataset or some new dataset) using each MCMC draw of the model parameters. \item The \fct{plot} function allows for a variety of plots depending on the input to the \code{plotfun} argument. The default is to plot the estimated baseline hazard (\code{plotfun = "basehaz"}), but alternatives include a plot of the estimated time-varying hazard ratio for models with time-varying effects (\code{plotfun = "tve"}), plots summarising the parameter estimates (e.g. posterior densities or posterior intervals), and plots providing diagnostics (e.g. MCMC trace plots). \item The \fct{ps_check} function provides a quick diagnostic check for the fitted survival function. It is based on the estimation sample and compares the predicted standardised survival curve to the observed Kaplan-Meier survival curve. \end{itemize} # Usage examples ## Example: A flexible parametric proportional hazards model We will use the German Breast Cancer Study Group dataset (see `?rstanarm-datasets` for details and references). In brief, the data consist of $N = 686$ patients with primary node positive breast cancer recruited between 1984-1989. The primary response is time to recurrence or death. Median follow-up time was 1084 days. Overall, there were 299 (44%) events and the remaining 387 (56%) individuals were right censored. We concern our analysis here with a 3-category baseline covariate for cancer prognosis (good/medium/poor). First, let us load the data and fit the proportional hazards model ```{r, warning = FALSE, message = FALSE, results='hide'} mod1 <- stan_surv(formula = Surv(recyrs, status) ~ group, data = bcancer, chains = CHAINS, cores = CORES, seed = SEED, iter = ITER) ``` The model here is estimated using the default cubic M-splines (with 5 degrees of freedom) for modelling the baseline hazard. Since there are no time-varying effects in the model (i.e. we did not wrap any covariates in the `tve()` function) there is a closed form expression for the cumulative hazard and survival function and so the model is relatively fast to fit. Specifically, the model takes ~3.5 sec for each MCMC chain based on the default 2000 (1000 warm up, 1000 sampling) MCMC iterations. We can easily obtain the estimated hazard ratios for the 3-catgeory group covariate using the generic `print` method for `stansurv` objects, as follows ```{r} print(mod1, digits = 3) ``` We see from this output we see that individuals in the groups with `Poor` or `Medium` prognosis have much higher rates of death relative to the group with `Good` prognosis (as we might expect!). The hazard of death in the `Poor` prognosis group is approximately 5.0-fold higher than the hazard of death in the `Good` prognosis group. Similarly, the hazard of death in the `Medium` prognosis group is approximately 2.3-fold higher than the hazard of death in the `Good` prognosis group. It may also be of interest to compare the different types of the baseline hazard we could potentially use. Here, we will fit a series of models, each with a different baseline hazard specification ```{r, warning = FALSE, message = FALSE, results='hide'} mod1_exp <- update(mod1, basehaz = "exp") mod1_weibull <- update(mod1, basehaz = "weibull") mod1_gompertz <- update(mod1, basehaz = "gompertz") mod1_bspline <- update(mod1, basehaz = "bs") mod1_mspline1 <- update(mod1, basehaz = "ms") mod1_mspline2 <- update(mod1, basehaz = "ms", basehaz_ops = list(df = 10)) ``` and then plot the baseline hazards with 95% posterior uncertainty limits using the generic `plot` method for `stansurv` objects (note that the default `plot` for `stansurv` objects is the estimated baseline hazard). We will write a little helper function to adjust the y-axis limits, add a title, and centre the title, on each plot, as follows ```{r, fig.height=5} library(ggplot2) plotfun <- function(model, title) { plot(model, plotfun = "basehaz") + # plot baseline hazard coord_cartesian(ylim = c(0,0.4)) + # adjust y-axis limits labs(title = title) + # add plot title theme(plot.title = element_text(hjust = 0.5)) # centre plot title } p_exp <- plotfun(mod1_exp, title = "Exponential") p_weibull <- plotfun(mod1_weibull, title = "Weibull") p_gompertz <- plotfun(mod1_gompertz, title = "Gompertz") p_bspline <- plotfun(mod1_bspline, title = "B-splines with df = 5") p_mspline1 <- plotfun(mod1_mspline1, title = "M-splines with df = 5") p_mspline2 <- plotfun(mod1_mspline2, title = "M-splines with df = 10") bayesplot::bayesplot_grid(p_exp, p_weibull, p_gompertz, p_bspline, p_mspline1, p_mspline2, grid_args = list(ncol = 3)) ``` We can also compare the fit of these models using the `loo` method for `stansurv` objects ```{r, message=FALSE} loo_compare(loo(mod1_exp), loo(mod1_weibull), loo(mod1_gompertz), loo(mod1_bspline), loo(mod1_mspline1), loo(mod1_mspline2)) ``` where we see that models with a flexible parametric (spline-based) baseline hazard fit the data best followed by the standard parametric (Weibull, Gompertz, exponential) models. Roughly speaking, the B-spline and M-spline models seem to fit the data equally well since the differences in `elpd` or `looic` between the models are very small relative to their standard errors. Moreover, increasing the degrees of freedom for the M-splines from 5 to 10 doesn't seem to improve the fit (that is, the default degrees of freedom `df = 5` seems to provide sufficient flexibility to model the baseline hazard). After fitting the survival model, we often want to estimate the predicted survival function for individual's with different covariate patterns. Here, let us estimate the predicted survival function between 0 and 5 years for an individual in each of the prognostic groups. To do this, we can use the `posterior_survfit` method for `stansurv` objects, and it's associated `plot` method. First let us construct the prediction (covariate) data ```{r preddata} nd <- data.frame(group = c("Good", "Medium", "Poor")) head(nd) ``` and then we will generate the posterior predictions ```{r predresults} ps <- posterior_survfit(mod1, newdata = nd, times = 0, extrapolate = TRUE, control = list(edist = 5)) head(ps) ``` Here we note that the `id` variable in the data frame of posterior predictions identifies which row of `newdata` the predictions correspond to. For demonstration purposes we have also shown a couple of other arguments in the `posterior_survfit` call, namely - the `times = 0` argument says that we want to predict at time = 0 (i.e. baseline) for each individual in the `newdata` (this is the default anyway) - the `extrapolate = TRUE` argument says that we want to extrapolate forward from time 0 (this is also the default) - the `control = list(edist = 5)` identifies the control of the extrapolation; this is saying extrapolate the survival function forward from time 0 for a distance of 5 time units (the default would have been to extrapolate as far as the largest event or censoring time in the estimation dataset, which is 7.28 years in the `brcancer` data). Let us now plot the survival predictions. We will relabel the `id` variable with meaningful labels identifying the covariate profile of each new individual in our prediction data ```{r predplot} panel_labels <- c('1' = "Good", '2' = "Medium", '3' = "Poor") plot(ps) + ggplot2::facet_wrap(~ id, labeller = ggplot2::labeller(id = panel_labels)) ``` We can see from the plot that predicted survival is worst for patients with a `Poor` diagnosis, and best for patients with a `Good` diagnosis, as we would expect based on our previous model estimates. Alternatively, if we wanted to obtain and plot the predicted *hazard* function for each individual in our new data (instead of their *survival* function), then we just need to specify `type = "haz"` in our `posterior_survfit` call (the default is `type = "surv"`), as follows ```{r predhaz} ph <- posterior_survfit(mod1, newdata = nd, type = "haz") plot(ph) + ggplot2::facet_wrap(~ id, labeller = ggplot2::labeller(id = panel_labels)) ``` We can quite clearly see in the plot the assumption of proportional hazards. We can also see that the hazard is highest in the `Poor` prognosis group (i.e. worst survival) and the hazard is lowest in the `Good` prognosis group (i.e. best survival). This corresponds to what we saw in the plot of the survival functions previously. ## Example: Non-proportional hazards modelled using B-splines To demonstrate the implementation of time-varying effects in `stan_surv` we will use a simulated dataset, generated using the **simsurv** package (Brilleman, 2018). We will simulate a dataset with $N = 200$ individuals with event times generated under the following Weibull hazard function \ \begin{align} h_i(t) = \gamma t^{\gamma-1} \lambda \exp( \beta(t) x_i ) \end{align} \ with scale parameter $\lambda = 0.1$, shape parameter $\gamma = 1.5$, binary baseline covariate $X_i \sim \text{Bern}(0.5)$, and time-varying hazard ratio $\beta(t) = -0.5 + 0.2 t$. We will enforce administrative censoring at 5 years if an individual's simulated event time is >5 years. ```{r simsurv-simdata} # load package library(simsurv) # set seed for reproducibility set.seed(999111) # simulate covariate data covs <- data.frame(id = 1:200, trt = rbinom(200, 1L, 0.5)) # simulate event times dat <- simsurv(lambdas = 0.1, gammas = 1.5, betas = c(trt = -0.5), tde = c(trt = 0.2), x = covs, maxt = 5) # merge covariate data and event times dat <- merge(dat, covs) # examine first few rows of data head(dat) ``` Now that we have our simulated dataset, let us fit a model with time-varying hazard ratio for `trt` ```{r tve_fit1, warning = FALSE, message = FALSE, results='hide'} mod2 <- stan_surv(formula = Surv(eventtime, status) ~ tve(trt), data = dat, chains = CHAINS, cores = CORES, seed = SEED, iter = ITER) ``` The `tve` function is used in the model formula to state that we want a time-varying effect (i.e. a time-varying coefficient) to be estimated for the variable `trt`. By default, a cubic B-spline basis with 3 degrees of freedom (i.e. two boundary knots placed at the limits of the range of event times, but no internal knots) is used for modelling the time-varying log hazard ratio. If we wanted to change the degree, knot locations, or degrees of freedom for the B-spline function we can specify additional arguments to the `tve` function. For example, to model the time-varying log hazard ratio using quadratic B-splines with 4 degrees of freedom (i.e. two boundary knots placed at the limits of the range of event times, as well as two internal knots placed -- by default -- at the 33.3rd and 66.6th percentiles of the distribution of uncensored event times) we could specify the model formula as ```{r, warning = FALSE, message = FALSE, results='hide', eval=FALSE} Surv(eventtime, status) ~ tve(trt, df = 4, degree = 2) ``` Let us now plot the estimated time-varying hazard ratio from the fitted model. We can do this using the generic `plot` method for `stansurv` objects, for which we can specify the `plotfun = "tve"` argument. (Note that in this case, there is only one covariate in the model with a time-varying effect, but if there were others, we could specify which covariate(s) we want to plot the time-varying effect for by specifying the `pars` argument to the `plot` call). ```{r, fig.height=5} plot(mod2, plotfun = "tve") ``` From the plot, we can see how the hazard ratio (i.e. the effect of treatment on the hazard of the event) changes as a function of time. The treatment appears to be protective during the first few years following baseline (i.e. HR < 1), and then the treatment appears to become harmful after about 4 years post-baseline. Thankfully, this is a reflection of the model we simulated under! The plot shows a large amount of uncertainty around the estimated time-varying hazard ratio. This is to be expected, since we only simulated a dataset of 200 individuals of which only around 70% experienced the event before being censored at 5 years. So, there is very little data (i.e. very few events) with which to reliably estimate the time-varying hazard ratio. We can also see this reflected in the differences between our data generating model and the estimates from our fitted model. In our data generating model, the time-varying hazard ratio equals 1 (i.e. the log hazard ratio equals 0) at 2.5 years, but in our fitted model the median estimate for our time-varying hazard ratio equals 1 at around ~3 years. This is a reflection of the large amount of sampling error, due to our simulated dataset being so small. ## Example: Non-proportional hazards modelled using a piecewise constant function In the previous example we showed how non-proportional hazards can be modelled by using a smooth B-spline function for the time-varying log hazard ratio. This is the default approach when the `tve` function is used to estimate a time-varying effect for a covariate in the model formula. However, another approach for modelling a time-varying log hazard ratio is to use a piecewise constant function. If we want to use a piecewise constant for the time-varying log hazard ratio (instead of the smooth B-spline function) then we just have to specify the `type` argument to the `tve` function. We will again simulate some survival data using the **simsurv** package to show how a piecewise constant hazard ratio can be estimated using `stan_surv`. Similar to the previous example, we will simulate a dataset with $N = 500$ individuals with event times generated under a Weibull hazard function with scale parameter $\lambda = 0.1$, shape parameter $\gamma = 1.5$, and binary baseline covariate $X_i \sim \text{Bern}(0.5)$. However, in this example our time-varying hazard ratio will be defined as $\beta(t) = -0.5 + 0.7 \times I(t > 2.5)$ where $I(X)$ is the indicator function taking the value 1 if $X$ is true and 0 otherwise. This corresponds to a piecewise constant log hazard ratio with just two "pieces" or time intervals. The first time interval is $[0,2.5]$ during which the true hazard ratio is $\exp(-0.5) = 0.61$. The second time interval is $(2.5,\infty]$ during which the true log hazard ratio is $\exp(-0.5 + 0.7) = 1.22$. Our example uses only two time intervals for simplicity, but in general we could easily have considered more (although it would have required couple of additional lines of code to simulate the data). We will again enforce administrative censoring at 5 years if an individual's simulated event time is >5 years. ```{r simsurv-simdata2} # load package library(simsurv) # set seed for reproducibility set.seed(888222) # simulate covariate data covs <- data.frame(id = 1:500, trt = rbinom(500, 1L, 0.5)) # simulate event times dat <- simsurv(lambdas = 0.1, gammas = 1.5, betas = c(trt = -0.5), tde = c(trt = 0.7), tdefun = function(t) (t > 2.5), x = covs, maxt = 5) # merge covariate data and event times dat <- merge(dat, covs) # examine first few rows of data head(dat) ``` We now estimate a model with a piecewise constant time-varying effect for the covariate `trt` as ```{r tve-fit2, warning = FALSE, message = FALSE, results='hide'} mod3 <- stan_surv(formula = Surv(eventtime, status) ~ tve(trt, degree = 0, knots = 2.5), data = dat, chains = CHAINS, cores = CORES, seed = SEED, iter = ITER) ``` This time we specify some additional arguments to the `tve` function, so that our time-varying effect corresponds to the true data generating model used to simulate our event times. Specifically, we specify `degree = 0` to say that we want the time-varying effect (i.e. the time-varying log hazard ratio) to be estimated using a piecewise constant function and `knots = 2.5` says that we only want one internal knot placed at the time $t = 2.5$. We can again use the generic `plot` function with argument `plotfun = "tve"` to examine our estimated hazard ratio for treatment ```{r, fig.height=5} plot(mod3, plotfun = "tve") ``` Here we see that the estimated hazard ratio reasonably reflects our true data generating model (i.e. a hazard ratio of $\approx 0.6$ during the first time interval and a hazard ratio of $\approx 1.2$ during the second time interval) although there is a slight discrepancy due to the sampling variation in the simulated event times. ## Example: Hierarchical survival models To demonstrate the estimation of a hierarchical model for survival data in `stan_surv` we will use the `frail` dataset (see `help("rstanarm-datasets")` for a description). The `frail` datasets contains simulated event times for 200 patients clustered within 20 hospital sites (10 patients per hospital site). The event times are simulated from a parametric proportional hazards model under the following assumptions: (i) a constant (i.e. exponential) baseline hazard rate of 0.1; (ii) a fixed treatment effect with log hazard ratio of 0.3; and (iii) a site-specific random intercept (specified on the log hazard scale) drawn from a N(0,1) distribution. Let's look at the first few rows of the data: ```{r frail-data-head} head(frail) ``` To fit a hierarchical model for clustered survival data we use a formula syntax similar to what is used in the **lme4** R package (Bates et al. (2015)). Let's consider the following model (which aligns with the model used to generate the simulated data): ```{r frail-fit-model, warning = FALSE, message = FALSE} mod_randint <- stan_surv( formula = Surv(eventtime, status) ~ trt + (1 | site), data = frail, basehaz = "exp", chains = CHAINS, cores = CORES, seed = SEED, iter = ITER) ``` The model contains a baseline covariate for treatment (0 or 1) as well as a site-specific intercept to allow for correlation in the event times for patients from the same site. We've call the model object `mod_randint` to denote the fact that it includes a site-specific (random) intercept. Let's examine the parameter estimates from the model: ```{r frail-estimates} print(mod_randint, digits = 2) ``` We see that the estimated log hazard ratio for treatment ($\hat{\beta}_{\text{(trt)}} = 0.46$) is a bit larger than the "true" log hazard ratio used in the data generating model ($\beta_{\text{(trt)}} = 0.3$). The estimated baseline hazard rate is $\exp(-2.3716) = 0.093$, which is pretty close to the baseline hazard rate used in the data generating model ($0.1$). Of course, the differences between the estimated parameters and the true parameters from the data generating model are attributable to sampling noise. If this were a real analysis, we might wonder whether the site-specific estimates are necessary! Well, we can assess that by fitting an alternative model that does **not** include the site-specific intercepts and compare it to the model we just estimated. We will compare it using the `loo` function. We first need to fit the model without the site-specific intercept. To do this, we will just use the generic `update` method for `stansurv` objects, since all we are changing is the model formula: ```{r frail-fixed-model, warning = FALSE, message = FALSE} mod_fixed <- update(mod_randint, formula. = Surv(eventtime, status) ~ trt) ``` Let's calculate the `loo` for both these models and compare them: ```{r frail-compare-1, warning = FALSE, message = FALSE} loo_fixed <- loo(mod_fixed) loo_randint <- loo(mod_randint) loo_compare(loo_fixed, loo_randint) ``` We see strong evidence in favour of the model with the site-specific intercepts! But let's not quite finish there. What about if we want to generalise the random effects structure further. For instance, is the site-specific intercept enough? Perhaps we should consider estimating both a site-specific intercept and a site-specific treatment effect. We have minimal data to estimate such a model (recall that there is only 20 sites and 10 patients per site) but for the sake of demonstration we will forge on nonetheless. Let's fit a model with both a site-specific intercept and a site-specific coefficient for the covariate `trt` (i.e. treatment): ```{r frail-random-trt, warning = FALSE, message = FALSE} mod_randtrt <- update(mod_randint, formula. = Surv(eventtime, status) ~ trt + (trt | site)) print(mod_randtrt, digits = 2) ``` We see that we have an estimated standard deviation for the site-specific intercepts and the site-specific coefficients for `trt`, as well as the estimated correlation between those site-specific parameters. Let's now compare all three of these models based on `loo`: ```{r frail-compare-2, warning = FALSE, message = FALSE} loo_randtrt <- loo(mod_randtrt) loo_compare(loo_fixed, loo_randint, loo_randtrt) ``` It appears that the model with just a site-specific intercept is the best fitting model. It is much better than the model without a site-specific intercept, and slightly better than the model with both a site-specific intercept and a site-specific treatment effect. In other words, including a site-specific intercept appears important, but including a site-specific treatment effect is not. This conclusion is reassuring, because it aligns with the data generating model we used to simulate the data! # References Bates D, Maechler M, Bolker B, Walker S. Fitting Linear Mixed-Effects Models Using lme4. *Journal of Statistical Software* 2015;67(1):1--48. \url{https://doi.org/10.18637/jss.v067.i01} Brilleman S. (2018) *simsurv: Simulate Survival Data.* R package version 0.2.2. \url{https://CRAN.R-project.org/package=simsurv} Hougaard P. Fundamentals of Survival Data. *Biometrics* 1999;55:13--22. Ramsay JO. Monotone Regression Splines in Action. *Statistical Science* 1988;3(4):425--461. \url{https://doi.org/10.1214/ss/1177012761} Wang W, Yan J. (2018) *splines2: Regression Spline Functions and Classes.* R package version 0.2.8. \url{https://CRAN.R-project.org/package=splines2} # Appendix A: Parameterisations on the hazard scale When `basehaz` is set equal to `"exp"`, `"weibull"`, `"gompertz"`, `"ms"` (the default), or `"bs"` then the model is defined on the hazard scale using the following parameterisations. ### Exponential model The exponential model is parameterised with scale parameter $\lambda_i = \exp(\eta_i)$ where $\eta_i = \beta_0 + \sum_{p=1}^P \beta_p x_{ip}$ denotes our linear predictor. For individual $i$ we have: \begin{align} \begin{split} h_i(T_i) & = \lambda_i \\ & = \exp(\eta_i) \\ H_i(T_i) & = T_i \lambda_i \\ & = T_i \exp(\eta_i) \\ S_i(T_i) & = \exp \left( - T_i \lambda_i \right) \\ & = \exp \left( - T_i \exp(\eta_i) \right) \\ F_i(T_i) & = 1 - \exp \left( - T_i \lambda_i \right) \\ & = 1 - \exp \left( - T_i \exp(\eta_i) \right) \\ S_i(T_i) - S_i(T_i^U) & = \exp \left( - T_i \lambda_i \right) - \exp \left( - T_i^U \lambda_i \right) \\ & = \exp \left( - T_i \exp(\eta_i) \right) - \exp \left( - T_i^U \exp(\eta_i) \right) \end{split} \end{align} or on the log scale: \begin{align} \begin{split} \log h_i(T_i) & = \log \lambda_i \\ & = \eta_i \\ \log H_i(T_i) & = \log(T_i) + \log \lambda_i \\ & = \log(T_i) + \eta_i \\ \log S_i(T_i) & = - T_i \lambda_i \\ & = - T_i \exp(\eta_i) \\ \log F_i(T_i) & = \log \left( 1 - \exp \left( - T_i \lambda_i \right) \right) \\ & = \log \left( 1 - \exp \left( - T_i \exp(\eta_i) \right) \right) \\ \log (S_i(T_i) - S_i(T_i^U)) & = \log \left[ \exp \left( - T_i \lambda_i \right) - \exp \left( - T_i^U \lambda_i \right) \right] \\ & = \log \left[ \exp \left( - T_i \exp(\eta_i) \right) - \exp \left( - T_i^U \exp(\eta_i) \right) \right] \end{split} \end{align} The definition of $\lambda$ for the baseline is: \begin{align} \begin{split} \lambda_0 = \exp(\beta_0) \Longleftrightarrow \beta_0 = \log(\lambda_0) \end{split} \end{align} ### Weibull model The Weibull model is parameterised with scale parameter $\lambda_i = \exp(\eta_i)$ and shape parameter $\gamma > 0$ where $\eta_i = \beta_0 + \sum_{p=1}^P \beta_p x_{ip}$ denotes our linear predictor. For individual $i$ we have: \begin{align} \begin{split} h_i(T_i) & = \gamma t^{\gamma-1} \lambda_i \\ & = \gamma t^{\gamma-1} \exp(\eta_i) \\ H_i(T_i) & = T_i^{\gamma} \lambda_i \\ & = T_i^{\gamma} \exp(\eta_i) \\ S_i(T_i) & = \exp \left( - T_i^{\gamma} \lambda_i \right) \\ & = \exp \left( - T_i^{\gamma} \exp(\eta_i) \right) \\ F_i(T_i) & = 1 - \exp \left( - {(T_i)}^{\gamma} \lambda_i \right) \\ & = 1 - \exp \left( - {(T_i)}^{\gamma} \exp(\eta_i) \right) \\ S_i(T_i) - S_i(T_i^U) & = \exp \left( - {(T_i)}^{\gamma} \lambda_i \right) - \exp \left( - {(T_i^U)}^{\gamma} \lambda_i \right) \\ & = \exp \left( - {(T_i)}^{\gamma} \exp(\eta_i) \right) - \exp \left( - {(T_i^U)}^{\gamma} \exp(\eta_i) \right) \end{split} \end{align} or on the log scale: \begin{align} \begin{split} \log h_i(T_i) & = \log(\gamma) + (\gamma-1) \log(t) + \log \lambda_i \\ & = \log(\gamma) + (\gamma-1) \log(t) + \eta_i \\ \log H_i(T_i) & = \gamma \log(T_i) + \log \lambda_i \\ & = \gamma \log(T_i) + \eta_i \\ \log S_i(T_i) & = - T_i^{\gamma} \lambda_i \\ & = - T_i^{\gamma} \exp(\eta_i) \\ \log F_i(T_i) & = \log \left( 1 - \exp \left( - {(T_i)}^{\gamma} \lambda_i \right) \right) \\ & = \log \left( 1 - \exp \left( - {(T_i)}^{\gamma} \exp(\eta_i) \right) \right) \\ \log (S_i(T_i) - S_i(T_i^U)) & = \log \left[ \exp \left( - {(T_i)}^{\gamma} \lambda_i \right) - \exp \left( - {(T_i^U)}^{\gamma} \lambda_i \right) \right] \\ & = \log \left[ \exp \left( - {(T_i)}^{\gamma} \exp(\eta_i) \right) - \exp \left( - {(T_i^U)}^{\gamma} \exp(\eta_i) \right) \right] \end{split} \end{align} The definition of $\lambda$ for the baseline is: \begin{align} \begin{split} \lambda_0 = \exp(\beta_0) \Longleftrightarrow \beta_0 = \log(\lambda_0) \end{split} \end{align} ### Gompertz model The Gompertz model is parameterised with shape parameter $\lambda_i = \exp(\eta_i)$ and scale parameter $\gamma > 0$ where $\eta_i = \beta_0 + \sum_{p=1}^P \beta_p x_{ip}$ denotes our linear predictor. For individual $i$ we have: \begin{align} \begin{split} h_i(T_i) & = \exp(\gamma T_i) \lambda_i \\ & = \exp(\gamma T_i) \exp(\eta_i) \\ H_i(T_i) & = \frac{\exp(\gamma T_i) - 1}{\gamma} \lambda_i \\ & = \frac{\exp(\gamma T_i) - 1}{\gamma} \exp(\eta_i) \\ S_i(T_i) & = \exp \left( \frac{-(\exp(\gamma T_i) - 1)}{\gamma} \lambda_i \right) \\ & = \exp \left( \frac{-(\exp(\gamma T_i) - 1)}{\gamma} \exp(\eta_i) \right) \\ F_i(T_i) & = 1 - \exp \left( \frac{-(\exp(\gamma T_i) - 1)}{\gamma} \lambda_i \right) \\ & = 1 - \exp \left( \frac{-(\exp(\gamma T_i) - 1)}{\gamma} \exp(\eta_i) \right) \\ S_i(T_i) - S_i(T_i^U) & = \exp \left( \frac{-(\exp(\gamma T_i) - 1)}{\gamma} \lambda_i \right) - \exp \left( \frac{-(\exp(\gamma T_i^U) - 1)}{\gamma} \lambda_i \right) \\ & = \exp \left( \frac{-(\exp(\gamma T_i) - 1)}{\gamma} \exp(\eta_i) \right) - \exp \left( \frac{-(\exp(\gamma T_i^U) - 1)}{\gamma} \exp(\eta_i) \right) \end{split} \end{align} or on the log scale: \begin{align} \begin{split} \log h_i(T_i) & = \gamma T_i + \log \lambda_i \\ & = \gamma T_i + \eta_i \\ \log H_i(T_i) & = \log(\exp(\gamma T_i) - 1) - \log(\gamma) + \log \lambda_i \\ & = \log(\exp(\gamma T_i) - 1) - \log(\gamma) + \eta_i \\ \log S_i(T_i) & = \frac{-(\exp(\gamma T_i) - 1)}{\gamma} \lambda_i \\ & = \frac{-(\exp(\gamma T_i) - 1)}{\gamma} \exp(\eta_i) \\ \log F_i(T_i) & = \log \left( 1 - \exp \left( \frac{-(\exp(\gamma T_i) - 1)}{\gamma} \lambda_i \right) \right) \\ & = \log \left( 1 - \exp \left( \frac{-(\exp(\gamma T_i) - 1)}{\gamma} \exp(\eta_i) \right) \right) \\ \log (S_i(T_i) - S_i(T_i^U)) & = \log \left[ \exp \left( \frac{-(\exp(\gamma T_i) - 1)}{\gamma} \lambda_i \right) - \exp \left( \frac{-(\exp(\gamma T_i^U) - 1)}{\gamma} \lambda_i \right) \right] \\ & = \log \left[ \exp \left( \frac{-(\exp(\gamma T_i) - 1)}{\gamma} \exp(\eta_i) \right) - \exp \left( \frac{-(\exp(\gamma T_i^U) - 1)}{\gamma} \exp(\eta_i) \right) \right] \end{split} \end{align} The definition of $\lambda$ for the baseline is: \begin{align} \begin{split} \lambda_0 = \exp(\beta_0) \Longleftrightarrow \beta_0 = \log(\lambda_0) \end{split} \end{align} ### M-spline model The M-spline model is parameterised with vector of regression coefficients $\boldsymbol{\theta} > 0$ for the baseline hazard and with covariate effects introduced through a linear predictor $\eta_i = \sum_{p=1}^P \beta_p x_{ip}$. Note that there is no intercept in the linear predictor since it is absorbed into the baseline hazard spline function. For individual $i$ we have: \begin{align} \begin{split} h_i(T_i) & = M(T_i; \boldsymbol{\theta}, \boldsymbol{k_0}) \exp(\eta_i) \\ H_i(T_i) & = I(T_i; \boldsymbol{\theta}, \boldsymbol{k_0}) \exp(\eta_i) \\ S_i(T_i) & = \exp \left( - I(T_i; \boldsymbol{\theta}, \boldsymbol{k_0}) \exp(\eta_i) \right) \\ F_i(T_i) & = 1 - \exp \left( - I(T_i; \boldsymbol{\theta}, \boldsymbol{k_0}) \exp(\eta_i) \right) \\ S_i(T_i) - S_i(T_i^U) & = \exp \left( - I(T_i; \boldsymbol{\theta}, \boldsymbol{k_0}) \exp(\eta_i) \right) - \exp \left( - I(T_i^U; \boldsymbol{\theta}, \boldsymbol{k_0}) \exp(\eta_i) \right) \end{split} \end{align} or on the log scale: \begin{align} \begin{split} \log h_i(T_i) & = \log(M(T_i; \boldsymbol{\theta}, \boldsymbol{k_0})) + \eta_i \\ \log H_i(T_i) & = \log(I(T_i; \boldsymbol{\theta}, \boldsymbol{k_0})) + \eta_i \\ \log S_i(T_i) & = - I(T_i; \boldsymbol{\theta}, \boldsymbol{k_0}) \exp(\eta_i) \\ \log F_i(T_i) & = \log \left[ 1 - \exp \left( - I(T_i; \boldsymbol{\theta}, \boldsymbol{k_0}) \exp(\eta_i) \right) \right] \\ \log (S_i(T_i) - S_i(T_i^U)) & = \log \left[ \exp \left( - I(T_i; \boldsymbol{\theta}, \boldsymbol{k_0}) \exp(\eta_i) \right) - \exp \left( - I(T_i^U; \boldsymbol{\theta}, \boldsymbol{k_0}) \exp(\eta_i) \right) \right] \end{split} \end{align} where $M(t; \boldsymbol{\theta}, \boldsymbol{k_0})$ denotes a cubic M-spline function evaluated at time $t$ with regression coefficients $\boldsymbol{\theta}$ and basis evaluated using the vector of knot locations $\boldsymbol{k_0})$. Similarly, $I(t; \boldsymbol{\theta}, \boldsymbol{k_0})$ denotes a cubic I-spline function (i.e. integral of an M-spline) evaluated at time $t$ with regression coefficients $\boldsymbol{\theta}$ and basis evaluated using the vector of knot locations $\boldsymbol{k_0}$. ### B-spline model The B-spline model is parameterised with vector of regression coefficients $\boldsymbol{\theta}$ and linear predictor where $\eta_i = \sum_{p=1}^P \beta_p x_{ip}$ denotes our linear predictor. Note that there is no intercept in the linear predictor since it is absorbed into the spline function. For individual $i$ we have: \begin{align} \begin{split} h_i(T_i) & = \exp \left( B(T_i; \boldsymbol{\theta}, \boldsymbol{k_0}) + \eta_i \right) \end{split} \end{align} or on the log scale: \begin{align} \begin{split} \log h_i(T_i) & = B(T_i; \boldsymbol{\theta}, \boldsymbol{k_0}) + \eta_i \end{split} \end{align} The cumulative hazard, survival function, and CDF for the B-spline model cannot be calculated analytically. Instead, the model is only defined analytically on the hazard scale and quadrature is used to evaluate the following: \begin{align} \begin{split} H_i(T_i) & = \int_0^{T_i} h_i(u) du \\ S_i(T_i) & = \exp \left( - \int_0^{T_i} h_i(u) du \right) \\ F_i(T_i) & = 1 - \exp \left( - \int_0^{T_i} h_i(u) du \right) \\ S_i(T_i) - S_i(T_i^U) & = \exp \left( -\int_0^{T_i} h_i(u) du \right) - \exp \left( - \int_0^{T_i^U} h_i(u) du \right) \end{split} \end{align} ### Extension to time-varying coefficients (i.e. non-proportional hazards) We can extend the previous model formulations to allow for time-varying coefficients (i.e. non-proportional hazards). The time-varying linear predictor is introduced on the hazard scale. That is, $\eta_i$ in our previous model definitions is instead replaced by $\eta_i(t)$. This leads to an analytical form for the hazard and log hazard. However, in general, there is no longer a closed form expression for the cumulative hazard, survival function, or CDF. Therefore, when the linear predictor includes time-varying coefficients, quadrature is used to evaluate the following: \begin{align} \begin{split} H_i(T_i) & = \int_0^{T_i} h_i(u) du \\ S_i(T_i) & = \exp \left( - \int_0^{T_i} h_i(u) du \right) \\ F_i(T_i) & = 1 - \exp \left( - \int_0^{T_i} h_i(u) du \right) \\ S_i(T_i) - S_i(T_i^U) & = \exp \left( -\int_0^{T_i} h_i(u) du \right) - \exp \left( - \int_0^{T_i^U} h_i(u) du \right) \end{split} \end{align} # Appendix B: Parameterisations under accelerated failure times When `basehaz` is set equal to `"exp-aft"`, or `"weibull-aft"` then the model is defined on the accelerated failure time scale using the following parameterisations. ### Exponential model The exponential model is parameterised with scale parameter $\lambda_i = \exp(-\eta_i)$ where $\eta_i = \beta_0^* + \sum_{p=1}^P \beta_p^* x_{ip}$ denotes our linear predictor. For individual $i$ we have: \begin{align} \begin{split} h_i(T_i) & = \lambda_i \\ & = \exp(-\eta_i) \\ H_i(T_i) & = T_i \lambda_i \\ & = T_i \exp(-\eta_i) \\ S_i(T_i) & = \exp \left( - T_i \lambda_i \right) \\ & = \exp \left( - T_i \exp(-\eta_i) \right) \\ F_i(T_i) & = 1 - \exp \left( - T_i \lambda_i \right) \\ & = 1 - \exp \left( - T_i \exp(-\eta_i) \right) \\ S_i(T_i) - S_i(T_i^U) & = \exp \left( - T_i \lambda_i \right) - \exp \left( - T_i^U \lambda_i \right) \\ & = \exp \left( - T_i \exp(-\eta_i) \right) - \exp \left( - T_i^U \exp(-\eta_i) \right) \end{split} \end{align} or on the log scale: \begin{align} \begin{split} \log h_i(T_i) & = \log \lambda_i \\ & = -\eta_i \\ \log H_i(T_i) & = \log(T_i) + \log \lambda_i \\ & = \log(T_i) - \eta_i \\ \log S_i(T_i) & = - T_i \lambda_i \\ & = - T_i \exp(-\eta_i) \\ \log F_i(T_i) & = \log \left( 1 - \exp \left( - T_i \lambda_i \right) \right) \\ & = \log \left( 1 - \exp \left( - T_i \exp(-\eta_i) \right) \right) \\ \log (S_i(T_i) - S_i(T_i^U)) & = \log \left[ \exp \left( - T_i \lambda_i) \right) - \exp \left( - T_i^U \lambda_i \right) \right] \\ & = \log \left[ \exp \left( - T_i \exp(-\eta_i) \right) - \exp \left( - T_i^U \exp(-\eta_i) \right) \right] \end{split} \end{align} The definition of $\lambda$ for the baseline is: \begin{align} \begin{split} \lambda_0 = \exp(-\beta_0^*) \Longleftrightarrow \beta_0^* = -\log(\lambda_0) \end{split} \end{align} The relationship between coefficients under the PH (unstarred) and AFT (starred) parameterisations are as follows: \begin{align} \begin{split} \beta_0 & = -\beta_0^* \\ \beta_p & = -\beta_p^* \end{split} \end{align} Lastly, the general form for the hazard function and survival function under an AFT model with acceleration factor $\exp(-\eta_i)$ can be used to derive the exponential AFT model defined here by setting $h_0(t) = 1$, $S_0(t) = \exp(-T_i)$, and $\lambda_i = \exp(-\eta_i)$: \begin{align} \begin{split} h_i(T_i) & = \exp(-\eta_i) h_0(t \exp(-\eta_i)) \\ & = \exp(-\eta_i) \\ & = \lambda_i \end{split} \end{align} \begin{align} \begin{split} S_i(T_i) & = S_0(t \exp(-\eta_i)) \\ & = \exp(-T_i \exp(-\eta_i)) \\ & = \exp(-T_i \lambda_i) \end{split} \end{align} ### Weibull model The Weibull model is parameterised with scale parameter $\lambda_i = \exp(-\gamma \eta_i)$ and shape parameter $\gamma > 0$ where $\eta_i = \beta_0^* + \sum_{p=1}^P \beta_p^* x_{ip}$ denotes our linear predictor. For individual $i$ we have: \begin{align} \begin{split} h_i(T_i) & = \gamma t^{\gamma-1} \lambda_i \\ & = \gamma t^{\gamma-1} \exp(-\gamma \eta_i) \\ H_i(T_i) & = T_i^{\gamma} \lambda_i \\ & = T_i^{\gamma} \exp(-\gamma \eta_i) \\ S_i(T_i) & = \exp \left( - T_i^{\gamma} \lambda_i \right) \\ & = \exp \left( - T_i^{\gamma} \exp(-\gamma \eta_i) \right) \\ F_i(T_i) & = 1 - \exp \left( - {(T_i)}^{\gamma} \lambda_i \right) \\ & = 1 - \exp \left( - {(T_i)}^{\gamma} \exp(-\gamma \eta_i) \right) \\ S_i(T_i) - S_i(T_i^U) & = \exp \left( - {(T_i)}^{\gamma} \lambda_i \right) - \exp \left( - {(T_i^U)}^{\gamma} \lambda_i \right) \\ & = \exp \left( - {(T_i)}^{\gamma} \exp(-\gamma \eta_i) \right) - \exp \left( - {(T_i^U)}^{\gamma} \exp(-\gamma \eta_i) \right) \end{split} \end{align} or on the log scale: \begin{align} \begin{split} \log h_i(T_i) & = \log(\gamma) + (\gamma-1) \log(t) + \log \lambda_i \\ & = \log(\gamma) + (\gamma-1) \log(t) - \gamma \eta_i \\ \log H_i(T_i) & = \gamma \log(T_i) + \log \lambda_i \\ & = \gamma \log(T_i) - \gamma \eta_i \\ \log S_i(T_i) & = - T_i^{\gamma} \lambda_i \\ & = - T_i^{\gamma} \exp(-\gamma \eta_i) \\ \log F_i(T_i) & = \log \left( 1 - \exp \left( - {(T_i)}^{\gamma} \lambda_i \right) \right) \\ & = \log \left( 1 - \exp \left( - {(T_i)}^{\gamma} \exp(-\gamma \eta_i) \right) \right) \\ \log (S_i(T_i) - S_i(T_i^U)) & = \log \left[ \exp \left( - {(T_i)}^{\gamma} \lambda_i \right) - \exp \left( - {(T_i^U)}^{\gamma} \lambda_i \right) \right] \\ & = \log \left[ \exp \left( - {(T_i)}^{\gamma} \exp(-\gamma \eta_i) \right) - \exp \left( - {(T_i^U)}^{\gamma} \exp(-\gamma \eta_i) \right) \right] \end{split} \end{align} The definition of $\lambda$ for the baseline is: \begin{align} \begin{split} \lambda_0 = \exp(-\gamma \beta_0^*) \Longleftrightarrow \beta_0^* = \frac{-\log(\lambda_0)}{\gamma} \end{split} \end{align} The relationship between coefficients under the PH (unstarred) and AFT (starred) parameterisations are as follows: \begin{align} \begin{split} \beta_0 & = -\gamma \beta_0^* \\ \beta_p & = -\gamma \beta_p^* \end{split} \end{align} Lastly, the general form for the hazard function and survival function under an AFT model with acceleration factor $\exp(-\eta_i)$ can be used to derive the Weibull AFT model defined here by setting $h_0(t) = \gamma t^{\gamma - 1}$, $S_0(t) = \exp(-T_i^{\gamma})$, and $\lambda_i = \exp(-\gamma \eta_i)$: \begin{align} \begin{split} h_i(T_i) & = \exp(-\eta_i) h_0(t \exp(-\eta_i)) \\ & = \exp(-\eta_i) \gamma {(t \exp(-\eta_i))}^{\gamma - 1} \\ & = \exp(-\gamma \eta_i) \gamma t^{\gamma - 1} \\ & = \lambda_i \gamma t^{\gamma - 1} \end{split} \end{align} \begin{align} \begin{split} S_i(T_i) & = S_0(t \exp(-\eta_i)) \\ & = \exp(-(T_i \exp(-\eta_i))^{\gamma}) \\ & = \exp(-T_i^{\gamma} [\exp(-\eta_i)]^{\gamma}) \\ & = \exp(-T_i^{\gamma} \exp(-\gamma \eta_i)) \\ & = \exp(-T_i \lambda_i) \end{split} \end{align} ### Extension to time-varying coefficients (i.e. time-varying acceleration factors) We can extend the previous model formulations to allow for time-varying coefficients (i.e. time-varying acceleration factors). The so-called "unmoderated" survival probability for an individual at time $t$ is defined as the baseline survival probability at time $t$, i.e. $S_i(t) = S_0(t)$. With a time-fixed acceleration factor, the survival probability for a so-called "moderated" individual is defined as the baseline survival probability but evaluated at "time $t$ multiplied by the acceleration factor $\exp(-\eta_i)$". That is, the survival probability for the moderated individual is $S_i(t) = S_0(t \exp(-\eta_i))$. However, with time-varying acceleration we cannot simply multiply time by a fixed (acceleration) constant. Instead, we must integrate the function for the time-varying acceleration factor over the interval $0$ to $t$. In other words, we must evaluate: \ \begin{align} \begin{split} S_i(t) = S_0 \left( \int_0^t \exp(-\eta_i(u)) du \right) \end{split} \end{align} \ as described by Hougaard (1999). Hougaard also gives a general expression for the hazard function under time-varying acceleration, as follows: \ \begin{align} \begin{split} h_i(t) = \exp \left(-\eta_i(t) \right) h_0 \left( \int_0^t \exp(-\eta_i(u)) du \right) \end{split} \end{align} **Note:** It is interesting to note here that the *hazard* at time $t$ is in fact a function of the full history of covariates and parameters (i.e. the linear predictor) from time $0$ up until time $t$. This is different to the hazard scale formulation of time-varying effects (i.e. non-proportional hazards). Under the hazard scale formulation with time-varying effects, the *survival* probability is a function of the full history between times $0$ and $t$, but the *hazard* is **not**; instead, the hazard is only a function of covariates and parameters as defined at the current time. This is particularly important to consider when fitting accelerated failure time models with time-varying effects in the presence of delayed entry (i.e. left truncation). For the exponential distribution, this leads to: \begin{align} \begin{split} S_i(T_i) & = S_0 \left( \int_0^{T_i} \exp(-\eta_i(u)) du \right) \\ & = \exp \left(- \int_0^{T_i} \exp(-\eta_i(u)) du \right) \end{split} \end{align} \begin{align} \begin{split} h_i(T_i) & = \exp \left(-\eta_i(T_i) \right) h_0 \left( \int_0^{T_i} \exp(-\eta_i(u)) du \right) \\ & = \exp \left(-\eta_i(T_i) \right) \exp \left(- \int_0^{T_i} \exp(-\eta_i(u)) du \right) \end{split} \end{align} and for the Weibull distribution, this leads to: \begin{align} \begin{split} S_i(T_i) & = S_0 \left( \int_0^{T_i} \exp(-\eta_i(u)) du \right) \\ & = \exp \left(- \left[\int_0^{T_i} \exp (-\eta_i(u)) du \right]^{\gamma} \right) \end{split} \end{align} \begin{align} \begin{split} h_i(T_i) & = \exp \left(-\eta_i(T_i) \right) h_0 \left( \int_0^{T_i} \exp(-\eta_i(u)) du \right) \\ & = \exp \left(-\eta_i(T_i) \right) \exp \left(- \left[\int_0^{T_i} \exp (-\eta_i(u)) du \right]^{\gamma} \right) \end{split} \end{align} The general expressions for the hazard and survival function under an AFT model with a time-varying linear predictor are used to evaluate the likelihood for the accelerated failure time model in `stan_surv` when time-varying effects are specified in the model formula. Specifically, quadrature is used to evaluate the cumulative acceleration factor $\int_0^t \exp(-\eta_i(u)) du$ and this is then substituted into the relevant expressions for the hazard and survival.