Once Stan’s implementation of HMC has run its magic, we finally have samples from the posterior distribution
Leave-One-Out (LOO) log pointwise predictive density is the preferred Bayesian way to do this. In this blogpost, I’ll explain how we can approximate this metric without the need of refitting the model
All of this is based on this great paper by Vehtari, Gelman and Gabry.
What is our metric? Log pointwise predictive density
Given an observation
The fundamental problem comes when we use
The problem is that evaluating the LOO posterior is just as computationally expensive as fitting the model all over again. If we then want to use all our observations to perform the external validation check, this amounts to fitting the probability model
Approximating the LOO posterior
Not being able to compute from a distribution is an awfully familiar problem in Bayesian Statistics. Which in this case comes in handy. We can use Importance Sampling to use the samples from the full posterior to approximate the LOO posterior. Thus, our Importance Weights for each sample
If we correct, then, our original full posterior samples by these weights, we get equivalent samples from the LOO posterior. Thus, we can compute the log pointwise predictive density (lpd) that can track the out-of-sample performance of our model.
The approximation is likely to fail
Sadly, this approximation to the LOO posterior using the full posterior is likely to fail. Importance Sampling only works when all of the weights are roughly equal. When the weights are very small with a large probability, and very, very large with a small probability, Importance Sampling fails: our computations end up effectively using only the large weights samples, thus drastically reducing our effective number of samples from the LOO posterior. That is, Importance Sampling is likely to fail when the distribution of the weights is fat-tailed.
Sadly, this is very likely to happen with our approximation: the LOO posterior is likely to have a larger variance and fatter tails than the full posterior. Thus, samples from the tails of the full posterior will have large weights to compensate for this fact. Therefore, the distribution of importance weights is likely gonna be fat-tailed.
Correcting the approximation with PSIS
Vehtari, Gelman and Gabry correct the distribution of Importance Weights and thereby improve the approximation to the LOO posterior. First, they use Extreme Value Theory to fit the tail of the distribution with a Generalized Pareto Distribution with tail shape parameter
Secondly, they replace the large weights with smoothed over versions of the weights according to expected order statistics of the fitted
Therefore, we arrive at a new vector of importance weights
It’s about the diagnostics we made along the way
The great thing about PSIS, besides creating better importance weights, it’s the diagnostics that it creates along the way. By fitting a
The smoothing and the truncating can only do so much. If
Therefore, by performing PSIS-LOO, we also arrive at a diagnostic for highly influential observations that are driving our inference and are thus surprising observations to our model.