David Salazar - Notes from Simon Wood’s Generalized Additive Models

Simon Wood is a world expert on GAMs and creator of the fantastic mgcv package in R. His book on GAMs is a great introduction to the subject. I’ve been re-reading it and taking notes to be able to better teach this to my co-workers. This post is a summary of the notes I’ve taken so far. They are questions that I try to ask myself to make sure that I have a good mental model of the subject. I’ve tried to make them as general as possible, but they are still biased towards my own understanding of the subject.

Generalized Additive Model (GAM)

Q: What is a Generalized Additive Model?
A: A linear model with a linear predictor involving a sum of smooth functions of covariates.

Q: What are the two problems of using GAMs?
A: 1. How to represent the smooth functions? 2. How smooth these functions should be?

Q: What’s the traditional solution to the above problems?
A: 1. Represent them via basis expansions for each smooth with estimation via penalized regression methods. 2. Estimate the function smoothness via cross validation or marginal likelihood maximization.

Q: What’s the restriction on how we represent a smooth to be estimated as a linear model?
A: f must be represented as a linear model. Choose a basis expansion of which f is an element of. Estimate parameters for each function in the basis, and the resulting f is the linear combination using these parameters. \[ f(x) = \sum_{j=i}^{k} b_j (x) \beta_j\].

Q: Why are polynomial basis expansions frowned upon as it comes to GAMs?
A: Polynomial basis expansions are one possible choice for these functions. However, they have certain limitations: 1. Runaway Behavior: Polynomials are unbounded and can have extreme values far from the data points, which can cause model instability and poor extrapolation. This is especially problematic at the boundaries of the data range. 2. Global Influence: A change in the data at one point can affect the polynomial fit at all points. In contrast, local regression techniques (like splines or kernel smoothing) used in GAMs are less influenced by distant points. 3. Inflexibility: Polynomials of low degree may not be flexible enough to capture complex non-linear relationships, while high degree polynomials can overfit the data and create oscillations. 4. Curse of Dimensionality: When dealing with multivariate data, the number of polynomial terms grows exponentially with the degree of the polynomial and the number of variables. This can make the model very complex and computationally expensive.

Q: How can we determine the resulting smoothness of the linear combination of the functions in the basis expansion?
A: By using penalized regression and adding a penalty on the “wiggliness” (i.e., integrated squared second derivative in the case of cubic splines). Then, the problem of estimating the wiggliness becomes the problem of adding a hyperparameter to control this penalty.

Q: Given the penalized regression approach to estimate the wiggliness of the resulting linear combination of the basis expansion functions, what’s Simon Woods’ advice to choosing the dimension of the basis expansion?
A: Choose a large enough k such that the basis is more flexible than we expect to need to represent f(x), then neither the exact choice of k, nor the precise location of the knots, has a great deal of influence on the model fit. Rather, it is the penalty hyperparameter.

Q: How can we justify using Restricted Maximum Likelihood (REML) to estimate the wiggliness hyperparameter?
A: By positing a prior on the wiggliness of the linear combination of the basis expansion, we can estimate the model as a mixed model. Then, we can use REML to estimate the hyperparameter as the variance of random effects.

Q: What alternative to REML can we use to estimate the wiggliness hyperparameters?
A: We can use Cross Validation. In particular, Generalized Cross Validation to avoid the computational cost of LOO-CV.

Q: Expand on why adding more than one smooth in a GAM introduces an identifiability problem?
A: The identifiability problem arises in the context of Generalized Additive Models (GAMs) when we have more than one smooth term in the model. This is because there can be multiple different combinations of smooth functions that give the same fitted values, which means the model is not identifiable. To understand this, consider a simple GAM with two smooth terms: \[ y = f_1(x_1) + f_2(x_2) + \varepsilon \] Now imagine that we add a constant $ c $ to $ f_1(x_1) $ and subtract the same constant from $ f_2(x_2) $. The fitted values remain the same: \[ y = [f_1(x_1) + c] + [f_2(x_2) - c] + \varepsilon = f_1(x_1) + f_2(x_2) + \varepsilon \] However, the functions $f_1(x_1)$ and $f_2(x_2)$ have changed. This means there are multiple different sets of functions $f_1(x_1)$ and $f_2(x_2)$ that give the same fitted values, so the model is not identifiable.

Q: What is a common solution to the identifiability problem introduced by having multiple smooths in a GAM?
A: A common solution to this problem is to impose a constraint that the mean of each smooth function is zero over the range of the data. This effectively removes the freedom to add or subtract constants from the functions. The result is that each function can only change the shape of the fitted values, not the overall level. This makes the model identifiable: there is now a unique set of functions $f_1(x_1)$ and $f_2(x_2)$ that give the fitted values.

Q: What interpretation change of the smooths must happen once we add the constraint of mean of each smooth be zero over the range of data to deal with identifiability constraints?
A: This constraint has a nice interpretation: it means that each smooth function represents the deviation from the overall mean response. It does not change the overall level of the response, only its shape. This is consistent with the idea that the smooth functions in a GAM capture the non-linear effects of the predictors.

Q: How does the identifiability constraint limits the maximum dimension for the basis?
A: It changes that from any k we choose to k-1, due to the zero centering that we need to impose on it if there are more than one spline.

Q: What’s a tricky point of setting the dimension k regarding basis with smaller dimensions k?
A: A k=20 will contain a larger number of functions with effective degrees of freedom=5 than a space with k=10.

Q: If reducing the k in the smooth won’t necessarily constrain the model to a simpler spline, what can we do to induce larger regularization of the smooth when using the Generalized Cross Validation?
A: We can use gamma hyperparameter to increase the penalty per degree of freedom increases in the GCV and thus increasingly smooth models are produced.

Q: In mgcv, when using smooths over several variables, what type of basis expansion are we using?
A: Tensor products of the marginal basis.

Q: In mgcv, when specifying the k for a tensor product, what’s the resulting dimension?
A: The product of the dimension of each of the marginal basis. You can define a marginal basis dimension per variable.

Q: Name three possible kind of basis expansions (smoothers) to use in GAMs?
A: - Local Linear basis. - Polynomial - Spline bases

Q: What’s the least squares method used in GAMs for the penalized likelihood maximization?
A: Penalized iteratively re weighted least squares (PIRLS)

Q: What are cubic splines?
A: A cubic spline is a curve constructed from sections of cubic polynomial joined together so that the curve is continuous up to second derivative.

Q: What’s the difference between cubic splines interpolators and cubic smoothing splines?
A: Treat the exact values that the function will take at the control points as n free parameters and estimate them by minimizing the squared error of them plus the integral of the second derivative controlled by a hyper parameter lambda.

Q: How do B splines improve upon natural polynomial splines?
A: It modifies the basis functions so their effects are strictly local: each basis function is only nonzero over the intervals between m + 3 adjacent knots.

Q: What are the improvements of P splines over B Splines?
A: P-splines extend B-splines by adding a penalty on the differences of the coefficients of the B-splines. This penalty typically targets the second or higher-order differences to ensure the resulting curve is smooth. The idea is to strike a balance between fitting the data closely (like B-splines) and keeping the curve smooth (through the penalty).

Q: What does it mean for a model smooth to be cyclic?
A: The resulting smooth has the same value and first few derivatives at its upper and lower boundaries.

Q: What are adaptive smoothers in the context of GAMs?
A: A spline where the level of smoothing depends on the values of the covariates.

Q: Why do most smoothing penalties in GAM have problems outputting the zero function? What can we do about it?
A: Some non-zero functions are in the null space (because they are considered completely smooth). If we want to encourage a zero function we can add an extra penalty that penalizes functions in the null space. The select argument in mgcv does that.

Q: What are isotropic smooths in the context of GAMs?
A: Smooths that in the multiple covariate case produce identical predictions of the response variable under any rotation or reflection of the covariates.

Q: How do we arrive at the knots selections and basis functions from thin plate splines?
A: These emerge naturally from the mathematical statement of the smoothing problem.

Q: What type of basis functions arise out of the thin plate spline formulation?
A: Radial basis functions.

Q: What’s the drawback of thin plate splines?
A: Their computational cost.

Q: What’s the idea behind thin plate regression splines that reduce their computational cost?
A: Truncating the space of wiggly components of the thin plate spline while leaving the components of zero wiggliness unchanged.

Q: When are soap film smoothing useful in gams?
A: When it’s important to not smooth across boundaries.

Q: What’s the trick to construct orthogonal non isotropic smooth for interactions of variables?
A: The marginal bases due to identifiability constraints cannot have a zero. Thus, they don’t have the unit vector and thus the product of the basis cannot contain the main effects.

Reuse

https://creativecommons.org/licenses/by/4.0/