1 Quick start
Before properly discussing high-dimensional model selection methods, we show some examples to motivate their utility and to provide a quick start for readers who are already familiar with the basics. The examples also illustrate potential pitfalls that are not always sufficiently appreciated.
We focus on sparse inference methods, mainly Bayesian model selection (BMS) and averaging (BMA), L0 criteria like the BIC or EBIC, L1 (LASSO) and adaptive LASSO.
For the illustrations we use R package modelSelection.
The main functions for regression-type models are modelSelection and bestBIC (along with companions like bestEBIC, bestAIC and bestIC).
modelSelection_eBayes provides an empirical Bayes counterpart to modelSelection.
1.1 Popular methods may fail
We consider a simple setting with \(p=3\) covariates and a moderately large sample size \(n=200\), where popular high-dimensional methods fail to recover the truly active covariates. It may feel underwhelming to start a book on high-dimensional model selection with a \(p=3\) example: the message is that if things can go wrong here, they are bound to get worse for larger \(p\). By understanding what can go wrong in such a manageable example, the hope is that we learn lessons that can be valuable in harder ones, and we better understand what different methods can bring to the table.
Suppose that data are truly generated as follows: \[\begin{align*} x_{i1} & \sim N(0,1) \\ x_{i2} & \sim N(0,1) \\ x_{i3} & \sim N(x_{i1} - x_{i2}, \sigma=0.2) \\ y_i & \sim N(x_{i1} - x_{i2}, \sigma=0.5) \end{align*}\] with independence across \(i=1,\ldots,n\). That is, \(y_i\) truly depends on \(x_{i1}\) and \(x_{i2}\) but not on \(x_{i3}\). The complicating issue is that \(x_{i3}\) is a near-copy of the mean of \(y_i\), and hence the marginal correlation between \(x_{i3}\) and \(y_i\) is high.
Code
To show that this is actually an easy exercise, we fit via least-squares the model that includes all covariates. The results correctly suggest that the third covariate could be conceivably dropped, whereas covariates 1-2 are highly statistically significant.
##
## Call:
## lm(formula = y ~ -1 + ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.83454 -0.64344 -0.05624 0.60601 2.49699
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x1 1.13570 0.33100 3.431 0.000732 ***
## x2 -1.05736 0.33124 -3.192 0.001644 **
## x3 -0.02529 0.32816 -0.077 0.938637
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9501 on 197 degrees of freedom
## Multiple R-squared: 0.705, Adjusted R-squared: 0.7005
## F-statistic: 156.9 on 3 and 197 DF, p-value: < 2.2e-16
The issue is that when the number of covariates \(p\) is large, possibly larger than the sample size \(n\), fitting a model that includes all parameters is often less reliable (the parameter estimates are less precise). High-dimensional methods use other strategies to identify the truly active covariates. A popular strategy is the LASSO (or \(L_1\) regularization) of (Tibshirani 1996), which can quickly obtain parameter estimates for all values of its regularization (or penalty) hyper-parameter. As one increases the penalty, the LASSO solution moves from the model excluding all covariates to first including \(x_{i3}\) (the variable most correlated with \(y_i\)), and subsequently to adding \(x_{i1}\) and \(x_{i2}\), without ever dropping \(x_{i3}\). That is, the data-generating model is not recovered by any value of the penalization parameter.
Code
## (Intercept) x1 x2 x3
## s0 FALSE FALSE FALSE FALSE
## s1 FALSE FALSE FALSE TRUE
## s22 FALSE TRUE FALSE TRUE
## s34 FALSE TRUE TRUE TRUE
In contrast, the Bayesian information criterion (BIC) of Schwarz (1978) (see Section 4.1), correctly identifies the data-generating model.
bestBIC enumerates all \(2^3=8\) models resulting from the inclusion/exclusion of the 3 covariates, and finds the model with best (lowest) BIC.
We use the R formula y ~ -1 + . to consider all variables in the data (other than y) as covariates,
and set -1 to exclude the intercept, since it’s not needed in this example).
We set verbose=FALSE to avoid printing progress information and cluttering the book.
In cases where \(p\) is large, bestBIC uses MCMC to explore the model space.
A computationally faster alternative is to use heuristics, for example finding the model with best BIC among those visited in the LASSO or adaptive LASSO path (Zou (2006)), using a coordinate descent algorithm or local combinatorial searches (Hazimeh, Mazumder, and Nonet (2023), R package L0Learn), see Section 4.
Function bestBIC_fast in modelSelection implements such heuristics.
## icfit object
##
## Model with best BIC : x1 x2
##
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions
Fitting the selected model via least-squares shows that both covariates are highly statistically significant.
##
## Call:
## glm(formula = f, family = family2glm(ms$family), data = data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x1 1.11070 0.06605 16.82 <2e-16 ***
## x2 -1.03235 0.06688 -15.44 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.8981151)
##
## Null deviance: 602.82 on 200 degrees of freedom
## Residual deviance: 177.83 on 198 degrees of freedom
## AIC: 550.07
##
## Number of Fisher Scoring iterations: 2
Above we used a single simulated dataset for illustration, but this phenomenon occurs more generally. The table below shows the proportion of times that each covariate was selected among 1,000 simulations, for \(n=200\) and \(n=1000\), when using the BIC, LASSO and adaptive LASSO (penalization parameter chosen by BIC). It also shows results for the fast model search heuristic of Hazimeh, Mazumder, and Nonet (2023) (R package L0Learn). The code to reproduce these results can be unfolded by clicking below.
Click to show code
Code
#Function to simulate one dataset of size n
simdata <- function(n) {
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- x1 - x2 + rnorm(n,sd=.2)
x <- cbind(x1, x2, x3)
y <- x1 - x2 + rnorm(n)
df <- data.frame(y, x1, x2, x3)
return(df)
}
#Function to simulate nsims datasets of size n and obtain
#the proportion of times that each covariate is selected
proportionSelected <- function(nsims, n) {
sel.L0Learn <- sel.L1 <- sel.adaL1 <- sel.bic <- matrix(NA, nrow=nsims, ncol=3)
for (i in 1:nsims) {
df <- simdata(n)
fit.L0Learn <- bestBIC_fast(y ~ -1 + ., data=df, fastmethod='L0Learn', verbose=FALSE)
fit.L1 <- bestBIC_fast(y ~ -1 + ., data=df, fastmethod='L1', verbose=FALSE)
fit.adaL1 <- bestBIC_fast(y ~ -1 + ., data=df, fastmethod='adaptiveL1', verbose=FALSE)
fit.bic <- bestBIC(y ~ -1 + ., data=df, verbose=FALSE)
sel.L0Learn[i,] <- fit.L0Learn$varnames %in% fit.L0Learn$topmodel
sel.L1[i,] <- fit.L1$varnames %in% fit.L1$topmodel
sel.adaL1[i,] <- fit.adaL1$varnames %in% fit.adaL1$topmodel
sel.bic[i,] <- fit.bic$varnames %in% fit.bic$topmodel
}
nn <- c('BIC', 'LASSO BIC', 'ALASSO BIC', 'L0Learn')
propsel <- rbind(colMeans(sel.bic), colMeans(sel.L1), colMeans(sel.adaL1), colMeans(sel.L0Learn))
propsel <- data.frame(Method=nn, propsel)
return(propsel)
}
propsel.n200 <- proportionSelected(nsims=1000, n=200)
propsel.n1000 <- proportionSelected(nsims=1000, n=1000)
propsel.n200$n <- 200
propsel.n1000$n <- 1000
propsel <- rbind(propsel.n200, propsel.n1000)
write.table(propsel, "tables/lassofailure.txt", row.names=FALSE, col.names=TRUE, sep="\t")| Method | X1 | X2 | X3 | X1 | X2 | X3 |
|---|---|---|---|---|---|---|
| BIC | 0.721 | 0.722 | 0.291 | 0.994 | 0.994 | 0.006 |
| LASSO BIC | 0.491 | 0.489 | 0.547 | 0.993 | 0.993 | 0.637 |
| ALASSO BIC | 0.491 | 0.489 | 0.531 | 0.993 | 0.993 | 0.007 |
| L0Learn | 0.530 | 0.528 | 0.497 | 0.993 | 0.993 | 0.522 |
BIC is the best method, including the truly active covariates 1-2 and excluding the truly inactive covariate 3 more frequently than the other methods. LASSO-BIC and L0Learn wrongly include the third covariate in 50% of the simulations or more, that is they incur a very high false positive rate, and the situation does not improve as \(n\) grows. Interestingly, adaptive LASSO performs similarly to LASSO for \(n=200\) and similarly to BIC for \(n=1,000\).
Remark: LASSO-BIC uses BIC to select the best model, so how can its results be so different from BIC? The reason is that LASSO-BIC only evaluates the BIC for models visited by the LASSO path, and the latter may not include the data-generating model. Similar issues are encountered when setting the LASSO penalty with cross-validation (results not shown).
These findings are not a coincidence. There is theory for LASSO explaining that it can be expected to select a super-set of the truly active variables (the so-called screening property), but that this may contain false positives when the covariates are correlated in challenging ways (more precisely, when the so-called irrepresantability condition is not met, see Section 4.2). The adaptive LASSO consistently identifies the truly active covariates under less stringent conditions than LASSO, as illustrated in our findings for \(n=1,000\). Alas, in this example, when \(n=200\) its performance leaves much to be desired.
In contrast, L0 criteria like the BIC have better theoretical properties in such settings, and the same holds for Bayesian model selection methods. Admittedly computation is harder, but given the potential improvements in variable selection we argue that one should at least try. As we shall see, finding a (possibly approximate) solution is often feasible. Following Tukey’s motto, we argue that an approximate solution to the right problem is preferable to an exact solution to the wrong problem.
1.2 Model uncertainty
A limitation of L0 criteria is that they ignore the uncertainty in the selected model, i.e. we’re not completely sure that it’s the correct one.
We use Bayesian model selection (BMS) to assess that uncertainty, in a Bayesian sense.
BMS requires setting a prior distribution on the models and on the parameters for each model.
For now, we run modelSelection with default priors (Beta-Binomial prior on the models, product MOM prior with default prior precision on the coefficients).
Code
BMS provides posterior model probabilities, a Bayesian measure of the certainty that each model is the correct one. Below we show the five models with highest posterior probability. Only two models have a non-negligible posterior probability. The top model includes covariates 1-2 (the data-generating trugh) and has posterior probability 0.64, whereas the second model includes all 3 covariates and has posterior probability 0.34. Given that covariate 3 is a near-copy of the outcome, and that we have a limited amount of data (\(n=200\)), it seems reasonable that we are not completely sure about which of these two models is the correct one.
## modelid family pp
## 7 1,2 normal 0.6438458284
## 8 1,2,3 normal 0.3442356139
## 2 3 normal 0.0114396003
## 6 1,3 normal 0.0004294816
## 4 2,3 normal 0.0000494759
coef gives Bayesian model averaging (BMA) point estimates, 0.95 posterior intervals, and marginal posterior inclusion probabilities \(P(\theta_j \neq 0 \mid y)\).
Below, phi refers to the error variance \(\sigma^2\) (which is included with probability 1).
In our example, BMS selects the right covariates and assigns high posterior probability to that solution, as one would ideally wish.
## estimate 2.5% 97.5% margpp
## intercept -0.03478042 -0.0496490 -0.0186861 1.0000000
## x1 0.98915072 0.4862229 1.2980217 0.9885109
## x2 -0.91515357 -1.2267667 -0.4173619 0.9881309
## x3 0.11173227 -0.2092983 0.6156765 0.3561542
## phi 0.90923160 0.7483423 1.1059294 1.0000000
Finally, we can use predict to obtain point predictions and 0.95 posterior predictive intervals.
## mean 2.5% 97.5%
## 1 -1.8887400 -2.10166202 -1.6974587
## 2 -0.4435298 -0.53110644 -0.3563479
## 3 0.9606666 0.81105334 1.1089939
## 4 -3.3575365 -3.69164069 -3.0255310
## 5 0.1268887 0.04994996 0.2085443
## 6 -0.2545186 -0.36481107 -0.1450838
1.3 Are computations always hard?
A common objection to Bayesian model selection and \(L_0\) criteria is that they’re computationally unfeasible as the number of covariates \(p\) grows. It is true that the problem is worst-case NP-hard even in canonical linear regression (see for example Natarajan (1995), Cover and Van Campenhout (2007), Foster, Karloff, and Thaler (2015)), in the sense that one can always design a dataset such that finding the best model according to these criteria has an exponential cost in \(p\). However this is in a worst-case scenario, in practice one may be able to solve the problem quickly, in particular as \(n\) grows and under so-called sparsity constraints. These issues are discussed in Chapters 4 and (background-bms), for now we provide an example where \(p=200\) and \(n=200\). The first six covariates have non-zero effects \(\beta_1=0.5\), \(\beta_2=0.6\), …, \(\beta_{6}=1\), whereas \(\beta_7=\ldots = \beta_p=0\). The covariates are generated from a multivariate Gaussian with unit variances and 0.7 pairwise correlations, which is a moderately strong correlation structure.
Code
Given that \(p\) is large here we use the EBIC (Chen and Chen (2008)), an extension of the BIC that’s better suited for high-dimensional problems. As shown below, optimization heuristics return an answer in $<$0.5 seconds in my Ubuntu laptop. The selected model selects all the truly active covariates 1-6 and also covariate 104, which is a false positive. This is a fairly good answer.
## elapsed
## 0.322
## icfit object
##
## Model with best : X1 X2 X3 X4 X5 X6 X104
##
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions
For comparison, the model with best EBIC within the LASSO path misses truly active covariates 1-2, and includes two false positives (covariates 62 and 68).
## icfit object
##
## Model with best EBIC : X3 X4 X5 X6 X62 X68
##
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions
Bayesian model selection via MCMC took longer to run, but still below 3 seconds. The highest posterior probability model matches that selected by the EBIC. We also see that its posterior probability is small, that is there isn’t strong Bayesian evidence that this is the correct model. This seems reasonable, given that \(p\) is large and that covariates are strongly correlated.
## elapsed
## 2.743
## modelid family pp
## 1 1,2,3,4,5,6,104 normal 0.05810229
## 2 2,3,4,5,6,104,199 normal 0.03788968
## 3 2,3,4,5,6,160,199 normal 0.03515450
## 4 2,3,4,5,6,104 normal 0.02808176
## 5 3,4,5,6,40,160,199 normal 0.02629312
## 6 3,4,5,6,40,199 normal 0.02219088
BMA returns point estimates and 0.95 posterior probability intervals, plotted below. The posterior interval of each \(\beta_j\) contains the data-generating \(\beta_j\) in this example, and the point estimates for the truly zero \(\beta_j\)’s are quite close to zero.
Click to show code
Code
b <- coef(ms)
sel <- !(rownames(b) %in% c('intercept','phi'))
b <- data.frame(b[sel,]) |>
transform(Covariate = 1:p, beta=beta) |>
rename(cilow=X2.5., ciup=X97.5.)
ggplot(b, aes(x=beta, y=estimate)) +
geom_point() +
geom_segment(aes(x=beta, y=cilow, yend=ciup)) +
labs(y='BMA estimate') +
geom_abline(slope=1)
Before we get too excited, let’s illustrate the Achilles heel of standard high-dimensional model selection methods. Their philosophy is to be conservative in terms of adding parameters to the model, with the hope that truly most parameters are zero (a sparsity assumption) and that the truly non-zero parameters are large enough (a so-called betamin assumption), or alternatively that the sample size \(n\) is large enough. If these assumptions are met, one hopes to be protected against false positive findings. The trade-off is that this conservative nature may result in low statistical power to detect parameters that are truly non-zero but small. We discuss these issues and some empirical Bayes strategies to ameliorate them in Chapter 7.
For now, we repeat the previous example but now the size of the non-zero \(\beta_j\)’s is divided by two. The consequence is that covariates 1-2 are no longer selected, neither by EBIC, LASSO-EBIC nor BMA. For BMA this is particularly problematic, because the posterior probability assigned to \(\beta_1 \neq 0\) and \(\beta_2 \neq 0\) is very close to zero, that is there is strong Bayesian evidence that neither covariate has an effect. This example illustrates that BMA results should be interpreted as a conservative analysis where small evidence for inclusion may simply be due to low statistical power.
Code
## icfit object
##
## Model with best : X2 X4 X5 X6
##
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions
## icfit object
##
## Model with best EBIC : X4 X5 X6
##
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions
## modelid family pp
## 1 2,4,5,6 normal 0.13290704
## 2 2,4,6 normal 0.08583127
## 3 4,5,6 normal 0.06940551
## 4 4,6,41,79 normal 0.06035544
## 5 4,5,6,79 normal 0.03890375
## 6 2,4,6,188 normal 0.03487626
## estimate 2.5% 97.5% margpp
## intercept -0.0375930630 -0.08706008 0.008621087 1.000000000
## X1 0.0101992975 0.00000000 0.261326126 0.031640715
## X2 0.2219554513 0.00000000 0.686080193 0.419708119
## X3 0.0001352324 0.00000000 0.000000000 0.003587509
## X4 0.5580736261 0.00000000 0.858255775 0.915832362
## X5 0.2362663946 0.00000000 0.767583782 0.429613908
Click to show code
Code

1.4 Logistic regression
For binary outcomes, we simply specify family='binomial' (and for Poisson we specify family='poisson').
We first create a binary version of our simulated outcome.
Code
We next use bestBIC and modelSelection as before. The selected model is still the correct one, but the posterior probability for (wrongly) including x[,3] is higher than in the linear regression data. This is intuitively expected, binary outcomes carry less information than Gaussian ones, so there is more uncertainty on the chosen model.
## icfit object
##
## Model with best BIC : x3
##
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions
Code
## Warning in hasPostSampling(object): Exact posterior sampling not implemented,
## using Normal approx instead
## estimate 2.5% 97.5% margpp
## (Intercept) 0.07790624 0.04125866 0.1220876 1.467939e-09
## x1 1.00831091 0.00000000 1.9425570 7.066122e-01
## x2 -0.90890993 -1.83924714 0.0000000 6.371661e-01
## x3 0.48760422 0.00000000 1.6393071 3.805434e-01
1.5 Non-linear effects via Generalized Additive Models (GAMs)
Non-linear effects can be modeled via cubic splines using the smooth argument (the default is 9 knots, producing 5 columns in design matrix for each non-linear covariate).
When using the smooth argument we cannot use the ~ . notation for including all covarites, rather we must list those for which we wish to include a non-linear effect (see the example below).
The effect of each covariate is decomposed as a linear part plus a deviation from linearity (which is forced to be orthogonal to the linear term). This is useful to identify covariates for which a linear effect is sufficient, and covariates for which there are non-linearities.
modelSelection considers 3 possibilities for each covariate: excluding it entirely, including only the linear effect, and including both linear and non-linear terms. For further details on this decomposition, see (D. Rossell and Rubio 2021).
The linear effect coefficients are displayed using the original variable names, and the non-linear coefficients with an .s appended. Here we have 5 columns coding for the non-linear effect, labelled as .s1 though .s5.
In our example, there is strong evidence for (correctly) including the linear effect of X1 and X2, excluding their non-linear effects, and exclusing X3 entirely.
Code
## Warning in hasPostSampling(object): Exact posterior sampling not implemented,
## using Normal approx instead
## estimate 2.5% 97.5% margpp
## (Intercept) 4.364227e-02 0.007302052 0.08272408 0.0044805633
## x1 8.418331e-01 0.000000000 1.42763119 0.8737221540
## x2 -7.344687e-01 -1.032236497 0.00000000 0.8689907661
## x3 2.414028e-01 0.000000000 1.03223704 0.5798951778
## x1.s1 -9.202478e-05 0.000000000 0.00000000 0.0001142201
## x1.s2 4.187202e-06 0.000000000 0.00000000 0.0001142201
## x1.s3 -1.367765e-04 0.000000000 0.00000000 0.0001142201
## x1.s4 4.712374e-05 0.000000000 0.00000000 0.0001142201
## x1.s5 -8.874937e-06 0.000000000 0.00000000 0.0001142201
## x2.s1 0.000000e+00 0.000000000 0.00000000 0.0001233372
## x2.s2 0.000000e+00 0.000000000 0.00000000 0.0001233372
## x2.s3 0.000000e+00 0.000000000 0.00000000 0.0001233372
## x2.s4 0.000000e+00 0.000000000 0.00000000 0.0001233372
## x2.s5 0.000000e+00 0.000000000 0.00000000 0.0001233372
## x3.s1 0.000000e+00 0.000000000 0.00000000 0.0001235762
## x3.s2 0.000000e+00 0.000000000 0.00000000 0.0001235762
## x3.s3 0.000000e+00 0.000000000 0.00000000 0.0001235762
## x3.s4 0.000000e+00 0.000000000 0.00000000 0.0001235762
## x3.s5 0.000000e+00 0.000000000 0.00000000 0.0001235762
## phi 9.228170e-01 0.756447557 1.12147546 1.0000000000