1 Quick start

Before properly discussing high-dimensional model selection methods, we show some examples to motivate their utility and to provide a quick start for readers who are already familiar with the basics. The examples also illustrate potential pitfalls that are not always sufficiently appreciated.

We focus on sparse inference methods, mainly Bayesian model selection (BMS) and averaging (BMA), L0 criteria like the BIC or EBIC, L1 (LASSO) and adaptive LASSO. For the illustrations we use R package modelSelection. The main functions for regression-type models are modelSelection and bestBIC (along with companions like bestEBIC, bestAIC and bestIC). modelSelection_eBayes provides an empirical Bayes counterpart to modelSelection.

1.1 Popular methods may fail

We consider a simple setting with \(p=3\) covariates and a moderately large sample size \(n=200\), where popular high-dimensional methods fail to recover the truly active covariates. It may feel underwhelming to start a book on high-dimensional model selection with a \(p=3\) example: the message is that if things can go wrong here, they are bound to get worse for larger \(p\). By understanding what can go wrong in such a manageable example, the hope is that we learn lessons that can be valuable in harder ones, and we better understand what different methods can bring to the table.

Suppose that data are truly generated as follows: \[\begin{align*} x_{i1} & \sim N(0,1) \\ x_{i2} & \sim N(0,1) \\ x_{i3} & \sim N(x_{i1} - x_{i2}, \sigma=0.2) \\ y_i & \sim N(x_{i1} - x_{i2}, \sigma=0.5) \end{align*}\] with independence across \(i=1,\ldots,n\). That is, \(y_i\) truly depends on \(x_{i1}\) and \(x_{i2}\) but not on \(x_{i3}\). The complicating issue is that \(x_{i3}\) is a near-copy of the mean of \(y_i\), and hence the marginal correlation between \(x_{i3}\) and \(y_i\) is high.

Code
library(modelSelection)
library(glmnet)
library(tidyverse)
library(mvtnorm)
set.seed(1234) 
n <- 200
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- x1 - x2 + rnorm(n,sd=.2)
x <- cbind(x1, x2, x3)
y <- x1 - x2 + rnorm(n)
df <- data.frame(y, x1, x2, x3)

To show that this is actually an easy exercise, we fit via least-squares the model that includes all covariates. The results correctly suggest that the third covariate could be conceivably dropped, whereas covariates 1-2 are highly statistically significant.

Code
summary(lm(y ~ -1 + ., data=df))
## 
## Call:
## lm(formula = y ~ -1 + ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.83454 -0.64344 -0.05624  0.60601  2.49699 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x1  1.13570    0.33100   3.431 0.000732 ***
## x2 -1.05736    0.33124  -3.192 0.001644 ** 
## x3 -0.02529    0.32816  -0.077 0.938637    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9501 on 197 degrees of freedom
## Multiple R-squared:  0.705,  Adjusted R-squared:  0.7005 
## F-statistic: 156.9 on 3 and 197 DF,  p-value: < 2.2e-16

The issue is that when the number of covariates \(p\) is large, possibly larger than the sample size \(n\), fitting a model that includes all parameters is often less reliable (the parameter estimates are less precise). High-dimensional methods use other strategies to identify the truly active covariates. A popular strategy is the LASSO (or \(L_1\) regularization) of (Tibshirani 1996), which can quickly obtain parameter estimates for all values of its regularization (or penalty) hyper-parameter. As one increases the penalty, the LASSO solution moves from the model excluding all covariates to first including \(x_{i3}\) (the variable most correlated with \(y_i\)), and subsequently to adding \(x_{i1}\) and \(x_{i2}\), without ever dropping \(x_{i3}\). That is, the data-generating model is not recovered by any value of the penalization parameter.

Code
fit <- glmnet(x=x, y=y, family='gaussian', alpha=1, intercept=FALSE)
models <- unique(t(as.matrix(coef(fit)) != 0))
models
##     (Intercept)    x1    x2    x3
## s0        FALSE FALSE FALSE FALSE
## s1        FALSE FALSE FALSE  TRUE
## s22       FALSE  TRUE FALSE  TRUE
## s34       FALSE  TRUE  TRUE  TRUE

In contrast, the Bayesian information criterion (BIC) of Schwarz (1978) (see Section 4.1), correctly identifies the data-generating model. bestBIC enumerates all \(2^3=8\) models resulting from the inclusion/exclusion of the 3 covariates, and finds the model with best (lowest) BIC. We use the R formula y ~ -1 + . to consider all variables in the data (other than y) as covariates, and set -1 to exclude the intercept, since it’s not needed in this example). We set verbose=FALSE to avoid printing progress information and cluttering the book.

In cases where \(p\) is large, bestBIC uses MCMC to explore the model space. A computationally faster alternative is to use heuristics, for example finding the model with best BIC among those visited in the LASSO or adaptive LASSO path (Zou (2006)), using a coordinate descent algorithm or local combinatorial searches (Hazimeh, Mazumder, and Nonet (2023), R package L0Learn), see Section 4. Function bestBIC_fast in modelSelection implements such heuristics.

Code
fit <- bestBIC(y ~ -1 + ., data=df, verbose=FALSE)
print(fit)
## icfit object
## 
## Model with best  BIC :  x1 x2 
## 
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions

Fitting the selected model via least-squares shows that both covariates are highly statistically significant.

Code
summary(fit)
## 
## Call:
## glm(formula = f, family = family2glm(ms$family), data = data)
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x1  1.11070    0.06605   16.82   <2e-16 ***
## x2 -1.03235    0.06688  -15.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.8981151)
## 
##     Null deviance: 602.82  on 200  degrees of freedom
## Residual deviance: 177.83  on 198  degrees of freedom
## AIC: 550.07
## 
## Number of Fisher Scoring iterations: 2

Above we used a single simulated dataset for illustration, but this phenomenon occurs more generally. The table below shows the proportion of times that each covariate was selected among 1,000 simulations, for \(n=200\) and \(n=1000\), when using the BIC, LASSO and adaptive LASSO (penalization parameter chosen by BIC). It also shows results for the fast model search heuristic of Hazimeh, Mazumder, and Nonet (2023) (R package L0Learn). The code to reproduce these results can be unfolded by clicking below.

Click to show code
Code
#Function to simulate one dataset of size n
simdata <- function(n) {
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  x3 <- x1 - x2 + rnorm(n,sd=.2)
  x <- cbind(x1, x2, x3)
  y <- x1 - x2 + rnorm(n)
  df <- data.frame(y, x1, x2, x3)
  return(df)
}

#Function to simulate nsims datasets of size n and obtain
#the proportion of times that each covariate is selected
proportionSelected <- function(nsims, n) {
  sel.L0Learn <- sel.L1 <- sel.adaL1 <- sel.bic <- matrix(NA, nrow=nsims, ncol=3)
  for (i in 1:nsims) {
    df <- simdata(n)
    fit.L0Learn <- bestBIC_fast(y ~ -1 + ., data=df, fastmethod='L0Learn', verbose=FALSE)
    fit.L1 <- bestBIC_fast(y ~ -1 + ., data=df, fastmethod='L1', verbose=FALSE)
    fit.adaL1 <- bestBIC_fast(y ~ -1 + ., data=df, fastmethod='adaptiveL1', verbose=FALSE)
    fit.bic <- bestBIC(y ~ -1 + ., data=df, verbose=FALSE)
    sel.L0Learn[i,] <- fit.L0Learn$varnames %in% fit.L0Learn$topmodel  
    sel.L1[i,] <- fit.L1$varnames %in% fit.L1$topmodel  
    sel.adaL1[i,] <- fit.adaL1$varnames %in% fit.adaL1$topmodel  
    sel.bic[i,] <- fit.bic$varnames %in% fit.bic$topmodel  
  }
  nn <- c('BIC', 'LASSO BIC', 'ALASSO BIC', 'L0Learn')
  propsel <- rbind(colMeans(sel.bic), colMeans(sel.L1), colMeans(sel.adaL1), colMeans(sel.L0Learn))
  propsel <- data.frame(Method=nn, propsel)
  return(propsel)
}

propsel.n200 <- proportionSelected(nsims=1000, n=200)
propsel.n1000 <- proportionSelected(nsims=1000, n=1000)
propsel.n200$n <- 200
propsel.n1000$n <- 1000

propsel <- rbind(propsel.n200, propsel.n1000)
write.table(propsel, "tables/lassofailure.txt", row.names=FALSE, col.names=TRUE, sep="\t")
n = 200
n = 1,000
Method X1 X2 X3 X1 X2 X3
BIC 0.721 0.722 0.291 0.994 0.994 0.006
LASSO BIC 0.491 0.489 0.547 0.993 0.993 0.637
ALASSO BIC 0.491 0.489 0.531 0.993 0.993 0.007
L0Learn 0.530 0.528 0.497 0.993 0.993 0.522

BIC is the best method, including the truly active covariates 1-2 and excluding the truly inactive covariate 3 more frequently than the other methods. LASSO-BIC and L0Learn wrongly include the third covariate in 50% of the simulations or more, that is they incur a very high false positive rate, and the situation does not improve as \(n\) grows. Interestingly, adaptive LASSO performs similarly to LASSO for \(n=200\) and similarly to BIC for \(n=1,000\).

Remark: LASSO-BIC uses BIC to select the best model, so how can its results be so different from BIC? The reason is that LASSO-BIC only evaluates the BIC for models visited by the LASSO path, and the latter may not include the data-generating model. Similar issues are encountered when setting the LASSO penalty with cross-validation (results not shown).

These findings are not a coincidence. There is theory for LASSO explaining that it can be expected to select a super-set of the truly active variables (the so-called screening property), but that this may contain false positives when the covariates are correlated in challenging ways (more precisely, when the so-called irrepresantability condition is not met, see Section 4.2). The adaptive LASSO consistently identifies the truly active covariates under less stringent conditions than LASSO, as illustrated in our findings for \(n=1,000\). Alas, in this example, when \(n=200\) its performance leaves much to be desired.

In contrast, L0 criteria like the BIC have better theoretical properties in such settings, and the same holds for Bayesian model selection methods. Admittedly computation is harder, but given the potential improvements in variable selection we argue that one should at least try. As we shall see, finding a (possibly approximate) solution is often feasible. Following Tukey’s motto, we argue that an approximate solution to the right problem is preferable to an exact solution to the wrong problem.

1.2 Model uncertainty

A limitation of L0 criteria is that they ignore the uncertainty in the selected model, i.e. we’re not completely sure that it’s the correct one. We use Bayesian model selection (BMS) to assess that uncertainty, in a Bayesian sense. BMS requires setting a prior distribution on the models and on the parameters for each model. For now, we run modelSelection with default priors (Beta-Binomial prior on the models, product MOM prior with default prior precision on the coefficients).

Code
priorCoef <- momprior()
priorModel <- modelbbprior()
fit.bms <- modelSelection(y ~ -1 + ., data=df,
                     priorCoef=priorCoef, priorModel=priorModel,
                     verbose=FALSE)

BMS provides posterior model probabilities, a Bayesian measure of the certainty that each model is the correct one. Below we show the five models with highest posterior probability. Only two models have a non-negligible posterior probability. The top model includes covariates 1-2 (the data-generating trugh) and has posterior probability 0.64, whereas the second model includes all 3 covariates and has posterior probability 0.34. Given that covariate 3 is a near-copy of the outcome, and that we have a limited amount of data (\(n=200\)), it seems reasonable that we are not completely sure about which of these two models is the correct one.

Code
postProb(fit.bms)[1:5,]
##   modelid family           pp
## 7     1,2 normal 0.6438458284
## 8   1,2,3 normal 0.3442356139
## 2       3 normal 0.0114396003
## 6     1,3 normal 0.0004294816
## 4     2,3 normal 0.0000494759

coef gives Bayesian model averaging (BMA) point estimates, 0.95 posterior intervals, and marginal posterior inclusion probabilities \(P(\theta_j \neq 0 \mid y)\). Below, phi refers to the error variance \(\sigma^2\) (which is included with probability 1). In our example, BMS selects the right covariates and assigns high posterior probability to that solution, as one would ideally wish.

Code
coef(fit.bms)
##              estimate       2.5%      97.5%    margpp
## intercept -0.03478042 -0.0496490 -0.0186861 1.0000000
## x1         0.98915072  0.4862229  1.2980217 0.9885109
## x2        -0.91515357 -1.2267667 -0.4173619 0.9881309
## x3         0.11173227 -0.2092983  0.6156765 0.3561542
## phi        0.90923160  0.7483423  1.1059294 1.0000000

Finally, we can use predict to obtain point predictions and 0.95 posterior predictive intervals.

Code
ypred <- predict(fit.bms)
head(ypred)
##         mean        2.5%      97.5%
## 1 -1.8887400 -2.10166202 -1.6974587
## 2 -0.4435298 -0.53110644 -0.3563479
## 3  0.9606666  0.81105334  1.1089939
## 4 -3.3575365 -3.69164069 -3.0255310
## 5  0.1268887  0.04994996  0.2085443
## 6 -0.2545186 -0.36481107 -0.1450838

1.3 Are computations always hard?

A common objection to Bayesian model selection and \(L_0\) criteria is that they’re computationally unfeasible as the number of covariates \(p\) grows. It is true that the problem is worst-case NP-hard even in canonical linear regression (see for example Natarajan (1995), Cover and Van Campenhout (2007), Foster, Karloff, and Thaler (2015)), in the sense that one can always design a dataset such that finding the best model according to these criteria has an exponential cost in \(p\). However this is in a worst-case scenario, in practice one may be able to solve the problem quickly, in particular as \(n\) grows and under so-called sparsity constraints. These issues are discussed in Chapters 4 and (background-bms), for now we provide an example where \(p=200\) and \(n=200\). The first six covariates have non-zero effects \(\beta_1=0.5\), \(\beta_2=0.6\), …, \(\beta_{6}=1\), whereas \(\beta_7=\ldots = \beta_p=0\). The covariates are generated from a multivariate Gaussian with unit variances and 0.7 pairwise correlations, which is a moderately strong correlation structure.

Code
n <- p <- 200
sigma <- diag(p)
sigma[upper.tri(sigma)] = sigma[lower.tri(sigma)] = 0.7
x <- rmvnorm(n, sigma=sigma)
beta <- matrix(c(seq(0.5,1,by=0.1), rep(0,p-6)), ncol=1)
y <- rnorm(n, x %*% beta)
df <- data.frame(y=y, x)

Given that \(p\) is large here we use the EBIC (Chen and Chen (2008)), an extension of the BIC that’s better suited for high-dimensional problems. As shown below, optimization heuristics return an answer in $<$0.5 seconds in my Ubuntu laptop. The selected model selects all the truly active covariates 1-6 and also covariate 104, which is a false positive. This is a fairly good answer.

Code
system.time(fit.ebicfast <- bestEBIC_fast(y ~ ., data=df, verbose=FALSE))['elapsed']
## elapsed 
##   0.322
Code
fit.ebicfast
## icfit object
## 
## Model with best  :  X1 X2 X3 X4 X5 X6 X104 
## 
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions

For comparison, the model with best EBIC within the LASSO path misses truly active covariates 1-2, and includes two false positives (covariates 62 and 68).

Code
bestEBIC_fast(y ~ ., data=df, fastmethod='L1', verbose=FALSE)
## icfit object
## 
## Model with best  EBIC :  X3 X4 X5 X6 X62 X68 
## 
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions

Bayesian model selection via MCMC took longer to run, but still below 3 seconds. The highest posterior probability model matches that selected by the EBIC. We also see that its posterior probability is small, that is there isn’t strong Bayesian evidence that this is the correct model. This seems reasonable, given that \(p\) is large and that covariates are strongly correlated.

Code
system.time(ms <- modelSelection(y ~ -1 + ., data=df, verbose=FALSE))['elapsed']
## elapsed 
##   2.743
Code
head(postProb(ms))
##              modelid family         pp
## 1    1,2,3,4,5,6,104 normal 0.05810229
## 2  2,3,4,5,6,104,199 normal 0.03788968
## 3  2,3,4,5,6,160,199 normal 0.03515450
## 4      2,3,4,5,6,104 normal 0.02808176
## 5 3,4,5,6,40,160,199 normal 0.02629312
## 6     3,4,5,6,40,199 normal 0.02219088

BMA returns point estimates and 0.95 posterior probability intervals, plotted below. The posterior interval of each \(\beta_j\) contains the data-generating \(\beta_j\) in this example, and the point estimates for the truly zero \(\beta_j\)’s are quite close to zero.

Click to show code
Code
b <- coef(ms)
sel <- !(rownames(b) %in% c('intercept','phi'))
b <- data.frame(b[sel,]) |>
  transform(Covariate = 1:p, beta=beta) |>
  rename(cilow=X2.5., ciup=X97.5.)

ggplot(b, aes(x=beta, y=estimate)) +
  geom_point() +
  geom_segment(aes(x=beta, y=cilow, yend=ciup)) +
  labs(y='BMA estimate') +
  geom_abline(slope=1)

Before we get too excited, let’s illustrate the Achilles heel of standard high-dimensional model selection methods. Their philosophy is to be conservative in terms of adding parameters to the model, with the hope that truly most parameters are zero (a sparsity assumption) and that the truly non-zero parameters are large enough (a so-called betamin assumption), or alternatively that the sample size \(n\) is large enough. If these assumptions are met, one hopes to be protected against false positive findings. The trade-off is that this conservative nature may result in low statistical power to detect parameters that are truly non-zero but small. We discuss these issues and some empirical Bayes strategies to ameliorate them in Chapter 7.

For now, we repeat the previous example but now the size of the non-zero \(\beta_j\)’s is divided by two. The consequence is that covariates 1-2 are no longer selected, neither by EBIC, LASSO-EBIC nor BMA. For BMA this is particularly problematic, because the posterior probability assigned to \(\beta_1 \neq 0\) and \(\beta_2 \neq 0\) is very close to zero, that is there is strong Bayesian evidence that neither covariate has an effect. This example illustrates that BMA results should be interpreted as a conservative analysis where small evidence for inclusion may simply be due to low statistical power.

Code
beta <- beta/2
y <- rnorm(n, x %*% beta)
df <- data.frame(y=y, x)
bestEBIC_fast(y ~ ., data=df, verbose=FALSE)
## icfit object
## 
## Model with best  :  X2 X4 X5 X6 
## 
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions
Code
bestEBIC_fast(y ~ ., data=df, fastmethod='L1', verbose=FALSE)
## icfit object
## 
## Model with best  EBIC :  X4 X5 X6 
## 
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions
Code
ms <- modelSelection(y ~ -1 + ., data=df, verbose=FALSE)
head(postProb(ms))
##     modelid family         pp
## 1   2,4,5,6 normal 0.13290704
## 2     2,4,6 normal 0.08583127
## 3     4,5,6 normal 0.06940551
## 4 4,6,41,79 normal 0.06035544
## 5  4,5,6,79 normal 0.03890375
## 6 2,4,6,188 normal 0.03487626
Code
b <- coef(ms)
head(b)
##                estimate        2.5%       97.5%      margpp
## intercept -0.0375930630 -0.08706008 0.008621087 1.000000000
## X1         0.0101992975  0.00000000 0.261326126 0.031640715
## X2         0.2219554513  0.00000000 0.686080193 0.419708119
## X3         0.0001352324  0.00000000 0.000000000 0.003587509
## X4         0.5580736261  0.00000000 0.858255775 0.915832362
## X5         0.2362663946  0.00000000 0.767583782 0.429613908
Click to show code
Code
sel <- !(rownames(b) %in% c('intercept','phi'))
b <- data.frame(b[sel,]) |>
  transform(Covariate = 1:p, beta=beta) |>
  rename(cilow=X2.5., ciup=X97.5.)

ggplot(b, aes(x=beta, y=estimate)) +
  geom_point() +
  geom_segment(aes(x=beta, y=cilow, yend=ciup)) +
  labs(y='BMA estimate') +
  geom_abline(slope=1)

1.4 Logistic regression

For binary outcomes, we simply specify family='binomial' (and for Poisson we specify family='poisson'). We first create a binary version of our simulated outcome.

Code
n <- 200
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- x1 - x2 + rnorm(n,sd=.2)
x <- cbind(x1, x2, x3)
y <- x1 - x2 + rnorm(n)
df <- data.frame(y, x1, x2, x3)

dfbin <- transform(df, ybin = (y > 0)) |>
  dplyr::select(!y) #drop variable y

We next use bestBIC and modelSelection as before. The selected model is still the correct one, but the posterior probability for (wrongly) including x[,3] is higher than in the linear regression data. This is intuitively expected, binary outcomes carry less information than Gaussian ones, so there is more uncertainty on the chosen model.

Code
fit2.bic <- bestBIC(ybin ~ ., family='binomial', data=dfbin, verbose=FALSE)
print(fit2.bic)
## icfit object
## 
## Model with best  BIC :  x3 
## 
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions
Code
fit2.bms <- modelSelection(ybin ~ ., data=dfbin,
                     priorCoef=priorCoef, priorModel=priorModel,
                     family='binomial', verbose=FALSE)
coef(fit2.bms)
## Warning in hasPostSampling(object): Exact posterior sampling not implemented,
## using Normal approx instead
##                estimate        2.5%     97.5%       margpp
## (Intercept)  0.07790624  0.04125866 0.1220876 1.467939e-09
## x1           1.00831091  0.00000000 1.9425570 7.066122e-01
## x2          -0.90890993 -1.83924714 0.0000000 6.371661e-01
## x3           0.48760422  0.00000000 1.6393071 3.805434e-01

1.5 Non-linear effects via Generalized Additive Models (GAMs)

Non-linear effects can be modeled via cubic splines using the smooth argument (the default is 9 knots, producing 5 columns in design matrix for each non-linear covariate). When using the smooth argument we cannot use the ~ . notation for including all covarites, rather we must list those for which we wish to include a non-linear effect (see the example below). The effect of each covariate is decomposed as a linear part plus a deviation from linearity (which is forced to be orthogonal to the linear term). This is useful to identify covariates for which a linear effect is sufficient, and covariates for which there are non-linearities. modelSelection considers 3 possibilities for each covariate: excluding it entirely, including only the linear effect, and including both linear and non-linear terms. For further details on this decomposition, see (D. Rossell and Rubio 2021).

The linear effect coefficients are displayed using the original variable names, and the non-linear coefficients with an .s appended. Here we have 5 columns coding for the non-linear effect, labelled as .s1 though .s5. In our example, there is strong evidence for (correctly) including the linear effect of X1 and X2, excluding their non-linear effects, and exclusing X3 entirely.

Code
fit.gam <- modelSelection(y ~ ., data=df,
                     smooth = ~ x1 + x2 + x3,
                     priorCoef=priorCoef,
                     priorModel=priorModel, verbose=FALSE)
Code
coef(fit.gam)
## Warning in hasPostSampling(object): Exact posterior sampling not implemented,
## using Normal approx instead
##                  estimate         2.5%      97.5%       margpp
## (Intercept)  4.364227e-02  0.007302052 0.08272408 0.0044805633
## x1           8.418331e-01  0.000000000 1.42763119 0.8737221540
## x2          -7.344687e-01 -1.032236497 0.00000000 0.8689907661
## x3           2.414028e-01  0.000000000 1.03223704 0.5798951778
## x1.s1       -9.202478e-05  0.000000000 0.00000000 0.0001142201
## x1.s2        4.187202e-06  0.000000000 0.00000000 0.0001142201
## x1.s3       -1.367765e-04  0.000000000 0.00000000 0.0001142201
## x1.s4        4.712374e-05  0.000000000 0.00000000 0.0001142201
## x1.s5       -8.874937e-06  0.000000000 0.00000000 0.0001142201
## x2.s1        0.000000e+00  0.000000000 0.00000000 0.0001233372
## x2.s2        0.000000e+00  0.000000000 0.00000000 0.0001233372
## x2.s3        0.000000e+00  0.000000000 0.00000000 0.0001233372
## x2.s4        0.000000e+00  0.000000000 0.00000000 0.0001233372
## x2.s5        0.000000e+00  0.000000000 0.00000000 0.0001233372
## x3.s1        0.000000e+00  0.000000000 0.00000000 0.0001235762
## x3.s2        0.000000e+00  0.000000000 0.00000000 0.0001235762
## x3.s3        0.000000e+00  0.000000000 0.00000000 0.0001235762
## x3.s4        0.000000e+00  0.000000000 0.00000000 0.0001235762
## x3.s5        0.000000e+00  0.000000000 0.00000000 0.0001235762
## phi          9.228170e-01  0.756447557 1.12147546 1.0000000000

References

Chen, J., and Z. Chen. 2008. “Extended Bayesian Information Criteria for Model Selection with Large Model Spaces.” Biometrika 95 (3): 759–71.
Cover, Thomas M, and Jan M Van Campenhout. 2007. “On the Possible Orderings in the Measurement Selection Problem.” IEEE Transactions on Systems, Man, and Cybernetics 7 (9): 657–61.
Foster, D., H. Karloff, and J. Thaler. 2015. “Variable Selection Is Hard.” In Conference on Learning Theory, 696–709.
Hazimeh, Hussein, Rahul Mazumder, and Tim Nonet. 2023. “L0learn: A Scalable Package for Sparse Learning Using L0 Regularization.” Journal of Machine Learning Research 24 (205): 1–8.
Natarajan, B. K. 1995. “Sparse Approximate Solutions to Linear Systems.” SIAM Journal on Computing 24 (2): 227–34.
———. 2021. “Additive Bayesian Variable Selection Under Censoring and Misspecification.” Statistical Science 38 (1): 13–29.
Schwarz, G. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics 6: 461–64.
Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society, B 58: 267–88.
Zou, H. 2006. “The Adaptive LASSO and Its Oracle Properties.” Journal of the American Statistical Association 101 (476): 1418–29.