1 Quick start

The main functions for regression-type models are modelSelection and bestBIC (along with companions like bestEBIC, bestAIC and bestIC). Details are in subsequent sections. Below we illustrate how to obtain:

  • Information criteria for all models (including those from MCMC exploration)
  • Posterior model probabilities
  • Marginal posterior inclusion probabilities
  • BMA point estimates and posterior intervals

See modelSelectionGGM for Gaussian graphical models and bfnormmix for Gaussian mixture models.

1.1 Linear regression

We simulate linear regression data \[ y_i = \sum_{j=1}^p x_{ij} \theta_j + \epsilon_i, \] for \(p=3\) covariates and \(i=1,\ldots,100\) individuals. We set regression coefficients \(\theta_1= 1\), \(\theta_2= 1\), \(\theta_3=0\) and \(\epsilon_i \sim N(0,1)\), and the random number seed for reproducibility. It is good practice to store the outcome and covariates into a data frame, as we do below.

library(modelSelection)
set.seed(1234)
x <- matrix(rnorm(100*3), nrow=100, ncol=3)
theta <- matrix(c(1,1,0), ncol=1)
y <- x %*% theta + rnorm(100)
df <- data.frame(y, x)

1.1.1 L0 criteria

bestBIC obtains the BIC for all models. As usual in R we can use formulas like y ~ X1 + X2 + X3 (provided the variables are stored in a data frame), or simply y ~ . to consider all the variables in the data (other than y) as covariates. An intercept is added automatically, giving a total of 4 variables and \(2^4=16\) possible models (the intercept can be removed by adding -1 to the formula, as usual). bestBIC enumerates these 16 models and finds the model with best (lowest) BIC (when there are many covariates, MCMC is used to explore the model space). A much faster alternative is to use a heuristic implemented in bestBIC_fast, based on coordinate descent and other tricks, see Section 3 (for Gaussian and binary outcomes it relies on package L0Learn (Hazimeh, Mazumder, and Nonet 2023)). We set verbose=FALSE to avoid printing progress information and cluttering the book. For our simulated data the selected model matches the data-generating truth, which only features the first two covariates.

fit.bic <- bestBIC(y ~ ., data=df, verbose=FALSE)
print(fit.bic)
## icfit object
## Model with best BIC: X1 X2
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions

We list the BIC for the top 5 models (index 1 corresponds to the intercept, 2 to x[,1] and so on). We can also use standard functions like summary and coef to view the MLE for the best model, and predict to obtain predictions for new data.

fit.bic$models[1:5,]
## # A tibble: 5 × 2
##   modelid   bic
##   <chr>   <dbl>
## 1 2,3      302.
## 2 2,3,4    307.
## 3 1,2,3    307.
## 4 1,2,3,4  311.
## 5 3        381.
summary(fit.bic)
## 
## Call:
## glm(formula = f, family = family2glm(ms$family), data = data)
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## X1   1.1505     0.1022   11.26   <2e-16 ***
## X2   1.1509     0.1006   11.44   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.06776)
## 
##     Null deviance: 371.43  on 100  degrees of freedom
## Residual deviance: 104.64  on  98  degrees of freedom
## AIC: 294.32
## 
## Number of Fisher Scoring iterations: 2
coef(fit.bic)
##       X1       X2 
## 1.150549 1.150920

1.1.2 Bayesian model selection

A limitation of L0 criteria is that they ignore the uncertainty in the selected model, i.e. we’re not completely sure that it’s the correct one. We use BMS to assess that uncertainty, in a Bayesian sense. BMS requires setting a prior distribution on the models and on the parameters for each model. For now, we run modelSelection with default priors (Beta-Binomial prior on the models, pMOM prior with default prior precision on the coefficients).

priorCoef <- momprior()
priorModel <- modelbbprior()
fit.bms <- modelSelection(y ~ ., data=df,
                     priorCoef=priorCoef, priorModel=priorModel,
                     verbose=FALSE)

postProb shows posterior model probabilities (sorted decreasingly), and coef gives BMA point estimates, 0.95 posterior intervals, and marginal posterior inclusion probabilities \(P(\theta_j \neq 0 \mid y)\). Below, phi refers to the error variance \(\sigma^2\) (which is included with probability 1). In our example, BMS selects the right covariates and assigns high posterior probability to that solution, as one would ideally wish.

coef(fit.bms)
##                estimate        2.5%      97.5%      margpp
## (Intercept) 0.007082034 -0.02658464 0.04089499 0.007366249
## X1          1.133309621  0.93331088 1.33480178 1.000000000
## X2          1.134404673  0.93919629 1.33501531 1.000000000
## X3          0.000366013  0.00000000 0.00000000 0.008254065
## phi         1.103715115  0.84213596 1.44604848 1.000000000
postProb(fit.bms)[1:5,]
##    modelid family           pp
## 7      2,3 normal 9.845428e-01
## 8    2,3,4 normal 8.090989e-03
## 15   1,2,3 normal 7.203173e-03
## 16 1,2,3,4 normal 1.630761e-04
## 3        3 normal 3.424188e-17

Finally, we can use predict to obtain point predictions and 0.95 posterior predictive intervals.

ypred <- predict(fit.bms)
head(ypred)
##         mean       2.5%       97.5%
## 1 -0.8928148 -1.1160111 -0.66885457
## 2 -0.2161415 -0.3514485 -0.08236455
## 3  1.3134407  1.0653356  1.56205993
## 4 -3.2261301 -3.6793885 -2.77249364
## 5 -0.4427614 -0.6498843 -0.23853199
## 6  0.7716784  0.6332783  0.90914325

1.2 Logistic regression

For binary outcomes, we simply specify family='binomial' (and for Poisson we specify family='poisson'). We first create a binary version of our simulated outcome.

dfbin <- transform(df, ybin = (y > 0)) |>
  dplyr::select(!y) #drop variable y

We next use bestBIC and modelSelection as before. The selected model is still the correct one, but the posterior probability for (wrongly) including x[,3] is higher than in the linear regression data. This is intuitively expected, binary outcomes carry less information than Gaussian ones, so there is more uncertainty on the chosen model.

fit2.bic <- bestBIC(ybin ~ ., family='binomial', data=dfbin, verbose=FALSE)
print(fit2.bic)
## icfit object
## Model with best BIC: X1 X2
## Use summary(), coef() and predict() to get inference for the top model
## Use coef(object$msfit) and predict(object$msfit) to get BMA estimates and predictions
fit2.bms <- modelSelection(ybin ~ ., data=dfbin,
                     priorCoef=priorCoef, priorModel=priorModel,
                     family='binomial', verbose=FALSE)
coef(fit2.bms)
## Warning in hasPostSampling(object): Exact posterior sampling not implemented,
## using Normal approx instead
##               estimate       2.5%     97.5%       margpp
## (Intercept) 0.16323081 0.03275118 0.2669387 2.730669e-10
## X1          1.38893900 0.78354882 2.0102542 1.000000e+00
## X2          1.05744788 0.53310589 1.6067301 9.999136e-01
## X3          0.07043283 0.00000000 0.7391405 1.695233e-01

1.3 Non-Linear effects via Generalized Additive Models (GAMs)

Non-linear effects can be modeled via cubic splines using the smooth argument (the default is 9 knots, producing 5 columns in design matrix for each non-linear covariate). When using the smooth argument we cannot use the ~ . notation for including all covarites, rather we must list those for which we wish to include a non-linear effect (see the example below). The effect of each covariate is decomposed as a linear part plus a deviation from linearity (which is forced to be orthogonal to the linear term). This is useful to identify covariates for which a linear effect is sufficient, and covariates for which there are non-linearities. modelSelection considers 3 possibilities for each covariate: excluding it entirely, including only the linear effect, and including both linear and non-linear terms. For further details on this decomposition, see (D. Rossell and Rubio 2021).

The linear effect coefficients are displayed using the original variable names, and the non-linear coefficients with an .s appended. Here we have 5 columns coding for the non-linear effect, labelled as .s1 though .s5. In our example, there is strong evidence for (correctly) including the linear effect of X1 and X2, excluding their non-linear effects, and exclusing X3 entirely.

fit.gam <- modelSelection(y ~ ., data=df,
                     smooth = ~ X1 + X2 + X3,
                     priorCoef=priorCoef,
                     priorModel=priorModel, verbose=FALSE)
coef(fit.gam)
## Warning in hasPostSampling(object): Exact posterior sampling not implemented,
## using Normal approx instead
##                  estimate        2.5%      97.5%       margpp
## (Intercept)  8.127021e-03 -0.01028342 0.02684467 7.499102e-03
## X1           1.140319e+00  1.02940753 1.25420735 1.000000e+00
## X2           1.139502e+00  1.03040279 1.24949508 1.000000e+00
## X3           1.579339e-04  0.00000000 0.00000000 8.545216e-03
## X1.s1       -1.088515e-04  0.00000000 0.00000000 4.026757e-04
## X1.s2        1.357836e-04  0.00000000 0.00000000 4.026757e-04
## X1.s3       -2.810730e-04  0.00000000 0.00000000 4.026757e-04
## X1.s4        2.847856e-04  0.00000000 0.00000000 4.026757e-04
## X1.s5        2.822071e-05  0.00000000 0.00000000 4.026757e-04
## X2.s1        4.295849e-03  0.00000000 0.00000000 4.296470e-03
## X2.s2        5.891295e-03  0.00000000 0.00000000 4.296470e-03
## X2.s3        2.200417e-03  0.00000000 0.00000000 4.296470e-03
## X2.s4        3.947671e-03  0.00000000 0.00000000 4.296470e-03
## X2.s5        1.376460e-03  0.00000000 0.00000000 4.296470e-03
## X3.s1       -1.142249e-04  0.00000000 0.00000000 1.855263e-05
## X3.s2        8.956766e-05  0.00000000 0.00000000 1.855263e-05
## X3.s3        7.400172e-05  0.00000000 0.00000000 1.855263e-05
## X3.s4        5.668310e-05  0.00000000 0.00000000 1.855263e-05
## X3.s5        1.327733e-04  0.00000000 0.00000000 1.855263e-05
## phi          1.093750e+00  0.82918219 1.43831078 1.000000e+00

References

Hazimeh, Hussein, Rahul Mazumder, and Tim Nonet. 2023. “L0learn: A Scalable Package for Sparse Learning Using L0 Regularization.” Journal of Machine Learning Research 24 (205): 1–8.
———. 2021. “Additive Bayesian Variable Selection Under Censoring and Misspecification.” Statistical Science 38 (1): 13–29.