Preface

Model selection (which I will also refer to as structural learning) is a fundamental problem in data analysis. It is therefore surprising that it’s plagued with misconceptions, and that the basic principles distinguishing it from estimation or forecasting problems are too often misunderstood. I heard too many times to count that finding good solutions to this problem is too hard. That it’s challenging to formulate in an exactly satisfying manner, that it requires prohibitive computations that one can solve exactly, and that one should give up and adopt alternative strategies that are computationally simpler, even if they give poorer solutions. Yet, in my applied experience I found the problem to be much simpler. In practice the exact formulation usually has a moderate effect on the results, provided one follows simple principles, and simple computational strategies are surprisingly effective. This book’s leitmotiv is well-captured by this quote from statistician John Tukey: “An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem”. Its goal is to introduce the reader to the main theoretical and methodological concepts, always keeping in mind and illustrating their impact in actual applications. It therefore offers a hands-on take that should hopefully be useful to applied data analysts, while also discussing nuances that should be of interest to more methodologically-oriented readers.

A brief story of how I got interested in this problem follows. When doing my PhD at Rice University I asked my supervisor, Peter Mueller, how do Bayesians test hypotheses? He answered that there’s no universal agreement, and that one needed to be careful of certain pitfalls (such as Jeffreys-Lindley paradox) to avoid getting non-sensical solutions. I was stunned by the answer, how was it possible that Bayesians didn’t agree on a solution for such a simple, fundamental task? This took me down the rabbit hole. I started reading the large literature on Bayesian tests, objective Bayes, associated mathematical statistics etc, and went deeper into the topic during my time as a postdoc with Val Johnson at MD Anderson Cancer Center. I gradually got the feeling that the literature on Bayesian tests was over-complicated. The complications arose from coming up with a formulation that adhered perfectly to certain principles (minimally informative, predictive matching, etc.), which is hard to do exactly except in simple settings, and assumes that there’s a universal agreement on these being the only possible correct principles. From my applied experience, as long as one is aware of some key ideas, most sensible Bayesian tests give a fairly similar answer, and contrary to public opinion the results are quite robust to the prior. I also fell in love with the fact that Bayesian tests can be seamlessly generalized to Bayesian model selection, providing an effective toolkit to encourage parsimonious solutions for challenging high-dimensional problems, particularly when it comes to structural learning. And I enjoyed learning about the close connections between Bayesian model selection and L0 criteria such as the BIC and later extensions like the EBIC. This was not just an academic affair, I saw these methods work well in many applications that I became involved in during the years, particularly when working in biomedical and social sciences problems where the goal was to understand a phenomenon (as opposed to pure forecasting, for which a black-box method could suffice).

A second rabbit hole had to do with computational aspects. For example, there’s a large literature claiming that Bayesian model selection and L0 criteria are computationally unfeasible. They certainly are in a worst-case scenario, to be more precise they have been shown to be NP-hard in the worst-case. But why should one take decisions based on worst-case scenarios? If I leave my home the worst-case scenario is that I get run over by a car, does this mean that I should just stay in my couch? A more practically-relevant question is how often is the problem feasible for the data that one has at hand, i.e. does this happen with high probability? There is theory showing that computations are indeed feasible with high probability, under certain (somewhat restrictive) conditions. More importantly from a hands-on take, in my experience simple algorithms do surprisingly well at finding good approximate solutions (remember John Tukey’s mantra).

In summary, a main motivation for this book was to introduce the fundamental notions behind high-dimensional model selection / structural learning, trying to dispel common misconceptions along the way. While methodological details are discussed, there is always a hands-on perspective. Another motivation was to provide supporting material for applied data analysts, undergraduate and graduate students, including serving as documentation for some of the R packages that one can use for this task.

A final note on software. There is a blatant emphasis on my R package modelSelection (which superseeds a previous package called mombf) which I have been developing over the years, mainly because it’s well-suited to emphasizing the main notions that I intend to convey. There is other excellent software that is used in this book, and of course yet further software that is not covered here. The package covers the following models:

A lot of work went into coding and maintaining the modelSelection package, if you use it please cite this book or the corresponding papers. The package’s C++ implementation is not optimal, but it’s designed to be minimally scalable in sparse high-dimensional settings (large \(p\)). On the Bayesian side, modelSelection is the main package implementing non-local priors (Johnson and Rossell (2010), Johnson and Rossell (2012), D. Rossell and Telesca (2017), Section 2.6.2) but other popular priors are also implemented, e.g. Zellner’s and Normal shrinkage priors in regression, or Gaussian spike-and-slab priors in graphical models.

References

Fúquene, J., M. F. J. Steel, and D. Rossell. 2019. “On Choosing Mixture Components via Non-Local Priors.” Journal of the Royal Statistical Society B 81 (5): 809–37.
Johnson, V. E., and D. Rossell. 2010. “On the Use of Non-Local Prior Densities for Default Bayesian Hypothesis Tests.” Journal of the Royal Statistical Society B 72: 143–70.
———. 2012. “Bayesian Model Selection in High-Dimensional Settings.” Journal of the American Statistical Association 24 (498): 649–60.
Rossell, D., O. Abril, and A. Bhattacharya. 2021. “Approximate Laplace Approximations for Scalable Model Selection.” Journal of the Royal Statistical Society B 83 (4): 853–79.
Rossell, D., and F. J. Rubio. 2018. “Tractable Bayesian Variable Selection: Beyond Normality.” Journal of the American Statistical Association 113 (524): 1742–58.
———. 2021. “Additive Bayesian Variable Selection Under Censoring and Misspecification.” Statistical Science 38 (1): 13–29.
Rossell, D., and D. Telesca. 2017. “Non-Local Priors for High-Dimensional Estimation.” Journal of the American Statistical Association 112: 254–65.
Sulem, Déborah, Jack Jewson, and David Rossell. 2025. “Bayesian Computation for High-Dimensional Gaussian Graphical Models with Spike-and-Slab Priors.” arXiv 2511.01875: 1–139.