High-dimensional model choice. A hands-on take
2025-11-13
Preface

Model selection (which I will also refer to as structural learning) is a fundamental problem in data analysis. It is therefore surprising that it’s plagued with misconceptions, and that the basic principles distinguishing it from estimation or forecasting problems are too often misunderstood. I heard too many times to count that finding good solutions to this problem is too hard. That it’s challenging to formulate in an exactly satisfying manner, that it requires prohibitive computations that one can solve exactly, and that one should give up and adopt alternative strategies that are computationally simpler, even if they give poorer solutions. Yet, in my applied experience I found the problem to be much simpler. In practice the exact formulation usually has a moderate effect on the results, provided one follows simple principles, and simple computational strategies are surprisingly effective. This book’s leitmotiv is well-captured by this quote from statistician John Tukey: “An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem”. Its goal is to introduce the reader to the main theoretical and methodological concepts, always keeping in mind and illustrating their impact in actual applications. It therefore offers a hands-on take that should hopefully be useful to applied data analysts, while also discussing nuances that should be of interest to more methodologically-oriented readers.
A brief story of how I got interested in this problem follows. When doing my PhD at Rice University I asked my supervisor, Peter Mueller, how do Bayesians test hypotheses? He answered that there’s no universal agreement, and that one needed to be careful of certain pitfalls (such as Jeffreys-Lindley paradox) to avoid getting non-sensical solutions. I was stunned by the answer, how was it possible that Bayesians didn’t agree on a solution for such a simple, fundamental task? This took me down the rabbit hole. I started reading the large literature on Bayesian tests, objective Bayes, associated mathematical statistics etc, and went deeper into the topic during my time as a postdoc with Val Johnson at MD Anderson Cancer Center. I gradually got the feeling that the literature on Bayesian tests was over-complicated. The complications arose from coming up with a formulation that adhered perfectly to certain principles (minimally informative, predictive matching, etc.), which is hard to do exactly except in simple settings, and assumes that there’s a universal agreement on these being the only possible correct principles. From my applied experience, as long as one is aware of some key ideas, most sensible Bayesian tests give a fairly similar answer, and contrary to public opinion the results are quite robust to the prior. I also fell in love with the fact that Bayesian tests can be seamlessly generalized to Bayesian model selection, providing an effective toolkit to encourage parsimonious solutions for challenging high-dimensional problems, particularly when it comes to structural learning. And I enjoyed learning about the close connections between Bayesian model selection and L0 criteria such as the BIC and later extensions like the EBIC. This was not just an academic affair, I saw these methods work well in many applications that I became involved in during the years, particularly when working in biomedical and social sciences problems where the goal was to understand a phenomenon (as opposed to pure forecasting, for which a black-box method could suffice).
A second rabbit hole had to do with computational aspects. For example, there’s a large literature claiming that Bayesian model selection and L0 criteria are computationally unfeasible. They certainly are in a worst-case scenario, to be more precise they have been shown to be NP-hard in the worst-case. But why should one take decisions based on worst-case scenarios? If I leave my home the worst-case scenario is that I get run over by a car, does this mean that I should just stay in my couch? A more practically-relevant question is how often is the problem feasible for the data that one has at hand, i.e. does this happen with high probability? There is theory showing that computations are indeed feasible with high probability, under certain (somewhat restrictive) conditions. More importantly from a hands-on take, in my experience simple algorithms do surprisingly well at finding good approximate solutions (remember John Tukey’s mantra).
In summary, a main motivation for this book was to introduce the fundamental notions behind high-dimensional model selection / structural learning, trying to dispel common misconceptions along the way. While methodological details are discussed, there is always a hands-on perspective. Another motivation was to provide supporting material for applied data analysts, undergraduate and graduate students, including serving as documentation for some of the R packages that one can use for this task.
A final note on software. There is a blatant emphasis on my R package modelSelection (which superseeds a previous package called mombf) which I have been developing over the years, mainly because it’s well-suited to emphasizing the main notions that I intend to convey.
There is other excellent software that is used in this book, and of course yet further software that is not covered here.
The package covers the following models:
Generalized linear models (GLMs, Chapter 5 and generalized additive models (GAMs, Chapter 6). (Johnson and Rossell 2012; D. Rossell and Telesca 2017; D. Rossell, Abril, and Bhattacharya 2021; D. Rossell and Rubio 2021).
Linear regression with non-normal residuals (D. Rossell and Rubio 2018), including asymmetric Normal, Laplace and asymmetric Laplace residuals.
Accelerated failure time models for right-censored survival data (D. Rossell and Rubio 2021).
Gaussian graphical models under spike-and-slab priors (Sulem, Jewson, and Rossell 2025).
Bayesian inference for Gaussian mixture models (Fúquene, Steel, and Rossell 2019).
A lot of work went into coding and maintaining the modelSelection package, if you use it please cite this book or the corresponding papers.
The package’s C++ implementation is not optimal, but it’s designed to be minimally scalable in sparse high-dimensional settings (large \(p\)).
On the Bayesian side, modelSelection is the main package implementing non-local priors (Johnson and Rossell (2010), Johnson and Rossell (2012), D. Rossell and Telesca (2017), Section 2.6.2) but other popular priors are also implemented, e.g. Zellner’s and Normal shrinkage priors in regression, or Gaussian spike-and-slab priors in graphical models.