Projection predictive model selection for Gaussian processes
Juho Piironen,Aki Vehtari +1 more
- 01 Sep 2016
- pp 1-6
TL;DR: This article proposed a method for simplification of Gaussian process models by projecting the information contained in the full encompassing model and selecting a reduced number of variables based on their predictive relevance, which is useful for improving explainability of the models, reducing the future measurement costs and reducing the computation time for making new predictions.
read more
Abstract: We propose a new method for simplification of Gaussian process (GP) models by projecting the information contained in the full encompassing model and selecting a reduced number of variables based on their predictive relevance. Our results on synthetic and real world datasets show that the proposed method improves the assessment of variable relevance compared to the automatic relevance determination (ARD) via the length-scale parameters. We expect the method to be useful for improving explainability of the models, reducing the future measurement costs and reducing the computation time for making new predictions.
read more
Chat with Paper
AI Agents for this Paper
Find similar papers on Google Scholar, PubMed and Arxiv
Write a critical review of this paper
Analyze citations of this paper to find unaddressed research gaps
Figures

Figure 2.2. Illustration of the imposed priors on the model complexity for different choices of sparsity hyperparameters when p = 50. Left graph shows the histograms of prior draws for meff (Eq. (2.13)) for the horseshoe prior with p 
Figure 3.2. Illustration of projective selection. The training data has n = 100 observations with 1000 features out of which 100 are relevant but correlated with each other and therefore carry similar information (the rest are completely irrelevant). Left plot shows the mean log predictive density (MLPD) and right plot the predictive mean squared error (MSE) as a function of features selected, both evaluated on an independent test set of 1000 observations (vertical lines denote one standard error bars). The reference model (dashed horizontal) is obtained from Bayesian linear regression using the first 5 principal components. The projection (black) is the single point projection with L1-search (Eq. (3.6)) but the predictions are computed without any penalization. Results for Lasso (gray) are shown for comparison. 
Table 2.1. Example prior distributions for the regression coefficients β j that can be expressed as scale mixtures of Gaussians. The middle column gives the conditional prior for β j given the hyperparameters, and the last column gives the hyperprior. All hyperparameters for which prior is not specified (τ, ν, π and c) are assumed to be given, although in practice these can be given hyperpriors as well. Symbol c is purposely used both in regularized horseshoe and spike-and-slab as it serves for the same purpose in both cases. For the inverse-gamma distribution, parameters a and b denote the shape and scale, respectively, and also for the exponential distribution, b denotes the scale. 
Figure 2.1. Priors densities imposed on the shrinkage factor (2.11) for different prior choices p(β j) (see Table 2.1). For Gaussian and spike-and-slab, the prior contains mass only at some discrete values depicted by the thick vertical bars. For all priors except spikeand-slab, black denotes the density when p nσ−1τ= 1 and grey denotespnσ−1τ= 0.3. 
Figure 3.1. Illustration of a typical difficulty encountered with correlated features. The model is the simple linear regression (2.1) without intercept and assuming the noise variance σ2 is known. Visualized are the likelihood, prior (horseshoe with τ = 1) and posterior densities for the regression coefficients β1 and β2 for a random data realization with n= 50 observations when the features x1 and x2 have a correlation of ρ = 0.8 (see the text for more details). The likelihood for both coefficients being zero is small, but the data provides little evidence whether both or only one of them is nonzero. A sparsifying prior such as the horseshoe results in a multimodal posterior but does not help in solving the feature selection problem.
Citations
Using stacking to average Bayesian predictive distributions
TL;DR: This work takes the idea of stacking from the point estimation literature and generalizes to the combination of predictive distributions, extending the utility function to any proper scoring rule, using Pareto smoothed importance sampling to efficiently compute the required leave-one-out posterior distributions and regularization to get more stability.
159
Projective Inference in High-dimensional Problems: Prediction and Feature Selection
TL;DR: In this paper, a two-stage approach is proposed to construct a possibly non-sparse model that predicts well, and then find a minimal subset of features that characterize the predictions.
67
•Proceedings Article
Variable selection for Gaussian processes via sensitivity analysis of the posterior predictive distribution
Topi Paananen,Juho Piironen,Michael Riis Andersen,Aki Vehtari +3 more
- 16 Apr 2019
TL;DR: This article proposed two variable selection methods for Gaussian process models that utilize the predictions of a full model in the vicinity of the training points and thereby rank the variables based on their predictive relevance.
Variable prioritization in nonlinear black box methods: a genetic association case study
TL;DR: Methodologically, the "RelATive cEntrality" (RATE) measure is developed to prioritize candidate genetic variants that are not just marginally important, but whose associations also stem from significant covarying relationships with other variants in the data.
•Posted Content
Projection Predictive Inference for Generalized Linear and Additive Multilevel Models
TL;DR: The simulative and real-word experiments demonstrate that the projection predictive inference method can drastically reduce the model complexity required to reach reference predictive performance and achieve good frequency properties.
References
Regression Shrinkage and Selection via the Lasso
TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Gaussian Processes For Machine Learning
Tanja Hueber
- 01 Jan 2016
TL;DR: The gaussian processes for machine learning is universally compatible with any devices to read, and is available in the digital library an online access to it is set as public so you can get it instantly.
10K
Multivariate Adaptive Regression Splines
TL;DR: In this article, a new method is presented for flexible regression modeling of high dimensional data, which takes the form of an expansion in product spline basis functions, where the number of basis functions as well as the parameters associated with each one (product degree and knot locations) are automatically determined by the data.
•Book
Bayesian learning for neural networks
Geoffrey E. Hinton,Radford M. Neal +1 more
- 01 Jan 1995
TL;DR: Bayesian Learning for Neural Networks shows that Bayesian methods allow complex neural network models to be used without fear of the "overfitting" that can occur with traditional neural network learning methods.
4.8K
Variable selection via Gibbs sampling
TL;DR: In this paper, the Gibbs sampler is used to indirectly sample from the multinomial posterior distribution on the set of possible subset choices to identify the promising subsets by their more frequent appearance in the Gibbs sample.
3.1K