Skip to main content

Prior Elicitation and Variable Selection in High Dimensional Regression Models

The development of modern scientific techniques, including large scale genomic technologies, has led to the generation of enormous amounts of data often characterized by high dimensionality and complex dependence structures. In many cases, the dimension of variables measured (d) exceeds the number of observations (n), leading to model non-identifiability and difficulties in parameter estimation. We develop a framework for variable and model selection in regression-type models for high dimensional problems. First, we develop a process for the specification of a general class of prior distributions, called Information Matrix (IM) priors, focusing on linear, and generalized linear models. The priors are then extended to the $d > n$ setting by defining a ridge parameter in the prior construction, leading to the Information Matrix Ridge (IMR) prior. The IM and IMR priors are based on a broad generalization of Zellner's g-prior for Gaussian linear models. Theoretical and computational analyses of the use of this prior indicate a number of highly desirable properties, including existence of the prior and posterior moment generating functions, robustness arising from the tail behavior, and interesting connections to Gaussian priors and the Jeffreys' prior. Several simulation studies demonstrate advantages over a Gaussian prior in high dimensional settings. We will demonstrate the superior performance of this prior in the context of the applications of (i) discovering gene regulatory networks from genomic sequence and gene expression microarray data in a yeast cell-cycle experiment and (ii) prediction of nucleosome positions using genomic sequence data in yeast.