Skip to main content

Confidentiality in high-dimensional data

Genome-wide association studies compute hundreds of thousands of estimates relating a single genetic variant to an outcome. Nearly all of the genetic variants are irrelevant to any given outcome and the out-of-sample prediction of outcomes from genetics can charitably be described as 'disappointing'. It is obvious that publishing the univariate regression summaries is harmless from the point of view of confidentiality and useful to other researchers, and this was NIH policy for studies it funded. However, although obvious, this is not true, as was demonstrated initially in 2008. I will explain this in terms of the familiar statistical issue of good in-sample prediction from overfitting.