Learning an ensemble of models instead of a single one can be a remarkably effective way to reduce predictive error. Ensemble methods include bagging, boosting, stacking, error-correcting output codes, and others. But how can we explain the amazing success of these methods (and hopefully design better ones as a result)? Many different explanations have been proposed, using concepts like the bias-variance tradeoff, margins, and Bayesian model averaging. However, each of these explanations has significant shortcomings. In this talk I will survey the current state of knowledge and describe a number of experiments testing some of the proposed explanations. One of the most striking conclusions is that Bayesian model averaging completely fails to explain the success of bagging and related methods. I conclude by describing what (according to the experimental evidence) is the best current explanation of why ensembles work, showing that the explanation is still incomplete, and pointing to where further research is needed.