Despite the diverse pedigrees of Data Mining methods, the underlying algorithms fall into a handful of families, whose properties suggest their likely performance on a given dataset. One typically selects an algorithm by matching its strengths to the properties of one's data. Yet, performance surprises, where competing models rank differently than expected, are common; model inference, even when semi-automated, seems to yet be as much art as science.
Recently however, researchers in several fields, statistics, economics, computational learning theory, and engineering, have discovered that a simple technique -- combining competing models -- almost always improves classification accuracy. (Such "bundling" is a natural outgrowth of Data Mining, since much of the model search process is automated, and candidate models abound.)
This talk will describe a collection of powerful model combination methods -- including bagging, boosting, and a little bit of Bayesian model averaging -- and briefly demonstrate their positive effects on scientific, medical, and marketing case studies. Used in combination, I show how these methods may easily incorporate many desirable properties including prediction accuracy, robustness, variance reduction, model interpretability, and scalability to massive datasets.