Skip to main content

Scalable and Streaming Inference for Complex Bayesian Models

Massive data sets have become commonplace in many applied fields including genomics, neuroscience, finance, information retrieval, and the social sciences, to name a few. Practitioners in these fields often want to fit complicated Bayesian models to these data sets in order to capture underlying dependencies in the data. Unfortunately, traditional algorithms for Bayesian inference in these models such as Markov chain Monte Carlo and variational inference do not typically scale to the large data sets encountered in practice. Additional complications arise when faced with an unbounded amount of data that arrives as a stream as most existing inference algorithms are not applicable. In this talk I will discuss our recent work on developing algorithms for these two settings. First, I will describe a stochastic variational inference algorithm for hidden Markov models with provable convergence criteria. The algorithm allows modeling sequences of hundreds of millions of observations without breaking the chain into arbitrary pieces. We demonstrate the efficacy of the algorithm by using it to segment a human chromatin data set with 250 million observations, achieving comparable performance to a state of the art model in a fraction of the time. Next, I will present a streaming variational inference algorithm for Bayesian nonparametric mixture models which provides an efficient nonparametric clustering algorithm for massive streaming data. We apply the method to perform online clustering of a large corpus of New York Times documents.