And Information Theoretic Priors

**active-dates:** mid 2012 to present

In my pursuit of simple, elegant machine learning algorithms, I came across Boltzmann machines probably somewhere around 2010. The idea seemed mysterious at first, especially since I was used to deterministic, directed graph-based models (i.e feedforward neural networks), whereas these were probabilistic, undirected models. But the more I studied them, the more I was intrigued, and the more connections I found to other areas of machine learning, statistics, and physics.

One useful generalization of Boltzmann machines is the set of binary "log-linear models," whose log probabilities are linear functions of the parameters. Log-linear models are a flexible language for representing structure (or lack thereof) among variables, which includes all kinds of undirected graph-based models (e.g. Boltzmann machines, Markov networks, Ising models) as a special case.

In order to put this powerful model class to good use, I'm using Bayesian inference as a basis for learning from data. This of course requires a prior distribution. Choosing a prior can be difficult for high-dimensional models, where less intuition is available. Fortunately, information theory provides a useful framework for defining priors which are optimal for prediction. This approach, which is well-studied in the field of objective Bayesian inference, leads to a particular prior with an interesting mathematical form and a wide range of useful properties.

Deriving this prior for log-linear models, and finding a way to making it computationally practical, have been my main focus now for several years. I am close to finishing a few key derivations, at which point I will write up the resulting math and begin running software experiments. As this has been quite a long process (from initial inspiration to where I am now), I am excited to see the results. More to come...