active: Sep 2010 to Jun 2011

Compression AI

Intelligence from Lossless Data Compression

After finishing much of my work on brain-inspired AI, I began exploring alternative approaches. What might be another good strategy besides brain modeling? It seemed that any general AI would require data prediction as a core component, so that became my focus: finding a good general purpose prediction algorithm. And the obvious place to look was the field of lossless data compression, as any effective compression algorithm is built on prediction.

Gzip AI?

My idea was then to pair compression with reinforcement learning. Imagine e.g. a "gzip-AI" (or any other standard compressor) which continually compresses its sensory/motor/reward data stream, looking for patterns, and uses its past experience to predict the future and make informed decisions. The benefits of this approach are:

  1. We offload most of the complexity of AI agent design onto "external" prediction algorithms, and we get automatic upgrades whenever those algorithms improve.
  2. The core progress measurement is dead simple: just compute the size of the compressed data (i.e. the sum of -log probabilities assigned to each symbol in the data sequence).

Summary of Efforts

I studied several lossless compression algorithms and also sequence predictors designed for language modeling. This included the PAQ series, PPM methods, and the Sequence Memoizer (SM) (based on the "hierarchical Pitman-Yor process").

To compare methods, I kept track of compression performance on a variety of text files, images files, and other data types. The PAQ methods generally compressed the best. Standard PPM didn't perform as well as SM, but I did find a simple PPM improvement: starting with the full blended PPM method, where each valid context contributes a weighted prediction, I changed the count update scheme to use weighted updates. This produced results that rivaled or exceeded those of SM.

Besides compression performance, I also generated samples from the learned models (e.g. samples of Pride and Prejudice, samples images, etc.) The text samples were always interesting to read, even if they were never incredibly realistic. Generated image samples, however, were often terrible for all algorithms tested. I imagine this was due to the fact that most lossless algorithms are designed mainly for naturally sequential data like text, not data that has been serialized from some other format (like images). To handle vastly different sequence types like this would require new models that more gracefully handle long-range contexts (which is now a focus of much current work in e.g. recurrent neural networks).

After seeing the lackluster generated samples, I was not optimistic about achieving good RL performance, but I had to give it a try. Initial tests (on extremely tiny environments) showed some promise, but again, the difficulty with long-range contexts limited its effectiveness on anything moderately complex. From my notes at the time:

It seems that most existing compression algorithms don't generalize too well for this application... e.g. Lempel-Ziv only works here if there are lots of repeated substrings in the RL agent's observation byte sequence. But if only 1 byte changes, it must switch to (or create) an entirely new dictionary entry.

Final Status

Given the limited power of compression algorithms available at the time of this project, I decided not to pursue this path until things improved. It still seems to me like a good way to divide efforts in the field of general AI: first develop advanced prediction algorithms, then use them as the core of an intelligent decision-making agent. Indeed, that appears to be a common strategy now popularized by large language models like GPT. In particular, the paper Chen et al, 2021, Decision Transformer: Reinforcement Learning via Sequence Modeling applies the transformer architecture to reinforcement learning in an elegant way that simply treats all environment-agent interactions as a sequence modeling problem.

(Actually, soon after giving up on using an off-the-shelf prediction algorithm, I became motivated to tackle the prediction problem directly. This included focusing on probabilistic graph-based models which are able to model all kinds of complicated structure in data. See here for more.)