22 July 2008

Learning

In Blerpl, learning works like this: To start with, the world model is empty. The network contains only the input sensor nodes. These are initially assumed to be independent causes, because we have not yet discovered any dependency rules.


The network is fed a sequence of input data, and each node stores the binary sequence it received. Then the algorithm searches for correlation, or dependency, between the sequences at the nodes. If it finds some correlation between two nodes, this indicates that there is some higher-level cause on which both of these nodes depend. The network is augmented to represent the structure that it has discovered in the data.


Here an edge (shown as an arrow with an open head) represents the dependency of in2 on in1. This says “the probability of in2 depends on the state of in1”. The learning algorithm has discovered that the probability of in2 receiving a one vs a zero is correlated with the state of in1, at least in the training input.

Probability? You mean the state of in1 is not enough to always predict the state of in2? In that case we’ve got some new independent causes - in2.0 is the state of in2 when in1 is zero, and in2.1 is the state of in2 when in1 is one.


In effect the in2 node is now a two-input multiplexer, with in1 as the select input and in2.0 and in2.1 as the data inputs. If in1 and in2 are always the same, then in2.0 is always zero and in2.1 is always one. But if they sometimes differ (i.e. the correlation is not perfect) then the error signal has somewhere to go, into the new independent nodes, where the algorithm might yet find a way to predict it.

So the algorithm has discovered some structure in the data, and inferred the existence of some new higher-level causes to explain it. The algorithm iterates in this way, looking for dependencies among all the nodes in the network, including the newly-created ones. There are a number of different kinds of correlation it can discover, representing spatial and temporal structure, which are represented as different edge annotations in the network. And so Blerpl’s model of the world grows.

This is just an introduction to give you the flavour of how Blerpl works, and we have a lot more detail to cover yet on the criteria that the algorithm uses to choose good network structures.

3 comments:

Unknown said...

So the Blerpl input nodes are bits and each node stores a history. This is similar to the Spike Lists at http://home.arcor.de/w.lorenz65/mlbench except that you seem to use bit sequences and I use spikes which would correspond to rising edges in your approach. It is faster to find correlations between spikes than between bits.

For examlple, computing all correlations of two lists A and B which consist of 1000 spikes each within a time window of 100 clocks takes about 1000 * 2 * 20 = 40000 CPU clocks. The result is a vector of 101 ints where the number at index i answers the question: If node B is delayed by i clocks, then how often does it fire together with node A?

The time window size of 101 has only little influence on execution speed (lists can partially overlap and the result vector has to be cleared once at function entry).

I suspect that finding correlations between bit histories is much slower.

Memory bandwith can be ignored in both approaches because they are highly local.

Ben said...

Hi Wolfgang, thanks for your comments. You're correct, each node stores a history, a binary sequence.

If I understand correctly, the idea you describe is to exploit the sparsity of the data over time when comparing two signals, perhaps using something like run-length encoding.

I've also envisaged doing this to Blerpl, as the sequences rapidly become sparse and compressing them would save memory and CPU time. But I haven't implemented this yet, as it won't make any difference to the results achieved by the algorithm, only runtime performance. And there is still a lot of algorithm experimentation to do.

By the way, because Blerpl splits sequences among multiple parent nodes, many "don't care" states arise - I've mentioned "temporal slowness" in reference to this. The resulting sequences would be quite amenable to very efficient entropy encoding.

Fun to think about, but that sort of optimisation is a bit of a distraction at the moment! :) The real task is understanding how to build useful world models that can predict very general sequences.
Cheers,

- Ben.

Unknown said...

Well, I did not describe an idea. Spike Lists are working right now and performing rank 3 compared to a few other machine learning algorithms which I was testing. Everybody can download the library and the source code from http://home.arcor.de/w.lorenz65/mlbench and verify that it really works on different tasks.

Yes, spikes can been seen as a run-length encoding for bits. My current implementation uses only the rising edges (transitions from 0 to 1) and completely ignores the falling edges (from 1 to 0). I wasn't sure if that worked but it did. The higher levels, i.e. those which are far away from the inputs, are changing slower and are therefore very sparse. This makes them good candidates for run-length encoding.

You said that runtime performance doesn't matter. Then, why do you think about building models at all? Why not just use genetic algorithms, sit back, watch your computer working, and drink some beer?

You also said that there is a lot of algorithm experimentation to do. As you did not publish any source code, I guess that you want to do that experiments alone by yourself. So have fun and keep your readers informed if you find some interesting results. Maybe some day Blerpl will be good enough for competition with other algorithms on a real PC.