31 July 2008

Variability and Uncertainty

What is the probability of this coin landing heads up on the next throw?

This article introduces two kinds of probability: variability and uncertainty.

Variability comes from inherent randomness in a system. Flipping a coin or rolling dice involves variability from one throw to the next because the outcome is very hard to predict, even if you've collected good statistics from many previous throws.

Uncertainty comes from incomplete knowledge about the system. If it weren't for the unfortunate accident that bent this coin, you would expect it to have a 50% chance of landing heads. But now we have some uncertainty about the actual probability. It might be 40%, it might be 60%, it might even be 5%.

Let's say that we have flipped the coin seven times, and we got five heads and two tails.


Here


This graph shows the probability distribution (uncertainty) of the coin's actual probability (variability) of coming up heads. You can change the number of heads and tails to see how that changes the graph.

Go on, try it. Put in zero heads and zero tails - the distribution is flat because there is no evidence, the probability could be anything. Put in three heads and zero tails - the most likely probability is 100%, but the distribution is quite spread out because there's not much evidence yet. With 50 heads and 50 tails the most likely probability is 50%, and it's quite a narrow peak because with that much evidence on both sides it is quite unlikely that the coin's probability is really 5% or 95%. (But not impossible - a higher resolution graph would show the curve only touches zero at 0% and 100%.)

We always have incomplete knowledge - and so there is always both variability and uncertainty. In order to make a rational prediction, you must take into account the amount of evidence you have for each outcome. Every prediction involves some uncertainty. Every time you throw the coin, you collect a little more evidence.

The Beta distribution, illustrated above, is what the Blerpl learning algorithm uses to model the binary independent causes it discovers as it builds its model of the world.

23 July 2008

Multiple Causes

Every part of our input is influenced by multiple causes all the time. No model of the world that is a single-parent pyramid hierarchy will ever capture that, even for simple worlds.

Because Blerpl learns multiple causes, it can do something NuPIC cannot - it can learn grammar from text. Here’s what I wrote in my notebook on the day I began to figure out how to do that.


Let’s say you are uttering the sentence “The pen is blue.” At the moment you are pronouncing the word “pen” there are several causes active:
  • One cause is the sentence structure “The noun is adjective” and not some other structure.
  • Another cause is that the noun is “pen”, not “book” or some other noun.
  • Yet another cause is that you’re up to the noun in the sentence structure.
If you restrict your view of the world to single-parent hierarchies, then you need a node that enumerates all the different possible sentences as distinct causes. By factoring the different aspects of the input into multiple causes, you can have an efficient representation, three causes in the above example, that each contribute to the sequence. And you can more easily generalise with novel input - if you know about blue pens and red books, the concept of a red pen is not so hard.

Something else very valuable happens when the network learns multiple causes. At the instant when the word “is” is being pronounced, neither the noun nor adjective causes are relevant, and they can take on “don’t care” states. This leads to the natural emergence of temporal slowness, as causes can spend much of their time being irrelevant to what is happening right now. The don’t care states need not be stored - which is key to how Blerpl works as a lossless compression algorithm.

NuPIC Limitations

If you’ve spent time working on recognition / classification problems using Numenta’s NuPIC framework, the idea of multiple high-level causes might seem a bit strange. Aren’t we trying to distil the main invariant feature from the input, to discover a label that summarises a group of patterns or sequences? The problem with that approach is that in general there isn’t one main cause, there are several.

The NuPIC algorithms are biased towards grouping particular kinds of patterns together, due to the choice of distance metric, the measure of similarity. This makes NuPIC efficient on the tasks for which it is designed, at the expense of generality. NuPIC cannot learn language, because the algorithms’ simplifying assumptions do not apply.

Let me give an example. If you train a NuPIC network on movies of dogs and cats walking and jumping, NuPIC will learn to classify dogs from cats, and the label that emerges at the top of the network will reflect the kind of animal, not its movement. To alter your network so that it identifies movement (walking versus jumping) at the top level node, you would need to change the learning algorithm to bias it in a different way, to pass information about movement up the network while discarding differences in visual structure. And to identify both movement and animal, you need nodes with multiple parents in a network that is not a pyramid. Perhaps that can be added to NuPIC, but we haven’t seen it yet.

This is not a criticism of Numenta. The folks there are friends, I worked with some of them at Palm, I was privileged to visit their offices last year, and I have heaps of respect for their work. Jeff’s book and the original NuPIC release inspired me to work on HTMs in the first place. I’d be thrilled to see Numenta incorporate ideas from Blerpl. And NuPIC does a good job at what it does. Blerpl cannot yet handle a 32x32 pixel image, because the implementation uses exhaustive greedy search and runs out of memory building the network. But if we are to develop algorithms that are capable of general learning and intelligence, we have to recognise the current limitations.

Blerpl is an HTM algorithm, although it is very different from NuPIC. Blerpl focusses on language (in a very general sense) while NuPIC focusses on vision. While I don’t yet have all the answers, I think Blerpl is a way to learn something about some aspects of HTM theory which are missing from other implementations. I hope you’ll take the time to keep reading as I do my best to explain what I’ve learned, and ask questions along the way.

22 July 2008

Learning

In Blerpl, learning works like this: To start with, the world model is empty. The network contains only the input sensor nodes. These are initially assumed to be independent causes, because we have not yet discovered any dependency rules.


The network is fed a sequence of input data, and each node stores the binary sequence it received. Then the algorithm searches for correlation, or dependency, between the sequences at the nodes. If it finds some correlation between two nodes, this indicates that there is some higher-level cause on which both of these nodes depend. The network is augmented to represent the structure that it has discovered in the data.


Here an edge (shown as an arrow with an open head) represents the dependency of in2 on in1. This says “the probability of in2 depends on the state of in1”. The learning algorithm has discovered that the probability of in2 receiving a one vs a zero is correlated with the state of in1, at least in the training input.

Probability? You mean the state of in1 is not enough to always predict the state of in2? In that case we’ve got some new independent causes - in2.0 is the state of in2 when in1 is zero, and in2.1 is the state of in2 when in1 is one.


In effect the in2 node is now a two-input multiplexer, with in1 as the select input and in2.0 and in2.1 as the data inputs. If in1 and in2 are always the same, then in2.0 is always zero and in2.1 is always one. But if they sometimes differ (i.e. the correlation is not perfect) then the error signal has somewhere to go, into the new independent nodes, where the algorithm might yet find a way to predict it.

So the algorithm has discovered some structure in the data, and inferred the existence of some new higher-level causes to explain it. The algorithm iterates in this way, looking for dependencies among all the nodes in the network, including the newly-created ones. There are a number of different kinds of correlation it can discover, representing spatial and temporal structure, which are represented as different edge annotations in the network. And so Blerpl’s model of the world grows.

This is just an introduction to give you the flavour of how Blerpl works, and we have a lot more detail to cover yet on the criteria that the algorithm uses to choose good network structures.

World Model

Here is the Blerpl learning algorithm’s view of the world:


Nothing too controversial here - hopefully we can agree that the world contains some causes, and that we receive information about the world from our senses. The structure of the world is what determines how those causes influence our sensor inputs. Blerpl’s job is to build a model of the world and discover that structure and those high-level causes, using only the sequences of information arriving at the inputs.

Each circle is a node, representing a cause. In Blerpl, causes are binary. That means that at any given moment, each node has a state, either zero or one. You might protest that the world contains causes with more than two states - Blerpl just models those as groups of binary causes. The choice of binary keeps the algorithm simple and general.

Nodes in0, in1, in2 and in3 are the sensor input nodes, like the input switches on the Predictatron, or the pixels in an image sensor. Of course there could be a lot more than four of them.

Nodes a, b, c, d and e are high-level independent causes in the world. They determine what is going on in the world. Because they are independent, the only basis we have for predicting them is the statistics of what they have done in the past. Independent nodes are drawn with a bold outline.

The box is a model of the structure of the world. It contains a network that describes how the independent causes influence our inputs. The network (not shown) is made up of nodes called dependent causes, because their states are determined by their parent nodes. The network may contain many layers of nodes, but ultimately every dependent cause’s state is determined by the independent causes.

You can think of the world model network as a box of and/or/not gates, flip-flops and so on. The assumption here is that the world can be modelled with boolean logic. And if signals turn up that can’t be predicted with logic, they are modelled as - guess what - independent causes.

15 July 2008

Confession

Ok ok, so I haven’t really built a Predictatron in a box. But I really have developed and implemented a very general learning algorithm, called Blerpl. More about that soon.

The point of the switches and lights story was to frame the prediction problem in a very specific way. I want you to imagine stepping through the process of reading out predictions and feeding in observations one at a time. I want the data format to be binary, and for the interface to be completely generic, with no preconceived notions about the world that the learning algorithm is exposed to. And I want you to believe that, reduced to such a simple description, this might be a problem we can solve with computers in a very general way. There need be no magic involved.

10 July 2008

Predictatron

The other day I built a computer that predicts its input. I call it the Predictatron. Here’s a picture:


As you can see, the front panel has a row of switches and a row of lights, labeled “Observation” and “Prediction” respectively. The switches are the inputs, and the lights display the computer’s predictions of those inputs. The extra switch on the bottom-left is the clock. It has two positions, marked “Predict” and “Observe”. It works like this:
  • First you switch the clock to Predict. Straight away, the lights display the computer’s prediction for the states of the inputs that you are about to set on the switches. You can write this prediction down if you want.
  • Next you flip the input switches to whatever you want, and switch the clock to Observe. The computer observes the states of the inputs at that moment, and stores them away. It remembers the entire sequence of inputs since you turned the power on. It also updates some internal data structures, ready to make better predictions in the future.
  • Then you flip the clock back to Predict and so on, repeating the cycle over and over as you step your way through the sequence of inputs. Predict, Observe. Predict, Observe.
To predict further into the future than one clock cycle, you loop the Prediction outputs back into the Observation inputs. At each step, you’re saying “Ok what if you’re right - what happens then?” This way you can predict as far into the future as you want.

If you just set the input switches randomly, you never get very good predictions. But if the input sequences contain any kind of pattern, structure, correlation, order, language, even if noisy, the Predictatron will do a decent job of learning the rules and making better predictions. The learning algorithm it uses is called Blerpl, which I will describe in detail in a future article.

14 February 2008

Data Compression

dog.png -125 bytes.Few but likely.
noise.png -246 bytes.Many but unlikely.

As you walk around the world looking at all the dogs and things, the images that fall on your image sensor are much more likely to be selected from the relatively small set of interesting images with visual structure, than from the very large set of images that look like random noise.

This is because the world is sparse. Visual structure consists of repetition and redundancy, such as the large area of empty space around the dog, or the number of similar pixels lined up in a row to make the dog's leg.

Lossless data compression works by exploiting sparseness, the structure and redundancy of information, to produce a representation with less bits. A compression method expects a certain kind of structure in the input; it reduces the size of inputs that have that kind of structure, at the expense of increasing the size of inputs that do not. No compression method can reduce the size of all possible inputs.

Compression and intelligence are related, because both have to do with finding patterns and structure in information. It takes some intelligence to recognise that an image contains the patterns of a dog. It is sparseness that makes compression and intelligence possible.

13 February 2008

Patterns and Causes

This is an image of a dog. You recognised it because you have seen dogs before, although likely not exactly the same as this one. In fact the image that falls on your retina will probably never be exactly the same twice, even if you spend the rest of your life looking only at dogs. And yet all these possible images of dogs have a certain recognisable dogness in common.

This particular dog image was formed on a 32x32 pixel image sensor, with 1 bit per pixel. There are 21024 distinct images that can be perceived by a sensor of those dimensions; in decimal that number has over 300 digits. But only a very small fraction of those possible images look like dogs, or anything else recognisable. Most of the possible images look like random noise.


An image of a dog is recognisable because the pixels are influenced indirectly by the presence of the dog. I say indirectly because the dog causes a head, body, legs and a tail to fall within the field of view. These dog-parts influence the number of photons reaching the various pixels on the image sensor, which in turn record light or dark pixels in the image.

People find it easy to recognise dogs, because the human neocortex is very good at learning patterns and remembering them. You weren't born knowing about dogs - your neocortex learned that from seeing lots of different images of them. And now your neocortex contains a hierarchical structure that records your memory of how dogs are made of dog-parts, and some essential features of the patterns in the appearance of dogs.

This blog is about how intelligence works, and how we might build computer systems that can recognise dogs and do other intelligent things like learning language.