metadatta.

Neural Networks III

January 27, 2007 · 3 Comments

Last week I wrote the second post in my ongoing series of posts on Computational Neuroscience, based on a class I’m taking on the subject. I take notes on my computer in class anyway, so what I post here is a modified version of those and hence a reasonably good course chronicle.

While a lot of the development can be very technical, I’m trying to make these posts accessible to most people with a general interest in biophysics, neural networks or computational neuroscience. This means that I have to gloss over a lot of interesting details, including a lot of fun derivations, but that’s ok – for one thing, I don’t know how to easily incorporate equations into these posts (nor do I have the time to). However, I do provide references to more detailed papers when I feel the need.

Last week’s subject was Mikhail Bongard’s problems of pattern recognition, Herrnstein’s discovery that pigeons were particularly adept at solving them, more perceptrons and the credit assignment problem, and how backpropagation provided a means of solving it. The week before, I introduced McCulloch-Pitts neurons and perceptrons, particularly the perceptron convergence theorem and linearly separable problems. This week builds on the past two weeks: we’ll be looking at NETtalk, the Ising Model and Hopfield Networks, and Attractor Dynamics.


Credit Assignment, Backpropagation, and NETtalk
One of the key points from last week was the Credit Assignment Problem: when training neural networks to perform a task, how does one apportion credit (i.e. raising a ‘good’ node’s weight versus lowering a ‘bad’ node’s weight)? It turns out that this is quite difficult to do with simple perceptron networks, with Minsky and Papert’s book on perceptrons in 1969 essentially putting the final nail in the coffin for perceptrons, and shifting attention to the sexy new field of artificial intelligence.

Because of this, it wasn’t until 1986 that this was solved (D. E. Rumelhart, G. E. Hinton and R. J. Williams, “Learning representations by back-propagating errors”, Nature 323, 533 (1986)). The algorithm presented is known as ‘backpropagation’, and the idea is relatively simple: basically, by applying the chain rule for derivatives to multilayered systems in which the input and output layers are separated by ‘hidden’ layers, and by ‘backpropagating’ the error in the output each time – that is, introducing activity into the network, letting it flow to the top layer, changing the connections between the top and next-to-top layer based on the error at the top, then backpropapating the error, and repeating this process down the hierarchy – Rumelhart, Hinton and Williams were able to show that a solution to the credit assignment problem could be solved. perceptron.jpgClearly, this was a Big Deal, and was instrumental in reinvigorating the dying/dead field of artificial neural networks.

There was another Big Deal yet to come. When backpropagation first came out, people applied it to a number of problems, including a number of toy problems (the TC problem, family trees, the 8-3-8 encoder). The last one – the 8-3-8 encoder – is pretty interesting: the point is that by just using 3 hidden layer nodes, one can spit out the 8-bit input as the output, with no loss of information (for all possible input patterns). The point is that this system acts a lot like a compression algorithm, encoding a pattern in some way that is more efficient – and this is possible because there are so many connections between all these nodes, and backpropagation allows the network to find a robust way of performing this compression and decompresion. (On a side note, it turns out that 3 is the minimum number of hidden units one can use in the 8-3-8 encoder; as far as I know, it is theoretically impossible to construct an 8-2-8 encoder.)

The big application of backpropagation, however – the one that got everyone talking – was Sejnowski and Rosenberg’s NETtalk, in 1986. The objective was to get a backpropagation network to learn how to read English text – to go from a string of English letters to phonemes. Think about it – how do you know which phoneme to use just by looking at a letter and the local context it’s in? This is the art of reading, and Sejnowski and Rosenberg’s goal was to train a network to do this, by feeding in text, having the network output phonemes, then feeding this output into a voice synthesizer. The tricky part is in getting the network to learn English’s weird phonological rules – there are so many arbitrary ways of pronouncing things (for example, consider the ‘a’ in the words ‘have’ and ‘gave’). And unlike children, who tend to understand words far before they can actually read them, the network doesn’t know a thing about etymology, semantics, or meaning. The point is that Sejnowski (whose doctorate was in physics!) and Rosenberg developed this network – NETtalk – and it was incredible. Here’s a graph from the original paper (which can be found here, along with an MP3 of the system actually reading – it starts out learning, and speaking gibberish, and eventually is reading text with virtually no errors):

nettalk.jpg

The professor teaching my computation neuroscience course was at the conference where Sejnowski and Rosenberg presented their work, and described it for us:

They first presented the work, and it was interesting, in an abstract academic sort of way. Then they played the tape of the thing actually learning, then speaking… it was incredible. Everyone was impressed, bewildered, humiliated. The response was dead silence – nobody asked anything, nobody applauded, nothing. This was incredible.

NETtalk’s success totally brought back neural networks from its decreptitude, and everyone started using backpropagation for everything (with varying success), including stock market prediction and protein folding.

Since then, a lot of work’s been going on in the field of neural networks. There are other approaches other than backpropagation: for example, the method of radial basis functions (if you are attempting to solve a classification in a given space by drawing a decision boundary, why restrict yourself to a linear boundary? Using techniques from functional analysis, it may be possible to define a suitable hyperplane that works); or using support vector machines (a related idea, in which the decision boundary hyperplane is defined by maximizing the amount of ‘blank space’ between it and the various data points you’re attempting to classify).

The Ising Model, Hopfield Networks, and Attractor Dynamics
Another important idea from neural networks in the context of modern computational neuroscience is that of the Hopfield network, partly because although people have connected things like support vector machines to the workings of the cerebellum, these are tenuous at best. A more concrete model that is applicable to biological systems is that of an attractor network in the style of Hopfield. Hopfield was initially trained as a physicist and worked on, among other things, spin glasses and ferromagnets – with enormous success. It was when he was exposed to the world of biology, however, that he had the insights which went on to make him famous across a number of fields.

Suppose we have N neurons, and every one of them can be either on or off – then we can represent the state of our network as a point in an N-dimensional space. The state of the network at the next time step can then be found by considering the interactions between the various neurons, summing up all the states of activity, and weighting things accordingly – kind of like the McCulloch and Pitts neurons I talked about in my first post. Then we iterate.

So while we start off with a given point in this N-dimensional space, carrying out this procedure and iterating every time step leads to a trajectory in this space of all possible states. Moving along this trajectory represents changing the state of this network. An attractor point, then, is a point in this phase space at which once you’ve gotten there, you stay there – kind of like a black hole. This is the notion of an attractor state in dynamical systems. (I wonder if one can associate a Schwarzschild radius to an attractor point? This isn’t a deep question, just idle speculation from someone who doesn’t have much experience with dynamical systems.)

Hopfield saw an analogy between the mathematics of spin glasses/ferromagnets and these kinds of attractor neworks. In the famous Ising model (in the context of ferromagnetism), one considers a crystal of 2-state magnetic moments (‘up’, or ‘down’) and nearest neighbor interactions between them. fmag.jpgThis can be thought of as as being sort of a network, with ‘up’ representing a node being ‘on’, and ‘down’ representing a node being ‘off’. In the Ising model, just as in networks, the interactions between neighboring moments (or nodes) is important, since each moment determines its state by looking at its neighbors.

The nice thing about a magnet is that if you apply a strong enough field, you can force the magnetic moments to point in a certain direction. This is the lowest energy state of the system, and it will tend to stay in this stable low energy state – hence, among other things, magnetic domains on a computer hard disk! (Image of ferromagnetic domains from G. Popov et al., PRB 65, 064426 (2002)). Hence, this is a very physical instantiation of a memory, and leads to the question – do these kinds of systems (describable, to a certain extent, by the Ising model and/or attractor networks) have any connection to real memories in neurological systems? Hopfield was one of the first to ask this question and draw connections between these seemingly different fields, and this was a very exciting thing back then. It still is.

If I had the ability to include equations on this blog, and if I had an infinite amount of free time, I would work out the math behind the Ising model (I like the Kramers-Wannier matrix method presented in Chandler’s Modern Statistical Mechanics), the math behind Hopfield networks, and show how they are similar. But I don’t know how to easily include equations in these posts (hence trying to keep things as non-technical as possible), and classes are kind of taking over my life – so I won’t. So just take my word – the two are very similar, and it was Hopfield who drew this connection first in 1982. Since then, a number of physicists have made enormous progress in the field, and one could argue that it was with Hopfield’s work that the field of computational neuroscience really took off.

Categories: Artificial Intelligence · Biophysics · Classes · Computational Neuroscience · Mathematical Biology · Neural Networks · Science

3 responses so far ↓

Leave a Comment