Reinforcement Learning and AI

Reinforcement learning, the way it is now practiced in AI, is just old-fashioned Behaviorism on modern steroids, so here is my take on Behaviorism, for anyone who is into that whole not-repeating-the-mistakes-of-history thing.

The Behaviorist paradigm, as it originally appeared in the field of psychology, says that an intelligent system looks at the current state of the world, decides on a response (an action) to give, and then it waits for some kind of reward to come back from the world. What happens next depends on which exact variant of Behaviorism you subscribe to, but the general idea is that if the reward comes, the system remembers to do that action more often in the future, in response to the same state of the world. If no reward comes, the system may just do something random (which could include repeating the previous behavior), or, if it is trying to be exploratory it might make a point of trying some other response (to develop a good baseline on which to judge the efficacy of different responses). If the system gets the opposite of a reward — something it considers to be bad — then it will generally avoid that behavior in the future.  What is supposed to happen is that by starting out with random responses, the system can home in on the ones that give rewards.

But in this tidy little behaviorist scheme, there are many gaps and ever-so-slightly circular bits of argument.

For example, what exactly is the mechanism that decides what the reward signal is, and when it arrives? If we are talking about a pigeon in a box and an experimenter outside, it’s easy. But what happens with a real-world system? A child looks at the picture book and Mom says “Spot is happy” when they are both looking at a picture of a happy Spot. What is the reward to be had if the child’s brain decides to associate the words and image?  And why associate the “happy” word with the smile on Spot’s face? Why not “is happy” and the image of that black spot on Spot’s nose?  And could it be that the first time that a reward actually happens is a few months later, when the child first says “Esmerelda is happy!” and someone responds “You’re right: Esmerelda is happy!”?

Also, what is the mechanism that chooses what are the relevant “states” and “actions” when the system is trying to decide what actions to initiate in response to which states? The current state of the world has a lot of irrelevant junk in it, so why did the system choose a particular subset of all the junk and focus only on that? And in the case of a child, the actions chosen are clearly not random (thereby allowing the system to do the right Behaviorist thing and explore all the possible actions to figure out which ones get the best rewards), so what kind of machinery is at work, preselecting actions that the system could try?

The short answer to this barrage of questions is that what is actually happening is that the Reinforcement Learning mechanism at the center is surrounded by some other mechanisms that (a) choose which rewards belong with which previous actions, (b) choose which aspects of the state of the world are relevant in a given situation, and (c) choose which actions are good candidates to try in a given situation.

Here’s the problem. If you find out what is doing all that work outside of the core behaviorist cycle of state-action-reward-reinforcement, then you have found where all the real intelligence is, in your intelligent system. And all of that surrounding machinery — the pre-processing and post-processing — will probably turn out to be so enormously complex, in comparison to the core reinforcement learning mechanism in the middle, that the core itself will be almost irrelevant.

But it gets worse. By the time you’ve discovered the preprocessing and post-processing mechanisms, you will probably have long ago thrown away the Reinforcement Learning core in the middle, because the need to get everything just right for the RL core to do its thing will have turned into a completely pointless exercise. It will be obvious that the other mechanisms are so clever in their own right, that they won’t need the Behaviorist cycle in the middle. You will just want to cut out the pointless middle man and hook the preprocessing and post-processing mechanisms directly to each other.

In fact, if you want to get into the nitty gritty, what you will do is distribute the idea of states, responses and rewards all over the insides of your other machinery. Yes, the basic idea will be there, but it will just be one more trivially small facet of the larger picture. Not worth making a fuss about, and not worth putting on some kind of throne, as if it was the centerpiece of the intelligent system’s design.


What is funny about this story is that everything I just wrote was discovered by cognitive psychologists when they emerged from under the smothering yoke of the behaviorist psychologists, in the late 1950s. Behaviorism was realized to be pointless 60 years ago. But for some reason the AI community has rediscovered it and has no interest in hearing from those who already know the reasons why it is a waste of time.