Alex Beal's 🆎 blog - How A Computer Learned to Play Pac-Man

Take a look at the video below. Don’t worry, the clip is only 30 seconds long.

What if I told you that not only is Pac-Man controlled by a computer, but that the computer is an AI agent that learned to play through trial and error? ¹ When this technique was first demonstrated to me, I was thrown for a loop. How is this sort of thing possible? Not only must it learn what’s good and what’s bad (at the start it doesn’t even realize that dying is bad), it must also learn to exploit these details and come up with a strategy. Think about all the different choices that are involved. If a ghost is charging me, should I run? If a ghost is scared, how do I decide between pursuing it, and going after the food pellets? Should I try to stay as far away from ghosts as possible, or should I risk approaching them in order to improve my score? At what threshold does this become too risky? And once all of this is decided and implemented, surely the resulting code applies only to this one game. A strategy with such specific cases couldn’t apply to other domains. Incredibly, though, it does. Take a look at this video, of a crawling machine learning to pull itself forward, using the same algorithm: ²

So how is this possible? The algorithm used is called Q-learning [wiki], and the basic premise is this: There is some learning agent and it is contained in a world with different possible states. For Pac-Man, his world is the game maze, and the states are the different configurations of pellets, where the ghosts are, if they’re scared or not, and so on. Given one of these states, there are possible actions that can be taken. For example, Pac-Man can move left, or right, or up, or down. These actions result in more possible states. So, if Pac-Man choses the “up” action, the new state is the same, except he is now one square up, and the ghosts have chosen new positions. Additionally, these moves can result in good things happening (getting a food pellet, or winning the game), or bad things happening (losing the game). The key to using this information is to find out which actions usually result in good outcomes, and which result in bad. So if in the current state there is a food pellet above me, and I move up to eat the food pellet, I want to remember that, in this state, moving up was a good decision because my score increased. This is essentially how training the AI agent works. It plays the game repeatedly (in this case, 100 times), and builds a table of memories. It remembers, for a given state, which actions resulted in which consequence. If it’s in a state it hasn’t encountered, it will try something random, and add the result to its table of memories. If it’s in a state it has encountered, it can either pick the best action, based on memory, or try something new, and revise its memories accordingly. Once training is finished, all moves are based on memories, and no experimentation occurs. At the limit, the optimum action is always the one with the highest expected reward, which can be calculated from the table of memories.

It’s interesting to compare this to how humans learn. Think back to when you learned to drive a stick shift. Your first attempt probably resulted in burned rubber and a stalled engine. For your next attempt, you tried something different by releasing the clutch more slowly, but the engine still stalled, just more gently this time. After an hour of failed attempts, and (embarrassing) memories, you eventually learned what works and what doesn’t. Now that you’ve learned the proper technique, you don’t experiment any more. You simply execute the optimum policy.

Bringing this analogy back to Q-learning, we begin to see some issues. If you were actually executing basic Q-learning, and building your table of memories, you wouldn’t just remember that dropping the clutch in is bad, you’d have to remember that, given the current state of the universe, dropping the clutch in is bad. The first issue is that this is impossible. The state of the universe can’t be held in memory. The second is that the entire state of the universe contains a lot of irrelevant information. The current position of the moon doesn’t make a difference as to whether or not releasing the clutch too quickly is bad. The last issue is that this prevents you from generalizing to other events. If you were really doing Q-learning, starting your car in your garage would seem different to you than starting it at, say, 30th and Broadway. This is a problem even for simplified environments like the one Pac-Man is in. As the game world grows, keeping a table of every single game state becomes intractable, and so does the time it takes to train the AI agent. Further, the AI agent can’t generalize between states. It doesn’t know that being killed by a ghost is bad unless it has a memory of being killed by a ghost with the game world in that exact state. So even though it knows that being killed in square (1,1) is bad, it won’t know that being killed in square (2,1) is bad, until it has experienced dying there.

This problem is such a show stopper, that using basic Q-learning on Pac-Man worlds of even moderate size has terrible results. This is why the Pac-Man agent in the video above uses a slightly modified version of Q-learning which solves this problem. Rather than recording the exact state of the world, it only records the parts (or features) of the world that are relevant. One relevant feature might be whether or not it is surrounded by ghosts. Once it learns in square (1,1) that being surrounded is bad, it will recognize this same fact in square (2,1) and in every other square. This is because it’s inspecting the world for features instead of states. The practical implication of this is that the table of memories now contains features rather than states, and now the programmer must come up with a list of features that are relevant. Unlike basic Q-learning, this part of the algorithm is game specific. This is still really neat, though, as the programmer only has to tell the agent what features are important, not what the features mean. In English, this means I can tell the Pac-Man agent, “Pay attention to how many ghosts are near you, but you’re going to have to figure it out for yourself whether being surrounded by ghosts is good or bad.” It’s then up to the AI agent to fill in these details during its training runs. To me, this is incredible.

So those are the basics of reinforcement learning, so called because previous memories are reinforced with punishments and rewards from new experiences. The power of these techniques is obvious. They are what enabled this robot to learn to crawl, and this one to flip pancakes.

If you want to learn more about these techniques, you can access Berkeley’s AI course here, and make your own crawler and Pac-Man agent. Udacity also offers a free AI course, but from its syllabus it doesn’t look like it addresses Q-learning specifically.

The framework, GUI, etc. were implemented by the folks running this course. I implemented the learning code.↩︎
This is actually using a simplified version of the algorithm above, but the point is still the same. The algorithm itself isn’t application specific.↩︎