Reinforcement Learning Tetris Example

In a previous AI life, I did some research into reinforcement learning, q-learning, td-learning, etc.. Initially we wanted to use these techniques to train a robot soccer team, however we soon learned that these techniques were simply the wrong tool for the job. Other than Tesauro's work on Backgammon we found few worthwhile applications of reinforcement learning. There wasn't anything similar to our soccer problem domain. Desperately stuck in the situation where we had to do something to pass this course it suddenly hit me: "How about Tetris?".

Motivation

Consider the game of Tetris. The player is continually given pieces of varying shape that must be positioned and rotated, then dropped on the pieces below. Since these pieces begin to pile up, the player must try to stack them efficiently. Furthermore, if the player manages to complete a row, then that row dissapears thus freeing up more space. The objective is to keep the height as low as possible. Note that the shape of each subsequent piece is random thus making it difficult to plan ahead (non-deterministic).

People who play many hours of this addicting game, will likely develop strategies for playing. For example, they may try to fill up the corners and single-wide spaces first, while avoiding the situation where free spaces get covered by objects on top thus becoming unaccessable.

We tested to see if an agent could learn to play tetris without giving it heuristics or directly programming its strategies. The guts of the agent is a utility function (or evaluation function) that indicates a value for a configuration of pieces (state). When the agent is given a random piece to add, it evaluates all the valid placements and chooses the action that results in the best configuration (according to the utility function). This architecture is very similar to what Tesauro used for TD-Gammon. Unlike Tesauro's neural network structure, we used a table to represent our utility function. This table is initially set to 0 for all input states. It is up to the agent to learn this table as it places the tetris pieces (i.e. as it moves from state to state).

Smaller Version of Game

For our experiments we used a smaller implementation of the Tetris game. The pieces are at most 2x2:


The area where pieces are placed is 6 units wide, and the maximum working height is 2. Some examples of states are:


As in the regular game of tetris, rows are removed when they become full. The following picture (going left to right) shows how this happens when a piece is added:


When the working height of 2 is exceeded, the bottom row is removed and the total height (agent's score for this game) is incremented:


Our games consisted of 10000 pieces each. Even though the game parameters are reduced, the size of our table (number of states) is 2^(6*2)=4096 which is rather big (but manageable).

An example of a state that should have a low utility value is:


Given the next piece is a solid 2x2 block, the agent will be forced to increase the total height of the pile by 2 units. Therefore it is desirable to avoid getting into states like these.

Implementation and Results

Using reinforcement learning the agent adapted the utility function and dramatically improved its performance as it played a number of games. The following table and plot show the improvement in agents score (total height) over a number of 10000 piece games:

 

 
  Game  Score 
  ------------
     1   1485 
     2   1166 
     4   1032 
     8    902 
    16    837 
    32    644 
    64    395 
   128    303 
   256    289 

Note that the a lower score is better than a higher score. To keep the table short only game numbers that are an even power of 2 were reported.

The agent used the following incremental learning update rule:

 
        U(state) = U(state)*(1-alpha) + (reward+gamma*U(nextstate))*alpha

Where U() is the utility function table. The agent is penalized with a reward of -100 for each level it goes above the working height, otherwise reward is set to 0. The intuition behind using a discount factor, gamma < 1.0, is because even though we may eventually have to position a piece that puts us above the working height it is best to avoid that situation as long as possible. Furthermore, a discount factor bounds the utility values which could otherwise go to positive or negative infinity. Using incremental learning (alpha

<1.0) instead of just making the direct assignment: >
 
        U(state) = reward + gamma*U(nextstate)

is necessary because of the non-deterministic element (the shape of the next piece is random) in the Tetris environment.

Attempted Learning Enhancements

Ideally the value of alpha should be decreased over time to help converge the utility values. Without a strategy for migrating alpha we simply used a smaller constant value of 0.02. This turned out to be a fairly good choice as we later tried some other values for comparison:

 
                value of Alpha
        Game  0.002  0.02   0.2
        -------------------------
           1   1451  1485  1404
           2   1204  1166  1043
           4   1043  1032   752
           8    971   902   525
          16    938   837   420
          32    912   644   370
          64    955   395   342
         128    848   303   339
         256    679   289   351

Learning is clearly slower with the lower alpha value of 0.002. Using the higher value (0.2) learns quicker at first but doesn't eventually reach the performance achieved by setting alpha to 0.02.

Using only the adjacent state update rule (instead of the extended TD learning update rule) was sufficient for the Tetris learner. This is because we were able to apply penalties more frequently (every time the blocks stack too high). Whereas, a backgammon game, which consists of many moves, must be completed before the reward/penalty is applied. Furthermore, Tesauro used backpropogation neural networks which are notoriously slow to train. We experimented with using TD learning for learning Tetris. The following table shows compares our previous results with a TD enhanced version of our agent.

 
        Game     Q    TD
        -------------------
           1   1485  1520
           2   1166  1182
           4   1032  1000
           8    902   903
          16    837   855
          32    644   635
          64    395   394
         128    303   289
         256    289   256

As the results show, we were unable to achieve any significant improvement in learning. This is understandable since in our Tetris environment the effects of actions are realized sooner and actions do not have lasting consequences. Given the different types of pieces, and all the possible placements of them, from a state there can be up to 52 possible next states. Its possible to get from any state to most others in 3 moves. TD learning is more likely to improve learning in an environment with more states and a lower branching factor.

Future Work

In most learning situations, an agent must decided whether to choose what it thinks is the best action, or explore a different action in the hopes of finding a better solution. It almost seems odd that our Tetris agent is able to continually improve performance even though it only chooses the best action. An explanation of this is that this game is a "losing battle". Utility values are initially set to 0 and tend to move downward. Nonetheless, we believe that adding an exploration strategy would provide some benefit.

As mentioned earlier, this size of game we were experimenting with was smaller than the regular version of Tetris. The biggest problem with scaling upward is that the utility table is exponential in size compared to the working area (game board). For a full sized game, an alternative implementation of the utility function, such as a neural network, would have to be used.

Conclusion

In conclusion, reinforcement learning techniques work extremely well for training agents to play Tetris-like games efficiently. We were able to tackle this problem in a way that was very similar to the approach taken by Tesauro's backgammon work, which is the most frequently (if not the only) mentioned application of reinforcement learning. An example that is often used to explain reinforcement learning is tic-tac-toe. One problem with using this as a learning example is that it is much better solved with simple minimax search. The small state space search problem in the Russell/Norvig AI textbook, which was fabricated to explain learning algorithms, also fails to inspire practical applications of this technology. In the famous cart-pole balancing problem (also known as the inverted pendulum), the task that is learned is a very simple reflex reaction. Our Tetris example shows a straightforward reinforcement learning solution to a well known and reasonably sized problem where there isn't a much more obvious and better solution that should be used instead.

Thus we have accomplished the goal of AI research: Finding a problem to fit the solution!


S Melax