DEV Community

Cover image for Drive a Tank With Python
Charles Landau
Charles Landau

Posted on • Originally published at charlesdlandau.io

Drive a Tank With Python

Previously, I reviewed core concepts in Reinforcement Learning (RL) and introduced important parts of the OpenAI Gym API. You can review that introduction here:

In this installment we'll train a tank commander. I'm sharing a bunch of the code for this project here:

GitHub logo CharlesDLandau / rlNotebooks

Reinforcement learning notebooks

rlNotebooks

Reinforcement learning notebooks for posting on the internet.

As these notebooks are instructional and experimental I don't recommend running them locally. I test and run these notebooks on Google Colab or Kaggle before I run them -- you can do it there too!




Side note: OpenAI Gym installation

...is somewhat painful. Many online notebook services like colab and Kaggle don't allow you to install some of the OpenAI environments, so I'm going to stick to Atari for now. If you're interested in trying to set up OpenAI gym with more flexibility, you might start with this interesting write-up.

In order to write agents that actually take the game screen into account when making decisions, we'll need to update our run_job utility from last time:

    action = model.decision_function(obs=observation, env=env)
Enter fullscreen mode Exit fullscreen mode

And our RandomAgent will need to be modified accordingly:

class RandomAgentContainer:
    """A model that takes random actions and doesn't learn"""

    def __init__(self):
        pass

    def decision_function(self, obs=None, env=None):
        if env:
            return env.action_space.sample()
        else:
            return 0

    def train_on_history(self, history):
        pass


model = RandomAgentContainer()
Enter fullscreen mode Exit fullscreen mode

Now we can use our RandomAgent to explore all the information that our job creates.

result = run_job("Robotank-v0", model,
        10000, episodes_per_train=0);

result.keys()


# Output
dict_keys(['history', 'env', 'parameters'])
Enter fullscreen mode Exit fullscreen mode

Our job produced video of the game being played, as well as a history of images, actions, predictions, and rewards. It also saved the environment object from OpenAI Gym.

Let's begin by trying to understand our observations in a video:

render_video(0, result['env']);
Enter fullscreen mode Exit fullscreen mode

GIF-OF-RANDOMAGENT

If, like me, you have never played Robot Tank on an Atari... you can read the manual! You can learn about the heads up display and more.

So anyway, it looks like the image has a bunch of noise in it. Let's see if we can extract any of that...

import matplotlib
from matplotlib.image import imread
from matplotlib.pyplot import imshow
Enter fullscreen mode Exit fullscreen mode
observation_sample = [im['observation'] for im in result.get('history')]

imshow(observation_sample[0])
Enter fullscreen mode Exit fullscreen mode

robotank_viewport

So using the ticks on the axes and with a little trial and error we can find the bounding boxes for the radar panel and the periscope.

imshow(observation_sample[0][139:172, 60:100, :])
Enter fullscreen mode Exit fullscreen mode
<matplotlib.image.AxesImage at 0x7f48ad329b38>
Enter fullscreen mode Exit fullscreen mode

viewport

So, we can certainly crop this image and worry less about the noise...

radar_bounding_box = ((139, 172), (60, 100), (None))
Enter fullscreen mode Exit fullscreen mode

From reading the manual we know that one of the four indicators bracketing this radar display is "R" for "radar." In other words, we' can't rely on radar as the only input, because all of those indicators represent subsystems that can be disabled.

Let's also take a bounding box for the periscope:

imshow(observation_sample[0][80:124, 12:160, :])
Enter fullscreen mode Exit fullscreen mode
<matplotlib.image.AxesImage at 0x7f48ad693b38>
Enter fullscreen mode Exit fullscreen mode

png

peri_bounding_box=((80, 124), (12, 160), (None))
Enter fullscreen mode Exit fullscreen mode

Let's also check the info field because it sometimes has observation data.

result.get('history')[0].get('info')

# Output
{'ale.lives': 4}
Enter fullscreen mode Exit fullscreen mode

I'm going to intentionally ignore the V, C, R, T boxes and we can always reintrouduce them later if we think a performance gain is in the offing. You saw in the manual how they work so I don't think it's a cause for concern...

Let's also try to understand the action space.

env = result['env']

print(env.action_space)

# Output:
Discrete(18)

Enter fullscreen mode Exit fullscreen mode

Hmm...not helpful at all. But that's what you'd naturally think to do... It turns out that extracting action meanings has its own namespace in the gym API.

# https://ai.stackexchange.com/a/3557
env.unwrapped.get_action_meanings()
Enter fullscreen mode Exit fullscreen mode
# Output
['NOOP',
 'FIRE',
 'UP',
 'RIGHT',
 'LEFT',
 'DOWN',
 'UPRIGHT',
 'UPLEFT',
 'DOWNRIGHT',
 'DOWNLEFT',
 'UPFIRE',
 'RIGHTFIRE',
 'LEFTFIRE',
 'DOWNFIRE',
 'UPRIGHTFIRE',
 'UPLEFTFIRE',
 'DOWNRIGHTFIRE',
 'DOWNLEFTFIRE']
Enter fullscreen mode Exit fullscreen mode

So "NOOP" presumably means "no-op" i.e. "do nothing." The rest apparently constitute all the permutations of actions available to the client. This is what we would expect the action space to be. These are also discrete actions so we can code our model to take exactly one action per step.

Finally, let's visualize the rewards

set([r['reward'] for r in result['history']]) # Unique rewards across history
Enter fullscreen mode Exit fullscreen mode
# Output
{0.0, 1.0}
Enter fullscreen mode Exit fullscreen mode
import matplotlib.pyplot as plt
plt.plot([r['reward'] for r in result['history']])
Enter fullscreen mode Exit fullscreen mode

reward

Looks like the reward function is simply "score a hit=1 else 0". We can confirm by visualizing the observations at reward time.

# Observations when reward was given
reward_incidents = list(filter(lambda i: i['reward'], result['history']))

i = 0
imshow(reward_incidents[i]['observation'])
Enter fullscreen mode Exit fullscreen mode

veiwport

i = 1
imshow(reward_incidents[i]['observation'])
Enter fullscreen mode Exit fullscreen mode

viewport

i = 2
imshow(reward_incidents[i]['observation'])
Enter fullscreen mode Exit fullscreen mode

viewport

They're all images that seem to be captured right after the tank scores a hit.

Writing the tank commander

The Brain

The TankCommander agent needs to learn how to decide which action to take. So, we first need to give it a mechanism for learning. In this case, we're going to use a special kind of graph. In this graph there are three kinds of nodes:

  1. Input nodes

    Which take our inputs and send signals to the nodes that they are connected to

  2. Regular nodes

    These nodes can have many connections from other nodes (including input nodes.) Some connections are strong, and some connections are weak. This node uses the signals and the signal strength from all the connections to decide what signal to send along all of its own outgoing connections.

  3. Output nodes

    These nodes are only different from regular nodes in that we read their
    signal.

The nodes are organized into "layers" that can share many connections and have a function in common for how they decide to aggregate and send signals.

Now, I've simplified a lot, but the graph that we're talking about, if properly organized, is a deep learning neural net. To create one we can use tensorflow like so:

import tensorflow as tf
from tensorflow.keras import datasets, layers, models, preprocessing, callbacks
import numpy as np
import random

inputs = layers.Input(shape=(44, 148, 3))
x = layers.Conv2D(16, (3, 3), activation='relu')(inputs)
x = layers.BatchNormalization(axis=-1)(x) # Channels @ -1
x = layers.MaxPooling2D((2, 2))(x)
x = layers.Flatten()(x)
x = layers.Dense(64)(x)
x = layers.Activation('relu')(x)
x = layers.Dense(64)(x)
x = layers.Activation('sigmoid')(x)
x = layers.BatchNormalization(axis=-1)(x)
x = layers.Dropout(0.02)(x)
x = layers.Dense(18, activation='linear')(x)
model = models.Model(inputs, x)
model.compile(optimizer='sgd',
      loss='mae',
      metrics=['accuracy'])
Enter fullscreen mode Exit fullscreen mode

Let's break it down:


inputs = layers.Input(shape=(44, 148, 3))
Enter fullscreen mode Exit fullscreen mode

You'll recall that this is the dimensions for one of our periscope images. These map to our input nodes.

x = layers.Conv2D(16, (3, 3), activation='relu')(inputs)
x = layers.BatchNormalization(axis=-1)(x) # Channels @ -1
x = layers.MaxPooling2D((2, 2))(x)
x = layers.Flatten()(x)
x = layers.Dense(64)(x)
x = layers.Activation('relu')(x)
x = layers.Dense(64)(x)
x = layers.Activation('sigmoid')(x)
x = layers.BatchNormalization(axis=-1)(x)
x = layers.Dropout(0.02)(x)
x = layers.Dense(18, activation='linear')(x)
model = models.Model(inputs, x)
Enter fullscreen mode Exit fullscreen mode

This establishes layers of regular nodes in our graph and their interactions. I'm oversimplifying.

model.compile(optimizer='sgd',
      loss='mae',
      metrics=['accuracy'])
Enter fullscreen mode Exit fullscreen mode

Finally we compile the model, giving it some special parameters. To teach our graph to drive a tank, we need to call the fit method e.g. model.fit(data, correct_prediction). Each time this happens, our model goes back and updates the strength (aka the 'weight') of the regular connections. The fancy name for this is "backpropagation." Anyway, the parameters of model.compile help determine how backpropagation is executed.

The Experiences

Now that we have a brain, the brain needs experiences to train on. As we know from our manual and our little exploration of the data, each episode maps to one full game of Robot Tank. However, in each game the player gets several tanks and if a tank is destroyed, the player simply "respawns" in a new tank. So the episode is actually not the smallest unit of experience, rather, it's the in-game "life" of a single tank.

We can extract this from:


episodes = [list(filter(lambda h: h["episode"]==e , history)
                      ) for e in range(n_episodes)
]

game_lives = [
            list(
              filter(lambda h: h.get('info').get('ale.lives')==l, episode)
        ) for l in range(5)
]

Enter fullscreen mode Exit fullscreen mode

For each of these lives we can get a cumulative reward (how many hits scored before the life ends.)


        rewards = [obs.get('reward') for obs in game_life]
        cum_rewards = sum(rewards)
Enter fullscreen mode Exit fullscreen mode

And using this number we can determine how strongly we want our brain to react to this experience

        # Positive experience
        if cum_rewards:
            nudge = cum_rewards * 0.03
        # Negative experience
        else:
           nudge = 0 - 0.03
Enter fullscreen mode Exit fullscreen mode

Now, for a given step, we can:

  1. Take our original action, prediction, and periscope image as data
  2. nudge our prediction only at the action index, since we can only learn from the actions we have taken.
  3. Call model.fit(image, prediction_with_nudge)

To visualize the problem with this, imagine you are tasked with riding a bike down a mountain blindfolded. As you miraculously ride down the mountain without killing yourself, you may reach a point where you seem to have reached the bottom. In any direction you try to go, you must pedal uphill. You take your blindfold off only to realize that you've barely gone a mile, and that you still have far to go before you reach the base of the mountain. The fancy term for this is a "local minimum."

https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/Extrema_example.svg/600px-Extrema_example.svg.png

To address this we can just force the commander to randomly take actions sometimes:

for obs in game_life:

    action, prediction = obs.get('action')
    if self.epsilon and (random.uniform(0, 1.0) < self.epsilon):
      action = random.randrange(18)

    # Copy
    update = list(prediction)

    # Update only the target action
    update[0][action] = update[0][action] + displacement
Enter fullscreen mode Exit fullscreen mode

With only 120k steps, our tank already seemed decided on a strategy as shown in this long, long gif.

As you can see, turning left is powerful in Robotank -- a whole squadron killed!

zoolander_turns_left

...but then it dies. I think this is pretty strong for a stumpy model trained only on the periscope viewport. Time permitting, I may continue to tinker with this one -- increasing the epsilon value, tinkering with the graph parameters, and adding views, all could help nudge the tank commander towards a more nuanced strategy.

Anyway you've learned a bit about implementing DL with RL in Python. You learned:

  1. DL basic concepts
  2. Exploratory data analysis for RL
  3. Selectively applying rewards to specific actions, and smallest divisible units of experience
  4. Introducing random actions to help explore the "action space"

Thanks for reading! Reach out with any questions.

Top comments (0)