DEV Community

Daniel
Daniel

Posted on • Updated on

Comprehensive starter guide to AWS DeepRacer

Comprehensive starter guide to AWS DeepRacer

What is AWS DeepRacer?

AWS DeepRacer is the fastest way to get started with machine learning and understanding the basics of coding. It doesn’t matter the current career path or age. The only thing that matters is your will to learn. What is AWS DeepRacer? AWS DeepRacer is a 1/18th scale RC autonomous racing league. These cars are 100% self-driving and use reinforcement learning to train themselves to get around the track. You are the developer who only provides minimal input.

Normally machine learning is very complex and takes time to build, but AWS neatly packages the experience into a user-friendly experience. We’ll delve into the 3 sections a developer has control of here shortly which are Hyperparameters, Action space, and reward function parameters but as I said this is a comprehensive guide.

I’ll be referring to AWS console and DeepRacer for Cloud for short I'll use DRfC. AWS console refers to training directly on AWS-provided infrastructure, but since AWS DeepRacer is open source. a group of us decided to take what AWS provided and package it into a system known as DRfC, this not only unlocks some additional features but allows you to train on your own platform.

Physical Racing vs Virtual Racing

Virtual and Physical are two completely different beasts and many racers struggle with the changes between the two. In Virtual Racing the track we train on is usually the same as the one we submit the model to. We can get away with high speeds in action space and overfitting

Physical racing, however, is very sensitive to high speed and overfitting. High-speed results in having to put a higher throttle in the car which can be seen as the vehicle stalling out on the track, or just not wanting to move without some assistance. a well-generalized model performs better in the real world.

Two misconceptions during actual racing are two things, the car is no longer learning. it is now exploiting the knowledge it has collected in the model. Second, the reward function has no role in the decision-making process. The only input the car relies on is from the camera, and using that with the sum of its experience to make a probability choice based on weights. a learning model has weights of a matrix embedded in a machine learning model.

DeepRacer League vs DeepRacer Student League

The Student League is a reduced-down AWS console to allow a younger generation to learn machine learning at an early age. They usually pair the season with a chance of an AI Scholarship with Udacity plus the usual prizes. You can still use a model in Student League on a physical car and be competitive. This is reserved for people 16+. Since there are limitations within the Student League I will note if a feature isn’t available.

DeepRacer League is the adult version without the training wheels. You have full control of all the features available to you, and prizes for performing well during the season. Which can end with a trip to AWS Reinvent. Restrictions are 18+

Policies

I’m only going to talk about policies at a high level, but just like you don’t talk about fight club. We don’t talk about SAC. While SAC is an option available with policies it is very sensitive to hyperparameters, and this has caused no one in the DeepRacer league to be successful with it. (as far as we know).

It is best to keep to PPO policy especially when starting your journey.

Convergence

Convergence is the ultimate goal for any ML engineer to achieve while training their model. This is the point that the car is no longer exploring its environment but instead exploiting in knowledge to get around the environment. When looking at charts to see if you are converging this is usually when training lap completion and evaluation lap completion meet. If these aren’t meeting you might need to try adjusting your hyperparameters. Hyperparameters tend to cause one of two outcomes which are constantly fighting each other, learning stability vs convergence. If hyperparameters aren’t set right the model might not be learning enough to reach convergence or might take a long time to reach convergence.

Another method for seeing if a car is converging, which is easier to see within DRfC is watching entropy. Entropy simply is how much uncertainty is left in the system. If the model is moving in the right direction this will decrease over time and if it isn’t you’ll see it increase. You can see this in the AWS console once you download the logs, but with DRfC we can watch it live. Entropy will be near 1 when the car starts doing laps and as low as 0.2-0.4 for a well-balanced function.

Log Analyze

Before I get into Hyperparameters that affect convergence, let's talk about how to visibly see convergence within the environment.

Log analysis is going to be key to being successful to understand how your model is performing where you can make improvements and if the reward function is favoring the ultimate goal. The ultimate goal is to get the most points possible. I’ll go into a little more detail on this later.

AWS DeepRacer community provides several tools for you to analyze your model and all you need are the training logs. I’m not going into depth on how to do this but the repo for the main tool I use can be found here. https://github.com/aws-deepracer-community/deepracer-analysis there are several others available including the one made by JPMC known are log guru. I’ll be looking at writing a more in-depth guide to deepracer-analysis in the coming months.

The student league can’t do log analysis because they can’t download the log. they could in theory analyze the reward in one of two ways, create a sample run and pass in the reward function to see the outcomes.

Second, would be to set up DRfC to test the reward function before burning the 10 hours. due to speed differences in hardware, you’ll need to estimate the AWS console with DRfC by looking at the number of iterations and matching closely to get the expected performance.

Hyperparameters

💡Not Available in Student League

Hyperparameters are settings that affect the training performance of the model. Sometimes these items need to be tweaked to make what might seem like a bad reward function into a good one. Generally, starting you should leave these at default except for the discount factor I would suggest setting that to 0.99

Hyperparameters Description Platform Availability
Batch Size the number of experiences that will be sampled at random used to update the model

reducing can promote a more stable policy
both
Beta Entropy it is a value of uncertainty that is added to the policy. The lower the number the less uncertain. Increasing this will promote exploration over exploitation at the cost of convergence. both
Epochs The number of iterations through the training data to update the model.

this can also affect learning, the trade-off is lower value higher stability at the cost of convergence
both
Learning Rate Control how much Gradient descent updates the weights.

decreasing can increase stability but at the cost of time.
both
Discount factor How far in the future the car looks ahead to calculate the possible rewards it collects.

If the environment is noisy decreasing this will help stability to focus on more short-term reward. More later
Both
Exploration Type Two choices Categorical which is default, and e-greedy.

With epsilon-greedy, you choose a value for ε (epsilon), typically between 0 and 1. The agent will explore (choose a random action) with probability ε and exploit (choose the best-known action) with probability (1 - ε). DRfC only

In categorical exploration, you model the exploration as a probability distribution over actions. Instead of having a single ε value, you define a probability distribution over actions, where each action has a probability associated with it. This distribution can be learned and updated as the agent interacts with the environment. AWS Console and DRfC
DRfC
e greedy Promotes the difference between exploration and exploration when using e-greedy exploration type DRfC
epsilon steps (Unconfirmed) I believe this is part of the PPO Clipping that controls how much a policy can deviate once a certain number of steps is hit during training. Higher promotes more exploration DRfC
Loss type the objective function used to update the network’s weights.

values Huber, and Mean squared error
Both
Number of Episodes The number of sessions used to collect data before the policy update.

For more complex problems increasing can lead to a stable model
Both

Action Space

💡Not Available in Student League. Student League is fixed at a continuous action space between 0.5 to 1 speed, -30 to 30 steering

As a Developer, you will think about how you want your vehicle to behave crafting a reward function to achieve the desired results. Similar to how you would train an animal if you wanted it to perform a trick by giving it treats every time it performs a desired action or actions. Currently, there are two different choices available.

Discrete is a list of consistent speeds paired with a steering angle, and DeepRacer will select from this list. How it selects from this list is affected by the exploration type. Which can only be changed in DRfC

Continuous you only set your min and max steering angle and speed. The car will select a random float between those values so during training the car may never select the same value twice.

Reward Function Parameters

Parameters are all the different values that get passed into the reward function during training, this happens every 1/15th of a second, and actions picked during that time are passed to the reward function to calculate the reward. There is a wide range of parameters, and some are only used if you are doing object avoidance. You’ll need to think carefully about which values suit your end goals to weave into your reward function. incidentally that 1/15th a second is always what is called a step in the parameters.

🏎️ Note: speed is a hard value based on the action space. it is best to think of speed as throttle and not the actual velocity of the car’s current state.

A comprehensive list can be found here:

Input parameters of the AWS DeepRacer reward function - AWS DeepRacer

Reward Function

Crafting a reward function can get pretty overwhelming with all the parameters you have available. Typically, people take one of two approaches when they first get started they either do reward = speed in hopes to generate a fast model, or they make the reward function overly complex. Complexity is fine if it uses only a few parameters. Some of my best models have used up to 2 parameters at most.

Use reward shaping where you can in your reward function. What is reward shaping? it where you still give a reward even if it is small if the action is off what you desire. For Example below is the AWS-provided Center line function

def reward_function(params):
    '''
    Example of rewarding the agent to follow center line
    '''

    # Read input parameters
    track_width = params['track_width']
    distance_from_center = params['distance_from_center']

    # Calculate 3 markers that are increasingly further away from the center line
    marker_1 = 0.1 * track_width
    marker_2 = 0.25 * track_width
    marker_3 = 0.5 * track_width

    # Give higher reward if the car is closer to center line and vice versa
    if distance_from_center <= marker_1:
        reward = 1
    elif distance_from_center <= marker_2:
        reward = 0.5
    elif distance_from_center <= marker_3:
        reward = 0.1
    else:
        reward = 1e-3  # likely crashed/ close to off track

    return reward
Enter fullscreen mode Exit fullscreen mode

This reward shaping in mind this center line function can be simplified to the following:

def reward_function(params):
    '''
    Example of rewarding the agent to follow center line
    '''

    # Read input parameters
    distance_from_center = params['distance_from_center']

    reward = 1 - distance_from_center



    return reward
Enter fullscreen mode Exit fullscreen mode

with this simplification, the car has some freedom to deviate from the center line and still receive a decent reward, but the further away it gets the lower the reward.

Now, we talked about simplifying and reward shaping. So let's talk about the elephant in the room using reward = speed or even reward = speed + other parameters is just plain bad for you.

Earlier I said 1 step is 1/15th of a second this value is key for the concept I’m about to show you. Firstly, you need to understand the car’s ultimate goal is to get the highest value reward possible. Meaning if don’t do log analysis on your reward function it might learn bad behavior due to this ultimate goal. Let's take a car that has two speed values of 0.5 m/s and 1 m/s on a 30-meter track.

💡 This is an oversimplification of the whole process but this allows you to understand why things can happen when you do not expect

We’ll assume constant speed for these two scenarios

30 meters * 1 m/s = 30 seconds

30 meters * 0.5 m/s = 60 seconds

Now we know the time it would take for each speed to get around the track, but now we need to account for the number of steps in one second which is 15.

30 second * 15 = 450 steps

60 seconds * 15 = 900 steps

Now we need to times the value by speed because that was what the reward was reward = speed.

450 steps * 1 reward = 450 points

900 steps * 0.5 reward = 450 points

So now you see if you just use the speed parameter you end up with the same number of points, which in a real scenario might end up with a car speeding up or slowing down, or just settling on a slow speed if you pair it with another parameter. For example, speed + distance from the center line if the value of the distance from the center line is always 1, you just made the new formula steps * speed reward + steps * distance reward. You’ve quickly just ruined your speed.

There is also another issue that can occur during training which is snaking, or zig-zagging. this is typically because the car has learned it can get a higher reward for maximizing its time. If you aren’t in the student league you can generally fix some of these issues by adjusting the discount factor. discount factor can make a seemingly bad reward a better one by limiting the number of steps it looks at in the future to make the decisions.

Top comments (1)

Collapse
 
stevecmd profile image
Steve Murimi

This was a good read!