## DEV Community is a community of 852,088 amazing developers

We're a place where coders share, stay up-to-date and grow their careers. jemaloQiu

Posted on • Updated on

# Reinforcement Learning with TF2 and Gym (part I)

Today I start my exercise on Reinforcement Learning using Tensorflow 2.0. Same as many guys in this domain do, I use the famous Cartpole game (see the figure below) of OpenAI Gym for implementing and testing my RL algorithm. One can check the Official introduction of Cartpole on this page. This post will show a raw framework of my implementation. I will present the main loop of the training process and the basic concepts of TF2, RL and Cartpole game. Details of the algorithmic work will be presented in next post.

### package preparation

Make sure that tensorflow (>=2.0), numpy and gym have been installed.

Now import the important modules.

``````import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = "2"

import tensorflow as tf

if tf.__version__.startswith("1."):
raise RuntimeError("Error!! You are using tensorflow-v1")

import numpy as np
import gym
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
import tensorflow.keras.backend as K

``````

I will use the Policy Gradient method for solving this RL problem. Thus I define firstly a class for this method. For the moment, this class is almost empty. I will fill it with all required functionalities later.

This class consists of a Sequential neural network. Input of this network should be vector of state observation and output a unique action index. It should provide some important functions for calculating gradients, updating model gradients, calculating rewards and save/reload models.

``````# a class of Policy Gradient Neural Network

def __init__(self):

self.model = None
self.optimizer = None

def create_model(self):
self.model = Sequential()
## TODO: fulfill the neural netwrok model

pass

pass
pass

def get_variable(self):
return self.model.variables

def calc_rewards(self):
pass

def save_model(self):
pass

pass
``````

### Cartpole preparation

This part of code will call gym environment and prints its parameters for getting info.

``````
ENV_SEED = 1024  ## Reproducibility of the game
NP_SEED = 1024  ## Reproducibility of numpy random

env = gym.make('CartPole-v0')
env.seed(ENV_SEED)
np.random.seed(NP_SEED)

### The Discrete space allows a fixed range of non-negative numbers, so in this case valid actions are either 0 or 1.
### The Box space represents an n-dimensional box, so valid observations will be an array of 4 numbers.
print(env.action_space)
print(env.observation_space)
### We can also check the Box’s bounds:
print(env.observation_space.high)
print(env.observation_space.low)

``````

### Main loop

This is the main loop for training. Since the PolicyGradientNet class has not yet been finished, so here I leave TODO keywords for indicating the places where I need to implement RL algorithmic code.

In each episode, there are multiple interaction loops between env and agent. Given current state, my agent calculates the action (0 or 1) that is applied on the cart, then the environment will accordingly update thus a new state is observed, so my agent will calculate again the action to apply, and so on...

Below is a screenshot of the official presentation of the agent-env interaction: Current code of this part:

``````# define firstly an instance of PolicyGradientNet

update_step = 5   # number of episodes for updating the network's gradient
limit_train = 1000  # training episode limit for stopping
i = 0  # episode counter
max_step = 0  # record the maximum steps in all episodes

while i < limit_train:
step  = 0
state_current = env.reset()  ## reset the game so it will play from its initial state

while True:

env.render()  ## refreshing of visual result
step += 1

## calculate an action
## TODO: use agent.model to calculate the optimal action
a = np.random.choice([0, 1], p=[0.5, 0.5])

## env.step() shall return: observation(object), reward(float), done(boolean),info(dict)
state_obs, reward, done, info = env.step(a)
state_current = state_obs  ## update current state vector
x, x_prime, theta, theta_prime = state_obs  ## state_obs  consists of : current x position, velocity, current pole orientation (about vertical axis) and current angular velocity

if done:  # done being True indicates the episode has terminated.
## TODO: launch agent training
if max_step < step:
max_step = step
print("Step: ", step)

break
if i % 100 == 0:
print("Max step is {} until episode {}  ".format(max_step, i))

i += 1
``````

Now the simulation is something like this:

Full program will be presented in next post.

## Discussion (4) Otu Michael
``````if tf.__version__.startswith("1."):
print("Error!! You are using tensorflow-v1")
``````

At this point the code will still run. Does the version of tensorflow affect the functionality of the app? jemaloQiu

yes, I check only if the version requirement has been satisfied. If you are using tensorflow 1.x, this program will report errors and halt. In this case, this message can give a hint for debugging. Otu Michael

but this doesn't halt.. throw an error??