A Markov decision process (MDP) is a mathematical framework used to model decision-making problems in which outcomes are partly random and partly under the control of a decision-maker. It consists of a set of states, a set of actions, and a set of rewards. The decision-maker takes actions in each state, and the environment transitions to a new state based on the action taken and a probability distribution over the next state. The decision-maker receives a reward based on the state and action taken.

Here’s an example of an MDP:

Consider a robot that is trying to navigate through a maze. The robot can move in four directions: up, down, left, or right. The robot’s goal is to reach the exit of the maze. The robot’s current state is its current location in the maze. The robot can only see a limited area around it, so it doesn’t have access to the entire maze. The robot’s actions are to move in one of the four directions. The environment transitions to a new state based on the action taken and a probability distribution over the next state. For example, if the robot moves up, there might be a 70% chance that it stays in the same location, a 10% chance that it moves to the left, a 10% chance that it moves to the right, and a 10% chance that it moves down. The robot receives a reward based on the state and action taken. For example, the robot might receive a reward of +10 if it reaches the exit of the maze, a reward of -1 if it bumps into a wall, and a reward of -0.1 for each step it takes.

The environment is typically formulated as a Markov Decision Process

1.) set of states: S

In chess, a state is a given configuration of the board

2.) set of actions: A (for example: up, down, left, right)

Every possible move in chess or tic-tac-toe

3.) Conditional distribution of the next state

The next state depends on the actual one exclusively „Markov-property"

P(s'|s,a) transition probability matrix

4.) R(s,s') the reward function of the next s' state after the agent makes action a while being in the state s

5.) γ gamma discount factor

The Markov property is a fundamental assumption in reinforcement learning. It states that the current state of an agent contains all the information that is relevant for decision-making. In other words, the future state of the agent depends only on its current state and not on its past states. This assumption enables theoretical proofs for certain algorithms and is widely used in current research.

For example, in a Markov decision process (MDP), we assume that the current state is independent of the whole history given the previous state. However, in real-life scenarios, this assumption may not always hold true and can lead to an RL algorithm failing in a specific environment.

Reference:

https://web.stanford.edu/class/cme241/lecture_slides/david_silver_slides/MDP.pdf

## Top comments (0)