1.intelligent agent->
2.learn to make good sequential decisions
3.optimality
4.utility
5.a agent need to intelligent to make good decisions
Atari learn the game from pixel to pixel
video game playing
robotics grasping clothes
educational games to amplify human intelligence
NLP,vision kind of optimization process
key aspects:
-
optimization:
good decision or at least good strategy
- delayed consequences: no idea about decision is good for now or immediate but helpful past
- exploaration: agent explore everything try to leearn everything.... data is censored only a reward for decision made.
- policy is mapping pst experiences to the action not better if preprogram due to large high search space
good question why not pre-program a policy?
big search space
enourmous code base
atari learning from space of images what to do next
need some sort of generalisation.
- AI planning : ogd why go game don't need exploration?
- supervised: og already have experience as form of dataset
- unsupervised: og no label but have data
- RL:oged
-
imitation learning:ogd learning from others experience.
assumes that input coming from good policy demos.
reduces rl to supervised learning
explore the world use experience to guide decisions
end of class goal
sequential decision making under uncertainity
- interactive close loop prcess agent take action max reward observation,reward -- > max future reward
expected stochastic process need strategic behaviour to get high reward
balancing immediate and long term reward
it may have to make long decision in which it get no rewards
for long times
if agent get easy option to choose for maximising the reward it can do that
reward function() is a important one.
sub desicipline of machine teaching
---------------------|------------
- + (0)+ - -
need constant two points
|||important things for sequential decision:
History,state space,world state descrete timer
small subset of the real world state
(Markov Assumption):
state current observation : s(t)
t=inf
his(t)=sum(s(i))
i=0
^
history|
markov whole history can be markov
POMD
Bandits: actions have no influence on next observations
MDP and POMDPs actions influence future onservations
types of SDP:
Deterministic:
Stochastic:
RL algorithm :
model
policy : mapping function States->actionsstochas
stochastic policy Determinsitic policy
value fucntion gamma: expected discounted sum on future rewards
Reward: Mars Rover Stochastic Markov Model
RL agents:
Model based: have model
Model free: have policy and value function
Key challenges:
Planning,
finite horizon setting is to the time span of the system operation during which you are concerned about such defined performance measures. If you want to control the system, meeting the performance measures for a finite time say T, then the problem is finite horizon and if you are concerned about the optimality during the whole time span i.e till t=∞
, then it is an infinite horizon problem.
The problem of deriving control u(t)
, t=[0,T]
for the system
x˙(t)=Ax(t)+Bu(t)
such that the performance index
PM=∫T0x(t)′Qx(t)+u′(t)Ru(t)dt
is minimised is a finite horizon problem
The problem of deriving control
,
t=[0,∞] for the system
x˙(t)=Ax(t)+Bu(t)
such that the performance index
PM=∫∞0x(t)′Qx(t)+u′(t)Ru(t)dt
is minimised is an infinite horizon problem
Evaluation and control
Top comments (1)
links goes here for understanding:
infinite horizon problem over optimal control: math.stackexchange.com/questions/2...
a simple example on it:math.stackexchange.com/questions/2...