T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Robot Learning:
Reinforcement Learning
Lecture 10
양정연
2020/12/10
1
T&C LAB-AI
Reward and Return in RL
1
2
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Past or Future Rewards
• 1. Viewpoint at the Terminal
– Return is the sum of all PAST rewards
• 2. Agent’s viewpoint ( RL uses this)
– Return is the sum of all Future rewards.
3
Initial state
Terminal
Past
reward
sum
Future
Reward
Sum !!
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Reward and Return
• Reward : get a reward in each state transition
– Whenever an agent moves, it gets a reward from environment
– Ex) +1,+2 at terminals and -0.1 at each turn
• State : state varies by time flows ( )
4
Initial state
Terminal
Terminal
0
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
Damage
-0.1
0
1
2...
t
s
s
s
s
Time
flows
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Reward and Return
• Return : summation of all rewards.
– Ex) Rewards are -0.1,-0.1, 1.
–
Return is -0.1-0.1+1 = 0.8
• Question: Return at another position?
– Ex) Rewards are -0.1, and 1
–
Returns is -0.1+1 = 0.9
5
1
k
k
R
r
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Return at Different Position
• Return is a function w.r.t. State Position
6
Initial state
Terminal
Initial state
Terminal
0
1
2
3
4
5
6
7
1
(s
s )
(
)
0.1 6 1 0.4
k
k
R
r
r
r
r
r
r
r
r
5
6
7
1
(
)
(
)
0.1 1 0.9
k
k
R s
s
r
r
r
0
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
0
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
`
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Example of a Single Return
7
Initial state
Terminal
Terminal
0
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
1
t
t k
k
R
r
A Single Return with one case
7
0
0
0
1
2
3
4
5
6
7
1
1
3
4
4
4
5
6
7
1
1
(
)
(
)
0.1 0.1 0.1 0.1 0.1 ( 0.1 1)
0.4
(
)
(
)
0.1 ( 0.1 1)
0.6
t
t k
k
k
k
t
t k
k
k
k
R
s
s
r
r
r
r
r
r
r
r
r
R
s
s
r
r
r
r
r
Watch this, S0 =S4!
However,
because S4 is closer to S7,
Rt=0 is smaller than Rt=4
(0.4< 0.6)
Damage
-0.1
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
However, There are Many Return Values
8
Initial state
Terminal
Terminal
Initial state
Terminal
Terminal
• Many possible returns are averaged for Learning
1
{ }
t
t k
k
E R
E
r
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Summary of Reinforcement Learning
• Future Reward
– If an agent moves in future, how much reward does an agent
obtains? ( Not the past reward)
• Return = sum of all possible future rewards
• Bigger Expectation of Return( sum of all future
rewards) is Better for us Reinforcement Learning!
9
1
{ }
t
t k
k
E R
E
r
1
t
t k
k
R
r
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Expectation is Hard works.
• State value is based on Expectation
• In other words, we collect many path data.
– How we estimate expectation? We need Brilliant Idea!!
• Expectation is estimated by Iterative Method
10
1
1
1
1
( )
( )
1
N
N
N
i
N
i
i
i
E x
x
E x
x
N
N
1
1
1
1
1
1
1
1
( )
( )
( )
( )
1
1
1
( )
( )
1
( )
( )
(1
) ( )
Infinite Impulse Response
N
N
N
i
N
N
N
N
i
N
N
N
N
N
N
N
N
E x
x
x
x
NE x
E x
E x
N
N
E x
x
E x
N
E x
x
E x
x
E x
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Estimated Expectation with IIR Filter
• In Digital signal processing (DSP)
• Finite Impulse Response (FIR) Vs. Infinite Impulse
Response(IIR)
• Basic concept
– A set of Impulses represents system behaviors.
– FIR is a set of impulses, but IIR is the recursive set of
impulses.
11
( )
( ), Laplace Transform of Impuse, (t) is 1
( )
Y(s)=G(s)
Y s
G s
X s
1
:
(1
)
k
k
k
IIR f
x
f
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Average Filter
Ex) ex/ml/l10iir.py
12
1
(1
)
k
k
k
f
x
f
Random x = 100
Random x = 1000
• IIR Filter :
• S becomes
• averaged value, 0.5.
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Important Meaning of Return 1
• Think next two cases
– Case 1) X3X2X1X0
– Case 2) X3X2 X3X2 X3X2X3X2X1X0
• With Negative Reward( eg, -0.1)
– Case 1) -0.1*2+1 = 0.8(Return)
– Case 2) -0.1*8+1 =0.2(Return)
– 0.8 is better than 0.2.
• Without Negative Reward
– Case 1) 0*2 + 1= 1
– Case 2) 0*8+1 = 1
– Question : case 1) and case 2) are equal?????
13
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Important Meaning of Return 2
• We Must think that Returns will be Expected.
– The Returns of Case 1) and Case 2) will be averaged.
• After Many cases are averaged, what happens?
14
Initial state
Reward: +1
Terminal
x0
x1
x2
x3
1
2
{ } s
{ } s
E R at
x
E R at
x
Why?
Because, x1 has
Bigger probability
to reach x0 than x2.
Why?
X1 is closer to X0
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Expected Return finds optimality
without Negative Reward
• Remind that -0.1 reward is helpful to find the optimality
– Long distance journey is NOT good for an agent.
– Case 1) X3X2X1X0 (best) -0.1*2+1 = 0.8
– Case 2) X3X2X3X2X3X2X3X2X1X0 (poor)
-0.1*8+1 =0.2
• But, without negative reward, expected return is also
good for which direction is Good or Not.
• Anyway, we can introduce the accelerating method by
using discounted return.
15
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Summary of RL
• Future Reward
– If an agent moves in future, how much reward does an agent
obtains? ( Not the past reward)
• Return = sum of all possible future reward
• Discounted Return :
– When a reward is far from the current state, discounted rate
is larger.
– This makes an effect on finding the optimal path without
wasting repetitive state transitions like [3,2,3,2,3,2,3,2,1,0]
• Episode : one sequence from initial to terminal state 16
0
1
1
0
0
k
k
t
t k
t k
k
k
R
r
r
T&C LAB-AI
Monte-Carlo(MC) method
2
17
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Monte Carlo (MC) Method
• If a state, s is equal to a position at x,
• From state, s, we can tell the function of position x.
18
Initial state
Terminal
Terminal
0
s
1
s
2
s
3
s
4
s
5
s
6
s
Time,t
Flows,
i
x
X
t
s
S
:
t
R return
( )
( )
t
i
t
i
if s
x
V s
V x
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Monte Carlo (MC) Method
• Expected Return= State value Function
• Monte Carlo: Update V(s) with Return R along saved
state transition history
– MC does not use discounted return, but uses Return.
19
1
2
3
( )
( )
( )
(
)
(
)
(
) ...
( )
t
k
k
k
k
k
k
E R
E
r s
E r s
r s
r s
r s
V s
( ')
(1
) ( )
along all history,h
t
V s
V s
R
5
6
5
4
terminal
{ ,
,
,
......
}
h
x x x x
x
t
i
if s
x
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Example of MC Method
20
Initial state
Terminal
Terminal
0
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
1
1
t
t k
k
R
r
{5, 6,5, 4,5,3, 2,1, 0}
h
(5)
(1
) (5)
(6)
(1
) (6)
(5)
(1
) (5)
(1
)
(4)
(1
) (4)
...
V
V
R
V
V
R
V
V
R
V
V
R
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Example of MC Method, l10mc1. py
• +1 reward at left, +2 reward at right, otherwise r=0
• How it works
21
0
1
2
3
4
5
6
7
8
9 10
S0= initial state
s=
0
1
2
3
4
5
6
7
8
9 10
V(s)
{5, 6,5, 4,5,3, 2,1, 0}
h
x
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Example of Episode
22
1
t
t k
k
R
r
History, h
( ')
(1
) ( )
along all history,h
t
V s
V s
R
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Example results with 1000 Episodes
• V(s) says that Right Direction is better
23
Alpha=0.01
Alpha=0.1
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Example of More Complex Cases, l10mc2
24
0
1
2
3
4
5
6
7
8
9 10
s=
Reward= 2
Reward= 1
Reward= -0.1
5000 episodes with alpha=0.01
From these results, it is not easy to say which one is better
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
5000 Episodes
with low alpha value(0.001)
25
• When number of episodes increases, low alpha value
contributes for convergences, but it is not so tough.
• The results says that RL gives us determination in
the more detailed ways
s0
s0
s0
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Summary of Monte-Carlo Method
• MC directly uses Return for update state value.
– It is very Intuitive method.
– MC is often used for verifying system characteristics.
– Many casino games are analyzed by MC.. ^^
• MC does not use Discounted Return,
– No gamma
• Shortcomings:
– MC stores all history of state transitions
– If state transition becomes longer, it becomes a handicap.
26
T&C LAB-AI
Discounted Return
3
27
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Discounted Return
• Discounted return is using the weighted reward.
• Far future rewards are strongly reduced.
• Near future rewards are slightly reduced.
• eg. S3S2 S3S2 S3S2…... S3S2S1S0
– Far future rewards are meaningless.
– The result of long journey becomes neglected….
– Gamma Reduction Ratio is used.
28
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Definition of Discounted Return
• Discounted Return
• Why Discounted Return is effective without -0.1 rewards
– Best case, s= [ 3, 2, 1, 0] reward +1 at s=0
– Not an optimal case, s=[ 3,2,3,2,1,0] reward + at s=0
– Which one is a larger Return?
29
1
1
0
0
(0<
1)
k
k
t
t k
t k
k
k
R
r
r
0
0
1
2
2
0
1
0
(
R
)
0
0
1
k
t
s s
t k
k
R
or
r
0
0
1
2
3
4
4
0
1
0
(
R
)
0
0
0
0
1
k
t
s s
t k
k
R
or
r
2
4
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Examples of a Single Discounted Return
30
Initial state
Terminal
Terminal
0
s
1
s
2
s
3
s
4
s
5
s
6
s
7
s
No Damage
0
1
1
0
0
k
k
t
t k
t k
k
k
R
r
r
6
0
1
2
3
4
5
6
0
0
1
0
1
1
2
3
4
5
6
7
0
0
6
6
6
7
3
0
1
2
4
4
1
4
1
5
6
7
0
0
2
2
2
7
(
)
(
)
0
1
(
)
0
1
k
k
t
t k
k
k
k
k
k
t
t k
k
k
k
R
s
s
r
r
r
r
r
r
r
r
r
r
R
s
s
r
r
r
r
r
r
6
2
0
0
4
4
(
)
(
)
t
t
R
s
s
R
s
s
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
State Value, V(s)
Stochastic version of Discounted Return
• Expected Discounted Return (=State value)
– Average of all future reward. Remember that there are many
paths.
ex) S=[3,2,3,2,1,0] , S=[ 3,2,3,2,3,2,1,0] , S=[ 3,4,3,2,1,0]
– We need to average all possible cases Expectation
• Definition of State Value, V(s)
31
,
0
2
3
1
2
3
(
)
(
)
( )
(
)
(
)
(
) ...
k
j
t s s
k j
j
k
k
k
k
E R
E
r s
E r s
r s
r s
r s
( )
( )
t
V s
E R
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Meaning of Discounted Return
• Path information is resolved in State Value.
32
2
3
,
1
2
3
(
)
( )
(
)
(
)
(
) ...
k
t s s
k
k
k
k
E R
E r s
r s
r s
r s
-1
0
0
0
1
r=0
r=0
r=0
r=0
r=1
2
3
4
,
(
)
0
0
0
0
1
k
t s s
E R
E
-1
0
0
0
1
r=0
r=0
r=0
r=1
2
3
,
(
)
0
0
0
1
k
t s s
E R
E
E
Expectation by
V 0.9V+ 0.1Vnew
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
RL Summary
• Return :
– sum of all possible rewards
• Discounted Return:
– sum of all discounted rewards using gamma
• Expected Return: average of (discounted) return
= State value, V(s)
• Episode : one sequence from initial to terminal state
• State value estimation with Two Different methods
– 1. Monte-Carlo Method
– 2. Temporal Difference Method
33
T&C LAB-AI
Temporal Difference
4
34
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Temporal Difference in RL
• Back to State Value Definition
• State value
• Without History information Temporal Difference
35
2
3
,
1
2
3
(
)
( )
(
)
(
)
(
) ...
k
t s s
k
k
k
k
E R
E r s
r s
r s
r s
( )
( )
t
V s
E R
2
3
1
2
3
1
2
1
2
3
1
( )
( )
(
)
(
)
(
) ...
( ( ))
(
)
(
)
(
) ...
(
)
k
k
k
k
k
k
k
k
k
k
V s
E r s
r s
r s
r s
E r s
E r s
r s
r s
r
V s
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Temporal Difference:
The Crucial Idea in RL
• Observe the Current State, s
• State value: V(s)
• Random Movement by Action: a
• Sense-and-action
• Update State Value, V
• Think expectation by alpha (0.01 in general)
36
( )
( )
( ')
V s
r s
V s
'
a
s
s
( )
(1
) ( )
( )
( ')
V s
V s
r s
V s
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Example of l10td1.py
37
( )
( )
( ')
(1
) ( )
V s
r s
V s
V s
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Result of l10td1
• MC shows nearly STRAIGHT Line.
• TD shows Curved results, Why?
– Think Gamma
38
1000 episodes with alpha=0.1 2000 episodes with alpha=0.1 2000 episodes with
alpha=0.01
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
39
Example of More Complex Cases, l10td2
0
1
2
3
4
5
6
7
8
9 10
s=
Reward= 2
Reward= 1
Reward= -0.1
1000 episodes with
alpha=0.1
1000 episodes with
alpha=0.01
2000 episodes with
alpha=0.01
• TD shows better performance than MC
T&C LAB-AI
HW. MC and TD
5
40
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Ex-1) Baskin Robbins Game
• Initial state, S=0
• Terminal state, S=31
• RL Agent says 1,2, or 3.
• Then we says 1,2, or 3.
• Finally, RL wins if you says the number over 31.
• Reward
– If RL loses, RL obtains -1
– If RL wins, RL obtains +1.
• How it works?...
41
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Baskin Robbins 31 Game
• Example
– Agent 1,2 678, , …. , 23,24, ,28,29,30
– Opponent 345, 9,10,11 … 22 26,27
,31
– Opponent speaks 31 and loses a game.
• RL designs
42
S (22)
Opponent …, 22
Agent’s action
23,24
S is not
Determined
Opponent …, 26,27
(Environmental changes)
S’(27)
S (22)
S’ (27)
Action(+2)
+ Environment
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Hint for Every Problems.
43
• In Baskin Robbins game, the next state is NOT
determined Because your turn is added.
– RL moves from 0 to 3, then your turn moves from 3 to 4~6.
– RL feels that action 1, 2, or 3 can move from 2 to 6.
– Thus, RL works on stochastic way.
• Like what you did in Baskin Robbins game, RL
results says that RL obtains the best reward at 27.
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
How to Build Baskin Robbins Game?
MC example
44
RL’s turn
Your
turn
RL loses a game.
RL wins a game.
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Prob.1. Complete “YOUR”
Baskin Robbins Game with MC
• Example of MC result
45
S=27
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Prob.2. Complete
“Your”
Baskin Robbins Game with TD
• Example of TD result
46
S=27
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Discussion
• Prob. 3. Explain Why 27 is so important?
• Prob. 4.a. Why MC has so many fluctuations?
• Prob. 4.b. How can we REDUCE many fluctuations
like below result? Show your Result
•
47
Many
fluctuations
(Dirty)
Prob. 4.a
Prob. 4.b
Smooth Here!
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Prob.5. Discussion about TD Results
• Prob. 5.a. From TD Results, V(s) is slightly positive
from s=0 to s=20. What is the meaning of it?
48
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Prob. 5.b.
• Prob. 5.b. After 2000, 4000,and 6000 episodes, TD
shows this tendency.
– 23 is better than 25, and 27 is better than 23.
– What is the meaning of it?
49
23
25
27
T&C LAB-AI
Dept. of Intelligent Robot Eng. MU
Robotics
Ex-2) Q-Learning : l9q1.py
• Q-learning has two modes.
• 1. Exploration: random searching for update Q value
• 2. Exploitation: Following Maximum Q value
– An agent follows Maximum Q value
– Argmax(Q(s,a) = a* Best policy(action)
50
'
( , )
(1
) ( , )
( , )
max ( ', ')
a
Q s a
Q s a
r s a
Q s a