mecㆍviewer v1.4 :: lecture10.pdf

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Robot Learning:

Reinforcement Learning

Lecture 10

양정연

2020/12/10

T&C LAB-AI

Reward and Return in RL

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Past or Future Rewards

• 1. Viewpoint at the Terminal

– Return is the sum of all PAST rewards

• 2. Agent’s viewpoint ( RL uses this)

– Return is the sum of all Future rewards.

Initial state

Terminal

Past

reward

sum

Future

Reward

Sum !!

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Reward and Return

• Reward : get a reward in each state transition

– Whenever an agent moves, it gets a reward from environment
– Ex) +1,+2 at terminals and -0.1 at each turn

• State : state varies by time flows ( )

Initial state

Terminal

Damage

-0.1

2...

 



Time

flows

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Reward and Return

• Return : summation of all rewards.

– Ex) Rewards are -0.1,-0.1, 1.
–

Return is -0.1-0.1+1 = 0.8

• Question: Return at another position?

– Ex) Rewards are -0.1, and 1
–

Returns is -0.1+1 = 0.9







T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Return at Different Position

• Return is a function w.r.t. State Position

Initial state

Terminal

Initial state

Terminal

s )

(

)

0.1 6 1 0.4





     

 

  



(

)

(

)

0.1 1 0.9

R s







 

 



T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Example of a Single Return

Initial state

Terminal

t k









A Single Return with one case

(

)

(

)

0.1 0.1 0.1 0.1 0.1 ( 0.1 1)

0.4

(

)

(

)

0.1 ( 0.1 1)

0.6

t k

















     

 



 

 





 

 

 

 



Watch this, S0 =S4!

However,

because S4 is closer to S7,

Rt=0 is smaller than Rt=4

(0.4< 0.6)

Damage

-0.1

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

However, There are Many Return Values

Initial state

Terminal

Initial state

Terminal

• Many possible returns are averaged for Learning

{ }

t k

E R











 









T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Summary of Reinforcement Learning

• Future Reward

– If an agent moves in future, how much reward does an agent

obtains? ( Not the past reward)

• Return = sum of all possible future rewards

• Bigger Expectation of Return( sum of all future

rewards) is Better for us  Reinforcement Learning!

{ }

t k

E R











 









t k









T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Expectation is Hard works.

• State value is based on Expectation
• In other words, we collect many path data.

– How we estimate expectation? We need Brilliant Idea!!

• Expectation is estimated by Iterative Method

( )

E x

























( )

) ( )

Infinite Impulse Response

E x

NE x

E x













































 





T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Estimated Expectation with IIR Filter

• In Digital signal processing (DSP)
• Finite Impulse Response (FIR) Vs. Infinite Impulse

Response(IIR)

• Basic concept

– A set of Impulses represents system behaviors.

– FIR is a set of impulses, but IIR is the recursive set of

impulses.

( )

( ), Laplace Transform of Impuse, (t) is 1

( )

Y(s)=G(s)

Y s

G s

X s







)

IIR f



 

 

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Average Filter

Ex) ex/ml/l10iir.py

)



 

 

Random x = 100

Random x = 1000

• IIR Filter :

• S becomes
• averaged value, 0.5.

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Important Meaning of Return 1

• Think next two cases

– Case 1) X3X2X1X0
– Case 2) X3X2 X3X2 X3X2X3X2X1X0

• With Negative Reward( eg, -0.1)

– Case 1) -0.1*2+1 = 0.8(Return)
– Case 2) -0.1*8+1 =0.2(Return)
– 0.8 is better than 0.2.

• Without Negative Reward

– Case 1) 0*2 + 1= 1
– Case 2) 0*8+1 = 1
– Question : case 1) and case 2) are equal?????

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Important Meaning of Return 2

• We Must think that Returns will be Expected.

– The Returns of Case 1) and Case 2) will be averaged.

• After Many cases are averaged, what happens?

Initial state

Reward: +1

Terminal

{ } s

E R at







Why?

Because, x1 has

Bigger probability

to reach x0 than x2.

Why?

X1 is closer to X0

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Expected Return finds optimality

without Negative Reward

• Remind that -0.1 reward is helpful to find the optimality

– Long distance journey is NOT good for an agent.
– Case 1) X3X2X1X0 (best)  -0.1*2+1 = 0.8
– Case 2) X3X2X3X2X3X2X3X2X1X0 (poor) 

-0.1*8+1 =0.2

• But, without negative reward, expected return is also

good for which direction is Good or Not.

• Anyway, we can introduce the accelerating method by

using discounted return.

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Summary of RL

• Future Reward

– If an agent moves in future, how much reward does an agent

obtains? ( Not the past reward)

• Return = sum of all possible future reward

• Discounted Return :

– When a reward is far from the current state, discounted rate

is larger.

– This makes an effect on finding the optimal path without

wasting repetitive state transitions like [3,2,3,2,3,2,3,2,1,0]

• Episode : one sequence from initial to terminal state 16

t k







 





T&C LAB-AI

Monte-Carlo(MC) method

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Monte Carlo (MC) Method

• If a state, s is equal to a position at x,

• From state, s, we can tell the function of position x.

Initial state

Terminal

Time,t

Flows,



R return

( )

if s

V s

V x







T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Monte Carlo (MC) Method

• Expected Return= State value Function

• Monte Carlo: Update V(s) with Return R along saved

state transition history

– MC does not use discounted return, but uses Return.





( )

(

)

(

)

(

) ...

( )

E R

r s

E r s

r s

V s









 















( ')

) ( )

along all history,h

V s



 



terminal

{ ,

......

}

x x x x



if s



T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Example of MC Method

Initial state

Terminal

t k









{5, 6,5, 4,5,3, 2,1, 0}

h 

(5)

) (5)

(6)

) (6)

(5)

) (5)

)

(4)

) (4)

...



  



 





 





 



 



 





T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Example of MC Method, l10mc1. py

• +1 reward at left, +2 reward at right, otherwise r=0
• How it works

9 10

S0= initial state

9 10

V(s)

{5, 6,5, 4,5,3, 2,1, 0}

h 

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Example of Episode

t k









History, h

( ')

) ( )

along all history,h

V s



 



T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Example results with 1000 Episodes

• V(s) says that Right Direction is better

Alpha=0.01

Alpha=0.1

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Example of More Complex Cases, l10mc2

9 10

Reward= 2

Reward= 1

Reward= -0.1

5000 episodes with alpha=0.01

From these results, it is not easy to say which one is better

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

5000 Episodes

with low alpha value(0.001)

• When number of episodes increases, low alpha value

contributes for convergences, but it is not so tough.

• The results says that RL gives us determination in

the more detailed ways

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Summary of Monte-Carlo Method

• MC directly uses Return for update state value.

– It is very Intuitive method.
– MC is often used for verifying system characteristics.
– Many casino games are analyzed by MC.. ^^

• MC does not use Discounted Return,

– No gamma

• Shortcomings:

– MC stores all history of state transitions
– If state transition becomes longer, it becomes a handicap.

T&C LAB-AI

Discounted Return

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Discounted Return

• Discounted return is using the weighted reward.

• Far future rewards are strongly reduced.

• Near future rewards are slightly reduced.

• eg. S3S2 S3S2 S3S2…... S3S2S1S0

– Far future rewards are meaningless.
– The result of long journey becomes neglected….

– Gamma Reduction Ratio is used.

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Definition of Discounted Return

• Discounted Return

• Why Discounted Return is effective without -0.1 rewards

– Best case, s= [ 3, 2, 1, 0] reward +1 at s=0

– Not an optimal case, s=[ 3,2,3,2,1,0] reward + at s=0

– Which one is a larger Return?

(0<

t k





 







(

)

s s

t k







 









(

)

s s

t k







 













T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Examples of a Single Discounted Return

Initial state

Terminal

No Damage

t k







 





(

)

(

)

(

)

t k







 







 





 









 





(

)

(

)













T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

State Value, V(s)

Stochastic version of Discounted Return

• Expected Discounted Return (=State value)

– Average of all future reward. Remember that there are many

paths.

ex) S=[3,2,3,2,1,0] , S=[ 3,2,3,2,3,2,1,0] , S=[ 3,4,3,2,1,0]

– We need to average all possible cases  Expectation

• Definition of State Value, V(s)





(

)

(

)

( )

(

)

(

)

(

) ...

t s s

k j

E R

r s

E r s

r s

















 













( )

V s

E R

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Meaning of Discounted Return

• Path information is resolved in State Value.





(

)

( )

(

)

(

)

(

) ...

t s s

E R

E r s

r s











-1

r=0

r=1





(

)

t s s

E R







-1

r=0

r=1





(

)

t s s

E R







Expectation by

V 0.9V+ 0.1Vnew

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

RL Summary

• Return :

– sum of all possible rewards

• Discounted Return:

– sum of all discounted rewards using gamma

• Expected Return: average of (discounted) return

= State value, V(s)

• Episode : one sequence from initial to terminal state
• State value estimation with Two Different methods

– 1. Monte-Carlo Method
– 2. Temporal Difference Method

T&C LAB-AI

Temporal Difference

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Temporal Difference in RL

• Back to State Value Definition

• State value

• Without History information  Temporal Difference





(

)

( )

(

)

(

)

(

) ...

t s s

E R

E r s

r s











( )

V s

E R









( )

(

)

(

)

(

) ...

( ( ))

(

)

(

)

(

) ...

(

)

V s

E r s

r s

E r s

r s

V s













 

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Temporal Difference:

The Crucial Idea in RL

• Observe the Current State, s
• State value: V(s)
• Random Movement by Action: a

• Sense-and-action
• Update State Value, V

• Think expectation by alpha (0.01 in general)

( )

( ')

V s

r s

V s













( )

) ( )

( )

( ')

V s

r s

V s







 



T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Example of l10td1.py





( )

( ')

) ( )

V s

r s

V s











 

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Result of l10td1

• MC shows nearly STRAIGHT Line.
• TD shows Curved results, Why?

– Think Gamma

1000 episodes with alpha=0.1 2000 episodes with alpha=0.1 2000 episodes with

alpha=0.01

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Example of More Complex Cases, l10td2

9 10

Reward= 2

Reward= 1

Reward= -0.1

1000 episodes with

alpha=0.1

1000 episodes with

alpha=0.01

2000 episodes with

alpha=0.01

• TD shows better performance than MC

T&C LAB-AI

HW. MC and TD

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Ex-1) Baskin Robbins Game

• Initial state, S=0
• Terminal state, S=31
• RL Agent says 1,2, or 3.
• Then we says 1,2, or 3.
• Finally, RL wins if you says the number over 31.

• Reward

– If RL loses, RL obtains -1
– If RL wins, RL obtains +1.

• How it works?...

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Baskin Robbins 31 Game

• Example

– Agent 1,2 678, , …. , 23,24, ,28,29,30
– Opponent 345, 9,10,11 … 22 26,27

,31

– Opponent speaks 31 and loses a game.

• RL designs

S (22)

Opponent  …, 22

Agent’s action


23,24

S is not

Determined

Opponent  …, 26,27
(Environmental changes)

S’(27)

S (22)

S’ (27)

Action(+2)

+ Environment

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Hint for Every Problems.

• In Baskin Robbins game, the next state is NOT

determined Because your turn is added.

– RL moves from 0 to 3, then your turn moves from 3 to 4~6.
– RL feels that action 1, 2, or 3 can move from 2 to 6.
– Thus, RL works on stochastic way.

• Like what you did in Baskin Robbins game, RL

results says that RL obtains the best reward at 27.

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

How to Build Baskin Robbins Game?

MC example

RL’s turn

Your

turn

RL loses a game.

RL wins a game.

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Prob.1. Complete “YOUR”

Baskin Robbins Game with MC

• Example of MC result

S=27

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Prob.2. Complete

“Your”

Baskin Robbins Game with TD

• Example of TD result

S=27

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Discussion

• Prob. 3. Explain Why 27 is so important?
• Prob. 4.a. Why MC has so many fluctuations?
• Prob. 4.b. How can we REDUCE many fluctuations

like below result? Show your Result

•

Many

fluctuations

(Dirty)

Prob. 4.a

Prob. 4.b

Smooth Here!

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Prob.5. Discussion about TD Results

• Prob. 5.a. From TD Results, V(s) is slightly positive

from s=0 to s=20. What is the meaning of it?

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Prob. 5.b.

• Prob. 5.b. After 2000, 4000,and 6000 episodes, TD

shows this tendency.

– 23 is better than 25, and 27 is better than 23.
– What is the meaning of it?

T&C LAB-AI

Dept. of Intelligent Robot Eng. MU

Robotics

Ex-2) Q-Learning : l9q1.py

• Q-learning has two modes.
• 1. Exploration: random searching for update Q value

• 2. Exploitation: Following Maximum Q value

– An agent follows Maximum Q value
– Argmax(Q(s,a) = a*  Best policy(action)

( , )

) ( , )

( , )

max ( ', ')

Q s a

r s a

Q s a









 





