For answering the question that we mentioned at the previous post, about the difference in the critic neural network between the AC and DDPG algorithms, I thought it will be better to review the Stanford CS234, Reinforcement Learning online course, the first lecture went well, but in the second lecture discussing the Markov Reward Process at twentieth minute, you can watch it here https://youtu.be/E3f2Camj0Is?t=1205

I found this number 1.53 and it stopped me a little bit as you can see in the figure below

Dr. Emma Brunskill also mentioned to get this number you have to average over a lots of those equations as you can see below

I tried to calculate this number ( 1.53 ) manually, so I found out all the possible combinations starting from the first cell:

States | Probaility | Return | Return |

S1, S2, S3, S4 | 0.4*0.4*0.4 = 0.064 | 1+0.5*0+0.25*0+0.125*0 = 1.0 | 0.064*1.0 = 0.064 |

S1, S2, S3, S3 | 0.4*0.4*0.2 = 0.032 | 1+0.5*0+0.25*0+0.125*0 = 1.0 | 0.032*1.0 = 0.032 |

S1, S2, S3, S2 | 0.4*0.4*0.4 = 0.064 | 1+0.5*0+0.25*0+0.125*0 = 1.0 | 0.064*1.0 = 0.064 |

S1, S2, S2, S3 | 0.4*0.2*0.4 = 0.032 | 1+0.5*0+0.25*0+0.125*0 = 1.0 | 0.032*1.0 = 0.032 |

S1, S2, S2, S2 | 0.4*0.2*0.2 = 0.016 | 1+0.5*0+0.25*0+0.125*0 = 1.0 | 0.016*1.0 = 0.016 |

S1, S2, S2, S1 | 0.4*0.2*0.4 = 0.032 | 1+0.5*0+0.25*0+0.125*1 = 1.125 | 0.032*1.125 = 0.036 |

S1, S2, S1, S2 | 0.4*0.4*0.4 = 0.064 | 1+0.5*0+0.25*1+0.125*0 = 1.25 | 0.064*1.25 = 0.08 |

S1, S2, S1, S1 | 0.4*0.4*0.6 = 0.096 | 1+0.5*0+0.25*1+0.125*1 = 1.375 | 0.096*1.375 = 0.132 |

S1, S1, S2, S3 | 0.6*0.4*0.4 = 0.096 | 1+0.5*1+0.25*0+0.125*0 = 1.5 | 0.096*1.5 = 0.144 |

S1, S1, S2, S2 | 0.6*0.4*0.2 = 0.048 | 1+0.5*1+0.25*0+0.125*0 = 1.5 | 0.048*1.5 = 0.072 |

S1, S1, S2, S1 | 0.6*0.4*0.4 = 0.096 | 1+0.5*1+0.25*0+0.125*1 = 1.625 | 0.096*1.625 = 0.156 |

S1, S1, S1, S2 | 0.6*0.6*0.4 = 0.144 | 1+0.5*1+0.25*1+0.125*0 = 1.75 | 0.144*1.75 = 0.252 |

S1, S1, S1, S1 | 0.6*0.6*0.6 = 0.216 | 1+0.5*1+0.25*1+0.125*1 = 1.875 | 0.216*1.875 = 0.405 |

V(S1) = 1.485 |

As you can see the V(S1) which is the expected return for the Mars Rover starting from S1 and making 3 moves equal 1.485 but in the lecture V(S1) = 1.53, so why I didn’t get it right ? I double checked the calculation but it gave me the same result, I thought that it might need more that 4 steps to get it right, so I wrote a program in Matlab to calculate the Value function, please check the MarsRover.m script file at our repository: https://github.com/hobby-robotics

When running the script at 4 steps it gave 1.485, which ensured that my manual calculations was right, but still I didn’t obtain the 1.53 that I am looking for, as you can see in the figure below

I decided to try with more steps and that is what I got

4 Steps -> 1.5099

5 Steps -> 1.5212

6 Steps -> 1.5271

10 Steps -> 1.5336 which is pretty close to the 1.53 mentioned in the lecture, but when trying more steps in Matlab it took considerably long time, actually I couldn’t wait for it to end, so I decided write to another program in C++ to achieve better performance and I obtained 1.53407 when running it at 15 steps per episode as you can see in the figure below

After moving forward in the lecture I found a closed form solution for the value function as you can see below

So I decided to try this equation, and voila math never stops to amaze me all the time, just look at the figure below

1.5343 this is the right number, and we did it in just 3 lines of code, no need for simulations, and calculating it infinite number of paths the robot could take, all we need is just one equation, to obtain the most accurate result you can get. But how this happened, as we saw earlier it follows from that simple equation:

V2 = R + Gamma * P * V1

V2 -> Current Expected Return

R -> Reward

Gamma -> Discount Factor

P -> Transition Matrix

V1 -> Previous Expected Return

It is obvious that after very long number of iterations V2 converges and also become very close to V1, but can we proof this equation ?

To check this let’s calculate expected returns V12 of mars rover starting from cell 1 and having 2 steps, and the expected returns V22 starting from cell 2 and having 2 steps

V(S12) = 1 * 0.4 * ( 1 + 0.5 * 0) +

1 * 0.6 * ( 1 + 0.5 * 1)

V(S22) = 1 * 0.4 * ( 0 + 0.5 * 0) +

1 * 0.2 * ( 0 + 0.5 * 0) +

1 * 0.4 * ( 0 + 0.5 * 1)

Now let’s consider a mars rover starting from cell 1 and moving 3 steps

V(S13) =

1 * 0.4 * 0.4 * ( 1 + 0.5 * 0 + 0.5 * 0.5 * 0 ) +

1 * 0.4 * 0.2 * ( 1 + 0.5 * 0 + 0.5 * 0.5 * 0 ) +

1 * 0.4 * 0.4 * ( 1 + 0.5 * 0 + 0.5 * 0.5 * 1 ) +

1 * 0.6 * 0.4 * ( 1 + 0.5 * 1 + 0.5 * 0.5 * 0 ) +

1 * 0.6 * 0.6 * ( 1 + 0.5 * 1 + 0.5 * 0.5 * 1)

V(S13) =

1 * 0.4 * 0.4 * ( 1 ) + 1 * 0.4 * 0.4 * ( 0.5 * 0 + 0.5 * 0.5 * 0 ) +

1 * 0.4 * 0.2 * ( 1 ) + 1 * 0.4 * 0.2 * ( 0.5 * 0 + 0.5 * 0.5 * 0 ) +

1 * 0.4 * 0.4 * ( 1 ) + 1 * 0.4 * 0.4 * ( 0.5 * 0 + 0.5 * 0.5 * 1 ) +

1 * 0.6 * 0.4 * ( 1 ) + 1 * 0.6 * 0.4 * ( 0.5 * 1 + 0.5 * 0.5 * 0 ) +

1 * 0.6 * 0.6 * ( 1 ) + 1 * 0.6 * 0.6 * ( 0.5 * 1 + 0.5 * 0.5 * 1 )

V(S13) =

1+

1 * 0.4 * 0.4 * ( 0.5 * 0 + 0.5 * 0.5 * 0 ) +

1 * 0.4 * 0.2 * ( 0.5 * 0 + 0.5 * 0.5 * 0 ) +

1 * 0.4 * 0.4 * ( 0.5 * 0 + 0.5 * 0.5 * 1 ) +

1 * 0.6 * 0.4 * ( 0.5 * 1 + 0.5 * 0.5 * 0 ) +

1 * 0.6 * 0.6 * ( 0.5 * 1 + 0.5 * 0.5 * 1 )

V(S13) =

1 + 0.5 * (

1 * 0.4 * 0.4 * ( 0 + 0.5 * 0 ) +

1 * 0.4 * 0.2 * ( 0 + 0.5 * 0 ) +

1 * 0.4 * 0.4 * ( 0 + 0.5 * 1 ) +

1 * 0.6 * 0.4 * ( 1 + 0.5 * 0 ) +

1 * 0.6 * 0.6 * ( 1 + 0.5 * 1 ))

V(S13) =

1 + 0.5 * (

1 * 0.4 * ( 0.4 * ( 0 + 0.5 * 0 ) +

0.2 * ( 0 + 0.5 * 0 ) +

0.4 * ( 0 + 0.5 * 1 ) ) +

1 * 0.6 * ( 0.4 * ( 1 + 0.5 * 0 ) +

0.6 * ( 1 + 0.5 * 1 ) ))

**V(S13) = 1 + 0.5 * ( 0.4 * V(S22) + 0.6 * V(S12) )**

Please check this link

https://en.wikipedia.org/wiki/Geometric_series

For more information about convergent series