## Why Mars Rover have 1.53 expected returns when starting from the first cell ?

For answering the question that we mentioned at the previous post, about the difference in the critic neural network between the AC and DDPG algorithms, I thought it will be better to review the Stanford CS234, Reinforcement Learning online course, the first lecture went well, but in the second lecture discussing the Markov Reward Process at twentieth minute, you can watch it here https://youtu.be/E3f2Camj0Is?t=1205
I found this number 1.53 and it stopped me a little bit as you can see in the figure below

Dr. Emma Brunskill also mentioned to get this number you have to average over a lots of those equations as you can see below

I tried to calculate this number ( 1.53 ) manually, so I found out all the possible combinations starting from the first cell:

As you can see the V(S1) which is the expected return for the Mars Rover starting from S1 and making 3 moves equal 1.485 but in the lecture V(S1) = 1.53, so why I didn’t get it right ? I double checked the calculation but it gave me the same result, I thought that it might need more that 4 steps to get it right, so I wrote a program in Matlab to calculate the Value function, please check the MarsRover.m script file at our repository: https://github.com/hobby-robotics
When running the script at 4 steps it gave 1.485, which ensured that my manual calculations was right, but still I didn’t obtain the 1.53 that I am looking for, as you can see in the figure below

I decided to try with more steps and that is what I got
4 Steps -> 1.5099
5 Steps -> 1.5212
6 Steps -> 1.5271
10 Steps -> 1.5336 which is pretty close to the 1.53 mentioned in the lecture, but when trying more steps in Matlab it took considerably long time, actually I couldn’t wait for it to end, so I decided write to another program in C++ to achieve better performance and I obtained 1.53407 when running it at 15 steps per episode as you can see in the figure below

After moving forward in the lecture I found a closed form solution for the value function as you can see below

So I decided to try this equation, and voila math never stops to amaze me all the time, just look at the figure below

1.5343 this is the right number, and we did it in just 3 lines of code, no need for simulations, and calculating it infinite number of paths the robot could take, all we need is just one equation, to obtain the most accurate result you can get. But how this happened, as we saw earlier it follows from that simple equation:

V2 = R + Gamma * P * V1

V2 -> Current Expected Return
R -> Reward
Gamma -> Discount Factor
P -> Transition Matrix
V1 -> Previous Expected Return

It is obvious that after very long number of iterations V2 converges and also become very close to V1, but can we proof this equation ?

To check this let’s calculate expected returns V12 of mars rover starting from cell 1 and having 2 steps, and the expected returns V22 starting from cell 2 and having 2 steps

V(S12) = 1 * 0.4 * ( 1 + 0.5 * 0) +
1 * 0.6 * ( 1 + 0.5 * 1)

V(S22) = 1 * 0.4 * ( 0 + 0.5 * 0) +
1 * 0.2 * ( 0 + 0.5 * 0) +
1 * 0.4 * ( 0 + 0.5 * 1)

Now let’s consider a mars rover starting from cell 1 and moving 3 steps

V(S13) =
1 * 0.4 * 0.4 * ( 1 + 0.5 * 0 + 0.5 * 0.5 * 0 ) +
1 * 0.4 * 0.2 * ( 1 + 0.5 * 0 + 0.5 * 0.5 * 0 ) +
1 * 0.4 * 0.4 * ( 1 + 0.5 * 0 + 0.5 * 0.5 * 1 ) +
1 * 0.6 * 0.4 * ( 1 + 0.5 * 1 + 0.5 * 0.5 * 0 ) +
1 * 0.6 * 0.6 * ( 1 + 0.5 * 1 + 0.5 * 0.5 * 1)

V(S13) =
1 * 0.4 * 0.4 * ( 1 ) + 1 * 0.4 * 0.4 * ( 0.5 * 0 + 0.5 * 0.5 * 0 ) +
1 * 0.4 * 0.2 * ( 1 ) + 1 * 0.4 * 0.2 * ( 0.5 * 0 + 0.5 * 0.5 * 0 ) +
1 * 0.4 * 0.4 * ( 1 ) + 1 * 0.4 * 0.4 * ( 0.5 * 0 + 0.5 * 0.5 * 1 ) +
1 * 0.6 * 0.4 * ( 1 ) + 1 * 0.6 * 0.4 * ( 0.5 * 1 + 0.5 * 0.5 * 0 ) +
1 * 0.6 * 0.6 * ( 1 ) + 1 * 0.6 * 0.6 * ( 0.5 * 1 + 0.5 * 0.5 * 1 )

V(S13) =
1+
1 * 0.4 * 0.4 * ( 0.5 * 0 + 0.5 * 0.5 * 0 ) +
1 * 0.4 * 0.2 * ( 0.5 * 0 + 0.5 * 0.5 * 0 ) +
1 * 0.4 * 0.4 * ( 0.5 * 0 + 0.5 * 0.5 * 1 ) +
1 * 0.6 * 0.4 * ( 0.5 * 1 + 0.5 * 0.5 * 0 ) +
1 * 0.6 * 0.6 * ( 0.5 * 1 + 0.5 * 0.5 * 1 )

V(S13) =
1 + 0.5 * (
1 * 0.4 * 0.4 * ( 0 + 0.5 * 0 ) +
1 * 0.4 * 0.2 * ( 0 + 0.5 * 0 ) +
1 * 0.4 * 0.4 * ( 0 + 0.5 * 1 ) +
1 * 0.6 * 0.4 * ( 1 + 0.5 * 0 ) +
1 * 0.6 * 0.6 * ( 1 + 0.5 * 1 ))

V(S13)  =
1  + 0.5 * (
1 * 0.4 * ( 0.4 * ( 0 + 0.5 * 0 ) +
0.2 * ( 0 + 0.5 * 0 ) +
0.4 * ( 0 + 0.5 * 1 ) ) +
1 * 0.6 * ( 0.4 * ( 1 + 0.5 * 0 ) +
0.6 * ( 1 + 0.5 * 1 ) ))

V(S13) = 1 + 0.5 * ( 0.4 * V(S22) + 0.6 * V(S12) )

https://en.wikipedia.org/wiki/Geometric_series

## Why is the critic network in the AC algorithm doesn’t have action path like it does in the DDPG algorithm ?

I am currently working on answering this question before moving forward and apply the AC algorithm, I noticed that difference when I was migrating from the DDPG algorithm to the AC algorithm, the reason for why I am migrating to AC, that’s because I need discrete actions for controlling my stepper motor, instead of DDPG continuous actions, which is suitable for DC motor, the stepper motor would have 3 actions [move_right, move_left, no_move], I will provide more details while trying to answer this question, stay tuned.

## Parts List (Not Final)

• Omron Encoder 2000 PPR
• 2 * 1KΩ Resistor
• EasyDriver 4.4
• 12V Power Supply
• DC Barrel Female Jack Adapter