We decide to experiment with different values for the discount factor. Let's start with 0, which means we only care about the immediate reward:
>>> gamma = 0 >>> V = cal_value_matrix_inversion(gamma, trans_matrix, R) >>> print("The value function under the optimal policy is:\n{}".format(V)) The value function under the optimal policy is: tensor([[ 1.], [ 0.], [-1.]])
This is consistent with the reward function as we only look at the reward received in the next move.
As the discount factor increases toward 1, future rewards are considered. Let's take a look at ??=0.99:
>>> gamma = 0.99 >>> V = cal_value_matrix_inversion(gamma, trans_matrix, R) >>> print("The value function under the optimal policy is:\n{}".format(V)) The value function under the optimal policy is: tensor([[65.8293], [64.7194], [63.4876]])