官术网_书友最值得收藏!

There's more...

We decide to experiment with different values for the discount factor. Let's start with 0, which means we only care about the immediate reward:

 >>> gamma = 0
>>> V = cal_value_matrix_inversion(gamma, trans_matrix, R)
>>> print("The value function under the optimal policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([[ 1.],
[ 0.],
[-1.]])

This is consistent with the reward function as we only look at the reward received in the next move.

As the discount factor increases toward 1, future rewards are considered. Let's take a look at ??=0.99:

 >>> gamma = 0.99
>>> V = cal_value_matrix_inversion(gamma, trans_matrix, R)
>>> print("The value function under the optimal policy is:\n{}".format(V))
The value function under the optimal policy is:
tensor([[65.8293],
[64.7194],
[63.4876]])
主站蜘蛛池模板: 巴彦淖尔市| 确山县| 信宜市| 孟津县| 新竹市| 沁阳市| 洛隆县| 金湖县| 青河县| 额尔古纳市| 茶陵县| 财经| 广德县| 增城市| 玉林市| 遵义县| 江源县| 柳河县| 桦甸市| 新津县| 德格县| 桃园市| 牟定县| 天峻县| 金川县| 宁南县| 怀化市| 安仁县| 桐庐县| 辽阳县| 温泉县| 平武县| 灯塔市| 普安县| 盘山县| 鸡东县| 华坪县| 浙江省| 武功县| 宁国市| 旬阳县|