官术网_书友最值得收藏!

Performing policy evaluation

We have just developed an MDP and computed the value function of the optimal policy using matrix inversion. We also mentioned the limitation of inverting an m * m matrix with a large m value (let's say 1,000, 10,000, or 100,000). In this recipe, we will talk about a simpler approach called policy evaluation.

Policy evaluation is an iterative algorithm. It starts with arbitrary policy values and then iteratively updates the values based on the Bellman expectation equation until they converge. In each iteration, the value of a policy, π, for a state, s, is updated as follows:

Here, π(s, a) denotes the probability of taking action a in state s under policy πT(s, a, s') is the transition probability from state s to state s' by taking action a, and R(s, a) is the reward received in state s by taking action a.

There are two ways to terminate an iterative updating process. One is by setting a fixed number of iterations, such as 1,000 and 10,000, which might be difficult to control sometimes. Another one involves specifying a threshold (usually 0.0001, 0.00001, or something similar) and terminating the process only if the values of all states change to an extent that is lower than the threshold specified.

In the next section, we will perform policy evaluation on the study-sleep-game process under the optimal policy and a random policy.

主站蜘蛛池模板: 扶沟县| 开化县| 克拉玛依市| 大安市| 云林县| 罗平县| 木兰县| 罗田县| 聂荣县| 贵德县| 舒兰市| 崇义县| 东安县| 松潘县| 洮南市| 辽阳县| 宝山区| 大渡口区| 互助| 兴和县| 鄂托克旗| 正蓝旗| 平湖市| 海门市| 浮山县| 楚雄市| 龙陵县| 射阳县| 沧州市| 康马县| 凤凰县| 房山区| 荆州市| 清水河县| 建始县| 乌拉特前旗| 定州市| 贵南县| 鸡泽县| 伊川县| 通渭县|