官术网_书友最值得收藏!

How it works...

We have just seen how effective it is to compute the value of a policy using policy evaluation. It is a simple convergent iterative approach, in the dynamic programming family, or to be more specific, approximate dynamic programming. It starts with random guesses as to the values and then iteratively updates them according to the Bellman expectation equation until they converge.

In Step 5, the policy evaluation function does the following tasks:

  • Initializes the policy values as all zeros.
  • Updates the values based on the Bellman expectation equation.
  • Computes the maximal change of the values across all states.
  • If the maximal change is greater than the threshold, it keeps updating the values. Otherwise, it terminates the evaluation process and returns the latest values.

Since policy evaluation uses iterative approximation, its result might not be exactly the same as the result of the matrix inversion method, which uses exact computation. In fact, we don't really need the value function to be that precise. Also, it can solve the curses of dimensionality problem, which can result in scaling up the computation to thousands of millions of states. Therefore, we usually prefer policy evaluation over the other.

One more thing to remember is that policy evaluation is used to predict how great a we will get from a given policy; it is not used for control problems.

主站蜘蛛池模板: 湖南省| 普兰店市| 梁平县| 噶尔县| 太湖县| 青海省| 蒙阴县| 轮台县| 淮北市| 靖西县| 榕江县| 高台县| 嘉荫县| 来安县| 东乡县| 名山县| 河南省| 南澳县| 岑溪市| 石渠县| 古丈县| 上蔡县| 偏关县| 荃湾区| 二手房| 合作市| 琼结县| 邻水| 分宜县| 繁峙县| 尖扎县| 阳原县| 万州区| 开原市| 重庆市| 赣榆县| 临猗县| 彰化市| 光泽县| 成安县| 明水县|