官术网_书友最值得收藏!

Value-based versus policy-based iteration

We'll be using value-based iteration for the projects in this book. The description of the Bellman equation given previously offers a very high-level understanding of how value-based iteration works. The main difference is that in value-based iteration, the agent learns the expected reward value of each state-action pair, and in policy-based iteration, the agent learns the function that maps states to actions.

One simple way to describe this difference is that a value-based agent, when it has mastered its environment, will not explicitly be able to simulate that environment. It will not be able to give an actual function that maps states to actions. A policy-based agent, on the other hand, will be able to give that function. 

Note that Q-learning and SARSA are both value-based algorithms. Because we are working with Q-learning in this book, we will not study policy-based iteration in detail here. The main thing to bear in mind about policy-based iteration is that it gives us the ability to learn stochastic policies and it is more useful for working with continuous action spaces.

主站蜘蛛池模板: 晋宁县| 武冈市| 重庆市| 阜平县| 永定县| 济南市| 南投县| 甘洛县| 泽普县| 光山县| 布尔津县| 郯城县| 天气| 普洱| 外汇| 蓝山县| 胶州市| 尤溪县| 盐池县| 杂多县| 衡山县| 新闻| 龙泉市| 雷波县| 九江县| 盐亭县| 白朗县| 永定县| 治多县| 乾安县| 德清县| 长岛县| 太原市| 南投县| 邵阳县| 平果县| 繁峙县| 广平县| 榆中县| 灵寿县| 澄江县|