官术网_书友最值得收藏!

The value function for optimality

Agents should be able to think about both immediate and future rewards. Therefore, a value is assigned to each encountered state that reflects this future information too. This is called value function. Here comes the concept of delayed rewards, where being at present what actions taken now will lead to potential rewards in future.

V(s), that is, value of the state is defined as the expected value of rewards to be received in future for all the actions taken from this state to subsequent states until the agent reaches the goal state. Basically, value functions tell us how good it is to be in this state. The higher the value, the better the state.

Rewards assigned to each (s,a,s') triple is fixed. This is not the case with the value of the state; it is subjected to change with every action in the episode and with different episodes too.

One solution comes in mind, instead of the value function, why don't we store the knowledge of every possible state?

The answer is simple: it's time-consuming and expensive, and this cost grows exponentially. Therefore, it's better to store the knowledge of the current state, that is, V(s):

V(s) = E[all future rewards discounted | S(t)=s]

More details on the value function will be covered in Chapter 3, The Markov Decision Process and Partially Observable MDP.

主站蜘蛛池模板: 黎川县| 奈曼旗| 彝良县| 兴国县| 怀远县| 兴和县| 耒阳市| 林甸县| 百色市| 庄浪县| 诏安县| 揭阳市| 琼结县| 三原县| 佳木斯市| 临沂市| 四川省| 楚雄市| 阜宁县| 天台县| 崇左市| 息烽县| 云阳县| 靖西县| 光泽县| 涞源县| 吉林市| 阜康市| 烟台市| 深圳市| 池州市| 化德县| 呼伦贝尔市| 上饶市| 灵寿县| 尼木县| 桑植县| 根河市| 中牟县| 尉氏县| 雅江县|