官术网_书友最值得收藏!

  • Python Reinforcement Learning
  • Sudharsan Ravichandiran Sean Saito Rajalingappaa Shanmugamani Yang Wenzhuo
  • 239字
  • 2021-06-24 15:17:32

The Bellman equation and optimality

The Bellman equation, named after Richard Bellman, American mathematician, helps us to solve MDP. It is omnipresent in RL. When we say solve the MDP, it actually means finding the optimal policies and value functions. There can be many different value functions according to different policies. The optimal value function  is the one which yields maximum value compared to all the other value functions:

 

Similarly, the optimal policy is the one which results in an optimal value function.

Since the optimal value function  is the one that has a higher value compared to all other value functions (that is, maximum return), it will be the maximum of the Q function. So, the optimal value function can easily be computed by taking the maximum of the Q function as follows:

  -- (3)

The Bellman equation for the value function can be represented as, (we will see how we derived this equation in the next topic):

It indicates the recursive relation between a value of a state and its successor state and the average over all possibilities.

Similarly, the Bellman equation for the Q function can be represented as follows:

 --- (4)

Substituting equation (4) in (3), we get:

The preceding equation is called a Bellman optimality equation. In the upcoming sections, we will see how to find optimal policies by solving this equation.         

主站蜘蛛池模板: 青神县| 米易县| 仙游县| 宁城县| 兴山县| 怀集县| 甘孜县| 天津市| 辛集市| 麻江县| 迭部县| 梁山县| 嘉荫县| 达孜县| 尤溪县| 伊宁县| 勐海县| 蒲城县| 安远县| 临泽县| 奈曼旗| 尚义县| 景德镇市| 岗巴县| 银川市| 阳泉市| 泗水县| 威海市| 广饶县| 衡山县| 安吉县| 许昌市| 隆尧县| 普兰县| 台山市| 闽清县| 恩平市| 万荣县| 洞口县| 肥西县| 万载县|