官术网_书友最值得收藏!

  • Python Reinforcement Learning
  • Sudharsan Ravichandiran Sean Saito Rajalingappaa Shanmugamani Yang Wenzhuo
  • 287字
  • 2021-06-24 15:17:32

Deriving the Bellman equation for value and Q functions

Now let us see how to derive Bellman equations for value and Q functions.

You can skip this section if you are not interested in mathematics; however, the math will be super intriguing.

First, we define,  as a transition probability of moving from state  to  while performing an action a:

We define  as a reward probability received by moving from state  to  while performing an action a:

             from (2)    ---(5)

We know that the value function can be represented as:

 from (1)

We can rewrite our value function by taking the first reward out:

  ---(6)

The expectations in the value function specifies the expected return if we are in the state s, performing an action a with policy π.

So, we can rewrite our expectation explicitly by summing up all possible actions and rewards as follows:

In the RHS, we will substitute  from equation (5) as follows:

Similarly, in the LHS, we will substitute the value of rt+1 from equation (2) as follows:

So, our final expectation equation becomes:

  ---(7)

Now we will substitute our expectation (7) in value function (6) as follows:

Instead of , we can substitute  with equation (6), and our final value function looks like the following:

In very similar fashion, we can derive a Bellman equation for the Q function; the final equation is as follows:

Now that we have a Bellman equation for both the value and Q function, we will see how to find the optimal policies.

主站蜘蛛池模板: 阳春市| 宣武区| 广德县| 陕西省| 眉山市| 治县。| 独山县| 宜丰县| 苗栗县| 全椒县| 三亚市| 宜良县| 如东县| 将乐县| 上杭县| 安多县| 开原市| 瑞丽市| 洛宁县| 榆树市| 象山县| 依安县| 呼图壁县| 东乡族自治县| 汉川市| 遂川县| 双峰县| 湛江市| 姜堰市| 增城市| 泊头市| 石狮市| 卫辉市| 镶黄旗| 武定县| 盐山县| 那坡县| 安阳县| 师宗县| 定陶县| 垦利县|