官术网_书友最值得收藏!

  • Python Reinforcement Learning
  • Sudharsan Ravichandiran Sean Saito Rajalingappaa Shanmugamani Yang Wenzhuo
  • 425字
  • 2021-06-24 15:17:34

Policy iteration

Unlike value iteration, in policy iteration we start with the random policy, then we find the value function of that policy; if the value function is not optimal then we find the new improved policy. We repeat this process until we find the optimal policy.

There are two steps in policy iteration:

  1. Policy evaluation: Evaluating the value function of a randomly estimated policy.
  2. Policy improvement: Upon evaluating the value function, if it is not optimal, we find a new improved policy:

The steps involved in the policy iteration are as follows:

  1. First, we initialize some random policy
  2. Then we find the value function for that random policy and evaluate to check if it is optimal which is called policy evaluation
  3. If it is not optimal, we find a new improved policy, which is called policy improvement
  4. We repeat these steps until we find an optimal policy

Let us understand intuitively by performing policy iteration manually step by step.

Consider the same grid example we saw in the section Value iteration. Our goal is to find the optimal policy:

  1. Initialize a random policy function.

Let us initialize a random policy function by specifying random actions to each state:

say A -> 0

       B -> 1

       C -> 0

  1. Find the value function for the randomly initialized policy.

Now we have to find the value function using our randomly initialized policy. Let us say our value function after computation looks like the following:

 

Now that we have a new value function according to our randomly initialized policy, let us compute a new policy using our new value function. How do we do this? It is very similar to what we did in Value iteration. We calculate Q value for our new value function and then take actions for each state which has a maximum value as the new policy.

Let us say the new policy results in:

A - > 0

B - > 1

C -> 1

We check our old policy, that is, the randomly initialized policy, and the new policy. If they are same, then we have attained the convergence, that is, found the optimal policy. If not, we will update our old policy (random policy) as a new policy and repeat from step 2.

Sound confusing? Look at the pseudo code:

policy_iteration():
Initialize random policy
for i in no_of_iterations:
Q_value = value_function(random_policy)
new_policy = Maximum state action pair from Q value
if random_policy == new policy:
break
random_policy = new_policy
return policy
主站蜘蛛池模板: 沂南县| 阜康市| 洞口县| 宁德市| 霍州市| 汉川市| 太保市| 广饶县| 华容县| 顺义区| 曲周县| 宜春市| 扬中市| 新建县| 青铜峡市| 伊春市| 苗栗市| 佛坪县| 和顺县| 古浪县| 滁州市| 城市| 东辽县| 曲阜市| 溧阳市| 大新县| 革吉县| 秭归县| 高陵县| 江油市| 天气| 嘉峪关市| 同德县| 延川县| 靖边县| 崇左市| 涡阳县| 南昌市| 高淳县| 东乡族自治县| 红河县|