- Python Reinforcement Learning
- Sudharsan Ravichandiran Sean Saito Rajalingappaa Shanmugamani Yang Wenzhuo
- 425字
- 2021-06-24 15:17:34
Policy iteration
Unlike value iteration, in policy iteration we start with the random policy, then we find the value function of that policy; if the value function is not optimal then we find the new improved policy. We repeat this process until we find the optimal policy.
There are two steps in policy iteration:
- Policy evaluation: Evaluating the value function of a randomly estimated policy.
- Policy improvement: Upon evaluating the value function, if it is not optimal, we find a new improved policy:

The steps involved in the policy iteration are as follows:
- First, we initialize some random policy
- Then we find the value function for that random policy and evaluate to check if it is optimal which is called policy evaluation
- If it is not optimal, we find a new improved policy, which is called policy improvement
- We repeat these steps until we find an optimal policy
Let us understand intuitively by performing policy iteration manually step by step.
Consider the same grid example we saw in the section Value iteration. Our goal is to find the optimal policy:
- Initialize a random policy function.
Let us initialize a random policy function by specifying random actions to each state:
say A -> 0
B -> 1
C -> 0
- Find the value function for the randomly initialized policy.
Now we have to find the value function using our randomly initialized policy. Let us say our value function after computation looks like the following:

Now that we have a new value function according to our randomly initialized policy, let us compute a new policy using our new value function. How do we do this? It is very similar to what we did in Value iteration. We calculate Q value for our new value function and then take actions for each state which has a maximum value as the new policy.
Let us say the new policy results in:
A - > 0
B - > 1
C -> 1
We check our old policy, that is, the randomly initialized policy, and the new policy. If they are same, then we have attained the convergence, that is, found the optimal policy. If not, we will update our old policy (random policy) as a new policy and repeat from step 2.
Sound confusing? Look at the pseudo code:
policy_iteration():
Initialize random policy
for i in no_of_iterations:
Q_value = value_function(random_policy)
new_policy = Maximum state action pair from Q value
if random_policy == new policy:
break
random_policy = new_policy
return policy
- 數據產品經理高效學習手冊:產品設計、技術常識與機器學習
- 數據可視化:從小白到數據工程師的成長之路
- Python數據分析入門:從數據獲取到可視化
- MongoDB管理與開發精要
- Hadoop 3.x大數據開發實戰
- SQL優化最佳實踐:構建高效率Oracle數據庫的方法與技巧
- AI時代的數據價值創造:從數據底座到大模型應用落地
- Python數據分析與挖掘實戰(第3版)
- Mastering LOB Development for Silverlight 5:A Case Study in Action
- MySQL DBA修煉之道
- 菜鳥學SPSS數據分析
- Internet of Things with Python
- 數據挖掘算法實踐與案例詳解
- Learning Ansible
- 數據會說話:活用數據表達、說服與決策