- Python Reinforcement Learning
- Sudharsan Ravichandiran Sean Saito Rajalingappaa Shanmugamani Yang Wenzhuo
- 637字
- 2021-06-24 15:17:34
Policy iteration
In policy iteration, first we initialize a random policy. Then we will evaluate the random policies we initialized: are they good or not? But how can we evaluate the policies? We will evaluate our randomly initialized policies by computing value functions for them. If they are not good, then we find a new policy. We repeat this process until we find a good policy.
Now let us see how to solve the frozen lake problem using policy iteration.
Before looking at policy iteration, we will see how to compute a value function, given a policy.
We initialize value_table as zero with the number of states:
value_table = np.zeros(env.nS)
Then, for each state, we get the action from the policy, and we compute the value function according to that action and state as follows:
updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])
We break this when the difference between value_table and updated_value_table is less than our threshold:
threshold = 1e-10
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break
Look at the following complete function:
def compute_value_function(policy, gamma=1.0):
value_table = np.zeros(env.nS)
threshold = 1e-10
while True:
updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break
return value_table
Now we will see how to perform policy iteration, step by step.
First, we initialize random_policy as zero NumPy array with shape as number of states:
random_policy = np.zeros(env.observation_space.n)
Then, for each iteration, we calculate the new_value_function according to our random policy:
new_value_function = compute_value_function(random_policy, gamma)
We will extract the policy using the calculated new_value_function. The extract_policy function is the same as the one we used in value iteration:
new_policy = extract_policy(new_value_function, gamma)
Then we check whether we have reached convergence, that is, whether we found the optimal policy by comparing random_policy and the new policy. If they are the same, we will break the iteration; otherwise we update random_policy with new_policy:
if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy
Look at the complete policy_iteration function:
def policy_iteration(env,gamma = 1.0):
random_policy = np.zeros(env.observation_space.n)
no_of_iterations = 200000
gamma = 1.0
for i in range(no_of_iterations):
new_value_function = compute_value_function(random_policy, gamma)
new_policy = extract_policy(new_value_function, gamma)
if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy
return new_policy
Thus, we can get optimal_policy using policy_iteration:
optimal_policy = policy_iteration(env, gamma = 1.0)
We will get some output, which is the optimal_policy, the actions to be performed in each state:
array([0., 3., 3., 3., 0., 0., 0., 0., 3., 1., 0., 0., 0., 2., 1., 0.])
The complete program is given as follows:
import gym
import numpy as np
env = gym.make('FrozenLake-v0')
def compute_value_function(policy, gamma=1.0):
value_table = np.zeros(env.nS)
threshold = 1e-10
while True:
updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break
return value_table
def extract_policy(value_table, gamma = 1.0):
policy = np.zeros(env.observation_space.n)
for state in range(env.observation_space.n):
Q_table = np.zeros(env.action_space.n)
for action in range(env.action_space.n):
for next_sr in env.P[state][action]:
trans_prob, next_state, reward_prob, _ = next_sr
Q_table[action] += (trans_prob * (reward_prob + gamma * value_table[next_state]))
policy[state] = np.argmax(Q_table)
return policy
def policy_iteration(env,gamma = 1.0):
random_policy = np.zeros(env.observation_space.n)
no_of_iterations = 200000
gamma = 1.0
for i in range(no_of_iterations):
new_value_function = compute_value_function(random_policy, gamma)
new_policy = extract_policy(new_value_function, gamma)
if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy
return new_policy
print (policy_iteration(env))
Thus, we can derive the optimal policy, which specifies what action to perform in each state, using value and policy iteration to solve the frozen lake problem.
- 同步:秩序如何從混沌中涌現
- 計算機組成原理與接口技術:基于MIPS架構實驗教程(第2版)
- Python絕技:運用Python成為頂級數據工程師
- Hands-On Machine Learning with Microsoft Excel 2019
- 信息系統與數據科學
- 使用GitOps實現Kubernetes的持續部署:模式、流程及工具
- Modern Programming: Object Oriented Programming and Best Practices
- 業務數據分析:五招破解業務難題
- 大數據精準挖掘
- INSTANT Apple iBooks How-to
- Construct 2 Game Development by Example
- Python數據分析與數據化運營
- 改進的群智能算法及其應用
- 數據中臺實戰:手把手教你搭建數據中臺
- Unity for Architectural Visualization