官术网_书友最值得收藏!

  • Python Reinforcement Learning
  • Sudharsan Ravichandiran Sean Saito Rajalingappaa Shanmugamani Yang Wenzhuo
  • 637字
  • 2021-06-24 15:17:34

Policy iteration

In policy iteration, first we initialize a random policy. Then we will evaluate the random policies we initialized: are they good or not? But how can we evaluate the policies? We will evaluate our randomly initialized policies by computing value functions for them. If they are not good, then we find a new policy. We repeat this process until we find a good policy.

Now let us see how to solve the frozen lake problem using policy iteration.

Before looking at policy iteration, we will see how to compute a value function, given a policy. 

We initialize value_table as zero with the number of states:

value_table = np.zeros(env.nS)

Then, for each state, we get the action from the policy, and we compute the value function according to that action and state as follows:

        updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])

We break this when the difference between value_table and updated_value_table is less than our threshold:

threshold = 1e-10
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break

Look at the following complete function:

def compute_value_function(policy, gamma=1.0):
value_table = np.zeros(env.nS)
threshold = 1e-10
while True:
updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break
return value_table

Now we will see how to perform policy iteration, step by step. 

First, we initialize random_policy as zero NumPy array with shape as number of states:

 random_policy = np.zeros(env.observation_space.n)

Then, for each iteration, we calculate the new_value_function according to our random policy:

new_value_function = compute_value_function(random_policy, gamma)

We will extract the policy using the calculated new_value_function. The extract_policy function is the same as the one we used in value iteration:

 new_policy = extract_policy(new_value_function, gamma)

Then we check whether we have reached convergence, that is, whether we found the optimal policy by comparing random_policy and the new policy. If they are the same, we will break the iteration; otherwise we update random_policy with new_policy:

if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy

Look at the complete policy_iteration function:

def policy_iteration(env,gamma = 1.0):
random_policy = np.zeros(env.observation_space.n)
no_of_iterations = 200000
gamma = 1.0
for i in range(no_of_iterations):
new_value_function = compute_value_function(random_policy, gamma)
new_policy = extract_policy(new_value_function, gamma)
if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy
return new_policy

Thus, we can get optimal_policy using policy_iteration

optimal_policy = policy_iteration(env, gamma = 1.0)

We will get some output, which is the optimal_policy, the actions to be performed in each state:

array([0., 3., 3., 3., 0., 0., 0., 0., 3., 1., 0., 0., 0., 2., 1., 0.])

The complete program is given as follows:

import gym
import numpy as np

env = gym.make('FrozenLake-v0')

def compute_value_function(policy, gamma=1.0):
value_table = np.zeros(env.nS)
threshold = 1e-10
while True:
updated_value_table = np.copy(value_table)
for state in range(env.nS):
action = policy[state]
value_table[state] = sum([trans_prob * (reward_prob + gamma * updated_value_table[next_state])
for trans_prob, next_state, reward_prob, _ in env.P[state][action]])
if (np.sum((np.fabs(updated_value_table - value_table))) <= threshold):
break
return value_table


def extract_policy(value_table, gamma = 1.0):
policy = np.zeros(env.observation_space.n)
for state in range(env.observation_space.n):
Q_table = np.zeros(env.action_space.n)
for action in range(env.action_space.n):
for next_sr in env.P[state][action]:
trans_prob, next_state, reward_prob, _ = next_sr
Q_table[action] += (trans_prob * (reward_prob + gamma * value_table[next_state]))
policy[state] = np.argmax(Q_table)

return policy

def policy_iteration(env,gamma = 1.0):
random_policy = np.zeros(env.observation_space.n)
no_of_iterations = 200000
gamma = 1.0
for i in range(no_of_iterations):
new_value_function = compute_value_function(random_policy, gamma)
new_policy = extract_policy(new_value_function, gamma)
if (np.all(random_policy == new_policy)):
print ('Policy-Iteration converged at step %d.' %(i+1))
break
random_policy = new_policy
return new_policy


print (policy_iteration(env))

Thus, we can derive the optimal policy, which specifies what action to perform in each state, using value and policy iteration to solve the frozen lake problem.

主站蜘蛛池模板: 武平县| 叶城县| 澄江县| 遵义县| 门源| 历史| 瑞金市| 大宁县| 绥中县| 伽师县| 吴桥县| 芮城县| 克拉玛依市| 和田市| 河西区| 宁都县| 拉孜县| 翁牛特旗| 宁远县| 莫力| 陵水| 茂名市| 隆尧县| 伊春市| 石棉县| 偏关县| 霍林郭勒市| 吉首市| 江川县| 上犹县| 福建省| 岐山县| 伊宁市| 寿宁县| 门头沟区| 南雄市| 克山县| 潞西市| 增城市| 清镇市| 滕州市|