官术网_书友最值得收藏!

There's more...

We can observe that the reward can reach the maximum value within the first 100 episodes. Can we just stop training when the reward reaches 200, as we did with the random search policy? That might not be a good idea. Remember that the agent is making continuous improvements in hill climbing. Even if it finds a weight that generates the maximum reward, it can still search around this weight for the optimal point. Here, we define the optimal policy as the one that can solve the CartPole problem. According to the following wiki page,

We refine the stopping criterion accordingly:

 >>> noise_scale = 0.01
>>> best_total_reward = 0
>>> total_rewards = []
>>> for episode in range(n_episode):
... weight = best_weight + noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... noise_scale = max(noise_scale / 2, 1e-4)
... else:
... noise_scale = min(noise_scale * 2, 2)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... total_rewards.append(total_reward)
... if episode >= 99 and sum(total_rewards[-100:]) >= 19500:
... break
...
Episode 1: 9.0
Episode 2: 9.0
Episode 3: 10.0
Episode 4: 10.0
Episode 5: 9.0
……
……
Episode 133: 200.0
Episode 134: 200.0
Episode 135: 200.0
Episode 136: 200.0
Episode 137: 200.0

At episode 137, the problem is considered solved.

主站蜘蛛池模板: 枞阳县| 宝丰县| 蕲春县| 清水县| 衡阳县| 精河县| 林甸县| 宜丰县| 安义县| 大洼县| 新宾| 象山县| 新干县| 古蔺县| 金阳县| 阿勒泰市| 华安县| 台山市| 油尖旺区| 锡林浩特市| 治县。| 望都县| 司法| 图木舒克市| 南木林县| 加查县| 和政县| 远安县| 大安市| 长阳| 百色市| 宣恩县| 井陉县| 崇阳县| 体育| 资溪县| 无锡市| 普兰店市| 会泽县| 安吉县| 佛山市|