- PyTorch 1.x Reinforcement Learning Cookbook
- Yuxi (Hayden) Liu
- 241字
- 2021-06-24 12:34:42
There's more...
We can observe that the reward can reach the maximum value within the first 100 episodes. Can we just stop training when the reward reaches 200, as we did with the random search policy? That might not be a good idea. Remember that the agent is making continuous improvements in hill climbing. Even if it finds a weight that generates the maximum reward, it can still search around this weight for the optimal point. Here, we define the optimal policy as the one that can solve the CartPole problem. According to the following wiki page,
We refine the stopping criterion accordingly: At episode 137, the problem is considered solved. >>> noise_scale = 0.01
>>> best_total_reward = 0
>>> total_rewards = []
>>> for episode in range(n_episode):
... weight = best_weight + noise_scale * torch.rand(n_state, n_action)
... total_reward = run_episode(env, weight)
... if total_reward >= best_total_reward:
... best_total_reward = total_reward
... best_weight = weight
... noise_scale = max(noise_scale / 2, 1e-4)
... else:
... noise_scale = min(noise_scale * 2, 2)
... print('Episode {}: {}'.format(episode + 1, total_reward))
... total_rewards.append(total_reward)
... if episode >= 99 and sum(total_rewards[-100:]) >= 19500:
... break
...
Episode 1: 9.0
Episode 2: 9.0
Episode 3: 10.0
Episode 4: 10.0
Episode 5: 9.0
……
……
Episode 133: 200.0
Episode 134: 200.0
Episode 135: 200.0
Episode 136: 200.0
Episode 137: 200.0
- 亮劍.NET:.NET深入體驗與實戰精要
- 高效能辦公必修課:Word圖文處理
- Design for the Future
- 機器人智能運動規劃技術
- 物聯網與云計算
- 電腦上網直通車
- Chef:Powerful Infrastructure Automation
- 液壓機智能故障診斷方法集成技術
- 網絡服務器搭建與管理
- 筆記本電腦電路分析與故障診斷
- Mastering OpenStack(Second Edition)
- Effective Business Intelligence with QuickSight
- Getting Started with Tableau 2019.2
- Appcelerator Titanium Smartphone App Development Cookbook(Second Edition)
- ARM嵌入式系統開發完全入門與主流實踐