- Reinforcement Learning with TensorFlow
- Sayon Dutta
- 383字
- 2021-08-27 18:51:58
Asynchronous advantage actor-critic
The A3C algorithm was published in June 2016 by the combined team of Google DeepMind and MILA. It is simpler and has a lighter framework that used the asynchronous gradient descent to optimize the deep neural network. It was faster and was able to show good results on the multi-core CPU instead of GPU. One of A3C's big advantages is that it can work on continuous as well as discrete action spaces. As a result, it has opened the gateway for many new challenging problems that have complex state and action spaces.
We will discuss it at a high note here, but we will dig deeper in Chapter 6, Asynchronous Methods. Let's start with the name, that is, asynchronous advantage actor-critic (A3C) algorithm and unpack it to get the basic overview of the algorithm:
- Asynchronous: In DQN, you remember we used a neural network with our agent to predict actions. This means there is one agent and it's interacting with one environment. What A3C does is create multiple copies of the agent-environment to make the agent learn more efficiently. A3C has a global network, and multiple worker agents, where each agent has its own set of network parameters and each of them interact with their copy of the environment simultaneously without interacting with another agent's environment. The reason this works better than a single agent is that the experience of each agent is independent of the experience of the other agents. Thus, the overall experience from all the worker agents results in diverse training.
- Actor-critic: Actor-critic combines the benefits of both value iteration and policy iteration. Thus, the network will estimate both a value function, V(s), and a policy, π(s), for a given state, s. There will be two separate fully-connected layers at the top of the function approximator neural network that will output the value and policy of the state, respectively. The agent uses the value, which acts as a critic to update the policy, that is, the intelligent actor.
- Advantage: Policy gradients used discounted returns telling the agent whether the action was good or bad. Replacing that with Advantage not only quantifies the the good or bad status of the action but helps in encouraging and discouraging actions better(we will discuss this in Chapter 4, Policy Gradients).
- 構建高質量的C#代碼
- 大數據管理系統
- 我的J2EE成功之路
- Go Machine Learning Projects
- 3D Printing with RepRap Cookbook
- Practical Data Wrangling
- WOW!Illustrator CS6完全自學寶典
- STM32G4入門與電機控制實戰:基于X-CUBE-MCSDK的無刷直流電機與永磁同步電機控制實現
- 基于ARM 32位高速嵌入式微控制器
- 網絡綜合布線設計與施工技術
- 嵌入式操作系統
- MATLAB/Simulink權威指南:開發環境、程序設計、系統仿真與案例實戰
- RedHat Linux用戶基礎
- Citrix? XenDesktop? 7 Cookbook
- 工業自動化技術實訓指導