官术网_书友最值得收藏!

  • The Reinforcement Learning Workshop
  • Alessandro Palmas Emanuele Ghelfi Dr. Alexandra Galina Petre Mayur Kulkarni Anand N.S. Quan Nguyen Aritra Sen Anthony So Saikat Basak
  • 867字
  • 2021-06-11 18:37:48

OpenAI Baselines

So far, we have studied the two different frameworks that allow us to solve reinforcement learning problems (OpenAI Gym and OpenAI Universe). We also studied how to create the "brain" of the agent, known as the policy network, with TensorFlow.

The next step is to train the agent and make it learn how to act optimally, only through experience. Learning how to train an RL agent is the ultimate goal of this book. We will see how most advanced methods work and find out about all their internal elements and algorithms. But even before we find out all the details of how these approaches are implemented, it is possible to rely on some tools that make the task more straightforward.

OpenAI Baselines is a Python-based tool, built on TensorFlow, that provides a library of high-quality, state-of-the-art implementations of reinforcement learning algorithms. It can be used as an out-of-the-box module, but it can also be customized and expanded. We will be using it to solve a classic control problem and a classic Atari video game by training a custom policy network.

Note

Please make sure you have installed OpenAI Baselines by using the instructions mentioned in the preface, before moving on.

Proximal Policy Optimization

It is worth providing a high-level idea of what Proximal Policy Optimization (PPO) is. We will remain at the highest level when describing this state-of-the-art RL algorithm because, in order to deeply understand how it works, you will need to become familiar with the topics that will be presented in the following chapters, thereby preparing you to study and build other state-of-the-art RL methods by the end of this book.

PPO is a reinforcement learning method that is part of the policy gradient family. Algorithms in this category aim to directly optimize the policy, instead of building a value function to then generate a policy. To do so, they instantiate a policy (in our case, in the form of a deep neural network) and build a method to calculate a gradient that defines where to move the policy function's approximator parameters (the weights of our deep neural network, in our case) to directly improve the policy. The word "proximal" suggests a specific feature of these methods: in the policy update step, when adjusting policy parameters, the update is constrained, thus preventing it from moving "too far" from the starting policy. All these aspects will be transparent to the user, thanks to the OpenAI Baselines tool, which will take care of carrying out the job under the hood. You will learn about these aspects in the upcoming chapters.

Note

Please refer to the following paper to learn more about PPO: https://arxiv.org/pdf/1707.06347.pdf.

Command-Line Usage

As stated earlier, OpenAI Baselines allows us to train state-of-the-art RL algorithms easily for OpenAI Gym problems. The following code snippet, for example, trains a PPO algorithm for 20 million steps in the Pong Gym environment:

python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4

    --num_timesteps=2e7 --save_path=./models/pong_20M_ppo2

    --log_path=./logs/Pong/

It saves the model in the user-defined save path so that it is possible to reload the weights on the policy network and deploy the trained agent in the environment with the following command-line instruction:

python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4

    --num_timesteps=0 --load_path=./models/pong_20M_ppo2 --play

You can easily train every available method on every OpenAI Gym environment by changing only the command-line arguments, without knowing anything about how they work internally.

Methods in OpenAI Baselines

OpenAI Baselines gives us access to the following RL algorithm implementations:

  • A2C: Advantage Actor-Critic
  • ACER: Actor-Critic with Experience Replay
  • ACKTR: Actor-Critic using Kronecker-factored Trust Region
  • DDPG: Deep Deterministic Policy Gradient
  • DQN: Deep Q-Network
  • GAIL: Generative Adversarial Imitation Learning
  • HER: Hindsight Experience Replay
  • PPO2: Proximal Policy Optimization
  • TRPO: Trust Region Policy Optimization

For the upcoming exercise and activity, we will be using PPO.

Custom Policy Network Architecture

Despite its out-of-the-box usability, OpenAI Baselines can also be customized and expanded. In particular, as something that will also be used in the next two sections of this chapter, it is possible to provide a custom definition to the module for the policy network architecture.

One aspect that needs to be clear is the fact that the network will be used as an encoder of the environment state or observation. OpenAI Baselines will then take care of creating the final layer, which is in charge of linking the latent space (space of embeddings) to the proper output layer. The latter is chosen depending on the type of the action space (is it discrete or continuous? How many available actions are there?) of the selected environment.

First of all, the user needs to import the Baselines register, which allows them to define a custom network and register it with a user-defined name. Then, they can define a custom deep learning model in the form of a function using a custom architecture. In this way, we are able to change the policy network architecture at will, testing different solutions to find the best one for a specific problem. A practical example will be presented in the exercise in the following section.

Now, we are ready to train our first RL agent and solve a classic control problem.

主站蜘蛛池模板: 普兰县| 田林县| 东源县| 托克逊县| 滁州市| 邵东县| 镇远县| 隆回县| 怀宁县| 调兵山市| 嵊州市| 通州区| 区。| 东方市| 牙克石市| 安新县| 浪卡子县| 宝山区| 肃南| 建阳市| 玉树县| 长兴县| 慈溪市| 曲周县| 青阳县| 潼南县| 搜索| 滕州市| 浠水县| 礼泉县| 嘉峪关市| 海安县| 公主岭市| 临清市| 游戏| 饶阳县| 鲁甸县| 平阴县| 修武县| 信丰县| 万荣县|