- Machine Learning Algorithms
- Giuseppe Bonaccorso
- 407字
- 2021-07-02 18:53:24
Reinforcement learning
Even if there are no actual supervisors, reinforcement learning is also based on feedback provided by the environment. However, in this case, the information is more qualitative and doesn't help the agent in determining a precise measure of its error. In reinforcement learning, this feedback is usually called reward (sometimes, a negative one is defined as a penalty) and it's useful to understand whether a certain action performed in a state is positive or not. The sequence of most useful actions is a policy that the agent has to learn, so to be able to make always the best decision in terms of the highest immediate and cumulative reward. In other words, an action can also be imperfect, but in terms of a global policy it has to offer the highest total reward. This concept is based on the idea that a rational agent always pursues the objectives that can increase his/her wealth. The ability to see over a distant horizon is a distinction mark for advanced agents, while short-sighted ones are often unable to correctly evaluate the consequences of their immediate actions and so their strategies are always sub-optimal.
Reinforcement learning is particularly efficient when the environment is not completely deterministic, when it's often very dynamic, and when it's impossible to have a precise error measure. During the last few years, many classical algorithms have been applied to deep neural networks to learn the best policy for playing Atari video games and to teach an agent how to associate the right action with an input representing the state (usually a screenshot or a memory dump).
In the following figure, there's a schematic representation of a deep neural network trained to play a famous Atari game. As input, there are one or more subsequent screenshots (this can often be enough to capture the temporal dynamics as well). They are processed using different layers (discussed briefly later) to produce an output that represents the policy for a specific state transition. After applying this policy, the game produces a feedback (as a reward-penalty), and this result is used to refine the output until it becomes stable (so the states are correctly recognized and the suggested action is always the best one) and the total reward overcomes a predefined threshold.

We're going to discuss some examples of reinforcement learning in the chapter dedicated to introducing deep learning and TensorFlow.
- 玩轉Scratch少兒趣味編程
- R語言數據分析從入門到精通
- Java應用開發與實踐
- Web開發的貴族:ASP.NET 3.5+SQL Server 2008
- PyTorch自然語言處理入門與實戰
- 高級C/C++編譯技術(典藏版)
- Python機器學習:手把手教你掌握150個精彩案例(微課視頻版)
- Building Machine Learning Systems with Python(Second Edition)
- Scala編程(第5版)
- Web程序設計:ASP.NET(第2版)
- Mastering ASP.NET Core 2.0
- Python高性能編程(第2版)
- Python繪圖指南:分形與數據可視化(全彩)
- Java EE基礎實用教程
- 大話C語言