官术网_书友最值得收藏!

Reinforcement learning

Even if there are no actual supervisors, reinforcement learning is also based on feedback provided by the environment. However, in this case, the information is more qualitative and doesn't help the agent in determining a precise measure of its error. In reinforcement learning, this feedback is usually called reward (sometimes, a negative one is defined as a penalty) and it's useful to understand whether a certain action performed in a state is positive or not. The sequence of most useful actions is a policy that the agent has to learn, so to be able to make always the best decision in terms of the highest immediate and cumulative reward. In other words, an action can also be imperfect, but in terms of a global policy it has to offer the highest total reward. This concept is based on the idea that a rational agent always pursues the objectives that can increase his/her wealth. The ability to see over a distant horizon is a distinction mark for advanced agents, while short-sighted ones are often unable to correctly evaluate the consequences of their immediate actions and so their strategies are always sub-optimal.

Reinforcement learning is particularly efficient when the environment is not completely deterministic, when it's often very dynamic, and when it's impossible to have a precise error measure. During the last few years, many classical algorithms have been applied to deep neural networks to learn the best policy for playing Atari video games and to teach an agent how to associate the right action with an input representing the state (usually a screenshot or a memory dump). 

In the following figure, there's a schematic representation of a deep neural network trained to play a famous Atari game. As input, there are one or more subsequent screenshots (this can often be enough to capture the temporal dynamics as well). They are processed using different layers (discussed briefly later) to produce an output that represents the policy for a specific state transition. After applying this policy, the game produces a feedback (as a reward-penalty), and this result is used to refine the output until it becomes stable (so the states are correctly recognized and the suggested action is always the best one) and the total reward overcomes a predefined threshold.

We're going to discuss some examples of reinforcement learning in the chapter dedicated to introducing deep learning and TensorFlow.

主站蜘蛛池模板: 资源县| 瓮安县| 松滋市| 东乡族自治县| 新巴尔虎右旗| 五家渠市| 通山县| 招远市| 娄烦县| 望江县| 庆阳市| 昭觉县| 赤峰市| 长兴县| 分宜县| 喜德县| 呈贡县| 牡丹江市| 天门市| 白玉县| 黔西| 承德市| 文安县| 凉城县| 西贡区| 洪洞县| 宜章县| 巧家县| 江陵县| 沛县| 平阳县| 孙吴县| 巴楚县| 宣汉县| 满洲里市| 邢台市| 富川| 镇巴县| 瑞金市| 明溪县| 克东县|