官术网_书友最值得收藏!

Temporal difference learning

TD learning algorithms are based on reducing the differences between estimates made by the agent at different times. TD algorithms try to predict a quantity that depends on the future values of a given signal. Its name derives from the differences used in predictions on successive time steps to guide the learning process. The prediction at any time is updated to bring it closer to the prediction of the same quantity at the next time step. In reinforcement learning, they are used to predict a measure of the total amount of reward expected in the future.

It is a combination of the ideas of the MC method and the DP.

MC methods allow solving reinforcement learning problems based on the average of the results obtained.  DP represents a set of algorithms that can be used to calculate an optimal policy given a perfect model of the environment in the form of a MDP.

TD algorithm can learn directly from raw data, without a model of the dynamics of the environment (such as MC). This algorithm updates the estimates based partly on previously learned estimates, without waiting for the final result (bootstrap, like DP). Converge (using a fixed policy) if the time step is sufficiently small, or if it reduces over time.

The consecutive predictions are often related to each other; the TD methods are based on this assumption. These methods try to minimize the error of consecutive time forecasts. To do this, calculate the value function update using the Bellman equation. As already mentioned, to improve the prediction, the bootstrap technique is used, thereby reducing the variance of the prediction in each update step.

The different types of algorithms based on time differences can be distinguished on the basis of the methodology of choosing an action adopted. There are methods of time differences on-policy, in which the update is made on the basis of the results of actions determined by the selected policy and off-policy methods, in which various policies can be assessed through hypothetical actions, not actually undertaken. Unlike on-policy methods, the latter can separate the problem of exploration from that of control, learning tactics not necessarily applied during the learning phase.

The most used TD learning algorithms are the following:

  • SARSA
  • Q-learning
  • Deep Q-learning

In the following sections, we will analyze the main characteristics of the two algorithms and the substantial differences.

主站蜘蛛池模板: 库车县| 青州市| 保康县| 庄浪县| 黔东| 新河县| 玉龙| 油尖旺区| 吕梁市| 曲沃县| 东城区| 特克斯县| 博客| 攀枝花市| 三台县| 朝阳市| 华坪县| 吉水县| 卢龙县| 安乡县| 拜城县| 嘉荫县| 永嘉县| 汝阳县| 荔浦县| 北辰区| 黎川县| 拉萨市| 读书| 福鼎市| 诸暨市| 上林县| 山丹县| 乐至县| 吐鲁番市| 介休市| 北海市| 平遥县| 子长县| 策勒县| 都安|