官术网_书友最值得收藏!

Understanding Q-learning

Q-learning is an off-policy algorithm that was first proposed by Christopher Watkins in 1989, and is a widely used RL algorithm. Q-learning, such as SARSA, keeps an update of the state-action value function for each state-action pair, and recursively updates it using the Bellman equation of dynamic programming as new experiences are collected. Note that it is an off-policy algorithm as it uses the state-action value function evaluated at the action, which will maximize the value. Q-learning is used for problems where the actions are discrete – for example, if we have the actions move north, move south, move east, move west, and we are to decide the optimum action in a given state, then Q-learning is applicable in such settings.

In the classical Q-learning approach, the update is given as follows, where the max is performed over actions, that is, we choose the action a corresponding to the maximum value of Q at state st+1:

The α is the learning rate, which is a hyper-parameter that the user can specify.

Before we code the algorithms in Python, let's find out what kind of problems will be considered.

主站蜘蛛池模板: 龙门县| 屏边| 无为县| 吉安市| 九龙县| 镇沅| 日喀则市| 鄄城县| 广西| 青阳县| 庆阳市| 台东县| 彰化市| 临城县| 盘锦市| 托克托县| 易门县| 雷波县| 邮箱| 枣阳市| 永州市| 宁都县| 江陵县| 新巴尔虎右旗| 博罗县| 临沭县| 乾安县| 手机| 松阳县| 深水埗区| 彰武县| 崇礼县| 南昌县| 读书| 日土县| 内乡县| 东阿县| 鞍山市| 嘉荫县| 襄樊市| 扶风县|