官术网_书友最值得收藏!

Q-learning

Q-learning is an off-policy temporal-difference reinforcement learning algorithm. What a mouthful! But fear not, let’s not worry about what all this means, and instead just see how the algorithm works. To do this, we’ll use the game of chess we introduced in the previous section. As a reminder, the board configuration (the locations of the pieces) is the current state of the environment. Here, the agents can take actions, a, by moving pieces, thus changing the state into a new one. We'll represent a game of chess as a graph where the different board configurations are the graph’s vertices, and the possible moves from each configuration are the edges. To make a move, the agent follows the edge from the current state, s, to a new state, s'. The basic Q-learning algorithm uses Q-table to help the agent decide which moves to make. The Q-table contains one row for each board configuration, while the columns of the table are all possible actions that the agent can take (the moves). A table cell, q(s, a), contains the cumulative expected reward, called Q-value. This is the potential total reward that the agent will receive for the remainder of the game if they take an action, a, from their current state, s. At the beginning, the Q-table is initialized with an arbitrary value. With that knowledge, let’s see how Q-learning works:

Initialize the Q table with some arbitrary value
for each episode:
Observe the initial state s
for each step of the episode:
Select new action a using a policy based on the Q-table
Observe reward r and go to the new state s’
Update q(s, a) in the Q table using the Bellman Equation
until we reach a terminal state for the episode

An episode starts with a random initial state and finishes when we reach the terminal state. In our case, one episode would be one full game of chess.

The question that arises is this: how does the agent's policy determine what will be the next action? To do so, the policy has to take into account the Q-values of all the possible actions from the current state. The higher the Q-value, the more attractive the action is. However, the policy will sometimes ignore the Q-table (exploitation of the existing knowledge) and choose another random action to find higher potential rewards (exploration). In the beginning, the agent will take random actions because the Q-table doesn’t contain much information. As time progresses and the Q-table is gradually filled, the agent will become more informed in interacting with the environment.

We update q(s, a) after each new action, by using Bellman equation. The Bellman equation is beyond the scope of this introduction, but we’ll discuss it in detail in the later chapters. For now, it's enough to know that the updated value, q(s, a), is based on the newly-received reward, r , as well as the maximum possible Q-value, q*(s’, a’)of the new state, s'.

This example was intended to help you understand the basic workings of Q-learning, but you might have noticed an issue with this. We store the combination of all possible board configurations and moves in the Q-table. This would make the table huge and impossible to fit in today’s computer memory. Fortunately, there is a solution for this: we can replace the Q-table with a neural network, which will tell the agent what the optimal action is in each state. In recent years, this development has allowed reinforcement learning algorithms to achieve superhuman performance on tasks such as the game of Go, Dota 2, and Doom. In this book, we’ll discuss how to apply Q-learning and other RL algorithms to some of these tasks.

主站蜘蛛池模板: 东至县| 抚松县| 横峰县| 秦安县| 德州市| 四子王旗| 富阳市| 霍邱县| 吉林市| 延津县| 浏阳市| 来安县| 汾阳市| 万宁市| 牟定县| 元谋县| 靖江市| 大厂| 沙坪坝区| 东兰县| 高尔夫| 普宁市| 苏尼特右旗| 新巴尔虎右旗| 延安市| 顺义区| 望江县| 湟源县| 嘉荫县| 保康县| 盐池县| 孝昌县| 宜兰县| 文登市| 若尔盖县| 陈巴尔虎旗| 萨迦县| 丰顺县| 越西县| 淮安市| 济源市|