官术网_书友最值得收藏!

Model

In the previous sections, we discussed how the environment is not fully known to the agent. In other words, the agent usually does not have an idea of how the internal algorithm of the environment looks. The agent thus needs to interact with it to gain information and learn how to maximize its expected cumulative reward. However, it is possible for the agent to have an internal replica, or a model, of the environment. The agent can use the model to predict how the environment would react to some action in a given state. A model of the stock market, for example, is tasked with predicting what the prices will look like in the future. If the model is accurate, the agent can then use its value function to assess how desirable future states look. More formally, a model can be denoted as a function, , that predicts the probability of the next state given the current state and an action:

In other scenarios, the model of the environment can be used to enumerate possible future states. This is commonly used in turn-based games, such as chess and tic-tac-toe, where the rules and scope of possible actions are clearly defined. Trees are often used to illustrate the possible sequence of actions and states in turn-based games:

Figure 4: A model using its value function to assess possible moves

In the preceding example of the tic-tac-toe game,denotes the possible states that taking theaction (represented as the shaded circle) could yield in a given state, . Moreover, we can calculate the value of each state using the agent's value function. The middle and bottom states would yield a high value since the agent would be one step away from victory, whereas  the top state would yield a medium value since the agent needs to prevent the opponent from winning.

Let's review the terms we have covered so far:

     
Term            Description            What does it output?
Policy            The algorithm or function that outputs decisions the agent makes            A scalar/single decision (deterministic policy) or a vector of probabilities over possible actions (stochastic policy)
Value Function            The function that describes how good or bad a given state is            A scalar value representing the expected cumulative reward
Model            An agent's representation of the environment, which predicts how the environment will react to the agent's actions           
The probability of the next state given an action and current state, or an enumeration of possible states given the rules of the environment

In the following sections, we will use these concepts to learn about one of the most fundamental frameworks in reinforcement learning: the Markov decision process.

主站蜘蛛池模板: 金溪县| 古交市| 东至县| 孙吴县| 洪江市| 栾川县| 陈巴尔虎旗| 聊城市| 多伦县| 关岭| 澎湖县| 长沙市| 湘潭市| 焦作市| 称多县| 大余县| 大方县| 贵溪市| 同心县| 民县| 隆昌县| 衡阳市| 石河子市| 镇康县| 淮阳县| 佛教| 西青区| 手游| 全南县| 潼关县| 保山市| 长沙市| 耿马| 丹江口市| 吉林省| 金湖县| 黄陵县| 大足县| 新安县| 开鲁县| 建始县|