官术网_书友最值得收藏!

Creating an MDP

Developed upon the Markov chain, an MDP involves an agent and a decision-making process. Let's go ahead with developing an MDP and calculating the value function under the optimal policy.

Besides a set of possible states, S = {s0, s1, ... , sm}, an MDP is defined by a set of actions, A = {a0, a1, ... , an}; a transition model, T(s, a, s'); a reward function, R(s); and a discount factor, ??. The transition matrix, T(s, a, s'), contains the probabilities of taking action a from state s then landing in s'. The discount factor, ??, controls the tradeoff between future rewards and immediate ones.

To make our MDP slightly more complicated, we extend the study and sleep process with one more state, s2 play games. Let's say we have two actions, a0 work and a1 slack. The 3 * 2 * 3 transition matrix T(s, a, s') is as follows:

This means, for example, that when taking the a1 slack action from state s0 study, there is a 60% chance that it will become s1 sleep (maybe getting tired ) and a 30% chance that it will become s2 play games (maybe wanting to relax ), and that there is a 10% chance of keeping on studying (maybe a true workaholic ). We define the reward function as [+1, 0, -1] for three states, to compensate for the hard work. Obviously, the optimal policy, in this case, is choosing a0 work for each step (keep on studying – no pain no gain, right?). Also, we choose 0.5 as the discount factor, to begin with. In the next section, we will compute the state-value function (also called the value function, just the value for short, or expected utility) under the optimal policy.

主站蜘蛛池模板: 宜川县| 财经| 宝坻区| 武定县| 卢氏县| 铜山县| 扎囊县| 扎兰屯市| 错那县| 睢宁县| 齐河县| 兴海县| 桂平市| 响水县| 邛崃市| 高台县| 军事| 陈巴尔虎旗| 宜宾县| 邢台县| 平定县| 岳西县| 禹城市| 瑞昌市| 汉川市| 绥芬河市| 萍乡市| 建阳市| 海林市| 清苑县| 梁河县| 五华县| 鹤壁市| 东乡县| 汪清县| 吴忠市| 宝丰县| 洮南市| 岫岩| 博白县| 增城市|