- Hands-On Q-Learning with Python
- Nazia Habib
- 385字
- 2021-06-24 15:13:15
MDPs and state-action diagrams
Note that in the Markov chain examples we discussed, there is only one event that can happen in each state to cause the system to move to the next state. There is no list of actions and no decisions to make about what action to take. In a random walk, we flip the same fair coin each time, and each time we flip the coin, we have a new pair of states that we can potentially enter.
An MDP adds to a Markov chain the presence of a decision-making agent that has a choice of what action to take and rewards to receive, and provides feedback to the agent, affecting its behavior. Recall that an MDP doesn't require any knowledge of any previous states to make a decision on what action to take from the current state.
Let's go back to the state diagram that we discussed in the last chapter. Notice that a state diagram for an MDP is necessarily more complex than a diagram for a Markov chain. It needs to represent the available action and rewards, and the different states that the system can be in:

As we discussed, we have three states and two actions in this environment. Either action can be taken from any state, and the probability of each outcome as a result of taking each action is labeled on the diagram. For example, when we are in state S0 and take action a0, we might end up in S2 or back in S0, each with a 50% probability.
By extension, an MDP that only allows one action from each state and has the same reward for each action (that is, effectively, no rewards at all) will simplify to a Markov chain.
The main takeaway from this discussion of MDPs is that when we assume the role of an agent navigating a stochastic environment, we need to be able to learn lessons and make decisions that can be applied consistently amid the occurrence of random events. We don't always know what the individual outcome of an action will be, but we need to be able to make decisions that will maximize our overall outcome at each step of the process. This is what the algorithms that we develop will work toward achieving.
- PPT,要你好看
- 機(jī)器學(xué)習(xí)實(shí)戰(zhàn):基于Sophon平臺的機(jī)器學(xué)習(xí)理論與實(shí)踐
- 嵌入式系統(tǒng)及其開發(fā)應(yīng)用
- 大學(xué)計(jì)算機(jī)基礎(chǔ):基礎(chǔ)理論篇
- 軟件架構(gòu)設(shè)計(jì)
- AWS:Security Best Practices on AWS
- Blockchain Quick Start Guide
- VMware Performance and Capacity Management(Second Edition)
- 統(tǒng)計(jì)策略搜索強(qiáng)化學(xué)習(xí)方法及應(yīng)用
- 永磁同步電動機(jī)變頻調(diào)速系統(tǒng)及其控制(第2版)
- Lightning Fast Animation in Element 3D
- Deep Reinforcement Learning Hands-On
- SAP Business Intelligence Quick Start Guide
- 網(wǎng)絡(luò)管理工具實(shí)用詳解
- 電腦上網(wǎng)入門