官术网_书友最值得收藏!

Learning the Markov decision process 

The Markov property is widely used in RL, and it states that the environment's response at time t+1 depends only on the state and action at time t. In other words, the immediate future only depends on the present and not on the past. This is a useful property that simplifies the math considerably, and is widely used in many fields such as RL and robotics.

Consider a system that transitions from state s0 to s1 by taking an action a0 and receiving a reward r1, and thereafter from s1 to s2 taking action a1, and so on until time t. If the probability of being in a state s' at time t+1 can be represented mathematically as in the following function, then the system is said to follow the Markov property:

Note that the probability of being in state st+1 depends only on st and at and not on the past. An environment that satisfies the following state transition property and reward function as follows is said to be a Markov Decision Process (MDP):

Let's now define the very foundation of RL: the Bellman equation. This equation will help in providing an iterative solution to obtaining value functions.

主站蜘蛛池模板: 永嘉县| 沙湾县| 临泽县| 滦南县| 嘉禾县| 徐汇区| 陇南市| 宝应县| 邵东县| 永宁县| 池州市| 开江县| 涪陵区| 芦山县| 梓潼县| 奉节县| 兴山县| 阜阳市| 林州市| 盐池县| 靖西县| 公安县| 无棣县| 扬州市| 定日县| 仪征市| 永胜县| 榆社县| 资中县| 乌审旗| 平利县| 峡江县| 师宗县| 买车| 盐津县| 双峰县| 开封县| 淮阳县| 潍坊市| 巴彦县| 资兴市|