書名： TensorFlow Reinforcement Learning Quick Start Guide
作者名： Kaushik Balakrishnan
本章字數(shù)： 214字
更新時間： 2021-06-24 15:29:07

Learning the Markov decision process

The Markov property is widely used in RL, and it states that the environment's response at time t+1 depends only on the state and action at time t. In other words, the immediate future only depends on the present and not on the past. This is a useful property that simplifies the math considerably, and is widely used in many fields such as RL and robotics.

Consider a system that transitions from state s₀ to s₁ by taking an action a₀ and receiving a reward r₁, and thereafter from s₁ to s₂ taking action a₁, and so on until time t. If the probability of being in a state s' at time t+1 can be represented mathematically as in the following function, then the system is said to follow the Markov property:

Note that the probability of being in state s_t+1 depends only on s_t and a_t and not on the past. An environment that satisfies the following state transition property and reward function as follows is said to be a Markov Decision Process (MDP):

Let's now define the very foundation of RL: the Bellman equation. This equation will help in providing an iterative solution to obtaining value functions.

官术网_书友最值得收藏!

TensorFlow Reinforcement Learning Quick Start Guide

Learning the Markov decision process