官术网_书友最值得收藏!

Value iteration

In value iteration, we start off with a random value function. Obviously, the random value function might not be an optimal one, so we look for a new improved value function in iterative fashion until we find the optimal value function. Once we find the optimal value function, we can easily derive an optimal policy from it:

The steps involved in the value iteration are as follows:

  1. First, we initialize the random value function, that is, the random value for each state.
  2. Then we compute the Q function for all state action pairs of Q(s, a).
  3. Then we update our value function with the max value from Q(s,a).
  4. We repeat these steps until the change in the value function is very small.

Let us understand it intuitively by performing value iteration manually, step by step.

Consider the grid shown here. Let us say we are in state A and our goal is to reach state C without visiting state B, and we have two actions, 0—left/right and 1—up/down: 

Can you think of what will be the optimal policy here? The optimal policy here will be the one that tells us to perform action 1 in the state A so that we can reach C without visiting B. How can we find this optimal policy? Let us see that now:

 Initialize the random value function, that is, a random values for all the states. Let us assign 0 to all the states:

Let's calculate the Q value for all state action pairs.

The Q value tells us the value of an action in each state. First, let us compute the Q value for state A. Recall the equation of the Q function. For calculating, we need transition and reward probabilities. Let us consider the transition and reward probability for state A as follows:

The Q function for the state A can be calculated as follows:

Q(s,a) =  Transition probability * ( Reward probability + gamma * value_of_next_state)

Here, gamma is the discount factor; we will consider it as 1.

Q value for state A and action 0:

Q(A,0) = (0.1 * (0+0) ) +  (0.4 * (-1.0+0) ) + (0.3 * (1.0+0) )

Q(A,0) = -0.1

Now we will compute the Q value for state A and action 1:

Q(A,1) = (0.3 * (0+0)) + (0.1 * (-2.0 + 0)) + (0.5 * (1.0 + 0))

Q(A,1) = 0.3

Now we will update this in the Q table as follows:

Update the value function as the max value from Q(s,a).

If you look at the preceding Q function, Q(A,1) has a higher value than Q(A,0) so we will update the value of state A as Q(A,1):

Similarly, we compute the Q value for all state-action pairs and update the value function of each state by taking the Q value that has the highest state action value. Our updated value function looks like the following. This is the result of the first iteration:

We repeat this steps for several iterations. That is, we repeat step 2 to step 3 (in each iteration while calculating the Q value, we use the updated value function instead of the same randomly initialized value function).

This is the result of the second iteration:

 

This is the result of the third iteration:

 

But when do we stop this? We will stop when the change in the value between each iteration is small; if you look at iteration two and three, there is not much of a change in the value function. Given this condition, we stop iterating and consider it an optimal value function.

Okay, now that we have found the optimal value function, how can we derive the optimal policy? 

It is very simple. We compute the Q function with our final optimal value function. Let us say our computed Q function looks like the following:

 

From this Q function, we pick up actions in each state that have maximal value. At state A, we have a maximum value for action 1, which is our optimal policy. So if we perform action 1 in state A we can reach C without visiting B.

主站蜘蛛池模板: 龙南县| 屏东市| 元氏县| 榆树市| 治县。| 新宾| 永城市| 北川| 麻城市| 西乌珠穆沁旗| 汉沽区| 汝阳县| 青河县| 本溪市| 昭平县| 察隅县| 广宗县| 漠河县| 新民市| 张家口市| 兰考县| 福安市| 吉水县| 南陵县| 黄骅市| 中超| 岫岩| 钦州市| 逊克县| 高阳县| 曲松县| 大余县| 青铜峡市| 塔河县| 榆林市| 高青县| 涿州市| 长沙县| 马关县| 浦江县| 林州市|