How it works...

We have just seen how effective it is to compute the value of a policy using policy evaluation. It is a simple convergent iterative approach, in the dynamic programming family, or to be more specific, approximate dynamic programming. It starts with random guesses as to the values and then iteratively updates them according to the Bellman expectation equation until they converge.

In Step 5, the policy evaluation function does the following tasks:

Initializes the policy values as all zeros.
Updates the values based on the Bellman expectation equation.
Computes the maximal change of the values across all states.
If the maximal change is greater than the threshold, it keeps updating the values. Otherwise, it terminates the evaluation process and returns the latest values.

Since policy evaluation uses iterative approximation, its result might not be exactly the same as the result of the matrix inversion method, which uses exact computation. In fact, we don't really need the value function to be that precise. Also, it can solve the curses of dimensionality problem, which can result in scaling up the computation to thousands of millions of states. Therefore, we usually prefer policy evaluation over the other.

One more thing to remember is that policy evaluation is used to predict how great a we will get from a given policy; it is not used for control problems.

官术网_书友最值得收藏!

PyTorch 1.x Reinforcement Learning Cookbook

How it works...