官术网_书友最值得收藏!

Theoretical background of the cross-entropy method

This section is optional and included for readers who are interested in why the method works. If you wish, you can refer to the original paper on cross-entropy, which will be given at the end of the section.

The basis of the cross-entropy method lies in the importance sampling theorem, which states this:

In our RL case, H(x) is a reward value obtained by some policy x and p(x) is a distribution of all possible policies. We don't want to maximize our reward by searching all possible policies, instead we want to find a way to approximate p(x)H(x) by q(x), iteratively minimizing the distance between them. The distance between two probability distributions is calculated by Kullback-Leibler (KL) pergence which is as follows:

The first term in KL is called entropy and doesn't depend on that, so could be omitted during the minimization. The second term is called cross-entropy and is a very common optimization objective in DL.

Combining both formulas, we can get an iterative algorithm, which starts with Theoretical background of the cross-entropy method and on every step improves. This is an approximation of p(x)H(x) with an update:

This is a generic cross-entropy method, which can be significantly simplified in our RL case. Firstly, we replace our H(x) with an indicator function, which is 1 when the reward for the episode is above the threshold and 0 if the reward is below. Our policy update will look like this:

Strictly speaking, the preceding formula misses the normalization term, but it still works in practice without it. So, the method is quite clear: we sample episodes using our current policy (starting with some random initial policy) and minimize the negative log likelihood of the most successful samples and our policy.

There is a whole book dedicated to this method, written by Dirk P. Kroese. A shorter description can be found in the Cross-Entropy Method paper by Dirk P.Kroese (https://people.smp.uq.edu.au/DirkKroese/ps/eormsCE.pdf).

主站蜘蛛池模板: 宁化县| 阜康市| 富顺县| 桃园市| 鄂尔多斯市| 临西县| 博罗县| 比如县| 永定县| 晋城| 镇坪县| 资源县| 宿州市| 余江县| 长岛县| 宁波市| 当涂县| 安溪县| 宿松县| 江永县| 柞水县| 依兰县| 吐鲁番市| 汽车| 铁岭市| 定南县| 宁城县| 攀枝花市| 长寿区| 襄樊市| 旬邑县| 壶关县| 绍兴县| 平江县| 方城县| 德保县| 镇江市| 莫力| 东光县| 股票| 曲阳县|