官术网_书友最值得收藏!

Bandit optimization strategies

One strategy for deciding which arm to pull at this point is a simple greedy strategy; that is, pull the arm that currently looks best and don't worry about any of the other arms. This is, of course, a naive approach, because although arm 4 might currently look like the best arm to pull, we don't know yet whether any of the other arms will be better than 4.

An even worse outcome might be that since 2 was the first arm that we pulled that gave us a reward, we might decide to keep pulling it forever and never discover a potentially better arm such as 4. In optimization problems, we call this getting stuck on a local maximum, as illustrated in the following diagram.

When we are stuck on a local maximum, all of the reward values we see near us are less than our current reward value, so from our limited information state we think that we have reached an optimal path and should stay there and continue to collect rewards. From where we are, we can't see the global maximum that could give us more rewards if we were to seek it out and locate it:

Getting stuck on a local maximum happens when an agent continually takes stock of its available actions and chooses the same one over and over because it happens to be the current highest value. Because it cannot reach any states from where it is, this will give it enough information to increase the value of the other actions – note that it can never discover any values higher than the current local maximum if it does not branch out and take other actions.

There are other, better, strategies than the greedy strategy that we will use, starting with the epsilon-greedy strategy, which we will discuss in depth at the beginning of the Chapter 7Decoupling Exploration and Exploitation in Multi-Armed Bandits, on solving bandit problems. Let's cover it briefly here.

Remember that epsilon is the probability that we will take an action other than the current highest Q-valued action. Using the epsilon-greedy strategy, when epsilon = 0.1, an agent will choose the current highest Q-valued machine with the probability of 0.9 (1 - epsilon) and any other machine with the probability of 0.1. 

We can set the epsilon to any value we choose, and we can choose to decay it over time as we conduct repeated trials. As we'll see in our example implementations, epsilon-greedy strategies suffer when the agent gets stuck exploring a suboptimal path before it gets a chance to explore all paths and find the best one. Our goal will be to find a strategy that efficiently gets us to an optimal explore-exploit balance and brings us as quickly as possible to the highest-yielding action gradients (that is exploitation paths).

主站蜘蛛池模板: 德格县| 杭锦旗| 宝坻区| 望城县| 岐山县| 且末县| 柳州市| 大荔县| 西峡县| 鄂伦春自治旗| 佳木斯市| 富宁县| 合水县| 新竹县| 马龙县| 舒兰市| 永吉县| 武功县| 石林| 琼结县| 龙游县| 自贡市| 庄河市| 石狮市| 兴国县| 石屏县| 同德县| 拉孜县| 枣强县| 尖扎县| 岳池县| 丰宁| 松阳县| 阿拉善左旗| 保靖县| 沅陵县| 木兰县| 大姚县| 广昌县| 合水县| 黑河市|