- Hands-On Q-Learning with Python
- Nazia Habib
- 507字
- 2021-06-24 15:13:17
Setting up a bandit problem
A straightforward MABP involves encountering a slot machine with n arms (alternatively, a row of n one-armed machines). We have a set amount of money to put in these machines and we want to maximize our payout. We keep track of which machines pay out each time and keep a probability distribution of each machine's payout.
When we start playing, we don't know which of these arms will pay out more than others; the only way we can find that out is by playing each one and observing long-term how often they pay out. What strategy should we use to decide which arms to pull, when to pull them, and when to prioritize one arm over others?
For simplicity, let's assume that each time you pull an arm, you get a reward of either $1 or $0 (a bandit with a payout of either 1 or 0 is called a Bernoulli bandit). With a particular 4-armed bandit, we might get the following results:

In the preceding simulation, we start out in exploration mode (meaning that we have a high epsilon value), so we try all four arms at once. We get no reward from arms 1 and 3, but we do get rewards from 2 and 4. We pull 2 again hoping for another reward, but we don't get one. We pull 4 again and get a reward, and now it looks like 4 is a good arm, so we pull it again and get another reward.
By the end of the trial, we have the following results:
- Arm 1: 2 pulls, 1 win, and 50% success
- Arm 2: 3 pulls, 1 win, and 33% success
- Arm 3: 1 pull, 0 wins, and 0% success
- Arm 4: 3 pulls, 3 wins, and 100% success
Based on these results, we now have some reason to believe 4 is a good arm to pull and that 2 is not necessarily very good. We have a 0% chance of a reward outcome from 3; however, because we have only pulled it once, we should pull it more to get more information about it before making a decision.
Similarly, we have a 50% chance of a reward outcome from 1, but since we have only pulled it twice, we probably don't have enough information yet to decide whether it is a good arm to pull. We will need to run the game for more rounds in order to be able to make useful predictions about the arms that we don't yet know enough about.
As we continue to play, we keep track of our results, and eventually, we build up a probability distribution of wins for each arm with enough observations so that we can rely on the results. This becomes the exploitation side of our strategy, and we want to make sure we play arms that we know are likely to win as often as possible, even while we continue to explore the results for other arms.
- 火格局的時空變異及其在電網(wǎng)防火中的應(yīng)用
- 平面設(shè)計初步
- 基于LabWindows/CVI的虛擬儀器設(shè)計與應(yīng)用
- 智能工業(yè)報警系統(tǒng)
- 中國戰(zhàn)略性新興產(chǎn)業(yè)研究與發(fā)展·智能制造裝備
- HTML5 Canvas Cookbook
- TensorFlow Reinforcement Learning Quick Start Guide
- Photoshop行業(yè)應(yīng)用基礎(chǔ)
- 計算機組成與操作系統(tǒng)
- ZigBee無線通信技術(shù)應(yīng)用開發(fā)
- 穿越計算機的迷霧
- 智能小車機器人制作大全(第2版)
- Proteus從入門到精通100例
- ARM? Cortex? M4 Cookbook
- 新手學(xué)Illustrator CS6平面廣告設(shè)計