- TensorFlow Reinforcement Learning Quick Start Guide
- Kaushik Balakrishnan
- 225字
- 2021-06-24 15:29:06
Rewards
In RL literature, rewards at a time instant t are typically denoted as Rt. Thus, the total rewards earned in an episode is given by R = r1+ r2 + ... + rt, where t is the length of the episode (which can be finite or infinite).
The concept of discounting is used in RL, where a parameter called the discount factor is used, typically represented by γ and 0 ≤ γ ≤ 1 and a power of it multiplies rt. γ = 0, making the agent myopic, and aiming only for the immediate rewards. γ = 1 makes the agent far-sighted to the point that it procrastinates the accomplishment of the final goal. Thus, a value of γ in the 0-1 range (0 and 1 exclusive) is used to ensure that the agent is neither too myopic nor too far-sighted. γ ensures that the agent prioritizes its actions to maximize the total discounted rewards, Rt, from time instant t, which is given by the following:

Since 0 ≤ γ ≤ 1, the rewards into the distant future are valued much less than the rewards that the agent can earn in the immediate future. This helps the agent to not waste time and to prioritize its actions. In practice, γ = 0.9-0.99 is typically used in most RL problems.
- 輕輕松松自動(dòng)化測(cè)試
- Verilog HDL數(shù)字系統(tǒng)設(shè)計(jì)入門與應(yīng)用實(shí)例
- 傳感器技術(shù)實(shí)驗(yàn)教程
- 反饋系統(tǒng):多學(xué)科視角(原書(shū)第2版)
- Windows內(nèi)核原理與實(shí)現(xiàn)
- 人工智能實(shí)踐錄
- 完全掌握AutoCAD 2008中文版:機(jī)械篇
- Visual FoxPro數(shù)據(jù)庫(kù)基礎(chǔ)及應(yīng)用
- PVCBOT機(jī)器人控制技術(shù)入門
- 從零開(kāi)始學(xué)PHP
- INSTANT Heat Maps in R:How-to
- 電腦日常使用與維護(hù)322問(wèn)
- 工業(yè)自動(dòng)化技術(shù)實(shí)訓(xùn)指導(dǎo)
- Learning Apache Apex
- MySQL Management and Administration with Navicat