官术网_书友最值得收藏!

The Epsilon-Greedy approach

The Epsilon-Greedy is a widely used solution to the explore-exploit dilemma. Exploration is all about searching and exploring new options through experimentation and research to generate new values, while exploitation is all about refining existing options by repeating those options and improving their values.

The Epsilon-Greedy approach is very simple to understand and easy to implement:

epsilon() = 0.05 or 0.1 #any small value between 0 to 1
#epsilon() is the probability of exploration

p = random number between 0 and 1

if p epsilon() :
pull a random action
else:
pull current best action

Eventually, after several iterations, we discover the best actions among all at each state because it gets the option to explore new random actions as well as exploit the existing actions and refine them.

Let's try to implement a basic Q-learning algorithm to make an agent learn how to navigate across this frozen lake of 16 grids, from the start to the goal without falling into the hole:

# importing dependency libraries
from __future__ import print_function
import Gym
import numpy as np
import time

#Load the environment

env = Gym.make('FrozenLake-v0')

s = env.reset()
print("initial state : ",s)
print()

env.render()
print()

print(env.action_space) #number of actions
print(env.observation_space) #number of states
print()

print("Number of actions : ",env.action_space.n)
print("Number of states : ",env.observation_space.n)
print()

#Epsilon-Greedy approach for Exploration and Exploitation of the state-action spaces
def epsilon_greedy(Q,s,na):
epsilon = 0.3
p = np.random.uniform(low=0,high=1)
#print(p)
if p > epsilon:
return np.argmax(Q[s,:])#say here,initial policy = for each state consider the action having highest Q-value
else:
return env.action_space.sample()

# Q-Learning Implementation

#Initializing Q-table with zeros
Q = np.zeros([env.observation_space.n,env.action_space.n])

#set hyperparameters
lr = 0.5 #learning rate
y = 0.9 #discount factor lambda
eps = 100000 #total episodes being 100000


for i in range(eps):
s = env.reset()
t = False
while(True):
a = epsilon_greedy(Q,s,env.action_space.n)
s_,r,t,_ = env.step(a)
if (r==0):
if t==True:
r = -5 #to give negative rewards when holes turn up
Q[s_] = np.ones(env.action_space.n)*r #in terminal state Q value equals the reward
else:
r = -1 #to give negative rewards to avoid long routes
if (r==1):
r = 100
Q[s_] = np.ones(env.action_space.n)*r #in terminal state Q value equals the reward
Q[s,a] = Q[s,a] + lr * (r + y*np.max(Q[s_,a]) - Q[s,a])
s = s_
if (t == True) :
break

print("Q-table")
print(Q)
print()

print("Output after learning")
print()
#learning ends with the end of the above loop of several episodes above
#let's check how much our agent has learned
s = env.reset()
env.render()
while(True):
a = np.argmax(Q[s])
s_,r,t,_ = env.step(a)
print("===============")
env.render()
s = s_
if(t==True) :
break
----------------------------------------------------------------------------------------------
<<OUTPUT>>

initial state : 0

SFFF
FHFH
FFFH
HFFG

Discrete(4)
Discrete(16)

Number of actions : 4
Number of states : 16

Q-table
[[ -9.85448046 -7.4657981 -9.59584501 -10. ] [ -9.53200011 -9.54250775 -9.10115662 -10. ] [ -9.65308982 -9.51359977 -7.52052469 -10. ] [ -9.69762313 -9.5540111 -9.56571455 -10. ] [ -9.82319854 -4.83823005 -9.56441915 -9.74234959] [ -5. -5. -5. -5. ] [ -9.6554905 -9.44717167 -7.35077759 -9.77885057] [ -5. -5. -5. -5. ] [ -9.66012445 -4.28223592 -9.48312882 -9.76812285] [ -9.59664264 9.60799515 -4.48137699 -9.61956668] [ -9.71057124 -5.6863911 -2.04563412 -9.75341962] [ -5. -5. -5. -5. ] [ -5. -5. -5. -5. ] [ -9.54737964 22.84803205 18.17841481 -9.45516929] [ -9.69494035 34.16859049 72.04055782 40.62254838] [ 100. 100. 100. 100. ]]

Output after learning

SFFF FHFH FFFH HFFG
=============== (Down) SFFF FHFH FFFH HFFG
=============== (Down) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
===============
(Right)
SFFF
FHFH
FFFH
HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
=============== (Right) SFFF FHFH FFFH HFFG
主站蜘蛛池模板: 黄石市| 荔浦县| 通江县| 崇礼县| 阿荣旗| 瑞丽市| 揭阳市| 海安县| 和硕县| 东辽县| 神池县| 安西县| 香格里拉县| 溧水县| 门头沟区| 屏南县| 呼图壁县| 天水市| 信宜市| 若羌县| 库尔勒市| 玉门市| 灵武市| 万盛区| 方山县| 道孚县| 阿拉善右旗| 株洲县| 炉霍县| 湟源县| 南涧| 丰都县| 乾安县| 普格县| 安吉县| 临沧市| 区。| 安顺市| 叙永县| 永顺县| 龙岩市|