Maintaining a table for a small number of states is possible but in the real world, states become infinite. Thus, there is a need for a solution that incorporates the state information and outputs the Q-values for the actions without using the Q-table. This is where neural network acts a function approximator, which is trained over data of different state information and their corresponding Q-values for all actions, thereby, they are able to predict Q-values for any new state information input. The neural network used to predict Q-values instead of using a Q-table is called Q-network.
Here for the FrozenLake-v0 environment, let's use a single neural network that takes state information as input, where state information is represented as a one hot encoded vector of the 1 x number of states shape (here, 1 x 16) and outputs a vector of the 1 x number of actions shape (here, 1 x 4). The output is the Q-values for all the actions:
# considering there are 16 states numbered from state 0 to state 15, then state number 4 will be # represented in one hot encoded vector as input_state = [0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0]
With the options of adding more hidden layers and different activation functions, a Q-network definitely has many advantages over a Q-table. Unlike a Q-table, in a Q-network, the Q-values are updated by minimizing the loss through backpropagation. The loss function is given by:
Let's try to implement this in Python and learn how to implement a basic Q-Network algorithm to make an agent learn to navigate across this frozen lake of 16 grids from the start to the goal without falling into the hole:
# importing dependency libraries from __future__ import print_function import Gym import numpy as np import tensorflow as tf import random
# Load the Environment env = Gym.make('FrozenLake-v0')
# Q - Network Implementation
## Creating Neural Network
tf.reset_default_graph() # tensors for inputs, weights, biases, Qtarget inputs = tf.placeholder(shape=[None,env.observation_space.n],dtype=tf.float32) W = tf.get_variable(name="W",dtype=tf.float32,shape=[env.observation_space.n,env.action_space.n],initializer=tf.contrib.layers.xavier_initializer()) b = tf.Variable(tf.zeros(shape=[env.action_space.n]),dtype=tf.float32)
init = tf.global_variables_initializer() #initializing tensor variables #initializing parameters y = 0.5 #discount factor e = 0.3 #epsilon value for epsilon-greedy task episodes = 10000 #total number of episodes
with tf.Session() as sess: sess.run(init) for i in range(episodes): s = env.reset() #resetting the environment at the start of each episode r_total = 0 #to calculate the sum of rewards in the current episode while(True): #running the Q-network created above a_pred,q_pred = sess.run([apred,qpred],feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1]}) #a_pred is the action prediction by the neural network #q_pred contains q_values of the actions at current state 's' if np.random.uniform(low=0,high=1) < e: #performing epsilon-greedy here a_pred[0] = env.action_space.sample() #exploring different action by randomly assigning them as the next action s_,r,t,_ = env.step(a_pred[0]) #action taken and new state 's_' is encountered with a feedback reward 'r' if r==0: if t==True: r=-5 #if hole make the reward more negative else: r=-1 #if block is fine/frozen then give slight negative reward to optimize the path if r==1: r=5 #good positive goat state reward
q_pred_new = sess.run(qpred,feed_dict={inputs:np.identity(env.observation_space.n)[s_:s_+1]}) #q_pred_new contains q_values of the actions at the new state
#update the Q-target value for action taken targetQ = q_pred max_qpredn = np.max(q_pred_new) targetQ[0,a_pred[0]] = r + y*max_qpredn #this gives our targetQ
#train the neural network to minimize the loss _ = sess.run(minimizer,feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1],qtar:targetQ})
s=s_ if t==True: break
#learning ends with the end of the above loop of several episodes above #let's check how much our agent has learned print("Output after learning") print() s = env.reset() env.render() while(True): a = sess.run(apred,feed_dict={inputs:np.identity(env.observation_space.n)[s:s+1]}) s_,r,t,_ = env.step(a[0]) print("===============") env.render() s = s_ if t==True: break ----------------------------------------------------------------------------------------------- <<OUTPUT>>
There is a cost of stability associated with both Q-learning and Q-networks. There will be cases when with the given set of hyperparameters of the Q-values are not converge, but with the same hyperparameters, sometimes converging is witnessed. This is because of the instability of these learning approaches. In order to tackle this, a better initial policy should be defined (here, the maximum Q-value of a given state) if the state space is small. Moreover, hyperparameters, especially learning rate, discount factors, and epsilon value, play an important role. Therefore, these values must be initialized properly.
Q-networks provide more flexibility compared to Q-learning, owing to increasing state spaces. A deep neural network in a Q-network might lead to better learning and performance. As far as playing Atari using Deep Q-Networks, there are many tweaks, which we will discuss in the coming chapters.