On-policy methods use the same policy to evaluate as was used to make the decisions on actions. On-policy algorithms generally do not have a replay buffer; the experience encountered is used to train the model in situ. The same policy that was used to move the agent from state at time t to state at time t+1, is used to evaluate if the performance was good or bad. For example, if a robot is exploring the world at a given state, if it uses its current policy to ascertain whether the actions it took in the current state were good or bad, then it is an on-policy algorithm, as the current policy is also used to evaluate its actions. SARSA, A3C, TRPO, and PPO are on-policy algorithms that we will be covering in this book.