# Notes on Reinforcement Learning: Temporal Difference for control

In the previous post of this series, we discussed Temporal Difference (TD) for prediction, namely, how to approximate the state value function, \(V(s)\) for a given policy. However, the ultimate goal of RL is to find the optimal policy, *i.e.*, solve the control problem. In the realm of TD, there are a few algorithms that can achieve this goal.

## SARSA

SARSA stands for state-action-reward-state-action. More precisely, it is an acronym for the sequence of the update rule of \(s_{t}, a_{t}, r_{t}, s_{t+1}, a_{t+1}\). The goal is, instead of solving for the state value function \(V(s)\), we solve for the state-action value function, \(q(s, a)\), which is the expected return when starting from state \(s\) and taking action \(a\), and then following the policy \(\pi\). The update rule is:

\[q(s_{t}, a_{t}) \leftarrow q(s_{t}, a_{t}) + \alpha(r_{t} + \gamma q(s_{t+1}, a_{t+1}) - q(s_{t}, a_{t}))\]Once the state-action value function is learned, we can greedify the policy to improve the policy, then start the next round of iteration.

## Expected SARSA

In SARSA, we have to wait first take to the next action \(a_{t+1}\), and wait for the next state \(s_{t+1}\), in order to make update to \(q(s_{t}, a_{t})\) based on \(q(s_{t+1}, a_{t+1})\). However, since we already know the policy we are following (the behavior policy), we can calculate the **expected** \(q(s_{t+1}, a_{t+1})\) without waiting for the next state \(s_{t+1}\). Therefore, the update rule becomes:

It appears that expected SARSA should always be preferred over SARSA, since what we are interested in is the long-term, expected, behavior, then taking expectation early on is a good idea (as opposed to from discrete sampling). This mitigates the variance from the behavior policy. However, the expectation can be expensive to calculate if the action space is large.

## Q-learning

Q-learning is just a little deviation from SARSA: it applies Bellman optimality equation on SARSA, so the update rule becomes:

\[q(s_{t}, a_{t}) \leftarrow q(s_{t}, a_{t}) + \alpha(r_{t} + \gamma \max_{a'} q(s_{t+1}, a') - q(s_{t}, a_{t}))\]Note the difference between SARSA and Q-learning is that in SARSA, we use the next action \(a_{t+1}\) to update \(q(s_{t}, a_{t})\), while in Q-learning, we use the **best** action for the state \(s_{t+1}\) to update \(q(s_{t}, a_{t})\). This is the only difference between SARSA and Q-learning. The rest of the algorithm is the same.

Q-learning gets us the optimal state-action values, not necessarily the policy (although we can greedify to get the optimal policy). Put differently, Q-learning is off-policy, since the state-action value update does not follow the current policy (behavior policy). In this manner, Q-learning can be considered as doing General Policy Improvement (GPI), hence more general than SARSA, since SARSA is on-policy.

## TD control and Bellman equations

Through the above three algorithms, we can see the fingerprints of Bellman equations. In fact, the update rule of expected SARSA and Q-learning are just the TD control version of Bellman equations and Bellman optimality equations, respectively. In essence, we bootstrap the state-action value function as if we know all other state-action values, and then update the current state-action value function based learning rate, discount factor and observed reward.

## Leave a Comment