3 minute read

In the previous post of this series, we discussed Temporal Difference (TD) for prediction, namely, how to approximate the state value function, \(V(s)\) for a given policy. However, the ultimate goal of RL is to find the optimal policy, i.e., solve the control problem. In the realm of TD, there are a few algorithms that can achieve this goal.


SARSA stands for state-action-reward-state-action. More precisely, it is an acronym for the sequence of the update rule of \(s_{t}, a_{t}, r_{t}, s_{t+1}, a_{t+1}\). The goal is, instead of solving for the state value function \(V(s)\), we solve for the state-action value function, \(q(s, a)\), which is the expected return when starting from state \(s\) and taking action \(a\), and then following the policy \(\pi\). The update rule is:

\[q(s_{t}, a_{t}) \leftarrow q(s_{t}, a_{t}) + \alpha(r_{t} + \gamma q(s_{t+1}, a_{t+1}) - q(s_{t}, a_{t}))\]

Once the state-action value function is learned, we can greedify the policy to improve the policy, then start the next round of iteration.

Expected SARSA

In SARSA, we have to wait first take to the next action \(a_{t+1}\), and wait for the next state \(s_{t+1}\), in order to make update to \(q(s_{t}, a_{t})\) based on \(q(s_{t+1}, a_{t+1})\). However, since we already know the policy we are following (the behavior policy), we can calculate the expected \(q(s_{t+1}, a_{t+1})\) without waiting for the next state \(s_{t+1}\). Therefore, the update rule becomes:

\[q(s_{t}, a_{t}) \leftarrow q(s_{t}, a_{t}) + \alpha(r_{t} + \gamma \sum_{a'} \pi(a' \mid s_{t+1}) q(s_{t+1}, a') - q(s_{t}, a_{t}))\]

It appears that expected SARSA should always be preferred over SARSA, since what we are interested in is the long-term, expected, behavior, then taking expectation early on is a good idea (as opposed to from discrete sampling). This mitigates the variance from the behavior policy. However, the expectation can be expensive to calculate if the action space is large.


Q-learning is just a little deviation from SARSA: it applies Bellman optimality equation on SARSA, so the update rule becomes:

\[q(s_{t}, a_{t}) \leftarrow q(s_{t}, a_{t}) + \alpha(r_{t} + \gamma \max_{a'} q(s_{t+1}, a') - q(s_{t}, a_{t}))\]

Note the difference between SARSA and Q-learning is that in SARSA, we use the next action \(a_{t+1}\) to update \(q(s_{t}, a_{t})\), while in Q-learning, we use the best action for the state \(s_{t+1}\) to update \(q(s_{t}, a_{t})\). This is the only difference between SARSA and Q-learning. The rest of the algorithm is the same.

Q-learning gets us the optimal state-action values, not necessarily the policy (although we can greedify to get the optimal policy). Put differently, Q-learning is off-policy, since the state-action value update does not follow the current policy (behavior policy). In this manner, Q-learning can be considered as doing General Policy Improvement (GPI), hence more general than SARSA, since SARSA is on-policy.

TD control and Bellman equations

Through the above three algorithms, we can see the fingerprints of Bellman equations. In fact, the update rule of expected SARSA and Q-learning are just the TD control version of Bellman equations and Bellman optimality equations, respectively. In essence, we bootstrap the state-action value function as if we know all other state-action values, and then update the current state-action value function based learning rate, discount factor and observed reward.

Leave a Comment