Notes on Reinforcement Learning: Temporal Difference for control
In the previous post of this series, we discussed Temporal Difference (TD) for prediction, namely, how to approximate the state value function,
SARSAPermalink
SARSA stands for state-action-reward-state-action. More precisely, it is an acronym for the sequence of the update rule of
Once the state-action value function is learned, we can greedify the policy to improve the policy, then start the next round of iteration.
Expected SARSAPermalink
In SARSA, we have to wait first take to the next action
It appears that expected SARSA should always be preferred over SARSA, since what we are interested in is the long-term, expected, behavior, then taking expectation early on is a good idea (as opposed to from discrete sampling). This mitigates the variance from the behavior policy. However, the expectation can be expensive to calculate if the action space is large.
Q-learningPermalink
Q-learning is just a little deviation from SARSA: it applies Bellman optimality equation on SARSA, so the update rule becomes:
Note the difference between SARSA and Q-learning is that in SARSA, we use the next action
Q-learning gets us the optimal state-action values, not necessarily the policy (although we can greedify to get the optimal policy). Put differently, Q-learning is off-policy, since the state-action value update does not follow the current policy (behavior policy). In this manner, Q-learning can be considered as doing General Policy Improvement (GPI), hence more general than SARSA, since SARSA is on-policy.
TD control and Bellman equationsPermalink
Through the above three algorithms, we can see the fingerprints of Bellman equations. In fact, the update rule of expected SARSA and Q-learning are just the TD control version of Bellman equations and Bellman optimality equations, respectively. In essence, we bootstrap the state-action value function as if we know all other state-action values, and then update the current state-action value function based learning rate, discount factor and observed reward.
Leave a Comment