# Notes on Reinforcement Learning: Bandit

About two years ago, I started taking courses in the Coursera specialization: Fundamentals of Reinforcement Learning. At that time, it was my first foray into this area, with great excitement, I rushed through the materials, and left with some concepts poorly understood. Recently I had the urge to pick up this topic again. In an attempt to do it better this time, I reckon it is better to take some notes.

## From 30,000 feet

Reinforcement learning (RL) is about decision making, *i.e.*, learning and applying the **best** policy. A policy is almost always evaluated by the rewards generated by following it.

The decision maker (the agent) generates the training data by interacting with the world (environment). This is in contrast to the supervised learning where the labels are already provided. In RL, the agent must learn the consequence (label) from its own actions through trial and error.

## Bandit *v.s.* RL

Arguably, the most famous example of RL is the multi-armed bandit problem. However, it is a bit dangerous to equate bandit to RL. In a bandit problem (*e.g.*, contextual bandit), the agent focuses on optimizing the **immediate reward** after applying the action: think the payout from pulling the slot machine’s arm. In an RL problem, the agent focuses on optimizing the **long-term reward**: think playing a chess game, the reward (winning) is only available at the end of the game, yet one still has to follow a policy early in the game, hoping to receive the ultimate reward. As such, bandit can be considered as a simple instantiation of RL, and usually is used as the introductory example.

## Nomenclature

Below we will introduce the common naming conventions in the RL context (mostly following Sutton and Barto). These should apply to both bandit and RL contexts. We will add more definitions as we go.

We commonly use the subscript \(t\) to denote the time step.

Symbol | Definition |
---|---|

\(a\) | An action, for example, turn left, turn right. |

\(\mathcal{A}\) | The set of all possible actions. |

\(A_t\) | The specific action taken at time step \(t\). |

\(R_t\) | The reward observed at time step \(t\). |

\(r\) | Reward of taking a certain action. |

\(s\) | A given state of the environment. |

\(\mathcal{S}\) | The set of all possible states. |

\(S_t\) | The state at time step \(t\). |

\(q(\cdot)\) | The action-value function of \(a\), or \(s,a\). |

\(Q_t(\cdot)\) | The estimated action-value function, at time step \(t\). |

There are a few terms that worth special mentioning.

**Expected action-values, \(q_*(a)\)**: it is the *expected* reward of taking an action. Here we treat it as state invariant. In the context of bandit, the reward is drawn from a (unknown) distribution, and observed immediately. Formally it can be written as:

**Policy, \(\pi(a \mid s)\)**: it is the mapping from a given state, \(s\), to the *probability* of taking action \(a\). The notion of \(\mid\) in the \(\pi\) is to emphasize it is a conditional probability distribution. It is possible that a policy is agnostic to the states, as in many bandit problems.

The goal of the agent is to choose the action \(a\) (*i.e.*, following a policy) that has the largest action-value. To do so, the agent wants to get \(q_*(a)\) as accurately as possible.

If the action-values are known, the bandit problem is solved: the policy is apply the action with the largest action-value.

## Estimating action-value

One approach is to estimate \(q_*(a)\) using the sample-average. With observations after applying different actions, one needs to update the estimated action-values. A general update rule is to update it incrementally. It works in the sample-average case, but also in non-stationary case (the distribution of reward changes over time).

In bandit, one observes the reward immediately as \(R_t\), then updates the action-value estimation as:

\[Q_{t+1}(a) = Q_t(a) + \alpha_t (R_t - Q_t(a)),\]where \(\alpha_t\) is the learning rate. There is a question about what the initial value, \(Q_0\), to use. The optimistic initial values approach assigns a large value of \(Q_0\), to encourage exploration early on. But it doesn’t allow for continuous exploration later, for example, to account for non-stationary rewards.

## Bandit policy

With a protocol to estimate and update the action-values, a greedy policy is to choose the action whose action-value is largest. Note that, as the agent follows this policy and applies actions continuously, the estimated action-values are also changing since the reward is drawn from a distribution.

Here comes the exploration-exploitation trade-off: an epsilon-greedy policy allows the agent to instead choose the action with the largest estimated action-values (exploitation), but to apply a random action (exploration). However, if the agent explores too much, it does not apply the learnings to guide its actions.

## Leave a Comment