Jekyll2023-12-27T02:20:54+00:00https://changyaochen.github.io/feed.xmlPain is inevitable. Suffering is optional.Personal website.Changyao ChenLarge Language Model (LLM) learning notes, part 22023-12-25T18:00:00+00:002023-12-25T18:00:00+00:00https://changyaochen.github.io/Coursera-LLM-part-2<p>LLM, at the simplest level, is a model to “finish the sentences”.
The model is trained as a supervised learning problem, where the training data
consists of inputs as the “unfinished” sentences, and labels as the next token
in the “finished” sentences. For example, a training sample can be:</p>
<blockquote>
<p>Input: “Sir Issac Newton developed”<br />
Label: “calculus”<br /><br />
Input: “Sir Issac Newton developed calculus and the law of”<br />
Label: “motion”</p>
</blockquote>
<p>Note that, the start and end of the sentence are special tokens. As such,
any prediction can be compared with the “ground truth” label to generate
some kind of error metric, which can be used to evaluate the model.</p>
<h2 id="evaluation-metrics">Evaluation metrics</h2>
<p>In LLM, the evaluation metrics depend on the task at hand.</p>
<h3 id="rouge">ROUGE</h3>
<p>Recall-Oriented Understudy for Gisting Evaluation, or
<a href="https://en.wikipedia.org/wiki/ROUGE_(metric)">ROUGE</a>,
is the metric used in summarization tasks. It measures the token-level
“overlap” between the model prediction and the ground truth. Here we need to
pay attention to the “token-level”, as it can be uni-gram, bi-gram, or even n-gram.
The difference is what’s the atomic unit of comparison. For example, for uni-gram,
the atomic unit is a single token (<em>e.g.</em>, “this”), and for bi-gram, the atomic unit
is two consecutive tokens (<em>e.g.</em>, “this is”).
Correspondingly, the ROUGE metric can be ROUGE-1, ROUGE-2, ROUGE-N, etc.</p>
<p>For example, if the ground truth is</p>
<blockquote>
<p><span style="color:green">“this is a test”</span></p>
</blockquote>
<p>and the model prediction is</p>
<blockquote>
<p><span style="color:blue">“that too is a test”</span></p>
</blockquote>
<p>then the ROUGE-1 recall is 3/4, the ROUGE-1 precision is 3/5,
and the ROUGE-1 F1 score is the 2/3 (the harmonic mean of recall and precision).
For ROUGE-2, the atomic unit will be bi-gram, and the ROUGE-2 recall is 2/3, as
the matching units are “is a” and “a test”. We can compute the ROUGE-2 precision and
ROUGE-2 F1 score similarly.</p>
<p>There is, however, a special type of ROUGE metric called ROUGE-L. In this case,
we care about the longest common subsequence (LCS) between the model prediction
and the ground truth. For example, with the above example, the LCS is “is a test”,
therefore the ROUGE-L recall is 3/4, the ROUGE-L precision is 3/5, and the ROUGE-L
F1 score is 2/3.</p>
<p>Note that, oftentimes, the recall metric is used as the default ROUGE metric.</p>
<h3 id="bleu">BLEU</h3>
<p>Bilingual Evaluation Understudy, or <a href="https://en.wikipedia.org/wiki/BLEU">BLEU</a>
is the metric used in machine translation tasks. It is closely related to ROUGE precision,
as it can be thought of as the <a href="https://en.wikipedia.org/wiki/Geometric_mean">geometric mean</a>
of ROUGE precisions with different lengths.</p>
<p>BLEU will be a value between 0 and 1, where 1 means the model prediction is identical
to the ground truth, and 0 means the model prediction is completely different from the
ground truth.</p>
<h2 id="parameter-efficient-fine-tuning-peft">Parameter Efficient Fine-Tuning (PEFT)</h2>
<p>Now we have established the evaluation metrics, we can talk about model training.
As noted in the previous post, LLM model training is a very expensive process, therefore,
one usually leverages pre-trained models to fine-tune it on the task at hand. That
is, we take the parameter values from the pre-trained model, and use them as the initial
values to continue gradient descent on the task-specific data.</p>
<p>However, such fine-tuning still modifies <strong>all</strong> the model parameters, which for
large models, runs in the range of 10s or 100s billions. This can still be a
computationally expensive process. To address this issue, we can use a technique
called Parameter Efficient Fine-Tuning (PEFT).</p>
<h3 id="lora">LoRA</h3>
<p>One of the PEFT approaches is Low rank adaption, or <a href="https://arxiv.org/abs/2106.09685">LoRA</a>.
The idea is very simple: in the pre-trained LLM, there are multiple matrices, such as the query
matrix. We will not touch those matrices, but instead, we will add a new matrix to each (or
only some) of them, and use the summed matrices when making inference. For these new matrices,
they are the outer product of two smaller matrices.</p>
<p>For example, if the original matrix is \(W_0 \in \mathbb{R}^{d \times k}\), then the
injected matrix \(\Delta W_0\) will have the same shape, but it can be decomposed as
\(\Delta W_0 = B A\), where \(B \in \mathbb{R}^{d \times r}\) and
\(A \in \mathbb{R}^{r \times k}\). Here \(r\) is much smaller (order of magnitude)
than either \(d\) or \(k\). In this way, the number of trainable parameters are
reduced from \(d \times k\) to \(r \times(d + k)\).</p>
<h3 id="soft-prompt-tuning">Soft prompt tuning</h3>
<p>In the previous post we talked about prompt engineering, which can be viewed as
a trial-and-error process, and the prompts are also in the form of human-readable text.
For soft prompt tuning, we move the “prompt” into the model, for example, as a 10-token
sequence that prefixes the input. During the fine-tuning process, we will only learn the
embedding of the prompt tokens using the task-specific data. In this way, the total
number of trainable parameters is much smaller than the original model, as we will
only train the embedding of the prompt tokens.</p>Changyao ChenLearning notes from the Coursera course on Large Language Models (LLMs)Large Language Model (LLM) learning notes, part 12023-12-08T18:00:00+00:002023-12-08T18:00:00+00:00https://changyaochen.github.io/Coursera-LLM-part-1<p>I have been putting off formally learning about Large Language Models (LLMs)
for a while, with the excuse that I don’t have time. Excuses are, at the end of
the day, excuses, so I try to do the easiest thing first: follow a Coursera
<a href="https://www.coursera.org/learn/generative-ai-with-llms">course</a>,
and try to take notes. This is the first part of my notes.</p>
<h2 id="rise-of-llms-transformer-architecture">Rise of LLMs: transformer architecture</h2>
<p>LLMs, at the simplest level, are models to “finish sentences”. To do so, we first
need to provide the “prompt” to the model, <em>i.e.</em>, ask a question, so that the
model, which is so knowledgeable (since it is trained on a massive amount of data),
can answer it. Here the emphasis is on the <strong>“massive amount of data”</strong> part.</p>
<p>Before circa 2017, the state-of-art model for text generation is Recurrent Neural
Network (RNN), with variants such as Long Short-Term Memory (LSTM) and Gated
Recurrent Unit (GRU). While successful, RNN has a crucial problem, that is it can
not handle long-term dependency very well. Namely, it can only practically remember
the last few words. If we want the model to remember earlier words, we need to
expand the size of model rapidly, to the point it can not be trained efficiently.</p>
<p>Then came the seminal paper, <a href="https://arxiv.org/abs/1706.03762">Attention Is All You Need</a>,
which proposed the transformer architecture that solves this problem.
We will probably have a separate post on the transformer architecture, but for now,
all we need to know is that this architecture breaks the sequential nature in the RNN
training routine, so that it can be scaled efficiently to use multi-core GPUs.
This allows transformer models to process training data in parallel, making use of
much larger datasets. More crucially, it’s able to learn to pay attention
to the meaning of the words it’s processing.</p>
<h2 id="prompt-engineering">Prompt engineering</h2>
<p>At inference time, we, as the end user, provide the prompt to the model, so that
it can complete the sentence(s) for us. As the saying goes, “to get a good answer,
you need to ask a good question”. The same apply to LLMs: there are techniques to
how to strut the prompt (question), so that the model can give us the best answer.</p>
<p>In the setting of LLMs, this is called prompt engineering. One of such techniques
is “in context learning”: we can provide our context to the model, and since LLMs
are good at remembering things, they can use the context to give us a better answer.</p>
<p>To make in context learning more concrete, there is so-called <em>zero-shot learning</em>.
In this setting, we provide the <strong>exact</strong> structure of the desired conversation,
such as:</p>
<blockquote>
<p>Question: What is the meaning of life?<br />
Answer:</p>
</blockquote>
<p>Here we effectively leave the blank for the model to fill in, and hope that the
model will give us the answer: 42.</p>
<p>However, giving the model just the structure of the conversation is not enough
to produce the desired answer. In addition, we need to provide one or more examples,
and this is called <em>one-/few-shot learning</em>. For example, we can provide the following
prompt:</p>
<blockquote>
<p>Question: what is the sentiment of this review: this movie is great!<br />
Answer: positive.<br />
Question: what is the sentiment of this review: this movie is terrible!<br />
Answer:</p>
</blockquote>
<p>By showing the model “how it should be done”, will increase the chance that you
will get the desired outcome.</p>
<h2 id="temperature-setting">Temperature setting</h2>
<p>A distinct feature of LLMs is that they are <em>probabilistic</em> models. This means
with the same prompt (input), the model can give you different answers (output).
The level of such randomness can be controlled by the so-called <em>temperature</em>
parameter.</p>
<p>As LLMs predict the next token (word) one-by-one (based on the previous
tokens), on each instance, it gives the probability of each token in the vocabulary.
To pick the next token, one can simply pick the token with the highest probability
(greedy), or sample the vocabulary based on the probability of each token. It is
the latter approach that gives us the randomness.</p>
<p>Oftentimes, we apply a softmax function to the probability distribution, before
we draw the next token, and the temperature parameter $T$ is inserted into the softmax.
For example, if the probability distribution for the next token is \(p(w_i)\), then
the probability of drawing the token $w_i$ is:</p>
\[\frac{\exp(\frac{p(w_i)}{T})}{\sum_{j}\exp(\frac{p(w_j)}{T})}\]
<p>As you can see, if \(T=1\), then we have the exact softmax function. If \(T\) is
close to 0, then we are getting closer to the greedy scenario: the token \(w_i\) with the
highest probability \(p(w_i)\) will almost always be picked. On the other hand,
if \(T\) is very large, we are getting closer to the uniform distribution: all tokens have the same probability of being picked.</p>
<h2 id="mlops">MLOps</h2>
<p>Finally, let’s get practical: how much computation resource is required to deploy,
or even to train an LLM? We can start with some back-of-the-envelope calculation.
Let’s say the LLM has 1 billion parameters. If we use 32-bit (4-byte)
floating point to represent each parameter, then we need 4GB of memory to
store/deploy the model. However, it is very common for LLM to have more than 100
billion parameters, so we are talking about 400GB of memory, probably don’t try this
at home.</p>
<p>For model training, not only we need to store the model parameters in memory,
and we also need to store the gradients, and a lot other things (optimizer states,
etc.). As a rule of thumb, we might need <strong>20x</strong> of the model size to train the model.</p>
<h2 id="scaling-laws">Scaling laws</h2>
<p>Do we really need 100 billion parameters to train a good LLM? How about using
a smaller model (fewer parameters), but with more training data? It is conceivable
that with a fixed computation budget, to achieve the same level of model performance,
there might be trade-off between model size and training data size. This is where
the <a href="https://en.wikipedia.org/wiki/Neural_scaling_law#Chinchilla_scaling_(Hoffmann,_et_al,_2022)">Chinchilla scaling law</a>
comes in. It states that:</p>
<blockquote>
<p>the optimal training data size (number of tokens)
<strong>20x</strong> of the model size (number of parameters), and vice versa.</p>
</blockquote>
<p>So spend your computational budget wisely!</p>Changyao ChenLearning notes from the Coursera course on Large Language Models (LLMs)Notes on Reinforcement Learning: Planning2023-05-07T18:00:00+00:002023-05-07T18:00:00+00:00https://changyaochen.github.io/RL-notes-7<p>In the previous posts of this series, we have discussed how the agent can learn what is the best action to take in a given state (<em>i.e.</em>, the policy), by interacting with the environment <strong>without</strong> a model, <em>i.e.</em>, in the model-free setting. These approaches include Monte Carlo method, and Temporal Difference (TD) method, <em>e.g.</em>, Q-learning. Since we do not have the complete description of the model, to apply techniques such as Dynamic Programming, we resort to the sampling from the environment, and used the past experience as the proxy to the model. Can we “squeeze more juice” from these past interactions, and treat them as if we have a model?</p>
<h2 id="combining-model-free-and-model-based-methods">Combining Model-Free and Model-Based Methods</h2>
<p>Yes, we can. In fact, this is a very common approach in practice. The idea is to use the model-free method to interact with the environment (sampling) in order to learn the underlying model, and use the model-based method to improve the policy (<em>e.g.</em>, calculate the state-action value functions). The latter is known as the (state-space) planning.</p>
<h3 id="dyna-q">Dyna-Q</h3>
<p>It is not hard to see how one can intermix the sampling (or acting) and planning. For example, we can sample from the environment for one step, then update the state-action value function (via either SARSA or Q-learning). Since we just learned something from the environment, we should update our understanding of the environment, <em>i.e.</em>, the model. This can be done simply by updating tabular information of the next state and reward. With such a table that keeps track of all the \((S, A)\) we have visited and the outcome (next state and the reward), we can run some simulation, as if this tabular model <strong>is</strong> the environment, and let the agent interact with it to keep updating the state-action value function.</p>
<p>Below is the pseudo-code for Dyna-Q, which is a combination of Q-learning and planning.</p>
<figure>
<center>
<a href="/assets/images/DynaQ_pseudo.png"><img style="width:100%;" src="/assets/images/DynaQ_pseudo.png" /></a>
</center>
</figure>
<p>As you can see, it is largely identical to the Q-learning algorithm, except that we add a model update step (e), and a planning step (f) in which the agent interacts with the model to continue updating the state-action value function.</p>
<p>It is worth noting that, in planning, the interaction with the environment (acting) serves two purposes: 1. To update the value estimation (<em>e.g.</em> \(Q(S, A)\)) incrementally; 2. To update the model, which can also update the same value estimation via planning. The former is called direct RL, and the latter is called model learning.</p>
<figure>
<center>
<a href="/assets/images/RL_planning_and_learning.png"><img style="width:80%;" src="/assets/images/RL_planning_and_learning.png" /></a>
</center>
</figure>
<h3 id="planning-smarter">Planning smarter</h3>
<p>In the Dyna-Q algorithm shown above, each of the planning step starts with a random state, and apply the same state-action value function udpate rule. There can be a few issues with that.</p>
<p>First, the uniformly random state may not be the most efficient way to plan: after all, we would like to see changes in \(Q(S, A)\), therefore, we would like to focus on states whose q-values are recently changed, and sample the state-action pairs that leading to them. To achieve this, we can use a priority queue to keep track of the states that we have visited, and the priority is the magnitude of the change in \(Q(S, A)\). We can then sample from this priority queue to in our planning step. This is known as the Prioritized Sweeping algorithm.</p>
<p>Second, there is only one (or a few) acting step (sampling from the real environment) between the (possibly many) planning steps. During each planning step, we run many simulations in which we are disconnected from the real environment and solely rely on the assumption the model is correct. However, the environment might have changed, hence, we need to encourage the agent to visit outdated states (states that we have not visited for a long time) more often (the Dyna-Q+ algorithm).</p>
<h2 id="putting-things-together">Putting things together</h2>
<p>So far, as in the first part of the Sutton and Barto, we only discussed the tabular setting, whereas the state and action spaces are discrete. However, it lays the foundation of the reinforcement learning, even for the future continuous setting. Before we proceed, let’s summarize the <strong>3</strong> key ideas.</p>
<h3 id="value-functions">Value functions</h3>
<p>One can argue that at the heart of RL, we want to estimate some value functions, be it state, or state-action.</p>
<h3 id="backup">Backup</h3>
<p>All the methods, <em>e.g.</em>, value functions estimations, operate by backing up values along actual or possible state trajectories. This is of Dynamic Programming, and in the RL setting, manifests itself through the Bellman equation explicitly.</p>
<h3 id="generalized-policy-iteration-gpi">Generalized policy iteration (GPI)</h3>
<p>Estimating the value functions is only part of the story, we still need a policy to operationalize the value functions. However, with different policies, the value functions will change. We maintain an approximate value function and an approximate policy, and they continually try to improve each on the basis of the other.</p>
<h2 id="on-to-next">On to next</h2>
<p>Here comes the world of infinite state and action spaces!</p>Changyao ChenA collections of notes of Reinforcement Learning, as I am going through the Coursera specialization: Fundamentals of Reinforcement Learning. Hopefully this will be useful for future self.Notes on Reinforcement Learning: Temporal Difference for control2023-04-15T18:00:00+00:002023-04-15T18:00:00+00:00https://changyaochen.github.io/RL-notes-6<p>In the previous post of this series, we discussed Temporal Difference (TD) for prediction, namely, how to approximate the state value function, \(V(s)\) for a given policy. However, the ultimate goal of RL is to find the optimal policy, <em>i.e.</em>, solve the control problem. In the realm of TD, there are a few algorithms that can achieve this goal.</p>
<h2 id="sarsa">SARSA</h2>
<p>SARSA stands for state-action-reward-state-action. More precisely, it is an acronym for the sequence of the update rule of \(s_{t}, a_{t}, r_{t}, s_{t+1}, a_{t+1}\). The goal is, instead of solving for the state value function \(V(s)\), we solve for the state-action value function, \(q(s, a)\), which is the expected return when starting from state \(s\) and taking action \(a\), and then following the policy \(\pi\). The update rule is:</p>
\[q(s_{t}, a_{t}) \leftarrow q(s_{t}, a_{t}) +
\alpha(r_{t} + \gamma q(s_{t+1}, a_{t+1}) - q(s_{t}, a_{t}))\]
<p>Once the state-action value function is learned, we can greedify the policy to improve the policy, then start the next round of iteration.</p>
<h2 id="expected-sarsa">Expected SARSA</h2>
<p>In SARSA, we have to wait first take to the next action \(a_{t+1}\), and wait for the next state \(s_{t+1}\), in order to make update to \(q(s_{t}, a_{t})\) based on \(q(s_{t+1}, a_{t+1})\). However, since we already know the policy we are following (the behavior policy), we can calculate the <strong>expected</strong> \(q(s_{t+1}, a_{t+1})\) without waiting for the next state \(s_{t+1}\). Therefore, the update rule becomes:</p>
\[q(s_{t}, a_{t}) \leftarrow q(s_{t}, a_{t}) +
\alpha(r_{t} + \gamma \sum_{a'} \pi(a' \mid s_{t+1}) q(s_{t+1}, a') - q(s_{t}, a_{t}))\]
<p>It appears that expected SARSA should always be preferred over SARSA, since what we are interested in is the long-term, expected, behavior, then taking expectation early on is a good idea (as opposed to from discrete sampling). This mitigates the variance from the behavior policy. However, the expectation can be expensive to calculate if the action space is large.</p>
<h2 id="q-learning">Q-learning</h2>
<p>Q-learning is just a little deviation from SARSA: it applies Bellman optimality equation on SARSA, so the update rule becomes:</p>
\[q(s_{t}, a_{t}) \leftarrow q(s_{t}, a_{t}) +
\alpha(r_{t} + \gamma \max_{a'} q(s_{t+1}, a') - q(s_{t}, a_{t}))\]
<p>Note the difference between SARSA and Q-learning is that in SARSA, we use the next action \(a_{t+1}\) to update \(q(s_{t}, a_{t})\), while in Q-learning, we use the <strong>best</strong> action for the state \(s_{t+1}\) to update \(q(s_{t}, a_{t})\). This is the only difference between SARSA and Q-learning. The rest of the algorithm is the same.</p>
<p>Q-learning gets us the optimal state-action values, not necessarily the policy (although we can greedify to get the optimal policy). Put differently, Q-learning is off-policy, since the state-action value update does not follow the current policy (behavior policy). In this manner, Q-learning can be considered as doing General Policy Improvement (GPI), hence more general than SARSA, since SARSA is on-policy.</p>
<h2 id="td-control-and-bellman-equations">TD control and Bellman equations</h2>
<p>Through the above three algorithms, we can see the fingerprints of Bellman equations. In fact, the update rule of expected SARSA and Q-learning are just the TD control version of Bellman equations and Bellman optimality equations, respectively. In essence, we bootstrap the state-action value function as if we know all other state-action values, and then update the current state-action value function based learning rate, discount factor and observed reward.</p>Changyao ChenA collections of notes of Reinforcement Learning, as I am going through the Coursera specialization: Fundamentals of Reinforcement Learning. Hopefully this will be useful for future self.How to use US Census Bureau data2023-02-20T18:00:00+00:002023-02-20T18:00:00+00:00https://changyaochen.github.io/us-census-data<p>Recently I had to learn the US Census Bureau data, to explore the demographic information at the ZIP code level.
In the past, I usually Google search for nicely parsed data, <em>e.g.</em>, total population at each ZIP code,
and chances are there are someone already made the data available. After all:</p>
<ul>
<li>The data is free (from the US Census Bureau).</li>
<li>The size of the data is (usually) small enough to be downloaded (after all there are about
400,000 ZIP codes across US).</li>
</ul>
<p>However this time around, I am interested in more than the common demographic information, therefore, I reckon it
is better to query directly from the data source, and I did learn quite a few things about the US Census data.</p>
<h2 id="us-census-bureau-does-not-know-zip-code">US Census Bureau does not know ZIP code</h2>
<p>The commonly known concept of <a href="https://en.wikipedia.org/wiki/ZIP_Code">ZIP code</a> is defined by US Postal Service (USPS),
in order to facilitate efficient mail delivery. However, the US Census Bureau operates and organizes data under its own
<a href="https://www.census.gov/programs-surveys/geography/guidance/hierarchy.html">geography entity hierarchy</a>, and it does
not honor the USPS ZIP code. As shown below, the smallest geographic entity is a Census Block (think it as a city block),
and rolling all the way up to the Nation level. As such, US Census Bureau data can not directly answer the question such as
“what is the population of ZIP code 10027”.</p>
<figure>
<center>
<a href="/assets/images/US_census_bureau_geography.png"><img style="width:100%;" src="/assets/images/US_census_bureau_geography.png" /></a>
</center>
</figure>
<p>To map the US Census Bureau data at the ZIP code level, the closest geographic entity is the Census Tract (there are about 70,000).
However, it is a many-to-many mapping, <em>i.e.</em>, a Census Tract can contain multiple ZIP codes, and a ZIP code can cover multiple
Census Tracts as well. I found some convenient <a href="https://www.huduser.gov/portal/datasets/usps_crosswalk.html">crosswalk files</a>
that provide such mappings.</p>
<p>But there is an easier way. Since 2000, the US Census Bureau introduced the geographic entity of
<strong>ZIP Code Tabulation Areas (ZCTAs)</strong>. Per <a href="https://en.wikipedia.org/wiki/ZIP_Code_Tabulation_Area">Wikipedia</a>:</p>
<blockquote>
<p>This new entity was developed to overcome the difficulties in precisely defining the land area covered by each ZIP code.</p>
</blockquote>
<p>The <a href="https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html">documentation</a> from US Census Bureau
provides the details of how ZCTAs are created. Probably the important aspect to a data consumer is:</p>
<blockquote>
<p>In most instances the ZCTA code is the same as the ZIP Code for an area.</p>
</blockquote>
<p>Good enough for me! It would be nice to quantify <em>most</em>, but I will give the benefit of the doubt and operate under the
mode of <code class="language-plaintext highlighter-rouge">ZIP code == ZCTA</code>.</p>
<h2 id="use-acs-not-the-census-itself">Use ACS, not the Census itself</h2>
<p>The colloquial of “US Census” usually refers to the <a href="https://en.wikipedia.org/wiki/United_States_census">decennial census</a>,
that conducted once every 10 years (the most recent
one is in 2020). It is a <em>complete count</em> of the entire U.S. population, and asks just a few questions about every person
and household. The public artifact of the Census is the Public Law (P.L.) 94-171
<a href="https://data.census.gov/all?q=Population+Total&y=2020&d=DEC+Redistricting+Data+(PL+94-171)">Redistricting Data</a>. From the name, it
is not hard to guess the purpose of the data: to re-draw the US congressional districts and reapportioning the
House of Representatives. Therefore, we can consider it only provides the population count, and nothing more.</p>
<p>The treasure trove of the US Census Bureau data, is the <a href="https://en.wikipedia.org/wiki/American_Community_Survey">American Community Survey</a> (ACS).
It conducts continuously since the early 2000s,
and is an ongoing survey of just a portion of the population. The ACS asks dozens of questions on a wide variety of topics
to gather information about the demographic, social, economic, and housing characteristics of the population. The main differences
between the decennial census and ACS are:</p>
<ul>
<li>The decennial census, well, is a census. It is an one-time, complete count of the population.</li>
<li>The ACS is an ongoing survey, designed to provide up-to-date information about communities throughout the United States,
enabling policymakers, businesses, and individuals to make informed decisions.</li>
</ul>
<p>As a data consumer, I’m almost always interested in ACS, given the rich information it provides. However,
being a survey, we need to keep in mind we are dealing with point estimates. Fortunately, for each of the information
(<em>e.g.</em>, family income), US Census Bureau also provides the margin of error. Another factor to consider is how many
samples are used to derive the point estimate. In ACS case, this is conveyed through how many years of survey results
are used. Currently there are only two options: 1-year estimate and 5-year estimate. The tradeoff is pretty intuitive:
the former reflects more recent trends, whereas the latter provides smaller margin of errors.</p>
<h2 id="accessing-the-acs-data">Accessing the ACS data</h2>
<p>To be honest, it is not very intuitive how to access the US Bureau data, at least for someone who is new to it.
After a few days’ digging, hopefully the following steps would save you some time. Here I assume we are interested
in the bulk download, not API calls (which is <a href="https://www.census.gov/data/developers/data-sets.html">available</a>).</p>
<h3 id="search-for-the-table-of-your-interest">Search for the table of your interest</h3>
<p>US Census Bureau provides a handy <a href="https://data.census.gov/table">entrypoint</a> to search for all available tables.
Apply the appropriate filters (from the left panel),
<a href="https://data.census.gov/table?y=2021&d=ACS+5-Year+Estimates+Detailed+Tables">for example</a>,
with the 2021 ACS 5-year estimate data. This will result in more than 1000 tables. We can then further
filter by the topics, <a href="https://data.census.gov/table?t=Education&y=2021&d=ACS+5-Year+Estimates+Detailed+Tables">for example</a>, education.
In this case, there are less than 100 tables left.</p>
<p>Each of the table comes with an alphanumerical table identifier (<em>e.g.</em>, “B06009”), and a long description (<em>e.g.</em>,
“PLACE OF BIRTH BY EDUCATIONAL ATTAINMENT IN THE UNITED STATES”). The default view will be at the national level,
and one can apply “Geography” filter to obtain aggregation at different geographic entities (<em>e.g.</em>, ZCTA).
Since our goal is to make bulk downloads, knowing the table identifier is enough.
With the table identifier in hand, we can leverage the FTP service provided by the US Census Bureau, to directly
download the raw files.</p>
<h3 id="data-dictionary">Data dictionary</h3>
<p>The raw data will be in tabular format, and a piece of key information is the data dictionary, <em>i.e.</em>, what each column means.
To understand it, one should probably (highly recommended) first read the ACS
<a href="https://www.census.gov/content/dam/Census/library/publications/2020/acs/acs_general_handbook_2020.pdf">manual</a>.</p>
<p>For the 2021 ACS 5-year estimate,
<a href="https://www2.census.gov/programs-surveys/acs/summary_file/2021/table-based-SF/documentation/ACS20215YR_Table_Shells.txt">this</a>
is the data dictionary, <em>i.e.</em>, mapping between the table identifier, the name of each column in the tables, and the semantics.
For example, the first few lines read:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Table ID|Line|Indent|Unique ID|Label|Title|Universe|Type
B01001|1.0|0|B01001_001|Total:|SEX BY AGE|Total population|int
B01001|2.0|1|B01001_002|Male:|SEX BY AGE|Total population|int
...
</code></pre></div></div>
<p>It indicates for the table with id of “<a href="https://www2.census.gov/programs-surveys/acs/summary_file/2021/table-based-SF/data/5YRData/acsdt5y2021-b01001.dat">b01001</a>”,
the columns with names of “B01001_E002” and “B01001_M002” means the <strong>E</strong>stimate and <strong>M</strong>argin of error
for “total population of male”, respectively. Note that the injected letter <strong>E</strong> and <strong>M</strong> in the column names.</p>
<h3 id="download-the-table-from-ftp">Download the table from FTP</h3>
<p>All the tables are accessible, even from your browser. A better way to access the data is to
download them programmatically. <a href="https://github.com/uscensusbureau/acs-summary-file/pull/14/files#diff-6e7bdc9416f82cbd97963c90ec0fa91c52a47ab3dec8db17dd5f81498097e62b">This</a>
is a simple Python snippet I coded up, to download a single table, and filter at the given geographic entity level.
Here I use the ZCTA (<code class="language-plaintext highlighter-rouge">summary_level=860</code>) as an example.</p>
<p>Note that, for the ACS 5-year estimate, each table contains aggregation at all geographic entity levels (aka, summary level).
The geographic entity is identified by the column <code class="language-plaintext highlighter-rouge">GEO_ID</code>, while there doesn’t seem to have an
official definition of the <code class="language-plaintext highlighter-rouge">GEO_ID</code>, <a href="https://mcdc.missouri.edu/geography/sumlevs/">this</a> is a very good resource to use.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Here we describe how to understand and access US Census Bureau data. Hopefully it can save you
some time Googling around for nicely parsed US demographic related information, as you can easily
download them from the source!</p>Changyao ChenA quick primer on how to leverage the rich data from the US Census Bureau.Notes on Reinforcement Learning: Temporal Difference learning for prediction2023-01-29T18:00:00+00:002023-01-29T18:00:00+00:00https://changyaochen.github.io/RL-notes-5<p>The Monte Carlo (MC) method for (state) value function estimation is an approach that <strong>does not</strong> rely on a model to predict
the system dynamics, namely, the probability of state transition and reward under a given action. This is more practical
as in real world, it is unlikely that we can have the full understanding of the system, to allow us writing down the
\(P(s', r \mid s, a)\).</p>
<p>Temporal Difference (TD) learning is another model-free framework, that combines the characteristics of MC and the model-based
Dynamic Programming (DP) method. It also address the drawback in MC that it only applies to the episodic situation.</p>
<h2 id="td-for-prediction">TD for prediction</h2>
<p>So far, the central theme of reinforcement learning is to solve for, or estimate the value functions, be it state values or
state-action values (prediction problem). In DP, we leverage the concept of bootstrap: when investigating a single state,
we assume we already know the values for other state values, and construct the Bellman equations to solve for it.
In MC, we allow the agent to take actions and experience the environment (sampling). As the agent visits different states and collecting
rewards, we do careful bookkeeping and then update the state values retrospectively, after an episode is terminated.</p>
<p>TD combines the concept of <strong>bootstrap</strong> from DP, and <strong>sampling</strong> from MC. With TD method, the agent also interacts with
the environment by taking actions. However, once the agent transitioned from state \(s_t\) to \(s_{t + 1}\),
and collecting the reward \(r_t\), it will update the state value \(V(s_t)\) as:</p>
\[V(s_t) \leftarrow V(s_t) + \alpha \overbrace{(\underbrace{[r_t + \gamma V(s_{t + 1})]}_{\text{TD target}} - V(s_t))}^{\text{TD error}},\]
<p>where \(\alpha\) is a newly introduced variable dubbed as learning rate.</p>
<p>Conceptually, we make correction to the current estimate of the state value, with a “TD error” term multiplied the learning rate.
Inside the construct of the TD error, the term “TD target” can be viewed as the estimation of true \(V(s_t)\),
<strong>as if</strong> we know the true value of \(V(s_{t + 1})\) (based on the definition of the state value). This assumption is
the very nature of bootstrap.</p>
<p>Note that, like DP, TD bootstraps, but unlike DP, TD does not require a model. Like MC, TD samples the environment to update
the value functions, but unlike MC, TD does not need to wait for the end of an episode to update the state values: TD makes
updates immediately after every action. In a way, TD takes the best of the two worlds.</p>
<h2 id="a-simple-example">A simple example</h2>
<p>Let’s use a simple example to demonstrate how these 3 methods work in estimating the state values.</p>
<p>In this example (Barto and Sutton, Example 6.2), the agent starts from state 3, and always takes a 50/50 chance between moving left or right.
The state 0 and 6 are terminal states, and moving from state 5 to 6 gives the agent a reward of +1, all other transitions have no reward.
As such, this is an episodic task. We will use DP, MC and TD to solve for the state values.
(Jupyter notebook <a href="https://github.com/changyaochen/changyaochen.github.io/blob/master/assets/notebooks/assets/notebooks/RL_DP_MC_TD_example.ipynb">here</a>)</p>
<figure>
<center>
<a href="/assets/images/dp_mc_td_example.png"><img style="width:100%;" src="/assets/images/dp_mc_td_example.png" /></a>
</center>
</figure>
<p>For the sake of simplicity, the discount factor \(\gamma\) is set to 1.</p>
<h3 id="dynamic-programming-dp">Dynamic Programming (DP)</h3>
<p>To use DP, we need to write down the state transition probability \(P\), along with the rewards probability \(R\).
Using a matrix notation, they can be written as:</p>
\[\begin{eqnarray}
P &=& p(s' \mid s) = 0.5 \times
\begin{bmatrix}
0 & 0 & 0 & 0 & 0 & 0 & 0 \\
1 & 0 & 1 & 0 & 0 & 0 & 0 \\
0 & 1 & 0 & 1 & 0 & 0 & 0 \\
0 & 0 & 1 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 & 1 & 0 \\
0 & 0 & 0 & 0 & 1 & 0 & 1 \\
0 & 0 & 0 & 0 & 0 & 0 & 0
\end{bmatrix} \\
R &=&
\begin{bmatrix}
0 & 0 & 0 & 0 & 0 & 0.5 & 0
\end{bmatrix}
\end{eqnarray}\]
<p>The indices in \(P\) and \(R\) correspond to the state indices.
The state values \(V\), also written in a matrix notation, can then be solved via Bellman equation as:</p>
\[\begin{eqnarray}
V &=& P(R + \gamma V) \\
&\Downarrow& \\
V &=& (I - \gamma P)^{-1} P R
\end{eqnarray}\]
<p>The state values turn out to be:</p>
\[V = [0, \frac{1}{6}, \frac{1}{3}, \frac{1}{2}, \frac{2}{3}, \frac{5}{6}, 0]\]
<p>Note this will be used as the ground truth against which the MC and TD methods will compare.</p>
<h3 id="monte-carlo-mc">Monte Carlo (MC)</h3>
<p>To apply MC, we will run a series of simulations. In each episode, the agent will start from state 3, and terminate
at either state 0 or 6. Once an episode is terminated, we will update the state value of visited states from the <strong>back</strong>.
Here we will use the first visit MC.</p>
<h3 id="temporal-difference-td">Temporal Difference (TD)</h3>
<p>In TD, we also run a series a simulations, starting from state 3. However, as soon as transition from an old state to a
new state, we apply the updating rule outlined above, to arrive at a new estimation of the state value of the old state.</p>
<h3 id="comparisons">Comparisons</h3>
<figure>
<center>
<a href="/assets/images/RL_DP_MC_TD_comparisons.png"><img style="width:100%;" src="/assets/images/RL_DP_MC_TD_comparisons.png" /></a>
</center>
</figure>
<p>With 100 episodes, both the MC and TD methods arrive at a reasonably well estimate of the state values
(quantified by the RMSE). It is also clear that, with more episodes, both the MC and TD methods are achieving
better results (smaller RMSE), albeit with different rates, and seemingly different asymptotic limit. However, it is worth
noting that both the convergence rate and limit are impacted by the hyperparameter of the model, such as learning rate.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Here we discussed TD as the second model-free method to estimate the state value functions, and this is in the larger context of
policy evaluation (prediction) framework. More specifically, the method is TD(0) (we will see a more general TD(\(\lambda\)) algorithm later).
Compared to MC, TD <em>usually</em> converges faster, and it is truly incremental and online.</p>Changyao ChenA collections of notes of Reinforcement Learning, as I am going through the Coursera specialization: Fundamentals of Reinforcement Learning. Hopefully this will be useful for future self.Notes on Reinforcement Learning: Monte Carlo method2022-12-30T18:00:00+00:002022-12-30T18:00:00+00:00https://changyaochen.github.io/RL-notes-4<blockquote>
<p>Found another <a href="http://web.stanford.edu/class/cs234/index.html">good course</a> from Stanford (video <a href="https://www.youtube.com/playlist?list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u">recordings</a> from winter 2019). If the official site is not working, <a href="https://github.com/tallamjr/stanford-cs234">this</a> Github repo contains most of the materials.</p>
</blockquote>
<p>In the previous notes, we laid out the foundation of reinforcement learning (RL), and most importantly, the concept of MDP, state, reward, policy, state value, (state-)action value, and how to find the optimal policy that leads to the largest state values. However, all the calculation are based on the assumption that there is a model that dictates the transition probability between different states, namely, \(p(s', r \mid s, a)\). It is not hard to see, this assumption (that we have the complete knowledge of the environment) can be easily violated. The Monte Carlo (MC) method helps us to move from a model-based situation to a model-free situation.</p>
<h2 id="monte-carlo-method">Monte Carlo method</h2>
<p>In the model-based paradigm, we use the Dynamic Programming (DP) method to iteratively calculate the state values \(v(s)\). To find the optimal state value \(v_{*}(s)\), we can use either policy iteration, or value iteration. Once \(v_{*}(s)\) is found, we can use the \(\text{argmax}\) to greedily find the optimal policy \(\pi_{*}\).</p>
<p>The essence is to find a good estimate of \(v_{\pi}(s)\). If there isn’t a model for use to apply the DP method, in the <strong>episodic case</strong>, we can use sample average as the estimate of \(v_{\pi}(s)\), from multiple episodes following the same policy \(\pi\). This is what MC method entails.</p>
<p>In the context of episodic tasks (with terminal states), after an episode (1, …, \(t\), …, \(T\)) is finished (following the policy \(\pi\)) and total rewards \(G_T\) collected, we can count from the back (\(T\)), and calculate the discounted rewards \(G_t\) of each of the passing states \(s_t\). As such, we are sampling \(v_\pi(s)\) from the episode, and from the definition of the state value function (note the expectation notation), we can use sample average as the estimation for the true \(v_\pi(s)\). Note that, for the same episode, the same state \(s\) can appear more than once.</p>
<p>We can run through many episodes, in the hope to visit each state multiple times, and the state values can be estimated as the simple average. The same procedure can be used to estimate the state-action values.</p>
<h2 id="first-visit-and-every-visit">First visit and every visit</h2>
<p>As mentioned above, the same state can be visited multiple times during the <em>same</em> episode, and it matters which visit (or all) should be used in the sample averaging. Being an estimator, we need to discuss its bias and variance. It turns out that, if we limit ourselves in the <em>first</em> visit (of the multiple visits to the same state) in a single episode, it is a unbiased estimator. If we use <em>every</em> visit to the same state in a single episode, it would lead to a biased estimator (think the following visits are correlated from the initial state). However, the every visit estimator is more data efficient as we don’t discard the following visits.</p>
<figure>
<center>
<a href="/assets/images/MC_prediction_algorithm.png"><img style="width:100%;" src="/assets/images/MC_prediction_algorithm.png" /></a>
</center>
</figure>
<p>Another aspect to consider is how to ensure exploration in MC? Exploring start (similar to the optimistic initial values in bandit) can help to address this, by randomize the initial state. In the following states, the agent will follow the prescribed policy. There are certain limitations of exploring start though, maybe it is not feasible in reality. We can do better: taking a page from the \(\epsilon\)-greedy policy in the multi-armed bandit setting: when deciding which action to take in the episode, instead of deterministically follow the current policy \(\pi\), there is a small probability, <em>e.g.</em>, \(\epsilon\), the agent will choose a random action.</p>
<h2 id="incremental-update">Incremental update</h2>
<p>Since the estimation of state value is taken as the sample average, with more data coming in, the update rule can be written as:</p>
\[v_{k + 1}(s) = v_k(s) + \alpha (G_i(s) - v_k(s))\]
<p>Here, \(i\) indicates the newly observed total reward from state \(s\), from episode \(i\) (we use the first visit algorithm here), \(k\) indicates the index of the update iteration, and \(\alpha\) can be thought as the learning rate: when it is set to \(1 / N(s)\), where \(N(s)\) is the total number of visits to state \(s\), the above formulation returns to the simple sample average. However, when written as such, it is:</p>
<ul>
<li>More space efficient: we don’t need to maintain the full history of \(v(s)\) as an array, but only the most recent one, and \(\alpha\).</li>
<li>More flexible to accommodate other update rules, for example, a constant value, to place more weight on recent observations.</li>
</ul>
<p>Another benefit is that, such an incremental update formulation will lead us naturally to the next topic: temporal difference learning.</p>
<h2 id="comparison-to-dynamic-programming-dp">Comparison to Dynamic Programming (DP)</h2>
<p>As mentioned in the beginning, the biggest advantage of MC over DP is that, in MC, we don’t require to know the model that prescribes the transition dynamics. Instead, we directly use the experience from interacting with the environment to incrementally update the estimation of the state values. Even in the situation where it is possible to obtain the transition dynamics, it can be tedious and error-prone, therefore MC can be preferred over DP. Moreover, when updating the state value \(v(s)\), we don’t rely on the estimates of other state values \(v(s')\), namely, we <strong>do not bootstrap</strong>.</p>Changyao ChenA collections of notes of Reinforcement Learning, as I am going through the Coursera specialization: Fundamentals of Reinforcement Learning. Hopefully this will be useful for future self.Notes on Reinforcement Learning: Policy and value iterations2022-11-27T18:00:00+00:002022-11-27T18:00:00+00:00https://changyaochen.github.io/RL-notes-3<p style="color:blue">We will limit the following discussion in the case of deterministic policy.</p>
<h2 id="policy-evaluation-prediction">Policy evaluation (prediction)</h2>
<p>In the previous post, we described the Bellman equation as the foundation to solve for the (state- or action-) value function under a given policy. We further argue that, despite its closed-form nature, it is infeasible to solve it analytically. In practice, we use an iterative algorithm called dynamic programming (DP).</p>
<p>Simply put, for the policy \(\pi\) under evaluation, we first initialize the state value function with some random values (<em>e.g.</em>, all zeros). Next, we will start the main loop. In each iteration of the main loop, we will update each of the value functions \(v_{\pi}(s)\) using the Bellman equation:</p>
\[\begin{align*}
v_{\pi}(s)_{k+1} &\leftarrow \sum_{a} \pi(a | s) \sum_{s', r} p(s', r \mid s, a)
\big[r + \gamma ~ {\color{red} v_{\pi}(s')_{k}}\big] \\
&\text{for all } s \in \mathcal{S}
\end{align*}\]
<p>Note that, when updating the value function for the \((k+1)^\text{th}\) iteration, the value functions used in the right-hand side of the Bellman equation take the value from the previous (<em>i.e.</em>, \(k^\text{th}\)) iteration (so-called synchronous update). If the changes in value functions from consecutive updates are less than a pre-set threshold, we will consider it converged and exit the main loop.</p>
<p>Here we only consider the state-value functions, but the same iterative procedure can be applied to calculate the action-value functions, using the corresponding Bellman equation. The only difference is that, in each iteration, instead of \(\mid \mathcal{S} \mid\) state-value functions to update, now we have \(\mid \mathcal{S} \times \mathcal{A} \mid\) action-value functions to update.</p>
<p>This process, policy evaluation (a.k.a, prediction), is the task to determine the state-value function, \(v_\pi(s)\), for a given policy \(\pi\).</p>
<h2 id="optimal-policy">Optimal policy</h2>
<p>Now we can evaluate the (state- or action-) value function of any policy, if the state-value function under a given policy is higher than under <em>any other</em> policy, then this state-value function is called the optimal state-value (subscripted with \(*\)), and the corresponding policy is called optimal policy, \(\pi_*\). Put differently, under the optimal policy, the corresponding value function at each state is the largest among all possible value functions.
While the \(v_*(s)\) is unique (as they are scalar), there can be multiple optimal policies.</p>
<p>If we already know the optimal state-value functions, \(v_*(s)\), it is quite trivial to find the optimal policy, as:</p>
\[\pi_*(a \mid s) = \text{argmax}_{a} \sum_{s', r}p(s', r \mid s, a)\big[r + \gamma v_*(s')\big]\]
<p>Basically, at each state \(s\), we choose the action that maximizes the expected total future rewards. Along this line of logic, one can reason the form of the <strong>Bellman optimality equation</strong> as (note that no policy is involved):</p>
\[\begin{align}
v_{*}(s) &= {\color{red}{\max_{a}}} \sum_{s', r} p(s', r \mid s, a)\big[r + \gamma ~ {v_{*}(s')}\big] \\
q_{*}(s, a) &= \sum_{s', r} p(s', r | s, a)\big[r + \gamma ~ {\color{red}{\max_{a'}}}~{q_{*}(s', a')}\big]
\end{align}\]
<p>If the state-value function for a given policy – calculated by the generic Bellman equation – equals the state-value function derived from the Bellman optimality equation, then this policy is the optimal policy, \(\pi_*\), and the state-value function is accordingly called \(v_*\). Note that \(v_*\) is unique, but there can be multiple \(\pi_*\) lead to the same \(v_*\).</p>
<h2 id="policy-improvement-control">Policy improvement (control)</h2>
<p>Given the link between the optimal value and optimal policy, and the fact that the ultimate goal of reinforcement learning is to find the optimal policy, we need to devise an algorithm to improve the policy. This process, policy improvement (a.k.a, control), is the task to improve the policy towards the optimal policy.</p>
<p>There is a greedy way to improve a given policy. From the policy evaluation, one obtains the all the state-value functions. From here, there is a way to achieve an equally good, or better policy, by taking greedy action at each state. Simply put, if the current policy \(\pi\) instructs an action at the given state \(s\), then a new policy \(\pi'\) subscribes that:</p>
\[\begin{align*}
\pi'(s) = \text{argmax}_{a}\sum_{s', r}p(s', r \mid s, a)[r + \gamma v_\pi(s')].
\end{align*}\]
<p>Note the state-values are still under the current policy \(\pi\).</p>
<p>Alternatively, if we use the action-value function, the greedy policy improvement becomes simpler: for a given state, the new policy will pick the action with the highest action-value function.</p>
<h2 id="policy-iteration">Policy iteration</h2>
<p>With the policy evaluation and policy improvement in place, we can see there is a clear link between them. Furthermore, it is not hard to see the iterative nature. This is the so-called policy iteration algorithm, in order to find the <strong>optimal</strong> policy.</p>
<p>Starting from a random policy, we will iteratively:</p>
<ul>
<li>With the current policy, apply <strong>policy evaluation</strong>, to update the value function.</li>
<li>With the current value function, apply <strong>policy improvement</strong>, to update the policy.</li>
</ul>
<p>This process exits if the policy stops changing.</p>
<p>Note that, a single policy evaluation step itself is also an iterative process, which only terminates after an exit condition is met, <em>i.e.</em>, the value-function stops changes between successive iterations.</p>
<h2 id="general-policy-iteration-and-value-iteration">General policy iteration and value iteration</h2>
<p>The policy iteration algorithm has a nested iterative nature: the policy update itself is an iterative process, whereas in each of the policy updates, we need to evaluate the state-value functions for the <em>given</em> policy, which itself is an iterative process. This can be computationally expensive, and yet with a slight change, we can avoid such a nested structure. This is where the value iteration algorithm comes in.</p>
<p>The <strong>value iteration</strong> algorithm side-step the policy evaluation step altogether: instead of finding the state-values for the given policy (run the iterative policy evaluation step until convergence), it tries to find the <em>optimal</em> state-values directly. This is achieved by greedily selecting the action that maximizes the state-value, <em>i.e.</em>, without following any policy. This is done with the Bellman optimality equation:</p>
\[\begin{eqnarray}
v_{k+1} (s) \leftarrow \max_{a}
\sum_{s', r}p(s', r \mid s, a)[r + \gamma v_{k}(s')]
\end{eqnarray}\]
<p>Effectively, we use the Bellman optimality equation as the update rule, to iteratively solve for the <strong>optimal</strong> state-value functions. Once the optimal state-value functions are obtained, we simply greedily select the action, to arrive at the optimal policy.</p>
<p>Note that in this algorithm, there is no policy involved when calculating the state-value functions, hence the name value iteration.</p>
<p>More generally, the dance between updating the value function, and updating the policy consists path to the optimal value function \(v_*\) and optimal policy \(\pi_*\). The picture below illustrates this process: an arrow that goes toward the “value” line indicates a policy evaluation (prediction) step, and an arrow that goes toward the “policy” line indicates a policy improvement (control) step.</p>
<figure>
<center>
<a href="/assets/images/RL_GPI.png"><img style="width:100%;" src="/assets/images/RL_GPI.png" /></a>
</center>
</figure>
<p>For the policy iteration algorithm, the arrow will touch each line, indicating a converged iteration, however, for a general policy iteration (<em>e.g.</em>, value iteration), we can lose the requirements and still find the ultimate convergence.</p>Changyao ChenA collections of notes of Reinforcement Learning, as I am going through the Coursera specialization: Fundamentals of Reinforcement Learning. Hopefully this will be useful for future self.Notes on Reinforcement Learning: MDP2022-08-15T18:00:00+00:002022-08-15T18:00:00+00:00https://changyaochen.github.io/RL-notes-2<p>In the previous note, we have highlighted the difference between Bandit and Reinforcement Learning (RL). In the former, the reward is immediate, and we want to identify the best action that maximizes this reward. In the latter, the reward is usually delayed, and the best action depends on the state where the agent is in. Moreover, the action will impact the future reward, so any action will have long term consequences.</p>
<p>This process can be modeled as <a href="https://en.wikipedia.org/wiki/Markov_decision_process">Markov Decision Process</a>, MDP. The central assumption of MDP is the memoryless <a href="https://en.wikipedia.org/wiki/Markov_property">Markov property</a>: given the current state \(s\) and action \(a\), the both the state transition probability and the reward probability are independent of all previous states and actions. If both the state space and action space are finite, then it is called a finite MDP.</p>
<p>The diagram below illustrates the close-loop nature of a RL system: the agent interacts with the environment by taking actions, receiving rewards, and moving to a different state.</p>
<figure>
<center>
<a href="/assets/images/reinforcement-learning-fig.jpeg"><img style="width:100%;" src="/assets/images/reinforcement-learning-fig.jpeg" /></a>
</center>
</figure>
<h2 id="nomenclature">Nomenclature</h2>
<p>We have introduced some nomenclature in the previous post, mostly in the context of bandit. Here we will add the common terminologies used in the RL context. Similarly, the subscript \(t\) denotes the time step.</p>
<table>
<thead>
<tr>
<th style="text-align: right">Symbol</th>
<th>Definition</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">\(\pi(a\mid s)\)</td>
<td>An policy, which in general is probability distribution. It prescribes the probability of taking action \(a\) in the state \(s\).</td>
</tr>
<tr>
<td style="text-align: right">\(v_\pi (s)\)</td>
<td>The state-value function of state \(s\), under the policy \(\pi\).</td>
</tr>
<tr>
<td style="text-align: right">\(q_\pi (s, a)\)</td>
<td>The action-value function of state-action pair \((s, a)\), under the policy \(\pi\).</td>
</tr>
<tr>
<td style="text-align: right">\(p(s', r \mid s, a)\)</td>
<td>The joint probability distribution of transition and reward.</td>
</tr>
<tr>
<td style="text-align: right">\(G_t\)</td>
<td>Total rewards from time step \(t\), <em>i.e.</em>, including future rewards.</td>
</tr>
<tr>
<td style="text-align: right">\(\gamma\)</td>
<td>Discount rate of future rewards.</td>
</tr>
</tbody>
</table>
<h2 id="policy">Policy</h2>
<p>A policy \(\pi\) defines the action, \(a\), the agent will take, when in a given state, \(s\). The policy can be either deterministic (for example: when seeing a 4-way roundabout, alway turn left), or probabilistic (for example: 50% chance turn left, 15% chance go straight, 15% chance turn right, 20% chance turn back). After taken each action, the agent will receive a reward, \(r\), from the environment. This reward is usually stochastic as well.</p>
<p>The goal of RL is to find a policy that maximizes the total reward <strong>in the long term</strong>.</p>
<h2 id="value-functions">Value functions</h2>
<p>For a given policy, one can calculate the value function: it estimates the future return under this policy. There are two types of value functions: state-value function, and action-value function. They are defined as:</p>
\[\begin{align}
v_{\pi}(s) &= \mathbb{E}_{\pi}[G_t | S_t = s]\\
q_{\pi}(s, a) &= \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a]
\end{align}\]
<p>Note here \(G_t\) is the total reward from time step \(t\), that is, it is the sum of \(R_{t + 1}, R_{t + 2}, ...\), until the episode stops (<em>e.g.</em>, winning a chess game), or into infinity.</p>
<p>To make it tractable, in the case of non-episodic process, the concept of <em>discounting</em> is introduced. In this setting, the future rewards are discounted, and \(G_t\) is now defined as:</p>
\[G_t := R_{t + 1} + \gamma R_{t + 2} + \gamma^2 R_{t + 3} + ... = R_{t + 1} + \gamma G_{t + 1}\]
<p>Obviously the discount rate \(\gamma \in [0, 1)\). Specifically, if \(\gamma = 0\), then we only consider the immediate reward, hence turning into the bandit case.</p>
<p>Intuitively, the value function measures <em>how good</em> it is to be in a given state (or taking a given action in a given state). For example, in a 2-dimensional grid world, if the goal is to reach a certain cell, then cells (states) close to the goal cell would have a higher state-value (assuming taking less moves are desired). Also, the value functions depend on the policy \(\pi\): in this grid world example, if the policy prescribes the agent to always move left, then the state left to the goal-state will have a state-value function smaller than that of a policy to always move right.</p>
<p>The goal of RL can then be boiled down to:</p>
<ol>
<li><strong>How to calculate the value functions for a given policy.</strong></li>
<li><strong>How to find the policy that has the highest value functions.</strong></li>
</ol>
<h2 id="dynamic-programming-and-bellman-equation">Dynamic programming and Bellman equation</h2>
<p>For a given policy \(\pi\), if the transition probability \(p(s', r \mid s, a)\) is known, then the a single state-value function can be described by other state-value functions as:</p>
\[\begin{align*}
v_{\pi}(s) &= \sum_{a} \pi(a | s) \sum_{s', r} p(s', r \mid s, a)\big[r + \gamma ~ {\color{red} v_{\pi}(s')}\big] \\
&\text{for all } s \in \mathcal{S}
\end{align*}\]
<p>Intuitively, it states that the value of the given state, \(v_\pi(s)\), consists of the immediate reward, \(r\) and the discounted state-values of all possible successive states, \(v_\pi(s')\). It also depends on what action does the agent take – controlled by \(\pi\), and what is the future state – controlled by \(p(s', r \mid s, a)\).</p>
<p>Note that, the state-value function of \(v_\pi(s)\) is expressed recursively (its own value included on the right-hand side), as if we know all state-value functions \(v_\pi(s')\). This is the hallmark of <a href="https://en.wikipedia.org/wiki/Dynamic_programming">dynamic programming</a>. In this RL setting, this is known as the <a href="https://en.wikipedia.org/wiki/Bellman_equation">Bellman equation</a>.</p>
<p>For a given RL problem, there are \(\mid \mathcal{S} \mid\) state-value functions, one for each state. Therefore, it is a set of linear equations, and in principle, can be solved analytically. However, it is usually infeasible to do so in practice, since the number of states is too large. Later we will introduce iterative algorithms to solve for the value functions.</p>
<blockquote>
<p>The Bellman equation answers the question of “how to calculate the value functions for a given policy”</p>
</blockquote>
<p>Similarly, one can derive the Bellman equation for the action-value functions as:</p>
\[\begin{align}
q_{\pi}(s, a) &= \sum_{s', r} p(s', r | s, a)\big[r + \gamma ~ \sum_{a'} \pi(a' | s') {\color{red} q_{\pi}(s', a')}\big]\\
&\text{for all }s \in \mathcal{S}, a \in \mathcal{A}
\end{align}\]
<p>There are \(\mid \mathcal{S} \mid \times \mid \mathcal{A} \mid\) action-value functions. Note that the second summation (over \(a'\)) amounts to \(v_{\pi}(s')\).</p>
<p>To answer the second question we posted above, namely, how to find the that has the highest value functions, we will leave it to the next post.</p>
<h2 id="markov-reward-process">Markov Reward Process</h2>
<p>In the preceding discussion, there is a policy embedded in the process (hence a <strong>decision</strong> process). If we remove this component, such that the system will transition between different states following its own (stochastic) dynamics, but still allow the agent to collect rewards at each state, we reduce the MDP to the Markov Reward Process (MRP).</p>
<p>One can still write the Bellman equation for state value functions in the case of MRP, but in a simpler way. Without consideration of taking actions, and associate the reward to the target state \(s'\), the transition probability becomes \(p(s' \mid s)\), and the Bellman equation is:</p>
\[\begin{eqnarray}
v(s) = \sum_{s'}p(s' \mid s)[r(s') + \gamma {\color{red} v(s')}].
\end{eqnarray}\]
<p>This is still a linear system of equations, but can expressed in a compact matrix notation as:</p>
\[V = P (R + \gamma V),\]
<p>where:</p>
\[\begin{eqnarray}
V &=&
\begin{bmatrix}
v(s_1) \\
v(s_2) \\
... \\
v(s_N)
\end{bmatrix}, \\
P &=&
\begin{bmatrix}
p(s_1 \mid s_1) & p(s_2 \mid s_1) & ... & p(s_N \mid s_1) \\
p(s_1 \mid s_2) & p(s_2 \mid s_2) & ... & p(s_N \mid s_2) \\
\vdots & \vdots & \ddots & \vdots \\
p(s_1 \mid s_N) & p(s_2 \mid s_N) & ... & p(s_N \mid s_N)
\end{bmatrix}, \\
R &=&
\begin{bmatrix}
r(s_1) \\
r(s_2) \\
... \\
r(s_N)
\end{bmatrix}.
\end{eqnarray}\]
<p>Therefore, the solution for the value function is:</p>
\[V = (I - \gamma P)^{-1} PR\]
<p>assuming \((I - \gamma P)\) is invertible.</p>Changyao ChenA collections of notes of Reinforcement Learning, as I am going through the Coursera specialization: Fundamentals of Reinforcement Learning. Hopefully this will be useful for future self.Notes on Reinforcement Learning: Bandit2022-07-31T18:00:00+00:002022-07-31T18:00:00+00:00https://changyaochen.github.io/RL-notes-1<p>About two years ago, I started taking courses in the <a href="https://www.coursera.org/specializations/reinforcement-learning">Coursera specialization</a>:
Fundamentals of Reinforcement Learning. At that time, it was my first foray into this area, with great excitement, I rushed through the materials, and left with some concepts poorly understood. Recently I had the urge to pick up this topic again. In an attempt to do it better this time, I reckon it is better to take some notes.</p>
<h2 id="from-30000-feet">From 30,000 feet</h2>
<p>Reinforcement learning (RL) is about decision making, <em>i.e.</em>, learning and applying the <strong>best</strong> policy. A policy is almost always evaluated by the rewards generated by following it.</p>
<p>The decision maker (the agent) generates the training data by interacting with the world (environment). This is in contrast to the supervised learning where the labels are already provided. In RL, the agent must learn the consequence (label) from its own actions through trial and error.</p>
<h2 id="bandit-vs-rl">Bandit <em>v.s.</em> RL</h2>
<p>Arguably, the most famous example of RL is the <a href="https://en.wikipedia.org/wiki/Multi-armed_bandit">multi-armed bandit</a> problem. However, it is a bit dangerous to equate bandit to RL. In a bandit problem (<em>e.g.</em>, contextual bandit), the agent focuses on optimizing the <strong>immediate reward</strong> after applying the action: think the payout from pulling the slot machine’s arm. In an RL problem, the agent focuses on optimizing the <strong>long-term reward</strong>: think playing a chess game, the reward (winning) is only available at the end of the game, yet one still has to follow a policy early in the game, hoping to receive the ultimate reward. As such, bandit can be considered as a simple instantiation of RL, and usually is used as the introductory example.</p>
<h2 id="nomenclature">Nomenclature</h2>
<p>Below we will introduce the common naming conventions in the RL context (mostly following <a href="https://mitpress.mit.edu/books/reinforcement-learning-second-edition">Sutton and Barto</a>). These should apply to both bandit and RL contexts. We will add more definitions as we go.</p>
<p>We commonly use the subscript \(t\) to denote the time step.</p>
<table>
<thead>
<tr>
<th style="text-align: right">Symbol</th>
<th>Definition</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: right">\(a\)</td>
<td>An action, for example, turn left, turn right.</td>
</tr>
<tr>
<td style="text-align: right">\(\mathcal{A}\)</td>
<td>The set of all possible actions.</td>
</tr>
<tr>
<td style="text-align: right">\(A_t\)</td>
<td>The specific action taken at time step \(t\).</td>
</tr>
<tr>
<td style="text-align: right">\(R_t\)</td>
<td>The reward observed at time step \(t\).</td>
</tr>
<tr>
<td style="text-align: right">\(r\)</td>
<td>Reward of taking a certain action.</td>
</tr>
<tr>
<td style="text-align: right">\(s\)</td>
<td>A given state of the environment.</td>
</tr>
<tr>
<td style="text-align: right">\(\mathcal{S}\)</td>
<td>The set of all possible states.</td>
</tr>
<tr>
<td style="text-align: right">\(S_t\)</td>
<td>The state at time step \(t\).</td>
</tr>
<tr>
<td style="text-align: right">\(q(\cdot)\)</td>
<td>The action-value function of \(a\), or \(s,a\).</td>
</tr>
<tr>
<td style="text-align: right">\(Q_t(\cdot)\)</td>
<td>The estimated action-value function, at time step \(t\).</td>
</tr>
</tbody>
</table>
<p>There are a few terms that worth special mentioning.</p>
<p><strong>Expected action-values, \(q_*(a)\)</strong>: it is the <em>expected</em> reward of taking an action. Here we treat it as state invariant. In the context of bandit, the reward is drawn from a (unknown) distribution, and observed immediately. Formally it can be written as:</p>
\[\begin{align}
q_*(a)
:=&~\mathbb{E}[R_t | A_t = a]\\
=&~\sum_r r \cdot p(r|a)
\end{align}\]
<p><strong>Policy, \(\pi(a \mid s)\)</strong>: it is the mapping from a given state, \(s\), to the <em>probability</em> of taking action \(a\). The notion of \(\mid\) in the \(\pi\) is to emphasize it is a conditional probability distribution. It is possible that a policy is agnostic to the states, as in many bandit problems.</p>
<p>The goal of the agent is to choose the action \(a\) (<em>i.e.</em>, following a policy) that has the largest action-value. To do so, the agent wants to get \(q_*(a)\) as accurately as possible.</p>
<blockquote>
<p>If the action-values are known, the bandit problem is solved: the policy is apply the action with the largest action-value.</p>
</blockquote>
<h2 id="estimating-action-value">Estimating action-value</h2>
<p>One approach is to estimate \(q_*(a)\) using the sample-average. With observations after applying different actions, one needs to update the estimated action-values.
A general update rule is to update it incrementally. It works in the sample-average case, but also in non-stationary case (the distribution of reward changes over time).</p>
<p>In bandit, one observes the reward immediately as \(R_t\), then updates the action-value estimation as:</p>
\[Q_{t+1}(a) = Q_t(a) + \alpha_t (R_t - Q_t(a)),\]
<p>where \(\alpha_t\) is the learning rate. There is a question about what the initial value, \(Q_0\), to use. The optimistic initial values approach assigns a large
value of \(Q_0\), to encourage exploration early on. But it doesn’t allow for continuous exploration later, for example, to account for non-stationary rewards.</p>
<h2 id="bandit-policy">Bandit policy</h2>
<p>With a protocol to estimate and update the action-values, a greedy policy is to choose the action whose action-value is largest. Note that, as the agent follows this policy and applies actions continuously, the estimated action-values are also changing since the reward is drawn from a distribution.</p>
<p>Here comes the exploration-exploitation trade-off: an epsilon-greedy policy allows the agent to instead choose the action with the largest estimated action-values (exploitation), but to apply a random action (exploration). However, if the agent explores too much, it does not apply the learnings to guide its actions.</p>Changyao ChenA collections of notes of Reinforcement Learning, as I am going through the Coursera specialization: Fundamentals of Reinforcement Learning. Hopefully this will be useful for future self.