The fundamental problem in statistical learning is the estimation of a function: perhaps a known function, or a neural network, or anything really, that captures some physical or social process, and then using that function to make a prediction: e.g.: is this image that of a cat or a dog (classification); what is your credit score given your income?

In other words, you have a *training sample*, or *training set* of data \({\cal D}\) which looks like this:

We’ll represent the variable being predicted by the letter \(y\), and the features or co-variates we use as an input in this probability by the letter \(x\). This \(x\) could be multi-dimensional, but for simplicity and notation, we’ll keep it 1-dimensional. Here you have maybe 50-60 pairs of points \(\{(x_1, y_1), (x_1, y_1), ..., (x_n, y_n)\}\) of data that have been taken.

But you do not know what function \(f(x)\), called a *generative process*, that this sample came from. For example, if \(x\) is the weight of a load on a bridge, \(y\) might be the stress on the bridge, and \(f(x)\) comes from some complex physics which might require many days to calculate on a supercomputer. If we did calculate it, we could *generate* the data.

But we are not calculating \(f(x)\), and in many other cases it is just not possible to calculate it. Instead, we take experimental data \(\{(x_1, y_1), (x_1, y_1), ..., (x_n, y_n)\}\) and try to learn the dependence of \(y\) on \(x\).

So, you will likely try some functions to capture this dependence, but you have no guarantee that these include \(f(x)\), since you do not know what \(f(x)\) actually is.

It gets worse. It looks like in the data above that there is a preponderance of samples between \(x=0.6\) and \(x=0.8\); but you have no notion if this true in the population at large: you do not know what the population density of samples \(p(x)\) is.

Finally, you do not how much noise there is in the \(y\) values you are seeing. You will need to find some way to model this noise. For example, if \(y\) here was a credit score and \(x\) was income, the noise would come from things like marital status, college debt, etc

But let us start by assuming that there is no noise at all. In other words, if you knew \(f(x)\), you could simply write:

\[y = f(x).\]

Let us then ask: what is the process of learning from data in the absence of noise. This never really happens, but it is a way for us to understand the *theory of approximation*, and lets us build a base for understanding how to learn from data with noise.

## How might have this data arisen ?

In the frequentist view of the world, this sample is but one sample that could have been drawn from a (possibly infinite) population. In other words, had we been given access to the population, we could have drawn say \(M\) different samples \({\cal D_m}\) from the population. Our current sample or training set {D} is but one of these different samples.

How did the population arise? Suppose God came to you in the middle of the night and said:

I generated all this data from: \[ y = f(x) \]and its exact. The \(y\)s you see come exactly from the application of the function \(f\) to the \(x\)s.

The image on the left below is this God-given function \(f\).

Well, now suppose **the lord also hit you on the head with a stick and made you forget what \(f(x)\) was**. So you now have no idea what the *generating function* for the data is.

But God used it to generate the *population* represented by the red circles in the plot on the right.

But God being angry, you are also not given access to the red circles. Indeed you are only given ONE *training sample* from the population, shown by the blue plus signs on the right.

### The task of prediction

So at the end of it all, your situation is illustrated below:

which is the plot we started with.

Now, our task is to use this sample and come up with a function \(g(x)\) that in some sense does the best possible job in approximating \(f\). In other words, someone can take any new \(x\) value and your *predictive model* **trained** on your sample and predict a new value \(y = g(x)\). That is, we wish to *estimate* \(f(x)\) using \(g(x)\).

Here is an illustration of using a particular \(g\) to estimate \(f(x)\). The blue regions roughly show the error incurred in doing so.

## The process of estimation

Such an *estimator* function, one that is estimated on the sample data, is called a **hypothesis**. The process of finding the estimator function is called *fitting* the data with the estimator or hypothesis function.

Your first idea might be to try every possible function in the universe to find the one that approximates \(f\) well. We’ll defer for a bit the question of how to evaluate a candidate hypothesis (function). But it should be clear to you that you cant try out every function in the universe: you will be dead, the sun will have gone supernova, and other exciting situations will happen in the meanwhile.

### Hypothesis Spaces and Best Fits

So you must limit your choice of candidate hypotheses to those from some set, such that you can process members of this set quickly. Such a set is called a *hypothesis set* or *hypothesis space*.

Let us, for now, consider as the set of functions we used to fit the data, the set of all possible straight lines. Thus we are saying that our hypothesis function \(h\) is some straight line. We’ll put the subscript \(1\) on the \(h\) to indicate that we are fitting the data with a polynomial of order 1, or a straight line A polynomial is a function that is a sum of coefficients times increasing powers of \(x\). For example, a quadratic looks like \(h_2(x) = a_0 + a_1 x + a_2 x^2\). And so on and so forth for a cubic and a quartic…

. This looks like:

\[ h_1(x) = a_0 + a_1 x \]

The set of all functions of a particular kind that we could have used to fit the data is called a **Hypothesis Space**. The words “particular kind” are deliberately vague: its our choice as to what we might want to put into a hypothesis space. A hypothesis space is denoted by the notation \(\cal{H}\).

So here we are considering the hypothesis space of all straight lines \(h_1(x)\). We’ll denote it as \(\cal{H}_1\), with the subscript again being used to mark the order of the polynomials allowed in the hypothesis space.. Often, you will see people write the set of all straight lines as \(\{h_1(x) \in {\cal H_1}\}\).

We’ll call the **best fit** straight line the function \(g_1(x)\). The “best fit” idea is this: amongst the set of all lines (i.e., all possible choices of \(h_1(x)\)), what is the line \(g_1(x)\) that best estimates our unknown function \(f\) on the sample data we have?

We have not (yet) figured how to find this best fit: lets just assume for now that there is a way to do it.

Lets rephrase: this is the best \(g_1\) to the data \(\cal{D}\) from the functions in the hypothesis space \(\cal{H}_1\). Remember that this is not the best fit from all possible functions, but rather, the best fit from the set of all the straight lines.

Another such hypothesis space might be \(\cal{H}_2\), the hypothesis space of all quadratic functions: \(\{h_2(x) \in {\cal H_2}\}\). A third such space might combine both of these together. We get to choose what we want to put into our hypothesis space.

Similarly, the best fit quadratic function could be denoted \(g_2(x)\).

### The choice of hypothesis: how statistics enters approximation

To think about how the choice of hypothesis space affects our analysis, and how this choice interacts with sampling, consider the simple example of 4 data points generated from the curve:

\[f(x) = x^2.\]

In the upper left plot, we show the data generating curve \(f(x)\) as well as the best fit quadratic (i.e., in \({\cal H}_2\) ) hypothesis \(g_2(x)\), where the fit is done on all 4 points in the population. Then, on the upper right plot we once again use all 4 points in the population to make a fit, but this time we find the best fit line in \({\cal H}_1\), i.e. \(g_1(x)\). As you might have expected we incur some error in doing this…

Let us turn our attention to the lower panels. Here we choose 2 different samples of size 3 from the population. `samp1`

has the points with the lowest 3 \(x\) values, and `samp2`

has the points with the highest 3. In the panel on the left, fitting 2 different quadratics to the samples just results in the original quadratic, as since there is no noise in the dataset, the points are exactly generated from \(x^2\) and 3 points are enough to uniquely determine a quadratic (you get 3 simultaneous equations to solve in the coefficients).

But in the lower right panel, choosing different points in the sample has a real impact on what the best fit line on each sample is. Remember that in real life we will only see one of these samples, and so the best line we might come up will in general not line up with either the best fit line on the population, or the actual generating curve (neither of which we would in real life know).

Indeed our fits look like there is noise in the data. And in a real model with actual noise, it will **not be possible to untangle** this *mis-specification* noise or *bias* coming from using hypotheses in \(\cal{H}_1\) to estimate a more complex \(f(x)\).

## Evaluating Models

Before we come up with a criterion to choose a best fit hypothesis (a \(g\)), we must come up with an evaluation metric which can be used to compare different hypotheses (different \(h\)). We’ll then turn around and use this evaluation metric to find the best fit model.

But it needs to be stressed that fitting and evaluation are two separate processes. One could use different metrics for these two purposes. Often you will find that one metric is used to evaluate choices within a hypothesis space, but another one is used to pick the best hypothesis space by comparing the best fit functions that were chosen from each hypothesis space. In either case though, a comparison of candidate functions \(h\) needs to be done.

To develop our statistical learning formalism, let us in this section assume **we have access to the entire population**, and that there is data continuously at all the \(x\) in the range between some values \(a\) and \(b\).

If somehow you had guessed as \(g(x)\) the exact function \(f(x)\) that generated the data, then the subtraction \(f(x) - g(x) = f(x) - f(x)\) would be exactly 0 at every point in the allowed range of \(x\). This means that you can define some notion of distance by simply subtracting the estimator function from the actual function, and then use this distance as a metric.

Clearly any other \(h\) should do worse, so we will want to use a positive estimate of the distance as our measure. We could take either the absolute value of the distance, \(\vert f(x) - h(x) \vert\), or the square of the distance as our measure, \((f(x) - h(x))^2\). Both are reasonable choices, and we shall use the squared distance for now.

We’ll call this distance the *local loss* function: \[ \ell (x) = (f(x) - h(x))^2.\]

This gives us the metric at every value of \(x\).

Now we want the evaluation to include information from all possible values of \(x\), so we must add over all possible values of \(x\) in the world, and divide by the number of them.

You probably know how to do this. You grid the x-axis in bins of size \(dx\), compute the value of \(\ell(x)\) at the center of each bin, add all these values, and then divide by the total number of values (bins): \(\frac{b-a}{dx}\):

\[\frac{\sum_z \ell(x)}{\frac{b-a}{dx}}\]

where \(a\) and \(b\) are the endpoints of the range of \(x\). (The symbol \(\sum_x\) means to take the sum over the values of \(x\) in the range).

As \(dx \rightarrow 0\), the usual limit we want to take, we get the integral:

\[R(h(x)) = \frac{1}{b-a}\,\int_a^b \ell(x) dx = \frac{1}{b-a}\,\int_a^b (f(x) - h(x))^2 dx.\]

This is known as the **error functional** or **risk functional** or **loss functional**(also just called **error**, **cost**, or **loss** or **risk**) of using function \(h(x)\) to estimate \(f(x)\). Here we use the word **functional** to denote that the risk is a *function of the function* \(h(x)\).

We can now limit ourselves to a specific hypothesis space \({\cal H}\), say the space of all straight lines \(\{h_1(x) \in {\cal H}\}\).

Then, the \(g(x)\) we finally land up fitting for is the function that gets this distance closest to 0 (which is what we would have for \(g=f\)). In other words, it is the function that *minimizes* this distance.

\[ g(x) = \arg\min_{h(x) \in \cal{H}} R_{\cal{D}}(h(x)).\]

Unless \(f(x)\) happens to be in this hypothesis space \({\cal H}\) we’ll never actually get 0.

### Non-uniform sampling

The above formula assumed, that in the population, it was equally likely to find data at any point in the range of \(x\). This is usually not true; often there is more data at intermediate values of \(x\). Indeed, this is the case in our example:

The above plot shows the histograms of the number of cases in the population around a particular x. If we take all these values at different \(x\) and kinda-join them up, we get the probability density function \(p(x)\) (PDF) of a value \(x\) in the data as a blue curve.

The interpretation of probability density is that when you multiply \(p(x)\) by the histogram width \(dx\) you get one of the histogram bars \(dP(x)\), a *sliver* of the probability that the feature \(X\) has value \(x\):

\[dP(x) = p(x)dx.\]

(To be precise you want these histogram bars to be as thin as possible in the infinite population limit.)

This is illustrated in the plot below. In the large population limit, the probability in the sliver can be thought of as the number of data points around a particular \(x\) divided by the total size of the data in the population…

Now when you add all of these probability slivers over the range from \(a\) to \(b\) you must get 1. You can also consider the area under the density curve up to some value \(x\): this function, \(P(x)\), is called the cumulative distribution function (CDF) ; sometimes you will see it just called the distribution function. And \(p(x)\) is called the density function.

When multiplied by the total number of data points in the population, \(dP(x)\), or the change in the CDF at \(x\), thus gives us the total number of data points at that \(x\). So it allows us to have different amounts of data at different \(x\).

So the formula for the risk functional can now be re-written:

\[R(h) = \int_a^b \ell(x) dP(x) = \int_a^b (f(x) - h(x))^2 p(x) dx.\]

In statistics, an integral of the product of a function and a probability density is called an expectation (or mean) value. This is because such an integral computes a weighted mean with the weights coming from our probability slivers.

Thus our loss functional can be written as an expectation value over a PDF, denoted using the \(E_{pdf(x)}[function]\) notation:

\[R(h) = E_{p(x)}[(h(x) - f(x))^2] = \int dx p(x) (h(x) - f(x))^2 .\]

(Note that by comparing the previous case of uniform probability at all points \(x\) with this, we can see that in the uniform case, \(p(x) = \frac{1}{b-a}\), equal density and thus slivers over the entire range, as we might have expected.)

Now for a fixed \(h(x)\), say perhaps a \(h(x) = g(x)\) found from some as-yet unknown fitting procedure, this quantity \(R(h)\) is uniquely known as long as \(p(x)\) is known exactly.

But unless you have been living in a parallel universe where the earth is free of Covid-19, you will realize that we do not have \(p(x)\) exactly. The histogram and pdf plot above showed us what it looks like on the population, and even the population has a finite number of points. Our training sample has a far fewer number of points.

Thus \(R(h)\) is actually a stochastic quantity dependent on the points in our training sample, and how they are used to construct \(p(x)\).

Things seem to be bleak for learning. But we can make approximations…

## Fitting and Evaluation on the sample

So far we have avoided the question of how to find the best fit \(g(x)\) amongst some hypotheses \(h(x) in \cal{H]}\). Lets get to it!

So we’ll need to find a way to turn the evaluation process we just described to use for “fitting” within a hypothesis space, to find the best hypothesis within.

Our problem here though, is that as mentioned earlier, we do not know \(p(x)\). We merely have a finite training sample and somehow must use it to make an estimation of this density function.

This process is called Empirical Risk Minimization (ERM) as we must minimize the risk measure calculated using a \(p(x)\) estimated from an “empirically observed” training sample. As we shall see in the next blog in this series ERM is just an attempt use of the law of large numbers. Here we merely state the law of large numbers and move on.

What we really want to calculate is:

\[R(h) = E_{p(x)}[(h(x) - f(x))^2] = \int dx p(x) (h(x) - f(x))^2 .\]

The Law of large numbers tells us that this integral can be calculated if we have a large number of draws from the probability distribution \(p(x)\):

\[R(h) = \lim_{n \to \infty} \frac{1}{n} \sum_{x_i \sim p(x)} (h(x_i) - f(x_i))^2\]

The notation \(x_i \sim p(x)\) denotes draw \(x_i\) from p(x).

In `numpy`

this usually entails doing something like `np.random.randn(10)`

which will gives us 10 draws from a normal distribution:

```
array([ 0.03804868, 0.78438344, -1.60312937, 0.35508712, -1.13228912,
-0.04496937, -0.05488397, -0.18335321, 0.27067754, -0.53343472])
```

Or we might use rejection sampling or MCMC or other methods.

But we do not have an infinitely large training population to determine \(p(x)\) exactly. We cant do this. Plus, obtaining draws from complex distributions is hard.

Indeed, the best we can do is to assume that the *empirical distribution* approximates the real \(p(x)\). Or, in other words, the actual \(x_i\) in our training sample are draws from \(p(x)\). So we replace the \(x_i \sim p(x)\) in the formula above with \(x_i in \cal{D}\) and get rid of the limit For those familiar with delta functions, this is equivalent to writing \(p(x) = \frac{1}{N}\sum_{x_i \in \cal{D}} \delta(x - x_i)\)

:

\[R(h) = \frac{1}{N} \sum_{x_i in \cal{D}} (h(x_i) - f(x_i))^2\]

where \(N\) is the number of points in \(\cal{D}\).

This is in general a stochastic quantity. much in the way finite sized Monte Carlo estimates are stochastic

Why? For a given \(h(x)\) this depends on the points in the training set. But for our current case, we have only one training set \({\cal D}\), and for this training sample, the above quantity is fixed. We’ll come back to the stochasticity later in the series when we talk about Bayes Risk.

Now its time to compare functions within a Hypothesis space and get the best fit! Lets use lines \(h_1(x) = a_0 + a_1 x\) to fit our points \((x_i, y_i=f(x_i))\) \(\cal{D}\).

\[ R_{\cal{D}}(h_1(x)) = \frac{1}{N} \sum_{y_i \in \cal{D}} (y_i - h_1(x_i))^2. \]

This can be pictured:

The red dots here are the \(y_i\) generated from some unknown \(f(x)\) at \(x=x_i\), while the red line represents one of the \(h_1(x)\) we are trying to use.

What this formula says then is: *the cost or risk is just the total squared distance to the line from the observation points*.

You probably had already intuited this, so this seems like a round about way to get here. But the method we have outlined generalizes and can be used for other Hypothesis spaces in regression and classification, so your effort in reading this far is not lost.

We also make explicit in our notation the in-sample data \(\cal{D}\) (we write \(R_{\cal{D}}(h_1(x))\)), because the value of the risk depends upon the points (our training sample) at which we made our observation (if we had made these observations \(y_i\) at a different set of \(x_i\), the value of the risk would be somewhat different).

Now, given these observations, and the hypothesis space \(\cal{H}_1\), we minimize the risk over all possible functions in the hypothesis space to find the **best fit** function \(g_1(x)\):

\[ g_1(x) = \arg\min_{h_1(x) \in \cal{H}} R_{\cal{D}}(h_1(x)).\]

Here the notation

\("g_1(x) = \arg\min_{h_1(x) \in \cal{H}})"\)

means: look at all \(h_1(x)\) and give me the function \(g_1(x) = h_1\) at which the risk \(R_{\cal{D}}(h_1)\) is minimized; i.e. the *minimization is over functions* \(h_1\).

Generalizing to any hypothesis space we can write:

\[ g(x) = \arg\min_{h(x) \in \cal{H}} R_{\cal{D}}(h(x)),\]

where \(\cal{H}\) is a general hypothesis space of functions.

The technical details of this minimization are not currently interesting to us, but let us say that lines are nicely parametrized by their slopes and intercepts, and the fitting is then a matter of some simple linear algebra. The end result is a line drawn with the values of intercept and slope that gives minimum risk amongst all lines. As we had sneak-peeked above, we have for our data:

## The Structure of Learning

Let us summarize the concepts we have seen about noise-free learning.

We have a target function \(f(x)\) that we do not know. But we do have a sample of data points from it, \((x_1,y_1=f(x_1)), (x_2,y_2=f(x_2)), ..., (x_n,y_n=f(x_n))\). We call this the **training sample** or **training set** \(\cal{D}\). We are interested in using this sample to estimate a function \(g\) to approximate the function \(f\), and which can be used for prediction on the entire population, or any other sample from it, also called **out-of-sample prediction**.

There are two ways statistics comes into this approximation problem , Firstly, we are trying to reconstruct the original function from a small-ish sample \(\cal{D}\) rather than a large-ish population. Secondly, we do not know \(f(x)\), and instead estimate on our sample, a function \(g(x)\) by some fitting procedure from a set of functions \(h \in {\cal H}\), the hypothesis space. This means that there will always be some error incurred because of the combination of finite sampling and this *mis-specification* of the model or hypothesis.

To do this fit, we use an algorithm, called the **learner**, which chooses functions from the hypothesis set \(\cal{H}\) and computes a cost measure or risk functional \(R\) (like the sum of the squared distance over all points in the data set) for each of these functions. It then chooses the function \(g\) which **minimizes** this cost measure amongst all the functions in \(\cal{H}\), and thus gives us a final hypothesis \(g\) which we then use to approximate or estimate \(f\) **everywhere**, not just at the points in our data set. Now we can predict \(y\) outside of our sample.

## What next?

Something should give you a little pause about our procedure so far. We have just one sample. How do we make sure our fitting procedure is robust to this choice (or really lack of choice) of training set? We have taken the empirical risk, which is really an approximate and stochastic quantity because we do not have full information about \(p(x)\), and estimated it deterministically using only one sample. How do we ensure generalization? We shall come to these in the next installments.