Jekyll2020-05-05T10:57:42+05:30/feed.xmlMachines LearnML and AI at Univ.AIWhy Learn With Us? The Big Picture2020-04-25T00:00:00+05:302020-04-25T00:00:00+05:30/2020/04/25/whylearnwithus<section id="why-should-you-deep-learn" class="level2">
<h2>Why should you deep learn?</h2>
<p>A lot of people still approach Deep Learning and AI as if these are subjects to be ensconced in the cathedral of Computer Science or Statistics education. We’ve seen this at the IITs and other colleges in India, and at computer science departments in the US. We think this is a bad idea: the ideas and techniques from AI are revolutionizing almost every field of endeavor today.</p>
<p>Physics? Check. Music? Check. Art? check. Mining? Check. Law? Check. Mechanical Engineering? Check. Epidemiology? Check. Students and professionals in any field who are expert at AI can expect to be the most sought after professionals in their fields in the coming years.</p>
<p>We have taught undergraduates and graduates at Harvard and UCLA. Our experience has showed us that firstly, all kinds of students, even those from the humanities, hunger for this learning. Secondly, it is possible for people with all kinds of backgrounds to pick up AI. It is not that complicated.</p>
<p>At some point in ones progression, though, background helps. Students who can code well, can push the limits faster. Thus, one might think that coding is enough. The same students, struggle though, when it comes to apply techniques in a new domain. Here, students with a math-y background have an advantage: they have the confidence to set up the modeling, and the rigor to analyze if their setup was any good.</p>
<p>We think that the key to persistent and “transportable” learning, then, involves putting one’s coding nose first, but also needs combining a lot of coding practice with the kind of math-y rigor you might expect at a top class institution.</p>
<p>Well, this is us. Lots of work. Lots of practice. But just the right amount of rigor to empower you to model on your own, and strike out your own path, either in the AI world, or in your own subject.</p>
<p>In a sense, we’ve <strong>created’ an online institution that we hope reproduces the best qualities of our respective Alma maters</strong> (Penn, MIT, Harvard, and UCLA), and improves upon them. This democratizes learning for passionate and driven folks everywhere.</p>
</section>
<section id="our-method" class="level2">
<h2>Our method</h2>
<p>What constitutes our learning platform? What is unique in it?</p>
<ul>
<li>5-6 week long, intense courses (we call them modules)</li>
<li>modules are combined to create programs. For example, our <strong>Accelerated Program</strong>, which might as well be called our Core Program, consists of the following modules, in order: (1) AI 1: basic ML and AI (2) Data Science 1: analytics (3) Data Science 2: modeling (4) AiI 2: Convolutional and Recurrent Models .</li>
<li>live teaching, by top professors. Recorded videos for later follow up.</li>
<li>our basic instructional unit consists of a short lecture segment followed up by a problem solving session. We do several several such units during a lecture. This method of instruction leads to higher engagement in both in-class and online settings, and hence greatly increased learning.</li>
<li>a lab or two interspersed amongst these units gets you programming hands-on</li>
<li>an innovative online platform, with the ability to digest our extensive material in different ways, and interesting approaches, such as comic books, summary slides, etc</li>
<li>homework every week</li>
<li>a module culminates in a week-long project. You are now ready to take on a larger applied problem.</li>
</ul>
<p><strong>Learning by doing is the secret of all learning</strong></p>
<p>If all of this sounds like a lot of work, it is. We wont lie to you about that.</p>
<p>Our students “do” a lot. Starting right from the lectures, there are quizzes built in. Then labs. Then homework. Then project. Every step of the way in our pedagogical technique involves doing.</p>
</section>
<section id="mentorship" class="level2">
<h2>Mentorship</h2>
<p>We recognize that this work can be scary. And lonely. You will struggle at times, and that can be demoralizing.</p>
<p>Thus, our commitment to you is to <strong>provide ample and outstanding mentorship at every step.</strong></p>
<p>What does this mean, actually?</p>
<ul>
<li>You work in groups, with an expert mentor assigned to each group</li>
<li>Our labs are done in groups, with each group guided by its mentor</li>
<li>Your homework is done in groups as well, again guided by a mentor.</li>
<li>Your project is also done in groups, with extensive mentorship</li>
<li>Mentor office hours on live video, in addition to the asynchronous guidance provided for homework.</li>
<li>Extensive discussion forums, with answers and discussions with your peers, mentors, and teachers</li>
</ul>
<p><strong>Extraordinary and ample mentorship makes the difference between good and exceptional. </strong></p>
</section>
<section id="evaluation" class="level2">
<h2>Evaluation</h2>
<p>We’re big on evaluation, but a large part of our evaluation goal is to ensure participation. We’ve seen that engagement in AI MOOCs falls off pretty fast! In our system, the exercises and quizzes make sure you are paying attention and participating! Research has shown that active participation facilitates recall.</p>
<p>The second reason for a lot of evaluation is for you to know if you are keeping pace with the subject: do you need to slow down? Do you need more mentorship? All of our programs allow for breaks, and rejoining later. We want you to succeed and want to do all that we can do to make it possible.</p>
<p>The third reason is the usual one! Evaluation done right is a signal to employers and teachers about your quality! We break down the components of your evaluation as well: maybe you are better at free form projects than at exercises and labs? Perhaps this points to a strong independent streak that employers and research labs might prefer?</p>
<section id="capstone" class="level3">
<h3>Capstone</h3>
<p>Our ‘Master’ and ‘Advanced’ programs feature capstones - somewhat difficult research or applied problems set in the real world, under our professors or with our industry partner. In a sense, this is the ultimate evaluation: a real world application of everything you have learnt. Students learn to prioritize, problem solve, make trade offs between perfection and effectiveness, weigh time vs delivery considerations - on other words, become professionals.</p>
</section>
</section>
<section id="our-intangibles" class="level2">
<h2>Our intangibles</h2>
<p>At the best universities, there are intangibles that add to the richness of student experience. One of these is exposure to ideas the are not part of your curriculum. A mind needs constant stretching by new ideas, different thoughts to be rich and creative. We make sure to feature ample “orthogonal” exposure in our programs with interesting talks by people working on the frontiers of new ideas.</p>
<p>Another intangible element of great institutions is diversity. People from all over the world, teaching, and learning in a community. A heterogenous pool of ideas and cultures that demands a curious and open mind to thrive. As an online institution, you will be exposed to students everywhere from India, and, soon from all over the world. This is in addition to your mentors, who are drawn from everywhere around the world.</p>
</section>
<section id="your-career" class="level2">
<h2>Your career</h2>
<p>We teach for the fun of it, the joy of it (and, well, to earn some money).</p>
<p>You’ll likely start with us, wanting to superhero your career. And you will!</p>
<p>But we secretly hope that you’ll want to do this just for the fun of it, as well :-).</p>
<p>But, getting back to your career for a second, we’ve created an outstanding process to ensure exceptional beginnings to our students careers. We get engaged with you and learn your interests and domain early. Then, our career office works with a multitude of potential employers to match candidates with jobs, and arrange interviews.</p>
<p>We’ve got your back.</p>
</section>Siddharth DasWhy should you deep learn?Statistical Learning Part 1: Approximation2020-04-14T00:00:00+05:302020-04-14T00:00:00+05:30/2020/04/14/statslearning<!--p class="byline">(Image by rawpixel.com)</p-->
<p>The fundamental problem in statistical learning is the estimation of a function: perhaps a known function, or a neural network, or anything really, that captures some physical or social process, and then using that function to make a prediction: e.g.: is this image that of a cat or a dog (classification); what is your credit score given your income?</p>
<p>In other words, you have a <em>training sample</em>, or <em>training set</em> of data <span class="math inline">\({\cal D}\)</span> which looks like this:</p>
<p><img src="/assets/whatwehave.png" /></p>
<p>We’ll represent the variable being predicted by the letter <span class="math inline">\(y\)</span>, and the features or co-variates we use as an input in this probability by the letter <span class="math inline">\(x\)</span>. This <span class="math inline">\(x\)</span> could be multi-dimensional, but for simplicity and notation, we’ll keep it 1-dimensional. Here you have maybe 50-60 pairs of points <span class="math inline">\(\{(x_1, y_1), (x_1, y_1), ..., (x_n, y_n)\}\)</span> of data that have been taken.</p>
<p>But you do not know what function <span class="math inline">\(f(x)\)</span>, called a <em>generative process</em>, that this sample came from. For example, if <span class="math inline">\(x\)</span> is the weight of a load on a bridge, <span class="math inline">\(y\)</span> might be the stress on the bridge, and <span class="math inline">\(f(x)\)</span> comes from some complex physics which might require many days to calculate on a supercomputer. If we did calculate it, we could <em>generate</em> the data.</p>
<p>But we are not calculating <span class="math inline">\(f(x)\)</span>, and in many other cases it is just not possible to calculate it. Instead, we take experimental data <span class="math inline">\(\{(x_1, y_1), (x_1, y_1), ..., (x_n, y_n)\}\)</span> and try to learn the dependence of <span class="math inline">\(y\)</span> on <span class="math inline">\(x\)</span>.</p>
<p>So, you will likely try some functions to capture this dependence, but you have no guarantee that these include <span class="math inline">\(f(x)\)</span>, since you do not know what <span class="math inline">\(f(x)\)</span> actually is.</p>
<p>It gets worse. It looks like in the data above that there is a preponderance of samples between <span class="math inline">\(x=0.6\)</span> and <span class="math inline">\(x=0.8\)</span>; but you have no notion if this true in the population at large: you do not know what the population density of samples <span class="math inline">\(p(x)\)</span> is.</p>
<p>Finally, you do not how much noise there is in the <span class="math inline">\(y\)</span> values you are seeing. You will need to find some way to model this noise. For example, if <span class="math inline">\(y\)</span> here was a credit score and <span class="math inline">\(x\)</span> was income, the noise would come from things like marital status, college debt, etc</p>
<p>But let us start by assuming that there is no noise at all. In other words, if you knew <span class="math inline">\(f(x)\)</span>, you could simply write:</p>
<p><span class="math display">\[y = f(x).\]</span></p>
<p>Let us then ask: what is the process of learning from data in the absence of noise. This never really happens, but it is a way for us to understand the <em>theory of approximation</em>, and lets us build a base for understanding how to learn from data with noise.</p>
<section id="how-might-have-this-data-arisen" class="level2">
<h2>How might have this data arisen ?</h2>
<p>In the frequentist view of the world, this sample is but one sample that could have been drawn from a (possibly infinite) population. In other words, had we been given access to the population, we could have drawn say <span class="math inline">\(M\)</span> different samples <span class="math inline">\({\cal D_m}\)</span> from the population. Our current sample or training set {D} is but one of these different samples.</p>
<p>How did the population arise? Suppose God came to you in the middle of the night and said:</p>
<blockquote>
<blockquote>
<p><strong>I generated all this data from</strong>: <span class="math display">\[ y = f(x) \]</span> <strong>and its exact. The <span class="math inline">\(y\)</span>s you see come exactly from the application of the function <span class="math inline">\(f\)</span> to the <span class="math inline">\(x\)</span>s</strong>.</p>
</blockquote>
</blockquote>
<p>The image on the left below is this God-given function <span class="math inline">\(f\)</span>.</p>
<figure class="fullwidth">
<img src="/assets/fpopsample.png" />
</figure>
<p>Well, now suppose <strong>the lord also hit you on the head with a stick and made you forget what <span class="math inline">\(f(x)\)</span> was</strong>. So you now have no idea what the <em>generating function</em> for the data is.</p>
<p>But God used it to generate the <em>population</em> represented by the red circles in the plot on the right.</p>
<p>But God being angry, you are also not given access to the red circles. Indeed you are only given ONE <em>training sample</em> from the population, shown by the blue plus signs on the right.</p>
<section id="the-task-of-prediction" class="level3">
<h3>The task of prediction</h3>
<p>So at the end of it all, your situation is illustrated below:</p>
<p><img src="/assets/whatwehave.png" /></p>
<p>which is the plot we started with.</p>
<p>Now, our task is to use this sample and come up with a function <span class="math inline">\(g(x)\)</span> that in some sense does the best possible job in approximating <span class="math inline">\(f\)</span>. In other words, someone can take any new <span class="math inline">\(x\)</span> value and your <em>predictive model</em> <strong>trained</strong> on your sample and predict a new value <span class="math inline">\(y = g(x)\)</span>. That is, we wish to <em>estimate</em> <span class="math inline">\(f(x)\)</span> using <span class="math inline">\(g(x)\)</span>.</p>
<p>Here is an illustration of using a particular <span class="math inline">\(g\)</span> to estimate <span class="math inline">\(f(x)\)</span>. The blue regions roughly show the error incurred in doing so.</p>
<p><img src="/assets/bias.png" /></p>
</section>
</section>
<section id="the-process-of-estimation" class="level2">
<h2>The process of estimation</h2>
<p>Such an <em>estimator</em> function, one that is estimated on the sample data, is called a <strong>hypothesis</strong>. The process of finding the estimator function is called <em>fitting</em> the data with the estimator or hypothesis function.</p>
<p>Your first idea might be to try every possible function in the universe to find the one that approximates <span class="math inline">\(f\)</span> well. We’ll defer for a bit the question of how to evaluate a candidate hypothesis (function). But it should be clear to you that you cant try out every function in the universe: you will be dead, the sun will have gone supernova, and other exciting situations will happen in the meanwhile.</p>
<section id="hypothesis-spaces-and-best-fits" class="level3">
<h3>Hypothesis Spaces and Best Fits</h3>
<p>So you must limit your choice of candidate hypotheses to those from some set, such that you can process members of this set quickly. Such a set is called a <em>hypothesis set</em> or <em>hypothesis space</em>.</p>
<p>Let us, for now, consider as the set of functions we used to fit the data, the set of all possible straight lines. Thus we are saying that our hypothesis function <span class="math inline">\(h\)</span> is some straight line. We’ll put the subscript <span class="math inline">\(1\)</span> on the <span class="math inline">\(h\)</span> to indicate that we are fitting the data with a polynomial of order 1, or a straight line<span><label for="sn-0" class="margin-toggle">⊕</label><input type="checkbox" id="sn-0" class="margin-toggle"/><span class="marginnote"> A polynomial is a function that is a sum of coefficients times increasing powers of <span class="math inline">\(x\)</span>. For example, a quadratic looks like <span class="math inline">\(h_2(x) = a_0 + a_1 x + a_2 x^2\)</span>. And so on and so forth for a cubic and a quartic…<br />
<br />
</span></span>. This looks like:</p>
<p><span class="math display">\[ h_1(x) = a_0 + a_1 x \]</span></p>
<p>The set of all functions of a particular kind that we could have used to fit the data is called a <strong>Hypothesis Space</strong>. The words “particular kind” are deliberately vague: its our choice as to what we might want to put into a hypothesis space. A hypothesis space is denoted by the notation <span class="math inline">\(\cal{H}\)</span>.</p>
<p>So here we are considering the hypothesis space of all straight lines <span class="math inline">\(h_1(x)\)</span>. We’ll denote it as <span class="math inline">\(\cal{H}_1\)</span>, with the subscript again being used to mark the order of the polynomials allowed in the hypothesis space.. Often, you will see people write the set of all straight lines as <span class="math inline">\(\{h_1(x) \in {\cal H_1}\}\)</span>.</p>
<p>We’ll call the <strong>best fit</strong> straight line the function <span class="math inline">\(g_1(x)\)</span>. The “best fit” idea is this: amongst the set of all lines (i.e., all possible choices of <span class="math inline">\(h_1(x)\)</span>), what is the line <span class="math inline">\(g_1(x)\)</span> that best estimates our unknown function <span class="math inline">\(f\)</span> on the sample data we have?</p>
<p><img src="/assets/whatwehaveandfit.png" /></p>
<p>We have not (yet) figured how to find this best fit: lets just assume for now that there is a way to do it.</p>
<p>Lets rephrase: this is the best <span class="math inline">\(g_1\)</span> to the data <span class="math inline">\(\cal{D}\)</span> from the functions in the hypothesis space <span class="math inline">\(\cal{H}_1\)</span>. Remember that this is not the best fit from all possible functions, but rather, the best fit from the set of all the straight lines.</p>
<p>Another such hypothesis space might be <span class="math inline">\(\cal{H}_2\)</span>, the hypothesis space of all quadratic functions: <span class="math inline">\(\{h_2(x) \in {\cal H_2}\}\)</span>. A third such space might combine both of these together. We get to choose what we want to put into our hypothesis space.</p>
<p>Similarly, the best fit quadratic function could be denoted <span class="math inline">\(g_2(x)\)</span>.</p>
</section>
<section id="the-choice-of-hypothesis-how-statistics-enters-approximation" class="level3">
<h3>The choice of hypothesis: how statistics enters approximation</h3>
<p>To think about how the choice of hypothesis space affects our analysis, and how this choice interacts with sampling, consider the simple example of 4 data points generated from the curve:</p>
<p><span class="math display">\[f(x) = x^2.\]</span></p>
<p><img src="/assets/fquadglin.png" /></p>
<p>In the upper left plot, we show the data generating curve <span class="math inline">\(f(x)\)</span> as well as the best fit quadratic (i.e., in <span class="math inline">\({\cal H}_2\)</span> ) hypothesis <span class="math inline">\(g_2(x)\)</span>, where the fit is done on all 4 points in the population. Then, on the upper right plot we once again use all 4 points in the population to make a fit, but this time we find the best fit line in <span class="math inline">\({\cal H}_1\)</span>, i.e. <span class="math inline">\(g_1(x)\)</span>. As you might have expected we incur some error in doing this…</p>
<p>Let us turn our attention to the lower panels. Here we choose 2 different samples of size 3 from the population. <code>samp1</code> has the points with the lowest 3 <span class="math inline">\(x\)</span> values, and <code>samp2</code> has the points with the highest 3. In the panel on the left, fitting 2 different quadratics to the samples just results in the original quadratic, as since there is no noise in the dataset, the points are exactly generated from <span class="math inline">\(x^2\)</span> and 3 points are enough to uniquely determine a quadratic (you get 3 simultaneous equations to solve in the coefficients).</p>
<p>But in the lower right panel, choosing different points in the sample has a real impact on what the best fit line on each sample is. Remember that in real life we will only see one of these samples, and so the best line we might come up will in general not line up with either the best fit line on the population, or the actual generating curve (neither of which we would in real life know).</p>
<p>Indeed our fits look like there is noise in the data. And in a real model with actual noise, it will <strong>not be possible to untangle</strong> this <em>mis-specification</em> noise or <em>bias</em> coming from using hypotheses in <span class="math inline">\(\cal{H}_1\)</span> to estimate a more complex <span class="math inline">\(f(x)\)</span>.</p>
</section>
</section>
<section id="evaluating-models" class="level2">
<h2>Evaluating Models</h2>
<p>Before we come up with a criterion to choose a best fit hypothesis (a <span class="math inline">\(g\)</span>), we must come up with an evaluation metric which can be used to compare different hypotheses (different <span class="math inline">\(h\)</span>). We’ll then turn around and use this evaluation metric to find the best fit model.</p>
<p>But it needs to be stressed that fitting and evaluation are two separate processes. One could use different metrics for these two purposes. Often you will find that one metric is used to evaluate choices within a hypothesis space, but another one is used to pick the best hypothesis space by comparing the best fit functions that were chosen from each hypothesis space. In either case though, a comparison of candidate functions <span class="math inline">\(h\)</span> needs to be done.</p>
<p>To develop our statistical learning formalism, let us in this section assume <strong>we have access to the entire population</strong>, and that there is data continuously at all the <span class="math inline">\(x\)</span> in the range between some values <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span>.</p>
<p>If somehow you had guessed as <span class="math inline">\(g(x)\)</span> the exact function <span class="math inline">\(f(x)\)</span> that generated the data, then the subtraction <span class="math inline">\(f(x) - g(x) = f(x) - f(x)\)</span> would be exactly 0 at every point in the allowed range of <span class="math inline">\(x\)</span>. This means that you can define some notion of distance by simply subtracting the estimator function from the actual function, and then use this distance as a metric.</p>
<p>Clearly any other <span class="math inline">\(h\)</span> should do worse, so we will want to use a positive estimate of the distance as our measure. We could take either the absolute value of the distance, <span class="math inline">\(\vert f(x) - h(x) \vert\)</span>, or the square of the distance as our measure, <span class="math inline">\((f(x) - h(x))^2\)</span>. Both are reasonable choices, and we shall use the squared distance for now.</p>
<p>We’ll call this distance the <em>local loss</em> function: <span class="math display">\[ \ell (x) = (f(x) - h(x))^2.\]</span></p>
<p>This gives us the metric at every value of <span class="math inline">\(x\)</span>.</p>
<p>Now we want the evaluation to include information from all possible values of <span class="math inline">\(x\)</span>, so we must add over all possible values of <span class="math inline">\(x\)</span> in the world, and divide by the number of them.</p>
<p><img src="/assets/bias2.png" /></p>
<p>You probably know how to do this. You grid the x-axis in bins of size <span class="math inline">\(dx\)</span>, compute the value of <span class="math inline">\(\ell(x)\)</span> at the center of each bin, add all these values, and then divide by the total number of values (bins): <span class="math inline">\(\frac{b-a}{dx}\)</span>:</p>
<p><span class="math display">\[\frac{\sum_z \ell(x)}{\frac{b-a}{dx}}\]</span></p>
<p>where <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> are the endpoints of the range of <span class="math inline">\(x\)</span>. (The symbol <span class="math inline">\(\sum_x\)</span> means to take the sum over the values of <span class="math inline">\(x\)</span> in the range).</p>
<p>As <span class="math inline">\(dx \rightarrow 0\)</span>, the usual limit we want to take, we get the integral:</p>
<p><span class="math display">\[R(h(x)) = \frac{1}{b-a}\,\int_a^b \ell(x) dx = \frac{1}{b-a}\,\int_a^b (f(x) - h(x))^2 dx.\]</span></p>
<p>This is known as the <strong>error functional</strong> or <strong>risk functional</strong> or <strong>loss functional</strong>(also just called <strong>error</strong>, <strong>cost</strong>, or <strong>loss</strong> or <strong>risk</strong>) of using function <span class="math inline">\(h(x)\)</span> to estimate <span class="math inline">\(f(x)\)</span>. Here we use the word <strong>functional</strong> to denote that the risk is a <em>function of the function</em> <span class="math inline">\(h(x)\)</span>.</p>
<p>We can now limit ourselves to a specific hypothesis space <span class="math inline">\({\cal H}\)</span>, say the space of all straight lines <span class="math inline">\(\{h_1(x) \in {\cal H}\}\)</span>.</p>
<p>Then, the <span class="math inline">\(g(x)\)</span> we finally land up fitting for is the function that gets this distance closest to 0 (which is what we would have for <span class="math inline">\(g=f\)</span>). In other words, it is the function that <em>minimizes</em> this distance.</p>
<p><span class="math display">\[ g(x) = \arg\min_{h(x) \in \cal{H}} R_{\cal{D}}(h(x)).\]</span></p>
<p>Unless <span class="math inline">\(f(x)\)</span> happens to be in this hypothesis space <span class="math inline">\({\cal H}\)</span> we’ll never actually get 0.</p>
<section id="non-uniform-sampling" class="level3">
<h3>Non-uniform sampling</h3>
<p>The above formula assumed, that in the population, it was equally likely to find data at any point in the range of <span class="math inline">\(x\)</span>. This is usually not true; often there is more data at intermediate values of <span class="math inline">\(x\)</span>. Indeed, this is the case in our example:</p>
<p><img src="/assets/pofx.png" /></p>
<p>The above plot shows the histograms of the number of cases in the population around a particular x. If we take all these values at different <span class="math inline">\(x\)</span> and kinda-join them up, we get the probability density function <span class="math inline">\(p(x)\)</span> (PDF) of a value <span class="math inline">\(x\)</span> in the data as a blue curve.</p>
<p>The interpretation of probability density is that when you multiply <span class="math inline">\(p(x)\)</span> by the histogram width <span class="math inline">\(dx\)</span> you get one of the histogram bars <span class="math inline">\(dP(x)\)</span>, a <em>sliver</em> of the probability that the feature <span class="math inline">\(X\)</span> has value <span class="math inline">\(x\)</span>:</p>
<p><span class="math display">\[dP(x) = p(x)dx.\]</span></p>
<p>(To be precise you want these histogram bars to be as thin as possible in the infinite population limit.)</p>
<p>This is illustrated in the plot below. In the large population limit, the probability in the sliver can be thought of as the number of data points around a particular <span class="math inline">\(x\)</span> divided by the total size of the data in the population…</p>
<p><img src="/assets/sliver.png" /></p>
<p>Now when you add all of these probability slivers over the range from <span class="math inline">\(a\)</span> to <span class="math inline">\(b\)</span> you must get 1. You can also consider the area under the density curve up to some value <span class="math inline">\(x\)</span>: this function, <span class="math inline">\(P(x)\)</span>, is called the cumulative distribution function (CDF) ; sometimes you will see it just called the distribution function. And <span class="math inline">\(p(x)\)</span> is called the density function.</p>
<p>When multiplied by the total number of data points in the population, <span class="math inline">\(dP(x)\)</span>, or the change in the CDF at <span class="math inline">\(x\)</span>, thus gives us the total number of data points at that <span class="math inline">\(x\)</span>. So it allows us to have different amounts of data at different <span class="math inline">\(x\)</span>.</p>
<p>So the formula for the risk functional can now be re-written:</p>
<p><span class="math display">\[R(h) = \int_a^b \ell(x) dP(x) = \int_a^b (f(x) - h(x))^2 p(x) dx.\]</span></p>
<p>In statistics, an integral of the product of a function and a probability density is called an expectation (or mean) value. This is because such an integral computes a weighted mean with the weights coming from our probability slivers.</p>
<p>Thus our loss functional can be written as an expectation value over a PDF, denoted using the <span class="math inline">\(E_{pdf(x)}[function]\)</span> notation:</p>
<p><span class="math display">\[R(h) = E_{p(x)}[(h(x) - f(x))^2] = \int dx p(x) (h(x) - f(x))^2 .\]</span></p>
<p>(Note that by comparing the previous case of uniform probability at all points <span class="math inline">\(x\)</span> with this, we can see that in the uniform case, <span class="math inline">\(p(x) = \frac{1}{b-a}\)</span>, equal density and thus slivers over the entire range, as we might have expected.)</p>
<p>Now for a fixed <span class="math inline">\(h(x)\)</span>, say perhaps a <span class="math inline">\(h(x) = g(x)\)</span> found from some as-yet unknown fitting procedure, this quantity <span class="math inline">\(R(h)\)</span> is uniquely known as long as <span class="math inline">\(p(x)\)</span> is known exactly.</p>
<p>But unless you have been living in a parallel universe where the earth is free of Covid-19, you will realize that we do not have <span class="math inline">\(p(x)\)</span> exactly. The histogram and pdf plot above showed us what it looks like on the population, and even the population has a finite number of points. Our training sample has a far fewer number of points.</p>
<p>Thus <span class="math inline">\(R(h)\)</span> is actually a stochastic quantity dependent on the points in our training sample, and how they are used to construct <span class="math inline">\(p(x)\)</span>.</p>
<p>Things seem to be bleak for learning. But we can make approximations…</p>
</section>
</section>
<section id="fitting-and-evaluation-on-the-sample" class="level2">
<h2>Fitting and Evaluation on the sample</h2>
<p>So far we have avoided the question of how to find the best fit <span class="math inline">\(g(x)\)</span> amongst some hypotheses <span class="math inline">\(h(x) in \cal{H]}\)</span>. Lets get to it!</p>
<p>So we’ll need to find a way to turn the evaluation process we just described to use for “fitting” within a hypothesis space, to find the best hypothesis within.</p>
<p>Our problem here though, is that as mentioned earlier, we do not know <span class="math inline">\(p(x)\)</span>. We merely have a finite training sample and somehow must use it to make an estimation of this density function.</p>
<p>This process is called Empirical Risk Minimization (ERM) as we must minimize the risk measure calculated using a <span class="math inline">\(p(x)\)</span> estimated from an “empirically observed” training sample. As we shall see in the next blog in this series ERM is just an attempt use of the law of large numbers. Here we merely state the law of large numbers and move on.</p>
<p>What we really want to calculate is:</p>
<p><span class="math display">\[R(h) = E_{p(x)}[(h(x) - f(x))^2] = \int dx p(x) (h(x) - f(x))^2 .\]</span></p>
<p>The Law of large numbers tells us that this integral can be calculated if we have a large number of draws from the probability distribution <span class="math inline">\(p(x)\)</span>:</p>
<p><span class="math display">\[R(h) = \lim_{n \to \infty} \frac{1}{n} \sum_{x_i \sim p(x)} (h(x_i) - f(x_i))^2\]</span></p>
<p>The notation <span class="math inline">\(x_i \sim p(x)\)</span> denotes draw <span class="math inline">\(x_i\)</span> from p(x).</p>
<p>In <code>numpy</code> this usually entails doing something like <code>np.random.randn(10)</code> which will gives us 10 draws from a normal distribution:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1"></a>np.random.randn(<span class="dv">10</span>)</span></code></pre></div>
<pre><code>array([ 0.03804868, 0.78438344, -1.60312937, 0.35508712, -1.13228912,
-0.04496937, -0.05488397, -0.18335321, 0.27067754, -0.53343472])</code></pre>
<p>Or we might use <a href="https://en.wikipedia.org/wiki/Rejection_sampling">rejection sampling</a> or <a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">MCMC</a> or other methods.</p>
<p>But we do not have an infinitely large training population to determine <span class="math inline">\(p(x)\)</span> exactly. We cant do this. Plus, obtaining draws from complex distributions is hard.</p>
<p>Indeed, the best we can do is to assume that the <em>empirical distribution</em> approximates the real <span class="math inline">\(p(x)\)</span>. Or, in other words, the actual <span class="math inline">\(x_i\)</span> in our training sample are draws from <span class="math inline">\(p(x)\)</span>. So we replace the <span class="math inline">\(x_i \sim p(x)\)</span> in the formula above with <span class="math inline">\(x_i in \cal{D}\)</span> and get rid of the limit<span><label for="sn-1" class="margin-toggle">⊕</label><input type="checkbox" id="sn-1" class="margin-toggle"/><span class="marginnote"> For those familiar with delta functions, this is equivalent to writing <span class="math inline">\(p(x) = \frac{1}{N}\sum_{x_i \in \cal{D}} \delta(x - x_i)\)</span><br />
<br />
</span></span>:</p>
<p><span class="math display">\[R(h) = \frac{1}{N} \sum_{x_i in \cal{D}} (h(x_i) - f(x_i))^2\]</span></p>
<p>where <span class="math inline">\(N\)</span> is the number of points in <span class="math inline">\(\cal{D}\)</span>.</p>
<p>This is in general a stochastic quantity.<span><label for="sn-2" class="margin-toggle">⊕</label><input type="checkbox" id="sn-2" class="margin-toggle"/><span class="marginnote"> much in the way finite sized Monte Carlo estimates are stochastic<br />
<br />
</span></span> Why? For a given <span class="math inline">\(h(x)\)</span> this depends on the points in the training set. But for our current case, we have only one training set <span class="math inline">\({\cal D}\)</span>, and for this training sample, the above quantity is fixed. We’ll come back to the stochasticity later in the series when we talk about Bayes Risk.</p>
<p>Now its time to compare functions within a Hypothesis space and get the best fit! Lets use lines <span class="math inline">\(h_1(x) = a_0 + a_1 x\)</span> to fit our points <span class="math inline">\((x_i, y_i=f(x_i))\)</span> <span class="math inline">\(\cal{D}\)</span>.</p>
<p><span class="math display">\[ R_{\cal{D}}(h_1(x)) = \frac{1}{N} \sum_{y_i \in \cal{D}} (y_i - h_1(x_i))^2. \]</span></p>
<p>This can be pictured:</p>
<p><img src="/assets/linreg.png" /></p>
<p>The red dots here are the <span class="math inline">\(y_i\)</span> generated from some unknown <span class="math inline">\(f(x)\)</span> at <span class="math inline">\(x=x_i\)</span>, while the red line represents one of the <span class="math inline">\(h_1(x)\)</span> we are trying to use.</p>
<p>What this formula says then is: <em>the cost or risk is just the total squared distance to the line from the observation points</em>.</p>
<p>You probably had already intuited this, so this seems like a round about way to get here. But the method we have outlined generalizes and can be used for other Hypothesis spaces in regression and classification, so your effort in reading this far is not lost.</p>
<p>We also make explicit in our notation the in-sample data <span class="math inline">\(\cal{D}\)</span> (we write <span class="math inline">\(R_{\cal{D}}(h_1(x))\)</span>), because the value of the risk depends upon the points (our training sample) at which we made our observation (if we had made these observations <span class="math inline">\(y_i\)</span> at a different set of <span class="math inline">\(x_i\)</span>, the value of the risk would be somewhat different).</p>
<p>Now, given these observations, and the hypothesis space <span class="math inline">\(\cal{H}_1\)</span>, we minimize the risk over all possible functions in the hypothesis space to find the <strong>best fit</strong> function <span class="math inline">\(g_1(x)\)</span>:</p>
<p><span class="math display">\[ g_1(x) = \arg\min_{h_1(x) \in \cal{H}} R_{\cal{D}}(h_1(x)).\]</span></p>
<p>Here the notation</p>
<p><span class="math inline">\("g_1(x) = \arg\min_{h_1(x) \in \cal{H}})"\)</span></p>
<p>means: look at all <span class="math inline">\(h_1(x)\)</span> and give me the function <span class="math inline">\(g_1(x) = h_1\)</span> at which the risk <span class="math inline">\(R_{\cal{D}}(h_1)\)</span> is minimized; i.e. the <em>minimization is over functions</em> <span class="math inline">\(h_1\)</span>.</p>
<p>Generalizing to any hypothesis space we can write:</p>
<p><span class="math display">\[ g(x) = \arg\min_{h(x) \in \cal{H}} R_{\cal{D}}(h(x)),\]</span></p>
<p>where <span class="math inline">\(\cal{H}\)</span> is a general hypothesis space of functions.</p>
<p>The technical details of this minimization are not currently interesting to us, but let us say that lines are nicely parametrized by their slopes and intercepts, and the fitting is then a matter of some simple linear algebra. The end result is a line drawn with the values of intercept and slope that gives minimum risk amongst all lines. As we had sneak-peeked above, we have for our data:</p>
<p><img src="/assets/whatwehaveandfit.png" /></p>
</section>
<section id="the-structure-of-learning" class="level2">
<h2>The Structure of Learning</h2>
<p>Let us summarize the concepts we have seen about noise-free learning.</p>
<p>We have a target function <span class="math inline">\(f(x)\)</span> that we do not know. But we do have a sample of data points from it, <span class="math inline">\((x_1,y_1=f(x_1)), (x_2,y_2=f(x_2)), ..., (x_n,y_n=f(x_n))\)</span>. We call this the <strong>training sample</strong> or <strong>training set</strong> <span class="math inline">\(\cal{D}\)</span>. We are interested in using this sample to estimate a function <span class="math inline">\(g\)</span> to approximate the function <span class="math inline">\(f\)</span>, and which can be used for prediction on the entire population, or any other sample from it, also called <strong>out-of-sample prediction</strong>.</p>
<p>There are two ways statistics comes into this approximation problem , Firstly, we are trying to reconstruct the original function from a small-ish sample <span class="math inline">\(\cal{D}\)</span> rather than a large-ish population. Secondly, we do not know <span class="math inline">\(f(x)\)</span>, and instead estimate on our sample, a function <span class="math inline">\(g(x)\)</span> by some fitting procedure from a set of functions <span class="math inline">\(h \in {\cal H}\)</span>, the hypothesis space. This means that there will always be some error incurred because of the combination of finite sampling and this <em>mis-specification</em> of the model or hypothesis.</p>
<p>To do this fit, we use an algorithm, called the <strong>learner</strong>, which chooses functions from the hypothesis set <span class="math inline">\(\cal{H}\)</span> and computes a cost measure or risk functional <span class="math inline">\(R\)</span> (like the sum of the squared distance over all points in the data set) for each of these functions. It then chooses the function <span class="math inline">\(g\)</span> which <strong>minimizes</strong> this cost measure amongst all the functions in <span class="math inline">\(\cal{H}\)</span>, and thus gives us a final hypothesis <span class="math inline">\(g\)</span> which we then use to approximate or estimate <span class="math inline">\(f\)</span> <strong>everywhere</strong>, not just at the points in our data set. Now we can predict <span class="math inline">\(y\)</span> outside of our sample.</p>
<p><img src="/assets/BasicModel.png" /></p>
</section>
<section id="what-next" class="level2">
<h2>What next?</h2>
<p>Something should give you a little pause about our procedure so far. We have just one sample. How do we make sure our fitting procedure is robust to this choice (or really lack of choice) of training set? We have taken the empirical risk, which is really an approximate and stochastic quantity because we do not have full information about <span class="math inline">\(p(x)\)</span>, and estimated it deterministically using only one sample. How do we ensure generalization? We shall come to these in the next installments.</p>
</section>Rahul DaveThe fundamental problem in statistical learning is the estimation of a function: perhaps a known function, or a neural network, or anything really, that captures some physical or social process, and then using that function to make a prediction: e.g.: is this image that of a cat or a dog (classification); what is your credit score given your income?