@@virtually_passed and why not just set a to zero allthe time? sint that easier--otherwise I don't see how to tell if your line starts close to origin or not
Nice video, but it's a shame you don't give more intuition for the choice of the least squares vs. the other distance measures and the fact that this is just a projection onto a linear function space - that realisation is what really made linear regression click for me and made it possible to trivially generalise it to other functions.
I agree with this comment - I've always wondered why we don't ever use the other two measures, and this would have been a good opportunity to naswer the question. Could you maybe point to any other resources that do?
Very simple, yet effective, explanation; I come out of this video happy knowing I learned something new which I would have never tackled by myself. Great work!
I watched it few days after you uploaded it, but I was in bed almost sleeping. Today I watched it again. It's amazing, and you are excellent teacher!!! Keep going!!!
Great video. Here's a cool fact: The first row of the matrix equation at 14:27 says that the sum of the residuals must be zero, which (after a bit of algebra) proves that the least-squares line must map the average of x to the average of y.
Very cool fact! Thanks for sharing! I'd never heard of this before so I decided to prove it for myself: r1 + r2 + ... + rn = 0 (a+bx1-y1) + (a+bx2-y2) + ... + (a+bxn-yn) = 0 n*a + b(x1+x2+...+xn) - (y1+y2+...+yn) = 0 divide both sides by 'n' a + b*x_avg - y_avg = 0 y_avg = a + b*x_avg Therefore the point P = (x_avg, y_avg) will lie on the line y = a+bx. Very neat!
@@virtually_passed ...at which point we could consider that point P to be a 'new' origin, and use coordinates relative to it to find the best fit of the data passing through that point - the simpler 1-dimensional case explored earlier in the video.
yeah partway thru the video i stopped to remake it in desmos to see if the horizontal component could be used in some way because i was curious (tho i didn’t get anywhere with it), and i first offset them all by the x and y averages and did the 1d case
Great video as always! Great visuals that really give insight to the problem, I also appreciate how you color code things and show every step of the computation. A tiny correction: at 11:40 it should be norm *squared*.
Thanks for the comment! I really appreciate the kind words. You're absolutely right! At 11:40 it should be error = ||AX-b||^2 Thanks for pointing that out :) fortunately it doesn't affect the rest of the video though :)
@@virtually_passed wait just because the equation at 7:50 has a bunch of squared terms does not tell you it's a parabola though so why did you say that??
@@virtually_passed oh and also if the linear b terms xy if some of those are negative,then even if this mightbe a parabola it might not always point up--since the negative linear b terms might be greather than the positive b squared terms..see what I mean??
@@leif1075 Sorry for the late reply! Notice that the error has the form of a parabola: e = k1*b^2 -2*k2 * b + k3 Where the constants k1, k2, and k3 are given by: k1 = x1^2 + x2^2 + ... k2 = x1y1 + x2y2 + ... k3 = y1^2 + y2^2 + ... Also note that k1 is always >= 0 because any real number squared is positive. It honestly doesn't matter what the values of k2 and k3 are, since the convexity of a parabola is always determined by the coefficient of the squared term. I've created a desmos link for you here to see for yourself why this is true: www.desmos.com/calculator/waagmohtua
The method of least squares was (along with everything else going on at the time) the point when i stopped understanding my linear algebra course at uni. And now i understand it. Thanks a lot
Holy $#17 this is like a dream come true, I can't believe you made this interactive! I literally just commented I want interactiveness built into #some2 videos! I haven't even gotten to the video yet... Mad respect guys you're awesome
Thanks for the kind words! We intended to make it more interactive, but we ran out of time. Originally we wanted it to be a "choose your own adventure" thing where you could choose the type of proof, choose whether you wanted to see a proof for 1 unknown (easy version) or 2 unknowns (harder version). Interactivity is still a dream of mine :)
Least squares... I had always thought of it as a square root sort of thing. I do statistics. I write queries and do data analysis as a job - for 25 years. I have a bit of a clue... But the the changing sizes of the squares as the line moved made me go "Ohhhhhhhhhhhh!". It just suddenly became intuitive. Great way to explain it- you are getting a comment 30 seconds in. Nice work :)
This an absolutely beautiful explanation of least squares and where it came from. The visual and conceptual combined was really wonderful. Wish I had this in college. It would have spared me a lot of pain. 😄
Thanks for the visuals! When I was learning OLS, I remember that my primary questions were a) why is the sum a good choice, what other options are there? and b) why squares and not absolute values? I see that you just jump over these two questions, but from my experience, for somebody, who is trying to understand the method (as opposed to memorizing it) these are the central questions, which unlock the understanding. So you may want to add some exposition on that in the future, I'm sure many students will appreciate.
Hi thanks for your kind words and feeback! I'm actually in the process of making more videos now so this is really good advice :) thanks! As a short answer to your question: 1) One of the massive advantages of Ordinary Least Squares (OLS) is that it guarantees convexity (ie, the parabola has only one global optimum). Convexity is a big deal in the field of optimizations. Some other fitting methods don't have this feature meaning that it's possible to get stuck in local optimums, which means that you won't get the best fit. 2) This is super computationally fast to compute. There are downsides to this method though which I haven't talked about. One is that it's highly sensitive to large outliers (since it squares the error). But this is partially resolved by adding a regularization term (basically adding a 1-norm and a 2-norm together in the objective). I'll elaborate more in a future video :)
@@virtually_passed Regarding computing, it's a bit misleading to claim you don't need iterations to find your parameters. Given a small dataset, you can fit the most complex model with the slowest optimization method quickly. Indeed, for least squares, solving the normal equation is trivial when the data set is small, but difficult with a larger dataset, and one resorts to iterative methods to solve least squares.
Some recreational mathematics is learning cool stuff you didn’t already know, and some recreational mathematics is re-learning stuff you knew but with a better feel and intuition behind it. I think a lot of people overlook that second one, and this video shows how it can be really cool! Would love to see a video where you go over the three methods you suggested and their pros and cons, that would be super cool.
Hey thanks for the kind words! I've made another video on least squares here and I intended to make a few more: ua-cam.com/video/wJ35udCsHVc/v-deo.html
2 роки тому+1
Broh... I love you. This was beautiful!!! So helpfull to understand Vandermonde's matrix...
Love your content, luckily I just got recommended this video, I got lost a bit at the end with the multivariable calculus but I understood the reasoning and that is a lot, thanks!!
3:01 It _does_ seem subjective when you put it like that. Which is why it's important to point out that the Least Squares method is equivalent to Maximum Likelihood Estimation for normal data, which makes it objectively superior.
Maybe mentioned already, but I think that what you demonstrate is the Reduced-Major-Axis-method, where the error can be in two variables. The least-squares-method assumes an input parameter without error (say the x axis) and an output parameter with error (say the y axis). The least-squares-method reduces (in case of in error in y) the vertical distance between the line and the actual points. At least that is how I have understood it while using it some time ago.
I can't remember who originally said it but one of my favourite quotes about proofs is that "you shouldn't set out to prove something that isn't already almost intuitive."
Im loving this I was never really good at math in school (only making it to algebra 1/2 and geometry) and I'm already half way through (wanted to pause it so I don't miss any) and it's amazing I've been able to understand everything very well (some time programming probably helped) but you have made it so accessible and I love how you take a moment to pause and explain what the key points are (like that there's one global minimum) and that we should notice them to remember for later it's very helpful in keeping track of everything If Im ever teaching something I'm definitely stealing that idea 10/10 will Like and Subscribe Edit: just finished it with the matrices and it was still very understandable (even if I don't fully understand it) I was able to grasp enough to see and understand the power of this And when you coded it it also helped a lot cus it brought it to a language I knew instead of one I'm still learning Still 10/10 would recommend
Nice video. It took far too long for me to understand this, because I didn't have the words to articulate my question of why squares instead of L1 metric throughout highschool and then uni or I would be brushed off with a silly answer like that it is just the best way. A similar question that I had unanswered for a long time is why e is the number that is raised to the i*theta for polar coordinates in the complex plane, and it often was dismissed with the fact that sin and cos were connected to e, but not why or how. When I did have a good professor who explained it well I was so happy. I wondered if there were ever times that we would want to use higher norms or Lp-spaces, because some of those are easily solved as well, but they told me that it would give undue weight to outliers. I was satisfied with that answer at the time, but now I wonder if there are any applications where you do want the focus to be on outliers where those datapoints are actually an important part of telling the story of what the data means.
I understood JUST enough linear algebra to understand how clever that is. I started to phase out on the multivariate (that's where I started flagging in college), but dang that was a really cool reveal that the 'Jacobian' was just A transverse.
Wow. You more or less just summarized concisely what I spent weeks learning in 4000 level econometrics courses. Could you do one for multivariable (multidimensional) values?
Thanks for the comment. What do you mean by multidimensional values? Do you mean to teach multivariable calculus? Or teach LS with multiple unknowns? :)
this is really awesome! allthough as a math major I would've liked to see an expansion of the formula for n dimensions (I would assume it uses r_i^n and the jacobin and shouldn't be very hard to generalize although I may be wrong)
Wow, it’s surprising how compact the expression ended up being! Very nice video. I wonder about one of the other approaches you showed at the beginning, namely the “minimize perpendicular distance” method. That one appeals to me because it doesn’t seem to care about the rotation of our coordinate axes. If we were to turn that into a sort of “least circles” fit, would the resulting expression be anywhere near as neat or useful?
Hey, thanks for your comment. The method you're referring to is formally called Orthogonal Distance Regression. If you want all the details I'd recommend reading the book Numerical Optimization by Stephen Wright. In short, this method is superior in many ways but is generally more computationally expensive because the "Jacobian" matrix shown at 14:45 is no longer a constant in the general case, and so the minimization requires iterations. Hope that makes sense :)
Great video. Many thanks for the visualization of the problem. If I remember correctly, if we increase the length of the x vector, we could fit polynomials as well. Can you confirm this?
Correct. For example you could try and fit data to the function y = a + bx + cx^2 + dx^3 In this case the vector X would be: X = [a,b,c,d] The A matrix will also have more columns as well.
Minimizing of perpendicular distance (squared) is also sometimes used, especially when having uncertaininty in both x and y. It is however way more complex computionally. Most packages for fitting do not support it, but it is possible, and used this in the past (including estimating error of the parameter estimation).
Excellent video. Can you do another video explaining the pros and cons of the other methods (i.e., the vertical distance and the perpendicular distance methods) compared to the least squares method?
I collaborated with someone who did most of the heavy lifting regarding the simulation. We used P5.js to make all the simulations. A link to his GitHub and his website is in the description :)
I actually intend to do just that! First I want to make a video on another proof of linear least squares using the column space of A. Then, if I have time I'll do one on orthogonal fitting using nonlinear least sqaures
Very nice visual approach, but as a physicist, I am missing the motivation of " y errors only" vs "x and y errors". In other words, one could rotate the squares and go back to the ODR that is hinted to in the beginning and still get a least-squares method. (BTW unlucky choices: vector X and vector b) A video about ODR and/or SVD would be nice.
What an interesting idea! I don't know of any methods that do that. A consequence of this method is that a bunch of clumped points would have a similar weighting as a single point. That could be quite useful, actually! Interesting idea.
I agree it's beautiful and elegant to derive it using linear algebra alone! I actually just made a video doing exactly that :) ua-cam.com/video/wJ35udCsHVc/v-deo.html
Least square fit is highly sensitive to outlayer points in the fit therefore the fit is distorted by bad points in the fit. A more robust estimation can be obtained by minimizing the mediane instead of the squared error which is biased. Try it !
okay, but why would the single line methods not work? especially the vertical line ones which would seem to do the same thing but without the squaring?
That's a great question! The short answer is that it does work! The 'verticle lines' method is actually used in some applications! If you go through the math, the objective function that we try to minimize here is the "1-norm" of the residual vector. This is because we try to minimize the sum of the absolute value of all of the residuals. In fact, sometimes the least squares method is used in conjunction with the 1-norm method in an attempt to make the fit more robust to outliers. If you want to see more, click on this amazing video by Steve Brunton: ua-cam.com/video/GaXfqoLR_yI/v-deo.html&ab_channel=SteveBrunton
I use Microsoft onenote to do all the handwriting mathematical equations. Sadly whenever I press too hard on my touchscreen with my hand, onenote displays that annoying graphic. I tried to get rid of most of them but sadly I couldn't get rid of them all :(
Good question. When dealing with matrices we can't divide anymore. We need to multiply both sides by the inverse matrix. And this operation is only defined for square matrices. A^T isn't square in general (there could be more rows than columns or visa versa). However, in the very unlikely case that A happens to be square (ie there are just as many unique data points as unknowns) then you can inverse A^T and the pseudo inverse will collapse into the regular inverse of A. Hope that makes sense
That's a great question! This method can indeed be extended to 3D data. Let's say you have n data points: (x1,y1,z1), (x2,y2,z2), (x3,y3,z3), .... , (xn,yn,zn) And let's say you wanted to fit the plane z = a + bx + cy to these data points. Here the unknowns are X = [a, b, c]. Just like in the 2D case you can construct a residual vector. But in this case, the residuals would be the error between the z coordinate on the plane and the z coordinate of the data. Ie ri = a + b*xi + c*yi - zi And so the A matrix will look like this: A = [ 1 x1 y1 1 x2 y2 1 x3 y3 1 x4 y4 .... 1 xn yn] and the b vector will look like this: b = [z1 z2 z3 z4 ... zn] Then you can use the same formula to find vector X = pinv(A)*b Hope that helps :)
@@virtually_passed The plane? So you'ed have to subsequently project the points onto the resulting plane and do a 2D "least squares" to get the line? There's no shortcut? Because that's what I was doing already, just the other way around. Project to the XY & XZ planes, Least Squares, Combine to 3d Line.
@@benjaminmiller3620 Hey mate, sorry I think I must have explained it poorly before. At no point is it needed to project the data to the XY and XZ planes. It's going to be hard to explain this without an image. Can you send an email to me at virtuallypassed@gmail.com and I'll reply with some images which will make that clearer :) In that email can you please provide me more details about the problem too? What is the exact form of the equation of the '3D line' you want to fit the data to? Is it actually a line? Or a surface?
Nothing wrong eyeballing it for simple cases :) Most programs have this inbuilt under the hood so you likely don't need to worry about the theory anyway :)
What I don't understand: doesn't the line depend on the orientation of the coordinate system? I don't know if it does, but I would expect so, and - graphically - that bugs me. I know it makes sense to square the errors (parallel to the y-axis) when dealing with a data set from a measurement. But when I draw points on the floor and ask you, what the best line through those points is, it shouldn't depend on the coordinate system.
i guess once you are IN a coordinate system, then the corresponding x,y data will give you unique values of a and b, changing the coordinate system will change the x, y and also it will change the corresponding a and b...so Different coordinate systems will give you different a and b. Making the line Fit every time. So it is in this sense, you will get a fit always independent of what coordinate you choose. But you will have to CHOOSE first in order to proceed. Choice IS independent.
Hey, that's a really interesting question! If I understand you correctly, you're claiming that if you have another axis x', y' that's 10 degrees rotated clockwise from the traditional axis x,y then the fitted curve will be slightly different. Is that correct? I haven't done the math on it, but I strongly suspect you're right. But consider trying to fit the data with a parabola y = a+bx+cx^2 instead. In this case, the parameters the LS fitting would need to find are (a,b and c). However, in the rotated coordinate system, if you tried to fit the parabola y' = a' + b'x' + c'x^2, then you'll find there are no values of (a', b' and c') that could ever make these two parabolas look the same! And that's because a rotated parabola has an entirely different equation in the original coordinate system. So when you think about it this way, it seems quite reasonable, in my subjective opinion, that a different coordinate system can make slightly different fits. In which case, you would need to define your coordinate system first, and then perform the fit :) Hope that helps :D
@@virtually_passed Haven't done the math either, but that's just what I suspected. Maybe squaring the perpendicular distances to the line and minimizing that sum would give you the same line always, independent of where the coordinate system is.
Cool idea! The answer will actually be the same. Here's why: Instead of minimizing: r1^2 + r2^2 + ... + rn^2 You will be minimizing: (π/4)*r1^2 + (π/4)*r2^2 + ... + (π/4)*rn^2 (this is because a circles area is πD^2/4) (π/4) * (r1^2 + r2^2 + ... + rn^2) Notice this is just a scaled version of the same minimization problem from before, so the parabola will just be a bit less steep but will have the same optimum.
In my textbook and some other websites the gradient is given by this formula: b = S_{xy}/S_{xx} = (nsum(xy) - sum(x)sum(y)) / (nsum(x²) - (sum(x))²) That is not the same as the formula here (sum(xy)/sum(x²)) . Why ?
Hi thanks for the question. The formula you are referring to finds the value of 'b' that fits the line y=a+bx. The formula that I derived at 8:27 finds the value of 'b' that fits the line y=bx. This is why the formula is different. However, later on in my video (16:00) I derive an even more general formula for fitting any polynomial with any amount of unknowns (not just lines!). If you were to use that formula for the special case of a line y=a+bx you'll get the same answer as the one you provided.
I mean, it's just one more step than the Perpendicular lines. After all, if you have a distance, there's no need to square it, sure it keeps all distances positive, but so does the absolute in 2d...
You have `invert (transpose A * A) * transpose A`... shouldn't that simplify to `invert A`? The inverse of a product is the product of the inverses, but in the opposite order, then the `invert (transpose A)` would cancel with `transpose A` by associativity.
That's a great question! If I understand your question correctly you are saying the following, right? X = inv(A^T A) A^T b =inv(A) inv(A^T) A^T b =inv(A) I b =inv(A) b This can only be true if A is a square matrix! Because the rule inv(AB) = inv(B) inv(A) only applies if A and B are square matrices - the traditional inverse is only defined for a square matrix. Hope that helps! :)
Some random advice - don’t tell us that you’re manipulating us by telling us that it’s a parabola. Instead, just suggest it’s shape resembles a parabola/hyperbola - get us thinking: ‘Huh - that’s interesting. Is it a parabola? is it a hyperbola?’ That has us thinking on it’s shape, and looking for what might be defining its shape -> that engages us in the lesson more than just monologging at us, and won’t anger some of us anywhere near as much as a bold statement of ‘I’m manipulating you for your own good’.
@@virtually_passed FWIW, The Action Lab recently did a video that involved putting superconductor into an induction heater. At face value, he appeared puzzled by the outcome, however, if you consider the whole video, he probably expected that outcome before he ever started filming -> it’s an example of engaging the audience by presenting them with something unexpected+unexplained. He’s doing much what you did vis-a-vis the parabola being true, but he put the focus on the subject without calling out that he was selectively feeding information to the audience. Good luck with your future ventures.
Yep, I got nothing. Absolutely no idea why you use the are of a square rather than just the length of the line. Then when you started using matrices, I was lost.
@@virtually_passed and if you replaced all the sums with Sum x^2, Sum xy and sum Y^2 then you could have done two things - solved everything without matrices, and also shown how incredibly efficient this algorithm is because you can incrementally add and remove points from those sums.
@@idjles Indeed the example I showed with 2 unknowns (a and b) can be solved without matrices. However, the method I used to solve it can be applied to a polynomial with 'n' parameters! Deriving a solution for 'n' unknowns without matrices will be very very hard and messy :)
how do we know that the curve is a straight line, that the function of the data is linear? Seems like it take on a logarithmic appearance. Few equations in the real world are linear. Seems like this could be an example of the problem of "lying with statistics".
Hi that's a really great question! The form of the equation that you want to fit the data to has to come from some external information about the system you're analyzing. Typically engineers or physics have a model of the thing they're trying to analyse. For example, if this data was force Vs distance for a spring then the model will probably be linear, or a cubic. If it was population Vs time then you'll use an exponential. You might be tempted to avoid this problem by trying to fit a curve with many many unknown parameters (perhaps by fitting a polynomial of degree 100 or something). But this is a bad idea because then you will just be overfitting. If you genuinely know nothing about the data you're measuring, and so you have no model (eg you're studying a part of the human brain or something) then there are other things you can do, but that goes beyond least squares.
@@virtually_passed awesome, thank you for the detailed and prompt reply! Perhaps my question could be material for your next video? I just found your channel with this video, excited to binge.
Thanks for the comment! Yeah, linear algebra can be quite tough. As long as you understood the first part though (solving for 1 unknown), that's the most important thing! The other half of the video is a way to solve for 'n' unknowns and it's basically the same idea :)
Saying it can be "easily and efficiently be implemented in software" is quite misleading by providing an example of a function. A single function call can be incredibly complex and inefficient. All that demonstrates is that it can be easily implemented.
Why did you not show the vertical and perpendicular options before spending multuple minutes essentially repeating that the square option was the best? Also, why are the square drawn the way they are and not some other way? Why use the vertical as a basis and not an horizontal or a perpendicular? I'm almost halfway through the video and I feel like I'm getting dragged through the problem and its "best" solution instead of being told about the approach to the problem. I feel like I'm not being allowed to see the steps that get us to the answer, I'm just sitting through long praise of the good answer. Honestly, why are we proving that squares yield parabolas? There is no intuitive reason why we're talking about parabolas by that point. And that's multiple minutes spent listening to maths I had no clue why I was listening to. And the rest of the video is more maths that was more like being told how to write an algorithm than why use that algorithm.
Hey, thanks so much for your comment. I really appreciate the feedback. I think I'll create another video that will describe the differences in these fitting methods in more detail. In short, there are pros and cons for each of the proposed fitting methods you've proposed. Ultimately, the 'best' method depends on the type of problem you have. However, the point of this video was to explain what the ordinary least squares method is, and to provide just a bit of motivation as to why it's so widely used. It's widely used because it's 1) very computationally efficient 2) simple to implement in software and 3) results in a convex optimization problem (the parabola only has one minimum). I hope that helps explain things :)
Typo at 11:39 it should be
|| [A]X - b ||^2
Don't worry, this doesn't affect anything in the video :)
Not using sum notation on the first proof is making it so much easier for me to understand. Brilliant!
Excellent video. Love the proof behind the parabola and the global min that the squared residuals must eventually attain. Bravo sir.
Thanks!
@@virtually_passed and why not just set a to zero allthe time? sint that easier--otherwise I don't see how to tell if your line starts close to origin or not
Within seconds of the video playing I immediately got an intuitive explanation of the least squares method better than I've ever had
Nice video, but it's a shame you don't give more intuition for the choice of the least squares vs. the other distance measures and the fact that this is just a projection onto a linear function space - that realisation is what really made linear regression click for me and made it possible to trivially generalise it to other functions.
Glad you liked the video and thanks for the feedback!
Still, the beginning was just introduction. He doesn't have to pick it up again if he wants to talk about the square method.
@@red_rassmueller1716Then why mention them in the first place? You can't blame us for being curious about an unresolved comparison.
I agree with this comment - I've always wondered why we don't ever use the other two measures, and this would have been a good opportunity to naswer the question. Could you maybe point to any other resources that do?
Wow, ive been binge-watching the SoMe2 video's. And ive been impressed with everyone's effort. Especially this video is so sick!
Thanks so much :)
Very simple, yet effective, explanation; I come out of this video happy knowing I learned something new which I would have never tackled by myself. Great work!
Thanks for the kind words!
I watched it few days after you uploaded it, but I was in bed almost sleeping. Today I watched it again. It's amazing, and you are excellent teacher!!! Keep going!!!
Thanks!
Great video. Here's a cool fact: The first row of the matrix equation at 14:27 says that the sum of the residuals must be zero, which (after a bit of algebra) proves that the least-squares line must map the average of x to the average of y.
Very cool fact! Thanks for sharing! I'd never heard of this before so I decided to prove it for myself:
r1 + r2 + ... + rn = 0
(a+bx1-y1) + (a+bx2-y2) + ... + (a+bxn-yn) = 0
n*a + b(x1+x2+...+xn) - (y1+y2+...+yn) = 0
divide both sides by 'n'
a + b*x_avg - y_avg = 0
y_avg = a + b*x_avg
Therefore the point P = (x_avg, y_avg) will lie on the line y = a+bx. Very neat!
@@virtually_passed ...at which point we could consider that point P to be a 'new' origin, and use coordinates relative to it to find the best fit of the data passing through that point - the simpler 1-dimensional case explored earlier in the video.
yeah partway thru the video i stopped to remake it in desmos to see if the horizontal component could be used in some way because i was curious (tho i didn’t get anywhere with it), and i first offset them all by the x and y averages and did the 1d case
This is neat indeed!
Great video as always! Great visuals that really give insight to the problem, I also appreciate how you color code things and show every step of the computation. A tiny correction: at 11:40 it should be norm *squared*.
Thanks for the comment! I really appreciate the kind words. You're absolutely right!
At 11:40 it should be
error = ||AX-b||^2
Thanks for pointing that out :) fortunately it doesn't affect the rest of the video though :)
@@virtually_passed wait just because the equation at 7:50 has a bunch of squared terms does not tell you it's a parabola though so why did you say that??
@@virtually_passed oh and also if the linear b terms xy if some of those are negative,then even if this mightbe a parabola it might not always point up--since the negative linear b terms might be greather than the positive b squared terms..see what I mean??
@@virtually_passed Hope you can respond when you can. Thanks very much.
@@leif1075 Sorry for the late reply! Notice that the error has the form of a parabola: e = k1*b^2 -2*k2 * b + k3
Where the constants k1, k2, and k3 are given by:
k1 = x1^2 + x2^2 + ...
k2 = x1y1 + x2y2 + ...
k3 = y1^2 + y2^2 + ...
Also note that k1 is always >= 0 because any real number squared is positive. It honestly doesn't matter what the values of k2 and k3 are, since the convexity of a parabola is always determined by the coefficient of the squared term. I've created a desmos link for you here to see for yourself why this is true: www.desmos.com/calculator/waagmohtua
Absolutely wonderful ! ! !
Combines linear algebra with calculus. This video is a GREAT "commercial" for both topics.
Thanks :)
You're welcome! Really, after all those years (I'm from 1969) this is the first time I see how both can go hand in hand.
The method of least squares was (along with everything else going on at the time) the point when i stopped understanding my linear algebra course at uni. And now i understand it. Thanks a lot
Glad it helped!
Holy $#17 this is like a dream come true, I can't believe you made this interactive! I literally just commented I want interactiveness built into #some2 videos! I haven't even gotten to the video yet... Mad respect guys you're awesome
Thanks for the kind words! We intended to make it more interactive, but we ran out of time. Originally we wanted it to be a "choose your own adventure" thing where you could choose the type of proof, choose whether you wanted to see a proof for 1 unknown (easy version) or 2 unknowns (harder version). Interactivity is still a dream of mine :)
Great explanation and visualization. Well done.
Thanks!
Least squares...
I had always thought of it as a square root sort of thing. I do statistics. I write queries and do data analysis as a job - for 25 years. I have a bit of a clue...
But the the changing sizes of the squares as the line moved made me go "Ohhhhhhhhhhhh!". It just suddenly became intuitive.
Great way to explain it- you are getting a comment 30 seconds in. Nice work :)
Glad you liked it :)
Fantastic presentation!
Thanks :)
Good video! Never had to use multi variable approach but now I know.
THis is awsome!! It represents perfectly the SoM2 spirit, but with a very original way to explain and present.Thank you so much
Thanks!
Great job - This is absolutely fantastic! You are doing us all a favor.
This an absolutely beautiful explanation of least squares and where it came from. The visual and conceptual combined was really wonderful. Wish I had this in college. It would have spared me a lot of pain. 😄
Glad you enjoyed it!
Thanks for the visuals!
When I was learning OLS, I remember that my primary questions were a) why is the sum a good choice, what other options are there? and b) why squares and not absolute values?
I see that you just jump over these two questions, but from my experience, for somebody, who is trying to understand the method (as opposed to memorizing it) these are the central questions, which unlock the understanding. So you may want to add some exposition on that in the future, I'm sure many students will appreciate.
Hi thanks for your kind words and feeback!
I'm actually in the process of making more videos now so this is really good advice :) thanks! As a short answer to your question:
1) One of the massive advantages of Ordinary Least Squares (OLS) is that it guarantees convexity (ie, the parabola has only one global optimum). Convexity is a big deal in the field of optimizations. Some other fitting methods don't have this feature meaning that it's possible to get stuck in local optimums, which means that you won't get the best fit.
2) This is super computationally fast to compute.
There are downsides to this method though which I haven't talked about. One is that it's highly sensitive to large outliers (since it squares the error). But this is partially resolved by adding a regularization term (basically adding a 1-norm and a 2-norm together in the objective).
I'll elaborate more in a future video :)
@@virtually_passed Thank you very much!
@@matveyshishov you're welcome!
@@virtually_passed Regarding computing, it's a bit misleading to claim you don't need iterations to find your parameters. Given a small dataset, you can fit the most complex model with the slowest optimization method quickly. Indeed, for least squares, solving the normal equation is trivial when the data set is small, but difficult with a larger dataset, and one resorts to iterative methods to solve least squares.
@@andrewzhang5345 I agree, thanks for the comment. I've edited my response.
Hold up, I’ve DONE THIS before. But this is a much better explanation. Thank you.
I wish we had all this materials back in scool 30 years ago... Nice work
Thanks!
Some recreational mathematics is learning cool stuff you didn’t already know, and some recreational mathematics is re-learning stuff you knew but with a better feel and intuition behind it. I think a lot of people overlook that second one, and this video shows how it can be really cool!
Would love to see a video where you go over the three methods you suggested and their pros and cons, that would be super cool.
Hey thanks for the comment and kind words. A lot of people have requested a summary video like that :) it's on the list :)
Thank you so much for making this topic so so so interesting
Hope to see much more
Hey thanks for the kind words! I've made another video on least squares here and I intended to make a few more:
ua-cam.com/video/wJ35udCsHVc/v-deo.html
Broh... I love you.
This was beautiful!!!
So helpfull to understand Vandermonde's matrix...
Thanks!
Whoa! I'm illuminated! Thanks.
You're welcome :)
very very nice! I have been teaching LSQ optimization to undergrads for years, now I will just point them to your video 🙂best of luck for #SoME2
Thanks for the kind words!
Very nice video!
Love your content, luckily I just got recommended this video, I got lost a bit at the end with the multivariable calculus but I understood the reasoning and that is a lot, thanks!!
Thanks for the comment. As long as you get the big picture, that's what matters most. The rest are all details :)
This was great, all of it, amazing job.
Thanks 😊
Awesome visualization
Thanks!
3:01 It _does_ seem subjective when you put it like that. Which is why it's important to point out that the Least Squares method is equivalent to Maximum Likelihood Estimation for normal data, which makes it objectively superior.
I don't think it is objective to assume that the MLE is the best estimator. There are plenty of circumstances where you actually want something else.
Nice useful channel. Great stuff ... thanks. Cheers.
Thanks!
Love it!
:)
wow, can't understand how this channel only has 13k subs. awesome video!
Thanks!
Maybe mentioned already, but I think that what you demonstrate is the Reduced-Major-Axis-method, where the error can be in two variables. The least-squares-method assumes an input parameter without error (say the x axis) and an output parameter with error (say the y axis). The least-squares-method reduces (in case of in error in y) the vertical distance between the line and the actual points. At least that is how I have understood it while using it some time ago.
I can't remember who originally said it but one of my favourite quotes about proofs is that "you shouldn't set out to prove something that isn't already almost intuitive."
Amazing video! Very enlightening
glad you liked it!
Great video, thank you so much for your explanation!
Really glad you liked it :)
Wow that was such a well-made video
Thanks!
Im loving this
I was never really good at math in school (only making it to algebra 1/2 and geometry) and I'm already half way through (wanted to pause it so I don't miss any) and it's amazing
I've been able to understand everything very well (some time programming probably helped) but you have made it so accessible and I love how you take a moment to pause and explain what the key points are (like that there's one global minimum) and that we should notice them to remember for later it's very helpful in keeping track of everything
If Im ever teaching something I'm definitely stealing that idea
10/10 will Like and Subscribe
Edit: just finished it with the matrices and it was still very understandable (even if I don't fully understand it) I was able to grasp enough to see and understand the power of this
And when you coded it it also helped a lot cus it brought it to a language I knew instead of one I'm still learning
Still 10/10 would recommend
Hey thanks so much for the kind words. I spent a lot of effort trying to make the video as accessible as possible so I'm glad it worked for you!
Nice video. It took far too long for me to understand this, because I didn't have the words to articulate my question of why squares instead of L1 metric throughout highschool and then uni or I would be brushed off with a silly answer like that it is just the best way. A similar question that I had unanswered for a long time is why e is the number that is raised to the i*theta for polar coordinates in the complex plane, and it often was dismissed with the fact that sin and cos were connected to e, but not why or how.
When I did have a good professor who explained it well I was so happy. I wondered if there were ever times that we would want to use higher norms or Lp-spaces, because some of those are easily solved as well, but they told me that it would give undue weight to outliers. I was satisfied with that answer at the time, but now I wonder if there are any applications where you do want the focus to be on outliers where those datapoints are actually an important part of telling the story of what the data means.
Excellent video. Make more video on statistics.
I understood JUST enough linear algebra to understand how clever that is. I started to phase out on the multivariate (that's where I started flagging in college), but dang that was a really cool reveal that the 'Jacobian' was just A transverse.
Glad you managed to follow it! Linear algebra is very powerful!
Wow. You more or less just summarized concisely what I spent weeks learning in 4000 level econometrics courses.
Could you do one for multivariable (multidimensional) values?
Thanks for the comment. What do you mean by multidimensional values? Do you mean to teach multivariable calculus? Or teach LS with multiple unknowns? :)
Excellent. Thank you.
:)
Words don't appropriately express gratitude, but thanks.
Wonderful video . Need more videos sir
Thanks :)
this is really awesome! allthough as a math major I would've liked to see an expansion of the formula for n dimensions (I would assume it uses r_i^n and the jacobin and shouldn't be very hard to generalize although I may be wrong)
Wow, it’s surprising how compact the expression ended up being! Very nice video.
I wonder about one of the other approaches you showed at the beginning, namely the “minimize perpendicular distance” method. That one appeals to me because it doesn’t seem to care about the rotation of our coordinate axes. If we were to turn that into a sort of “least circles” fit, would the resulting expression be anywhere near as neat or useful?
Hey, thanks for your comment.
The method you're referring to is formally called Orthogonal Distance Regression. If you want all the details I'd recommend reading the book Numerical Optimization by Stephen Wright.
In short, this method is superior in many ways but is generally more computationally expensive because the "Jacobian" matrix shown at 14:45 is no longer a constant in the general case, and so the minimization requires iterations.
Hope that makes sense :)
@@virtually_passed Thanks for the detailed answer! It made a lot of sense
Great video. Many thanks for the visualization of the problem.
If I remember correctly, if we increase the length of the x vector, we could fit polynomials as well. Can you confirm this?
Correct. For example you could try and fit data to the function
y = a + bx + cx^2 + dx^3
In this case the vector X would be:
X = [a,b,c,d]
The A matrix will also have more columns as well.
Minimizing of perpendicular distance (squared) is also sometimes used, especially when having uncertaininty in both x and y. It is however way more complex computionally. Most packages for fitting do not support it, but it is possible, and used this in the past (including estimating error of the parameter estimation).
Agreed :)
you are incredible ❤❤❤❤❤
I really like this video
Thanks 🙏
Excellent video. Can you do another video explaining the pros and cons of the other methods (i.e., the vertical distance and the perpendicular distance methods) compared to the least squares method?
Hi thanks for the comment. Quite a few others have requested a video like that. It's on the list :)
Hi! Awesome video! Which tools do you use to create the interactive exercises?
I collaborated with someone who did most of the heavy lifting regarding the simulation. We used P5.js to make all the simulations. A link to his GitHub and his website is in the description :)
The 2 norm at 11:50 or so should be squared. Very nice presentation
Thanks! You're right! I've made a post about this.
Great video
Thanks :)
Amazing ❤
Very elegant video, great concept to choose! Very 3b1b. Is there another standard to aspire to? 🤔
Thanks for the kind words. 3b1b is a hero of mine :)
awesome thanks, can you do one with nonlinear curve fitting aswell
I actually intend to do just that! First I want to make a video on another proof of linear least squares using the column space of A. Then, if I have time I'll do one on orthogonal fitting using nonlinear least sqaures
@@virtually_passed awesome, waiting.....
Very nice visual approach, but as a physicist, I am missing the motivation of " y errors only" vs "x and y errors". In other words, one could rotate the squares and go back to the ODR that is hinted to in the beginning and still get a least-squares method. (BTW unlucky choices: vector X and vector b) A video about ODR and/or SVD would be nice.
I was really hoping you'd follow through on that promise to explain why Least Squares is better than the other two approaches.
I intend to. Meanwhile, I've written quite a bit on this in other people's comments. :)
Excelent video!
Virtually Based. Thanks for the video, subscribed
:)
Nice video
Thanks!
The sad part of this amazing applied math video is that it ends!!!
Thanks! I have a follow-up proof video about least squares if you're interested ☺️
How about minimize areas setwise discounting the intersections of the graphic squares? It would discount dense parts and should make a better fit
What an interesting idea! I don't know of any methods that do that. A consequence of this method is that a bunch of clumped points would have a similar weighting as a single point. That could be quite useful, actually! Interesting idea.
This is so much easier and more elegant to derive using linear algebra alone. There is no need to use Multivariable Calculus.
I agree it's beautiful and elegant to derive it using linear algebra alone! I actually just made a video doing exactly that :)
ua-cam.com/video/wJ35udCsHVc/v-deo.html
Least square fit is highly sensitive to outlayer points in the fit therefore the fit is distorted by bad points in the fit. A more robust estimation can be obtained by minimizing the mediane instead of the squared error which is biased. Try it !
Yes! Which is why sometimes the objective function is the sum of the 2 norm and 1 norm to make a more robust fit :)
okay, but why would the single line methods not work? especially the vertical line ones which would seem to do the same thing but without the squaring?
That's a great question! The short answer is that it does work! The 'verticle lines' method is actually used in some applications! If you go through the math, the objective function that we try to minimize here is the "1-norm" of the residual vector. This is because we try to minimize the sum of the absolute value of all of the residuals.
In fact, sometimes the least squares method is used in conjunction with the 1-norm method in an attempt to make the fit more robust to outliers. If you want to see more, click on this amazing video by Steve Brunton:
ua-cam.com/video/GaXfqoLR_yI/v-deo.html&ab_channel=SteveBrunton
Do you have any recommendation of a material that connects this topic with QR Factorization?
Hi great question! I'm sure there are many resources online, but I use Chapter 10 of the book "Numerical Optimization" by Stephen J Wright. Good luck!
@@virtually_passed Thanks! Also, very good video. I shared with my university colleagues. I really found it very well done
@@luanmartins8068 thanks!
Nice! What are the gray circles that appear in the background for about one frame at a time?
I use Microsoft onenote to do all the handwriting mathematical equations. Sadly whenever I press too hard on my touchscreen with my hand, onenote displays that annoying graphic. I tried to get rid of most of them but sadly I couldn't get rid of them all :(
@@virtually_passed Ah. I thought it was an easter egg or a subliminal message. Good luck in the contest.
@@PhilipSmolen thanks!
can this idea be extended to fit any degree of polynomial function?
Yes! The A matrix just gets larger :)
Why can't you "divide" A transpose from both sides of that final equation?
Good question. When dealing with matrices we can't divide anymore. We need to multiply both sides by the inverse matrix. And this operation is only defined for square matrices. A^T isn't square in general (there could be more rows than columns or visa versa). However, in the very unlikely case that A happens to be square (ie there are just as many unique data points as unknowns) then you can inverse A^T and the pseudo inverse will collapse into the regular inverse of A.
Hope that makes sense
Does this naturally extend to higher dimensional points? How would one find the best fitting line to a 3d point cloud?
That's a great question! This method can indeed be extended to 3D data. Let's say you have n data points:
(x1,y1,z1), (x2,y2,z2), (x3,y3,z3), .... , (xn,yn,zn)
And let's say you wanted to fit the plane z = a + bx + cy to these data points. Here the unknowns are X = [a, b, c].
Just like in the 2D case you can construct a residual vector. But in this case, the residuals would be the error between the z coordinate on the plane and the z coordinate of the data. Ie
ri = a + b*xi + c*yi - zi
And so the A matrix will look like this: A =
[ 1 x1 y1
1 x2 y2
1 x3 y3
1 x4 y4
....
1 xn yn]
and the b vector will look like this: b =
[z1
z2
z3
z4
...
zn]
Then you can use the same formula to find vector X = pinv(A)*b
Hope that helps :)
@@virtually_passed The plane? So you'ed have to subsequently project the points onto the resulting plane and do a 2D "least squares" to get the line? There's no shortcut? Because that's what I was doing already, just the other way around. Project to the XY & XZ planes, Least Squares, Combine to 3d Line.
@@benjaminmiller3620 Hey mate, sorry I think I must have explained it poorly before. At no point is it needed to project the data to the XY and XZ planes. It's going to be hard to explain this without an image. Can you send an email to me at virtuallypassed@gmail.com and I'll reply with some images which will make that clearer :)
In that email can you please provide me more details about the problem too? What is the exact form of the equation of the '3D line' you want to fit the data to? Is it actually a line? Or a surface?
@@virtually_passed A line. *r* = *r_0* + _t_ * *v* (I prefer the vector equation.) Not sure where you got "surface" from.
@@benjaminmiller3620 Hey Benjamin, I just replied to your email. I suggest using PCA. Details in the email :)
is the reason for using squares for the error to guarantee that it’s always positive? (there’s a minimum)
Yes! That's a big motivational factor! Admittedly you could have also taken the absolute value, but squaring makes the math much easier
How you not gonna name your collaborator at the start?
I'll just buckle up and do the regression by hand. I guessed the value for b correctly. I don't need scary algorithms and maths.
Nothing wrong eyeballing it for simple cases :) Most programs have this inbuilt under the hood so you likely don't need to worry about the theory anyway :)
Could this be done with circles, with the points making circles, tangent to the line of best fit?
Yes it can! More generally, you can use it to fit ellipses. You just need to do a clever transformation. Hint: let error = x^2+y^2
What I don't understand: doesn't the line depend on the orientation of the coordinate system?
I don't know if it does, but I would expect so, and - graphically - that bugs me. I know it makes sense to square the errors (parallel to the y-axis) when dealing with a data set from a measurement.
But when I draw points on the floor and ask you, what the best line through those points is, it shouldn't depend on the coordinate system.
i guess once you are IN a coordinate system, then the corresponding x,y data will give you unique values of a and b, changing the coordinate system will change the x, y and also it will change the corresponding a and b...so Different coordinate systems will give you different a and b. Making the line Fit every time. So it is in this sense, you will get a fit always independent of what coordinate you choose. But you will have to CHOOSE first in order to proceed. Choice IS independent.
Hey, that's a really interesting question! If I understand you correctly, you're claiming that if you have another axis x', y' that's 10 degrees rotated clockwise from the traditional axis x,y then the fitted curve will be slightly different. Is that correct?
I haven't done the math on it, but I strongly suspect you're right. But consider trying to fit the data with a parabola y = a+bx+cx^2 instead. In this case, the parameters the LS fitting would need to find are (a,b and c). However, in the rotated coordinate system, if you tried to fit the parabola y' = a' + b'x' + c'x^2, then you'll find there are no values of (a', b' and c') that could ever make these two parabolas look the same! And that's because a rotated parabola has an entirely different equation in the original coordinate system. So when you think about it this way, it seems quite reasonable, in my subjective opinion, that a different coordinate system can make slightly different fits. In which case, you would need to define your coordinate system first, and then perform the fit :) Hope that helps :D
@@virtually_passed Haven't done the math either, but that's just what I suspected.
Maybe squaring the perpendicular distances to the line and minimizing that sum would give you the same line always, independent of where the coordinate system is.
You forgot to explain why we chose squares over linear distances.
What about method of least circles...
Cool idea! The answer will actually be the same. Here's why:
Instead of minimizing:
r1^2 + r2^2 + ... + rn^2
You will be minimizing:
(π/4)*r1^2 + (π/4)*r2^2 + ... + (π/4)*rn^2 (this is because a circles area is πD^2/4)
(π/4) * (r1^2 + r2^2 + ... + rn^2)
Notice this is just a scaled version of the same minimization problem from before, so the parabola will just be a bit less steep but will have the same optimum.
Everyone is working hard eh.
In my textbook and some other websites the gradient is given by this formula:
b = S_{xy}/S_{xx} = (nsum(xy) - sum(x)sum(y)) / (nsum(x²) - (sum(x))²)
That is not the same as the formula here (sum(xy)/sum(x²)) . Why ?
Hi thanks for the question. The formula you are referring to finds the value of 'b' that fits the line y=a+bx.
The formula that I derived at 8:27 finds the value of 'b' that fits the line y=bx. This is why the formula is different.
However, later on in my video (16:00) I derive an even more general formula for fitting any polynomial with any amount of unknowns (not just lines!). If you were to use that formula for the special case of a line y=a+bx you'll get the same answer as the one you provided.
Error function: *forms a parabola*
Me: :o
I mean, it's just one more step than the Perpendicular lines. After all, if you have a distance, there's no need to square it, sure it keeps all distances positive, but so does the absolute in 2d...
this is not for beginners but for anyone who got a B in statistics this is better than 3b1b
Well, i think i don't have "basic highschool calculus"
You have `invert (transpose A * A) * transpose A`... shouldn't that simplify to `invert A`? The inverse of a product is the product of the inverses, but in the opposite order, then the `invert (transpose A)` would cancel with `transpose A` by associativity.
That's a great question!
If I understand your question correctly you are saying the following, right?
X = inv(A^T A) A^T b
=inv(A) inv(A^T) A^T b
=inv(A) I b
=inv(A) b
This can only be true if A is a square matrix! Because the rule inv(AB) = inv(B) inv(A) only applies if A and B are square matrices - the traditional inverse is only defined for a square matrix. Hope that helps! :)
@@virtually_passed Ah! I was thinking about that, but I forgot that A^T * A would be square (and possibly invertible), even if A isn't.
@@MCLooyverse correct :)
Some random advice - don’t tell us that you’re manipulating us by telling us that it’s a parabola. Instead, just suggest it’s shape resembles a parabola/hyperbola - get us thinking: ‘Huh - that’s interesting. Is it a parabola? is it a hyperbola?’ That has us thinking on it’s shape, and looking for what might be defining its shape -> that engages us in the lesson more than just monologging at us, and won’t anger some of us anywhere near as much as a bold statement of ‘I’m manipulating you for your own good’.
Ooo thanks for the pedagogy advice!
@@virtually_passed FWIW, The Action Lab recently did a video that involved putting superconductor into an induction heater.
At face value, he appeared puzzled by the outcome, however, if you consider the whole video, he probably expected that outcome before he ever started filming -> it’s an example of engaging the audience by presenting them with something unexpected+unexplained.
He’s doing much what you did vis-a-vis the parabola being true, but he put the focus on the subject without calling out that he was selectively feeding information to the audience.
Good luck with your future ventures.
Yep, I got nothing. Absolutely no idea why you use the are of a square rather than just the length of the line. Then when you started using matrices, I was lost.
You could have completed the square instead of calculating de/db. You would have found b without calculus.
You're absolutely right!
@@virtually_passed and if you replaced all the sums with Sum x^2, Sum xy and sum Y^2 then you could have done two things - solved everything without matrices, and also shown how incredibly efficient this algorithm is because you can incrementally add and remove points from those sums.
@@idjles Indeed the example I showed with 2 unknowns (a and b) can be solved without matrices. However, the method I used to solve it can be applied to a polynomial with 'n' parameters! Deriving a solution for 'n' unknowns without matrices will be very very hard and messy :)
how do we know that the curve is a straight line, that the function of the data is linear? Seems like it take on a logarithmic appearance. Few equations in the real world are linear. Seems like this could be an example of the problem of "lying with statistics".
Hi that's a really great question! The form of the equation that you want to fit the data to has to come from some external information about the system you're analyzing. Typically engineers or physics have a model of the thing they're trying to analyse. For example, if this data was force Vs distance for a spring then the model will probably be linear, or a cubic. If it was population Vs time then you'll use an exponential.
You might be tempted to avoid this problem by trying to fit a curve with many many unknown parameters (perhaps by fitting a polynomial of degree 100 or something). But this is a bad idea because then you will just be overfitting.
If you genuinely know nothing about the data you're measuring, and so you have no model (eg you're studying a part of the human brain or something) then there are other things you can do, but that goes beyond least squares.
@@virtually_passed awesome, thank you for the detailed and prompt reply! Perhaps my question could be material for your next video? I just found your channel with this video, excited to binge.
How did I get here from watching animators
\_o_0_/
until u didnt implement matrixes it was understandable, then i tryied to continue with your dream, but lost, i need learn more math..
Thanks for the comment! Yeah, linear algebra can be quite tough. As long as you understood the first part though (solving for 1 unknown), that's the most important thing! The other half of the video is a way to solve for 'n' unknowns and it's basically the same idea :)
Saying it can be "easily and efficiently be implemented in software" is quite misleading by providing an example of a function.
A single function call can be incredibly complex and inefficient.
All that demonstrates is that it can be easily implemented.
Why did you not show the vertical and perpendicular options before spending multuple minutes essentially repeating that the square option was the best?
Also, why are the square drawn the way they are and not some other way? Why use the vertical as a basis and not an horizontal or a perpendicular? I'm almost halfway through the video and I feel like I'm getting dragged through the problem and its "best" solution instead of being told about the approach to the problem. I feel like I'm not being allowed to see the steps that get us to the answer, I'm just sitting through long praise of the good answer. Honestly, why are we proving that squares yield parabolas? There is no intuitive reason why we're talking about parabolas by that point. And that's multiple minutes spent listening to maths I had no clue why I was listening to.
And the rest of the video is more maths that was more like being told how to write an algorithm than why use that algorithm.
Hey, thanks so much for your comment. I really appreciate the feedback. I think I'll create another video that will describe the differences in these fitting methods in more detail. In short, there are pros and cons for each of the proposed fitting methods you've proposed. Ultimately, the 'best' method depends on the type of problem you have. However, the point of this video was to explain what the ordinary least squares method is, and to provide just a bit of motivation as to why it's so widely used. It's widely used because it's 1) very computationally efficient 2) simple to implement in software and 3) results in a convex optimization problem (the parabola only has one minimum).
I hope that helps explain things :)
promosm
What does that mean? :}