I have had courses and put a lot of effort reading material online, but your explanation is by far the one that will remain indelible in my mind. Thank you
@Emm-- not sure how / if I can reply to your comment. An iso-surface is the set of points such that a function f(x) has constant value, e.g. all x such that f(x) = c. For a Gaussian distribution, for example, this is an ellipse, shaped according to the eigenvectors and eigenvalues of the covariance matrix. So, the iso-surfaces of theta1^2 + theta2^2 are circles, while the iso-surfaces of |theta1|+|theta2| look like diamonds. The iso-surface of the squared error on the data is also ellipsoidal, with a shape that depends on the data. Alpha scales the importance of the regularization term in the loss function, so higher alpha means more regularization. I didn't prove the sparsity assertion in the recording, but effectively, the "sharpness" of the diamond shape on the axes (specifically, the discontinuous derivative at e.g. theta1=0) means that it is possible for the optimum of the sum of (data + regularization) to have its optimum at a point where some of the parameters are exactly zero. If the function is differentiable at those points, this will effectively never happen -- the optimum will effectively always be at some (possibly small, but) non-zero value.
Very few videos online give some key concepts here, like what we're truly trying to minimize with the penalty expression. Most just give the equation but never explain the intuition behind L1 and L2. Kudos man
Whoa, I wasn't ready for the superellipse, that's a nice suprise. That helps me understand the limit case of p -> inf. Also exciting to think about rational values for P such as the 0.5 case. Major thanks for the picture at 7 minutes in. I learned about the concept of compressed sensing the other day, but didn't understand how optimization under regularized L1 norm leads to sparsity. This video made it click for me. :)
Thank you for the great explanation. Some questions: 1. At 2:09 the slide says that the regularization term alpha x theta x thetaTranspose is known as the L2 penalty. However, going by the formula for Lp norm, isn't your term missing the square root? Shouldn't the L2 regularization be: alpha x squareroot(theta x thetaTranspose)? 2. At 3:27 you say "the decrease in the mean squared error would be offset by the increase in the norm of theta". Judging from the tone of your voice, I would guess that statement should be self-apparent from this slide. However, am I correct in understanding that this concept is not explained here; rather, it is explained two slides later?
+RandomUser20130101 "L2 regularization" is used loosely in the literature to mean either Euclidean distance, or squared Euclidean distance. Certainly, the L2 norm has a square root, and in some cases (L2,1 regularization, for example; see en.wikipedia.org/wiki/Matrix_norm) the square root is important, but often it is not; it does not change, for example, the isosurface shape. So, there should exist values of alpha (regularization strength) that will make them equivalent; alternatively, the path of solutions as alpha is changed should be the same. offset by increase: regularization is being explained in these slides generally; using the (squared) norm of theta is introduced as a notion of "simplicity" in the previous slides, and I think it is not hard to see (certainly if you actually solve the values) that to get the regression curve in the upper right of the slide at 3:27 requires high values of the coefficients, causing a trade-off between the two terms. Two slides later is the geometric picture in parameter space, which certainly also illustrates this trade-off point.
Sometimes I wish some profs would present a UA-cam playlist of good videos instead of giving their lectures themselves. This is so much better explained. There are so many good resources on the net, why are there still so many bad lectures given?
Just replace the "regularizing" cost term that is the sum of squared values of the parameters (L2 penalty), with one that is the sum of the absolute values of the parameters.
Thank you. One of the best explanations of L1 vs L2 regularization!
I have had courses and put a lot of effort reading material online, but your explanation is by far the one that will remain indelible in my mind. Thank you
Oh my G. After 5 years of confusion, I finally understood Lp regularization!
Thank you so much Alex!
@Emm-- not sure how / if I can reply to your comment.
An iso-surface is the set of points such that a function f(x) has constant value, e.g. all x such that f(x) = c. For a Gaussian distribution, for example, this is an ellipse, shaped according to the eigenvectors and eigenvalues of the covariance matrix.
So, the iso-surfaces of theta1^2 + theta2^2 are circles, while the iso-surfaces of |theta1|+|theta2| look like diamonds. The iso-surface of the squared error on the data is also ellipsoidal, with a shape that depends on the data.
Alpha scales the importance of the regularization term in the loss function, so higher alpha means more regularization.
I didn't prove the sparsity assertion in the recording, but effectively, the "sharpness" of the diamond shape on the axes (specifically, the discontinuous derivative at e.g. theta1=0) means that it is possible for the optimum of the sum of (data + regularization) to have its optimum at a point where some of the parameters are exactly zero. If the function is differentiable at those points, this will effectively never happen -- the optimum will effectively always be at some (possibly small, but) non-zero value.
Best explanation of regularization I ever saw! Concise, detailed just enough, and covers all the practically important aspects. Thank you Sir!
Nice video , this is what I dig in youtube , an actual concise clear explanation worth any paid course
Very few videos online give some key concepts here, like what we're truly trying to minimize with the penalty expression. Most just give the equation but never explain the intuition behind L1 and L2. Kudos man
Whoa, I wasn't ready for the superellipse, that's a nice suprise. That helps me understand the limit case of p -> inf. Also exciting to think about rational values for P such as the 0.5 case.
Major thanks for the picture at 7 minutes in. I learned about the concept of compressed sensing the other day, but didn't understand how optimization under regularized L1 norm leads to sparsity. This video made it click for me. :)
Best explanation yet on what ridge regression does.
Wonderful video to give some intuition on L1 vs L2. Thank you!
great video even after 10 years! thanks! :)
First heard of this via more theoretical material. Very cool to see a discussion from a more applied (?) perspective.
This really is an incredible explanation of the idea behind regularization. Thanks a lot for your insight!
Great presentation with very reasonable depth!
Thank you! That's a very clear and concise explanation.
Thank you!! This really helped to understand the difference between L1 and L2.
Very clear explained, helped a lot, thanks Alex!
Wow, that was such a great explanation. Thank you.
Next time, I'd love it if you included the effect lambda has on regularization, including visuals!
Thank you Alexander - very well explained !
Which excellent videos you posted! Congratulations!
Many thanks for the brilliant video !!
English major: Brevity is the soul of wit.
Statistics/Math Major: Verbal SCAD type regularization is the soul of wit.
I just found out you videos now, thank you for a such wonderful explanation, it really helps me to understand this term
OMG! This stuff is just way too cool! I love maths.
As my old friend Borat would say: Very Nice!
Why don’t we draw concentric circles and diamonds as well? To represent optimization space of regularization term?
Awesome description, thanks 🙏
This is superb. Thanks for putting it together.
Thanks! This helps me understand regularization term a lot.
I learn a lot from this video.Thank you!
awesome explanation. thank you
Thank you for the great explanation. Some questions:
1. At 2:09 the slide says that the regularization term alpha x theta x thetaTranspose is known as the L2 penalty. However, going by the formula for Lp norm, isn't your term missing the square root? Shouldn't the L2 regularization be: alpha x squareroot(theta x thetaTranspose)?
2. At 3:27 you say "the decrease in the mean squared error would be offset by the increase in the norm of theta". Judging from the tone of your voice, I would guess that statement should be self-apparent from this slide. However, am I correct in understanding that this concept is not explained here; rather, it is explained two slides later?
+RandomUser20130101 "L2 regularization" is used loosely in the literature to mean either Euclidean distance, or squared Euclidean distance. Certainly, the L2 norm has a square root, and in some cases (L2,1 regularization, for example; see en.wikipedia.org/wiki/Matrix_norm) the square root is important, but often it is not; it does not change, for example, the isosurface shape. So, there should exist values of alpha (regularization strength) that will make them equivalent; alternatively, the path of solutions as alpha is changed should be the same.
offset by increase: regularization is being explained in these slides generally; using the (squared) norm of theta is introduced as a notion of "simplicity" in the previous slides, and I think it is not hard to see (certainly if you actually solve the values) that to get the regression curve in the upper right of the slide at 3:27 requires high values of the coefficients, causing a trade-off between the two terms. Two slides later is the geometric picture in parameter space, which certainly also illustrates this trade-off point.
+Alexander Ihler Thank you for the info.
Thanks for the videos, I really enjoy learning from them!
Sometimes I wish some profs would present a UA-cam playlist of good videos instead of giving their lectures themselves. This is so much better explained. There are so many good resources on the net, why are there still so many bad lectures given?
fuck thats truth but depressing
How to identify the extremities of ellipse with the equation?
Finally I know what does those isosurface diagrams mean found in PRML
Apologies. But what is rationale of concentric ellipses ??? Understood the l1/l2 area though
Awesome Explanation sir. Thanks much!
Thank you, that was an elegant explanation.
Great video! Did help me a lot!
Beautiful!
That was a great video Alex!
The most perfect video
Thank you for excellent video
Nice, clear explanation. Thnx.
Thank you so much!!!
Thanks you make great videos :)
Awesome!
sorry i can give only one like : )
I love your accent
How is L1 regularization performed?
Just replace the "regularizing" cost term that is the sum of squared values of the parameters (L2 penalty), with one that is the sum of the absolute values of the parameters.
Thank You!
Lasso gives sparse parameter vectors. QUOTE. OF THE DAY, GO AHEAD AND FINISH THE REPORT :P
6:47
what is the best in real world ? why your boos keep paying you