Wow, the picture with the RSS contours and the intuition why the lasso sets the coefficients to exactly zero is beautiful. I haven't see an illustration like this before. Thank you!
My supervisor/collaborator in ML introduced me to Tibshirani's work during my MSc. and I feel like I was once blind and now can see. Thank you so much for these videos...
One question. Why don't we mannually set a threshold epsilon and get rid of coefficients that's under this threshold in Ridge, because Ridge should be running faster by its differentiability?
On Lasso having been more likely have a ZERO coeff unlike ridge - Is it fair to say that for coeffs less than 1, ridge regression because of the squared term will (square of numbers less than 1 is smaller than the number itself) impose a less penalty than Lasso. On the other hand if we anticipate the coeffs to be greater than 1, we might be better off with Ridge Regression coz of the larger penalty and more shrinkage?
you normalize the data before passing to either, so coefficient sizes are not as meaningful, so I dont think that's good/correct intution. Also, this doesnt explain why LASSO does make them zero and not just shrink more. I think good intuition is to look at the orthogonal case where all x_i are uncorrelated. Then what LASSO does is "soft-thresholding" (basically everythinking within a band around zero becomes EXACTLY ZERO), while Ridge just shrinks ALL COEFFICIENTS equally (by 1/(1+lambda)). It is actually well-known that LASSO *overshrinks* the coefficients (for inference) -- so for purely predicitve purposes, where you dont care about sparsity you're probably better off with Ridge. You can take a look at Elastic Net and especially the original paper by Zou and Hastie, where they discuss LASSO's limitations (see elastic net vs naive elastic net for some corrections). Also, generally speaking, LASSO makes sense when you have a lot of features and you want to remove some of them -- in these cases it's very hard to expect exactly which coefficients are going to be > 1 ...
why is it more likely for lasso to touch the corners of the diamond than for ridge to touch the points on the circle on each axis that make one of the predictors equal to zero?
Consider two parameters b1 and b2. Consider b1=0.1 and b2=0.1 . Their squares are 0.01 and 0.01, too small. Ridge won't bother to set these two equal to zero, instead, it makes them too small like 0.001, whose square is 0.000001, a neglicible number.
Wow, the picture with the RSS contours and the intuition why the lasso sets the coefficients to exactly zero is beautiful. I haven't see an illustration like this before. Thank you!
My supervisor/collaborator in ML introduced me to Tibshirani's work during my MSc. and I feel like I was once blind and now can see. Thank you so much for these videos...
I like Daniele, she's brings a youthful vibrancy to the presentation.
DAYUMN! you came up with this? Thats epic B)
Amazing content! Thanks!
One question. Why don't we mannually set a threshold epsilon and get rid of coefficients that's under this threshold in Ridge, because Ridge should be running faster by its differentiability?
On Lasso having been more likely have a ZERO coeff unlike ridge - Is it fair to say that for coeffs less than 1, ridge regression because of the squared term will (square of numbers less than 1 is smaller than the number itself) impose a less penalty than Lasso. On the other hand if we anticipate the coeffs to be greater than 1, we might be better off with Ridge Regression coz of the larger penalty and more shrinkage?
you normalize the data before passing to either, so coefficient sizes are not as meaningful, so I dont think that's good/correct intution. Also, this doesnt explain why LASSO does make them zero and not just shrink more. I think good intuition is to look at the orthogonal case where all x_i are uncorrelated. Then what LASSO does is "soft-thresholding" (basically everythinking within a band around zero becomes EXACTLY ZERO), while Ridge just shrinks ALL COEFFICIENTS equally (by 1/(1+lambda)).
It is actually well-known that LASSO *overshrinks* the coefficients (for inference) -- so for purely predicitve purposes, where you dont care about sparsity you're probably better off with Ridge. You can take a look at Elastic Net and especially the original paper by Zou and Hastie, where they discuss LASSO's limitations (see elastic net vs naive elastic net for some corrections).
Also, generally speaking, LASSO makes sense when you have a lot of features and you want to remove some of them -- in these cases it's very hard to expect exactly which coefficients are going to be > 1 ...
why is it more likely for lasso to touch the corners of the diamond than for ridge to touch the points on the circle on each axis that make one of the predictors equal to zero?
Consider two parameters b1 and b2. Consider b1=0.1 and b2=0.1 . Their squares are 0.01 and 0.01, too small. Ridge won't bother to set these two equal to zero, instead, it makes them too small like 0.001, whose square is 0.000001, a neglicible number.
6:24 minimization equations