You are truly an angel. Your videos on Ridge, Lasso and Elastic Net really helps with my understanding. It's way better than the lectures in my university.
this channel is saving my ass when it comes to applied ml class. so frustrating when a dude who has been researching Lasso for 10 years just breaks out some Linear algebra derviation and then acts like your suppose to instantly understand that...... thanks for taking the time to come up with an exhbition that makes sense.
I was wondering why I missed out on this video while going through the ones on Ridge and Lasso Regression from Sept-Oct 2018. Then I noticed this is a video you put out only a few days ago. Awesome. Much gratitude from Malaysia. 🙇
Many people on the Internet explain regularization of regression using polynomial features in the data that ridge and lasso are allegedly used to reduce the curvature of the line, but in fact in this case you just need to find the right degree of the polynomial. You are one of the few who have shown the real essence of regularization in linear regression and the bottom line is that we simply fine the model by exchanging the bias for a lower variance through slope changes. By the way, real overfitting in regression can be well observed in data with a large number of features, some of which strongly correlate with each other, as well as a relatively small number of samples, and in this case that L1/L2/Lasso will be useful. Thank you so much for a very good explanation.
I've never been good with this kind of math/statistics because when I encounter the book formulas I tend to forget or not understand the symbols. Your videos make it possible to go beyond the notation and to learn the idea behind these concepts to apply them in machine learning. Thank you !
Hello Josh, Ridge and Lasso clearly visualized :) I must say that if one thing that makes your videos clearly explained to curious minds like me is that the visual illustrations that you provide in your stat videos. Glad. Thank you very much for your efforts.
Thanks, josh for this amazing video. I promise to support this channel once I land a job offer as a data scientist. This is the only video on youtube, that practically shows all the algo's.
Your are a super professor and I'll give you a infinity BAM !!!!!!!!!!!. I really like the way your repeat the earlier discussed topic to refresh the student memory and that really helpful and you have a-lot of patience. Once again you proved that a picture is worth a thousand words.
Fortunately, I asked you :) I agree squared and absolute penalty are better word choices for these regularization methods. Thanks again for making my ML at Scale a tad bit easier.
And thats the reason why lasso does a kind of feature selection and sets many weights to 0 compared to ridge regression. And now I know the reason behind it thanks a lot❤
OMG!!! I've always thought that Ridge is a better method for solving overfitting because it introduces squared penalty to cost function that reduces weights more heavily and faster close to 0. Now you've changed my mind
Hi Josh, pse accept my heartfelt thanks to such a wonderful video. I guess your videos are an academy in itself. Just follow along your videos and BAM!! you are a master of Data Science and Machine Learning. 👏
You simply amaze me with each of your videos. The best part is the way you explain stuff is so original and simple. Will really love if you could also pen down a book on AI/ML. Would be a bestseller i reckon for sure. Keep up the good work and enlightening us :)
Amazing series on regularization (As usual) I just didn't quite understand why in the ridge regression the weights/parameters never ever reach zero, I didn't give it much thought but it didn't pop right at me like it usually does in your videos lol, but again great series!
Really enjoying these videos, Josh. Please keep 'em coming. Although I understand the distinction between correlation and interaction, I'd be interested to see how you might explain it in your inimitable fashion.
@@statquest I'm pushing my luck here, but one more item, if I may: the difference between PCA and factor analysis. Often, these are distinguished in general terms (e.g., they are concerned with the total variance vs the shared variance, respectively), but I think that the best way to distinguish them would be to apply both methods to the same data set. I would be most interested in seeing that done.
Also it will be of great help if you explain the following points. 1. How the lasso regression excludes the useless variables. 2. How the ridge regression do a little better when most variables are useful Thanks, Manjusha
L1 and L2 norm are very common phrases. if you aren't familiar with them that is on you... They are much more clear language as it makes it immediately clear that this is just a distance defined by whichever norm you are using in your space. calling it square or absolute value obfuscates the fact that it is a norm and not some other motivation.
Really excellent video Josh. You consistently do a great job, and I appreciate it. Could you make a video showing the use of Ridge regression and especially Lasso regression in parameter selection? I had to do that once, and it is complicated. From your example it seems that using neither penalty gives you the best response. So, in what circumstances do you want to use the regression to improve your result? If you are using lasso regression to find the top 3 predictive parameters, how does this work? What are the dangers? How do you optimally use it? A complicated subject for sure! I'm sorry if this is covered in your videos on Lasso and Ridge regression individually, I am watching them next. I agree with your naming convention btw, squared and absolute-value penalty is MUCH more intuitive!
Watch the other regularization videos first. I cover some of what you would like to know in about parameter selection in my video on Elastic-Net in R: ua-cam.com/video/ctmNq7FgbvI/v-deo.html
@@statquest I will check out those videos, thanks. I actually did use elastic net regularization. The whole issue is complex (for somebody without a decent stats background) because the framework of how everything works isn't covered very well AND simply anywhere that I could find, without going down several pretty deep rabbit holes. Some of the parameter selections that I remember were suggested depended on the assumption that the parameters were independent, which was NOT the case in my situation. I'm still not sure what the best approach would have been.
@@statquest As an additional note, I've always found that examples and exercises are even more important than theory, while theory is essential at times too. In many math classes concepts were laid out in formal and generalized glory, but I couldn't get the concept at all until I put hard numbers or examples to it. It's probably not the subject of your channel or in your interest, but I think some really hand-holding examples of using these concepts in some kaggle projects, or going through what some interesting papers did, would be a great way of bringing the theory and the real world together.
@@omnesomnibus2845 I do webinars that focus on the applied side of all these concepts. So we can learn the theory, and then practice it with real data.
Thanks a lot for thes awesome videos, you deserver milllion followers, and a lot of credits :) I just love these and they are KISS. so simple and understandable. I owe you a lot of thanks and credits :D
Thank you for your work as always. Its AWESOME. I just got some questions. Why is there a kink in the SSR curve for Lasso Regression ? Is it because we are adding lambda * |slope| which is a linear component ? And Does the curve for Ridge Regression stay parabola because we are adding lambda*slope^2 which is a parabola component ?
thank you!!!!! i have question do you have time series model or time series forecasting?? please please make those video with you amazing explanation!!!! :):)
All your videos are great, but the regularization ones have been a fantastic help. Was wondering if you were planning any on selective inference from lasso models? That would complete the set for me haha
Amazing video as always Josh! Just to be sure if I got it correctly, the plot between RSS error and slope represents a parabola in 2D. So when we do the same thing in 3D i.e. With 2 parameters, does it represent the same bowl shaped cost function that we try to minimise?
Ridge Regression (L2-norm) never shrinks coefficients to zero, but Lasso Regression (L1-norm) may shrink coefficients to zero, and that's the reason Lasso can perform feature selection while Ridge can't.
Excellent. I have just one question. In case of L1 penalty, isn't the line with lambda equal 40 (or slope 0) giving a bad line? I mean with blue line, we were getting a better fit since it didn't completely ignore weight in predicting height and sum of residuals is smallest?
L2= weight penalisation (smooths out weight losss curve but and reduces overfitting , but higher lambda can kill model training) L1 = weight imputation (dragging it to zero, useful for learnable ignoring of variables, useful for high dimensional data at times) . I have used both of these earlier with similar mindset. Earlier even in Deep Learning i used a similar analogy to reason about what was happening. The visualisation really did helped, so just wanted to know is this simplistic way of viewing the behaviour makes sense ??? Or am I missing something ....
Hi Josh, great videos as always!!! I am wondering is there any guidelines on how we should pick which one? Under what cases will ridge be better and under what cases will lasso be better?
Support StatQuest by buying my book The StatQuest Illustrated Guide to Machine Learning or a Study Guide or Merch!!! statquest.org/statquest-store/
Subscriiibed ! DOUBLE BAM!!! 😂😂
You are truly an angel. Your videos on Ridge, Lasso and Elastic Net really helps with my understanding. It's way better than the lectures in my university.
Thanks!
Still the best stat videos on UA-cam
You have no idea how much you've helped me. You'll be in the acknowledgments of my diploma
Wow, thanks!
He'd probably like better to be in the acknowledgments of your checkbook then
@@johnnyt5108 I am sure many people will do that by buying the reports and the book Josh has written.
update after 2 years: did u include him on ur diploma?
Agree with you, yet "unfortunately, no one asked me!".
this channel is saving my ass when it comes to applied ml class. so frustrating when a dude who has been researching Lasso for 10 years just breaks out some Linear algebra derviation and then acts like your suppose to instantly understand that...... thanks for taking the time to come up with an exhbition that makes sense.
Thanks!
I'm just gonna take a minute to appreciate the effort you put in your jokes to make the video more interesting. Its quite underrated.
Thank you!
I was wondering why I missed out on this video while going through the ones on Ridge and Lasso Regression from Sept-Oct 2018. Then I noticed this is a video you put out only a few days ago. Awesome. Much gratitude from Malaysia. 🙇
Thanks! :)
Many people on the Internet explain regularization of regression using polynomial features in the data that ridge and lasso are allegedly used to reduce the curvature of the line, but in fact in this case you just need to find the right degree of the polynomial. You are one of the few who have shown the real essence of regularization in linear regression and the bottom line is that we simply fine the model by exchanging the bias for a lower variance through slope changes.
By the way, real overfitting in regression can be well observed in data with a large number of features, some of which strongly correlate with each other, as well as a relatively small number of samples, and in this case that L1/L2/Lasso will be useful.
Thank you so much for a very good explanation.
Thanks!
The visualization really sells it.
Thanks! :)
I've never been good with this kind of math/statistics because when I encounter the book formulas I tend to forget or not understand the symbols. Your videos make it possible to go beyond the notation and to learn the idea behind these concepts to apply them in machine learning. Thank you !
Bam! :)
Hello Josh, Ridge and Lasso clearly visualized :) I must say that if one thing that makes your videos clearly explained to curious minds like me is that the visual illustrations that you provide in your stat videos. Glad. Thank you very much for your efforts.
Thank you very much! :)
Thanks, josh for this amazing video. I promise to support this channel once I land a job offer as a data scientist. This is the only video on youtube, that practically shows all the algo's.
Thank you and Good luck!
Best visuals ever! No matter how much I think I know about stats, I always learn something from your videos. Thanks.
Thank so much! BAM! :)
I thought I was familiar with the concept of regularization, but your videos always help me grasp the concept more easily and, of course, deeper!
Thanks!
Fantastic, Josh!! Thank you very very much. We all owe you a lot many thanks. "I" owe you a lot. 😊😊👍👍
Awesome! Thanks! :)
You are awesome Josh. This one always bothered me why L1 would make coefficients to 0 and not L2 and you explained it so simply.
Thank you! :)
dude is creating quality videos and replies to every comment!
talk about dedication!
thanks a lot
bam! :)
As a result when the slope becomes 0 for large lambda in lasso, then we can use lasso for feature selection.
Nice Video Josh!!.
Bam! :)
Can we start a petition to change the lasso and ridge names to absolute value penalty and squared penalty pwease?
That would be awesome! :)
@@statquest I am listening to u on spotify
@@JoaoVitorBRgomes Bam!
Your are a super professor and I'll give you a infinity BAM !!!!!!!!!!!. I really like the way your repeat the earlier discussed topic to refresh the student memory and that really helpful and you have a-lot of patience. Once again you proved that a picture is worth a thousand words.
Thank you very much! :)
I needed this information for my data science class and didn't expect such a well crafted and humorous video!
You are doing great work sir!
Wow, thank you!
Just found this channel today, great illustrations! Thanks for keeping the voice speed down, that makes me easy to follow!
Awesome, thank you!
These videos are so clear and fun, they helped me a lot with modeling and statistic in biology.
Thank you! :)
Just wanna say..U r my Guru..means Teacher..In data Science ...more love to u from India
Thank you! :)
Fortunately, I asked you :)
I agree squared and absolute penalty are better word choices for these regularization methods. Thanks again for making my ML at Scale a tad bit easier.
BAM! Thank you very much! :)
And thats the reason why lasso does a kind of feature selection and sets many weights to 0 compared to ridge regression. And now I know the reason behind it thanks a lot❤
BAM! :)
This explained everything i needed to know in 9 minutes. Absolute genius, thank you!
Glad it was helpful!
I just became the 104th patron of the channel!
TRIPLE BAM!!! Thank you very much!!! :)
OMG!!! I've always thought that Ridge is a better method for solving overfitting because it introduces squared penalty to cost function that reduces weights more heavily and faster close to 0. Now you've changed my mind
bam! They both have strengths and weaknesses.
Great work Josh! Your songs get me every time.
Bam! :)
The explanation can't be any better than this....!
bam! :)
The best. Definitely gonna come back and donate once I land a job.
Wow! Thank you!
Dude you succeed at helping me and at making that thing funny as I'm struggling with my ML homework, thank you so much
Glad I could help!
incredible videos, been watching all of your videos during quarantine for my future job interview. Still waiting for the time series tho. Thanks sir
Thanks!
You are a life saver! I have been trying to understand this for years now!! Thanks a ton!!!
Bam! :)
Thank you, regularization seris videos from 2018 to 2020 are so helpful.😀
Thanks!
This is the perfect explanation I am searching for why L1 can be used for feature importance!!!
bam! :)
Thanks a lot for this wonderful lesson...loved it ..seeing how the function behaves with different parameters makes it etched in the memory
Glad it helped!
Hi Josh, pse accept my heartfelt thanks to such a wonderful video. I guess your videos are an academy in itself. Just follow along your videos and BAM!! you are a master of Data Science and Machine Learning. 👏
Wow, thank you!
Thank you very much for this video! It helped me visually understand how Lasso regression can remove some predictors from the final model!
Glad it was helpful!
Thanks for taking out time and explaining ML concepts in an amazing manner with clear visualizations.
Great work.
WOW! Thank you so much for supporting StatQuest! TRIPLE BAM!!! :)
@@statquest Hey Josh! What's your preferred way of being supported? Would Paypal be better than Patreon?
@@heteromodal It's really up to you - whatever is more convenient and whether or not you want to be a long time supporter or not.
@@statquest I meant assuming i make a fixed sum donation - would you see more of it through PP or Patreon :)
@@heteromodal If it's a one-time donation, than PayPal is probably the best.
You are really doing great great job. This channel is the best way to learn a lot, right and important things in a short time.
Thank you very much! And thank you for your support!! BAM! :)
Great, many thanks, very understandable and clear. it gave me a good intuiton of how lasso regression shrinks some varables to zero.
Glad it was helpful!
The best visualization I've ever seen
Thank you! :)
You simply amaze me with each of your videos. The best part is the way you explain stuff is so original and simple. Will really love if you could also pen down a book on AI/ML. Would be a bestseller i reckon for sure. Keep up the good work and enlightening us :)
Wow, thank you!
This aged very well (he has a book now lol)
Ridge regression! Good topic to cover as always!
Thanks! :)
God of explanation !!! 🙏🏻🙏🏻🙏🏻 Awesome stuff 🙂🙂
Thank you! 🙂
Very well explained this one video cleared all my doubts along with practical calculations and visualization. Kudos for the great job.
Thanks! :)
thanks to this video I finally understand why lasso and ridge have the so called shrinking effect.
bam!
You should receive a Nobel Prize.
BAM! :)
Thank you so much
Blessing from Spain/Morocco
Thanks!
Amazing series on regularization (As usual)
I just didn't quite understand why in the ridge regression the weights/parameters never ever reach zero, I didn't give it much thought but it didn't pop right at me like it usually does in your videos lol, but again great series!
Thanks!
Mind Blowing! Thank you for such valuable content
Thanks!
your videos're very explanatory for studying this field...
Glad you think so!
this video.....you are my savior ❤️❤️❤️
bam!
Really enjoying these videos, Josh. Please keep 'em coming. Although I understand the distinction between correlation and interaction, I'd be interested to see how you might explain it in your inimitable fashion.
I'll put that on the to-do list.
@@statquest I'm pushing my luck here, but one more item, if I may: the difference between PCA and factor analysis. Often, these are distinguished in general terms (e.g., they are concerned with the total variance vs the shared variance, respectively), but I think that the best way to distinguish them would be to apply both methods to the same data set. I would be most interested in seeing that done.
Could be interesting to see the explaination in the case of a multidimensional problem with more than 2 d features, but very nice video!
Be grateful we've got such a nice guy.
Excellent Explanations 👍👍👍
Great work 👍👍👍
Thank you!
‘Unfortunately no one asked me ‘ 😀 .
Unique content . Hats off !
Also it will be of great help if you explain the following points.
1. How the lasso regression excludes the useless variables.
2. How the ridge regression do a little better when most variables are useful
Thanks,
Manjusha
Thanks!
L1 and L2 norm are very common phrases. if you aren't familiar with them that is on you... They are much more clear language as it makes it immediately clear that this is just a distance defined by whichever norm you are using in your space. calling it square or absolute value obfuscates the fact that it is a norm and not some other motivation.
Noted!
"I got ... calling a young StatQuest phone" 😁
(The Ladys might love your work fam.)
Bam!
But how do we pick the right penalty? As a college professor in econ, your lectures and dry humor are perfect for me as I tool up in ML.
We use cross validation to test a bunch of different penalties and select the one that performs the best.
That's a great video, Josh!
6:10 they should definitely have asked you 😂
BAM! :)
I first like your videos then watch them!
BAM!
Your videos are super intuitive.. thanks alot sir
Thanks and welcome!
Really excellent video Josh. You consistently do a great job, and I appreciate it. Could you make a video showing the use of Ridge regression and especially Lasso regression in parameter selection? I had to do that once, and it is complicated. From your example it seems that using neither penalty gives you the best response. So, in what circumstances do you want to use the regression to improve your result? If you are using lasso regression to find the top 3 predictive parameters, how does this work? What are the dangers? How do you optimally use it? A complicated subject for sure! I'm sorry if this is covered in your videos on Lasso and Ridge regression individually, I am watching them next. I agree with your naming convention btw, squared and absolute-value penalty is MUCH more intuitive!
Watch the other regularization videos first. I cover some of what you would like to know in about parameter selection in my video on Elastic-Net in R: ua-cam.com/video/ctmNq7FgbvI/v-deo.html
@@statquest I will check out those videos, thanks. I actually did use elastic net regularization. The whole issue is complex (for somebody without a decent stats background) because the framework of how everything works isn't covered very well AND simply anywhere that I could find, without going down several pretty deep rabbit holes. Some of the parameter selections that I remember were suggested depended on the assumption that the parameters were independent, which was NOT the case in my situation. I'm still not sure what the best approach would have been.
@@statquest As an additional note, I've always found that examples and exercises are even more important than theory, while theory is essential at times too. In many math classes concepts were laid out in formal and generalized glory, but I couldn't get the concept at all until I put hard numbers or examples to it. It's probably not the subject of your channel or in your interest, but I think some really hand-holding examples of using these concepts in some kaggle projects, or going through what some interesting papers did, would be a great way of bringing the theory and the real world together.
@@omnesomnibus2845 I do webinars that focus on the applied side of all these concepts. So we can learn the theory, and then practice it with real data.
@@statquest That's great!
Dude you're killing it!
Thank you! :)
Just an incredible explanation!
Thank you!
Awesome! And I should mention actually: We are asking YOU!"
Bam!
Love the They Might Be Giants-esque intro.
XTC vs Adam Ant! :)
"Unfortunately, no one asked me" 🤣🤣🤣
:)
Can you do a lecture on Kohonen Self Organising Maps?
Thanks a lot for thes awesome videos, you deserver milllion followers, and a lot of credits :)
I just love these and they are KISS. so simple and understandable. I owe you a lot of thanks and credits :D
Thank you so much 😀!
Thank you for your work as always. Its AWESOME. I just got some questions. Why is there a kink in the SSR curve for Lasso Regression ? Is it because we are adding lambda * |slope| which is a linear component ? And Does the curve for Ridge Regression stay parabola because we are adding lambda*slope^2 which is a parabola component ?
I believe that is correct.
Hi. Great video! I had the same query as to why we cannot see a similar kink in curve in the Ridge Regression CF vs Slope curve.
Thank you Josh
Any time!
This guy is amazing.... BAM!!!
Thanks! :)
You're god of studies
:)
Great videos. Very helpful. Thanks !
Glad you like them!
Great videos, thank you very much!!!
Glad you like them!
Explain stats to a 10-year-old?
Me: "You kid, Subscribe and drill through all the content of StatQuest with Josh Starmer"
:)
This is amazing - thanks for this
Thanks!
thank you!!!!! i have question do you have time series model or time series forecasting?? please please make those video with you amazing explanation!!!! :):)
I don't have one yet, but it's on the to-do list. :)
StatQuest with Josh Starmer ohhh good to hear!!!! thank you for response! i will wait the time series !!
Great master, thanks for your great effort
Thank you!
You saved my degree
bam!
Hi thanks for the great videos. I don't understand why we get this "kink" on Lasso regression and not Ridge
The "kink" comes from the absolute value function.
Great explanation!
Thanks!
THAT IS SOOOOOO GOOD MAN
Thanks! :)
All your videos are great, but the regularization ones have been a fantastic help. Was wondering if you were planning any on selective inference from lasso models? That would complete the set for me haha
Not yet!
Very well done as usual.
Thank you very much! :)
So you mean this statquest answered the question "why Lasso regression can remove useless variable and Ridge cannot", am I right?
yes
Amazing video as always Josh! Just to be sure if I got it correctly, the plot between RSS error and slope represents a parabola in 2D. So when we do the same thing in 3D i.e. With 2 parameters, does it represent the same bowl shaped cost function that we try to minimise?
Yes
Hey josh!! Can u plz make a video for K-modes algorithm for categorical variables(unsupervised learning) with an example..plz?
Ridge Regression (L2-norm) never shrinks coefficients to zero, but Lasso Regression (L1-norm) may shrink coefficients to zero, and that's the reason Lasso can perform feature selection while Ridge can't.
bam! :)
Thank you very much!
bam!
Underrated
Thanks!
Hi Josh, would you consider explain the nuances of arithmetic, geometric and harmonic means?. I couldn't find it on the quests.
I'll put it on the to-do list.
@@statquest thank you!
Clear and apt..
Thanks! :)
Excellent.
I have just one question. In case of L1 penalty, isn't the line with lambda equal 40 (or slope 0) giving a bad line? I mean with blue line, we were getting a better fit since it didn't completely ignore weight in predicting height and sum of residuals is smallest?
What time point, minutes and seconds, are you asking about?
@@statquest 7:16
@@usamahussain4461 Yes. For both L1 and L2 you need to test different values for lambda, including setting it to 0, to find the optimal value.
Thank you for helping us to understand statistics! May I request for a video on Dirichlet regression?
L2= weight penalisation (smooths out weight losss curve but and reduces overfitting , but higher lambda can kill model training)
L1 = weight imputation (dragging it to zero, useful for learnable ignoring of variables, useful for high dimensional data at times)
.
I have used both of these earlier with similar mindset. Earlier even in Deep Learning i used a similar analogy to reason about what was happening. The visualisation really did helped, so just wanted to know is this simplistic way of viewing the behaviour makes sense ??? Or am I missing something ....
Hi Josh, great videos as always!!! I am wondering is there any guidelines on how we should pick which one? Under what cases will ridge be better and under what cases will lasso be better?
I talk about that a little bit in this video: ua-cam.com/video/ctmNq7FgbvI/v-deo.html