I am studying a MSc in Stats at a decent uni and I have to say that your channel is damn amazing. Good job there, the intuition that you manage to put in your videos is mindblowing. You gained a subscriber :)
I have been reading a lot on bias-variance trade-off and have been using it for some time now. But the way you explained it with amazing visuals, it was mind-blowing and very intuitive to understand. Totally like your content and will be keep waiting for more content like this in future.
Your videos have the perfect balance between rigor and simplicity. Kudos to you! Keep making such great content. You're destined to be really successful. 🎉
This is probably the best take on Bias Variance Trade-Off I have ever seen on UA-cam, the one from ritvikmath is a close second. Please don't ever stop making video like this, great stuff :)
The moment you flashed the decomposed equation, it clicked to me this looks a lot like Epistemic and Aleatoric Uncertainty components. P.S: We need much more quality content like this on high-end academic literature, please keep going full throttle. You earned my subscribe!
Thank you very much! I’m not familiar with those components, but I’m glad to hear you are seeing relationships I don’t :) and will do, I have 4-5 videos in the pipeline. New one every 3 weeks!
Until 7:22, I thought this was very theoretical, but as soon as you started the animations, everything made more sense and became clear . Truly incredible, amazing work. Lots of love from India, and please keep up the good work. You are the 3blue1brown of data science.
@@Mutual_Information I am a student in IIT Kanpur (one of the premier institutes of India), and I am currently doing a course Statistical Methods for Business Analytics. Here is the link to the playlist and the (lecture slides in the description). ua-cam.com/play/PLEDn2e5B93CZL-T8Srj_wz_5FIjLMMoW-.html Just play any video in this, and tell me would you be willing to learn from these videos . The way of teaching is lagging far behind in our country.
Hey DJ, the quality of your videos is mindblowing, I subscribed even before watching the video till the end. I'm 100% sure your channel will blow up in the nearest future!
Have you seen recent results in deep learning that show larger neural networks have both lower bias and lower variance than smaller models? Past a point, more parameters give less variance, which is amazing! See “Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition” Adlam et Al
I hadn't seen this before but now that I've read some of it, it's quite an interesting idea. Maybe it explains some of the weird behavior observed in the Grokking paper? I still am mystified by how these deep NNs sometimes defy the typical U shape of test error.. wild! Thanks for sharing
Excellent video! One question I have is in practice, what is the relationship between EPE and the mean square error (MSE) loss we usually optimize for in practice for regression problem? Is EPE an expected value of MSE? Or is MSE only related to the bias term in EPE? or are they completely unrelated?
Glad you enjoyed it! They are certainly related :) To make MSE and EPE comparable, the first thing we'd have to do is integrate EPE(x_0) over the domain of x, which we can call EPE, as you do. In that case, MSE is a biased estimate of EPE (to answer your question, it's an estimate of the whole of EPE - not any one of the terms). The MSE is going to be more optimistic/lower than EPE. This is because when fitting, you chose parameters to make MSE low.. if you had many parameters, you could make MSE really low (overfitting!). But EPE measures how good your model is relative to the p(x, y) - more parameter doesn't necessarily mean a better model! To get a better estimate, you could look at MSE out of sample. And that's what we do to determine those hypers.
@@Mutual_Information thanks so much for taking the time to reply! I will need sometime and probably another pass of the video and putting things on paper before I digest it all :-D but you have given me all elements of explanation. Keep up the good work your videos are some of the best out there, you put the bar very high! :-)
No, though I should explore that one day. I use a personal library that leans heavily on Altair, which is a Python static plotting library based on d3.
Policy Gradient RL methods will be out this summer! Diffusion.. that's a whole beast I don't have plans for right now. I'd need to learn quite a bit to get up to speed. KL Divergence, for sure I'll do that. Possibly later this year.
The whole channel started b/c I actually wanted to write a book on ML.. but then I figured few people would read it, so might as well communicate the same those on a YT channel, where it had a better chance. Literally, I'd say "It's a textbook in video format". But then I realized, it can make the videos very dense and a little dry. So I've evolved a bit since.
So can I understand bias and variance in terms of a sampling distribution from which my specific model is taken? If the variance is high, the mean of this sampling distribution will be quite close to the true value. But since the variance of this distribution is so large, it is unlikely that my specific model represents the true value (but not impossible?). And if the model is very low in complexity, the variance of the sampling distiribution will be quite small. But since the expected value from the sampling distribution is far from the true value, it is very unlikely that my specific model represents the true value?
That sounds about right. Think of it this way. There is some true data generating mechanism that is unknow to your model. A complex model is more likely to be able to capture it. In doing so, if you re-sample from the true data generating process.. fit the model.. and look at the average of those fits.. then those will equal the average of the true distribution. This is what I mean when I say "The complex model can 'capture' the true data generating mechanism". Aka, the model is low bias. However, the cost of such flexibility is that the model produces very different ("high variance") fits over different re-samplings of the data. Does that make sense?
This channel will explode soon - quality of content is too good, thank you !
I am studying a MSc in Stats at a decent uni and I have to say that your channel is damn amazing. Good job there, the intuition that you manage to put in your videos is mindblowing. You gained a subscriber :)
Thank you! Very happy to have you. More good stuff coming soon :)
And if you’d think it be helpful to your classmates, please share it with them 😁
I have been reading a lot on bias-variance trade-off and have been using it for some time now. But the way you explained it with amazing visuals, it was mind-blowing and very intuitive to understand. Totally like your content and will be keep waiting for more content like this in future.
Excellent! More coming soon!
Your videos have the perfect balance between rigor and simplicity. Kudos to you! Keep making such great content. You're destined to be really successful. 🎉
I appreciate that! Hope you're right :)
This is probably the best take on Bias Variance Trade-Off I have ever seen on UA-cam, the one from ritvikmath is a close second.
Please don't ever stop making video like this, great stuff :)
Currently, the plan is to keep going - Thanks!
I love the humor at the end ("if you make the heroic move of checking my sources in the description"). I'm learning so much from you, thank you!
Wow this video truly opened my mind.
I have been heard this term from ML people many many times, but it remains vague until I watch this video!
Incredible work man! I’m truly looking forward for more content!
Thank you! More coming!
Great video!! The beginning as a creator in yt is pretty hard, so don't give up
Thank you! I won’t, especially with the encouragement
The moment you flashed the decomposed equation, it clicked to me this looks a lot like Epistemic and Aleatoric Uncertainty components. P.S: We need much more quality content like this on high-end academic literature, please keep going full throttle. You earned my subscribe!
Thank you very much! I’m not familiar with those components, but I’m glad to hear you are seeing relationships I don’t :) and will do, I have 4-5 videos in the pipeline. New one every 3 weeks!
Until 7:22, I thought this was very theoretical, but as soon as you started the animations, everything made more sense and became clear . Truly incredible, amazing work. Lots of love from India, and please keep up the good work. You are the 3blue1brown of data science.
Thank you, encouragement like this means a lot. I’ll make sure to keep the good stuff coming :)
@@Mutual_Information I am a student in IIT Kanpur (one of the premier institutes of India), and I am currently doing a course
Statistical Methods for Business Analytics.
Here is the link to the playlist and the (lecture slides in the description).
ua-cam.com/play/PLEDn2e5B93CZL-T8Srj_wz_5FIjLMMoW-.html
Just play any video in this, and tell me would you be willing to learn from these videos . The way of teaching is lagging far behind in our country.
Hey DJ, the quality of your videos is mindblowing, I subscribed even before watching the video till the end. I'm 100% sure your channel will blow up in the nearest future!
Thank you brother! I’m very happy to hear you like them and excited to have you as a sub. More to come!
Highly underrated video! Great work
Simply awesome explanation!
This is GOLD
Thank you very much! :)
Awesome graphical visualization.
Great video thanks!. I've never seen this explained in a regression context, only for classification in terms of VC dimension.
Glad you appreciate it. This is an old video but I learned to lighten up on the on screen text, but I'm glad it still works for some
clearly explained thanks!
this needs more views
Thank you, very clear video
This is sooooo good. Thanks a lot for sharing your knowledge in such an amazing explanation!
Beautifully done!
Amazing video!
Great video, thank you!
thanks you're carrying my MsC
Great video!
Well explained! Thanks!!
really nice content and intuitions, liked it a lot !
Thank you!
fantastic visualizations
Awesome info!
Subscribed want to learn this stuff but not sure where to start!
Well I may be biased, but I think this channel is a fine place to start :)
I can see bright future of this channel. God job man . Keep uploading ❤️
.
From United States Of India 🇮🇳😆
Will do!
This video clearly deserves a lot more views than this. Keep up the good work.
Thanks! Slowly things are improving. I think eventually more people will come to appreciate this one.
Have you seen recent results in deep learning that show larger neural networks have both lower bias and lower variance than smaller models? Past a point, more parameters give less variance, which is amazing! See “Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition” Adlam et Al
I hadn't seen this before but now that I've read some of it, it's quite an interesting idea. Maybe it explains some of the weird behavior observed in the Grokking paper? I still am mystified by how these deep NNs sometimes defy the typical U shape of test error.. wild! Thanks for sharing
A masterpiece of yt
I'm glad you think so.. I was actually thinking about re-doing this one
Thank you! this is amazing content.
subscribed. would u mind sharing how to quickly make the visuals with the math equations? Id love to use a similar resource for my students.
Hey Jad. I have plans to open source my code for this, but it’s not ready yet. I’ll make an announcement when it’s ready,
Excellent video! One question I have is in practice, what is the relationship between EPE and the mean square error (MSE) loss we usually optimize for in practice for regression problem? Is EPE an expected value of MSE? Or is MSE only related to the bias term in EPE? or are they completely unrelated?
Glad you enjoyed it! They are certainly related :) To make MSE and EPE comparable, the first thing we'd have to do is integrate EPE(x_0) over the domain of x, which we can call EPE, as you do. In that case, MSE is a biased estimate of EPE (to answer your question, it's an estimate of the whole of EPE - not any one of the terms). The MSE is going to be more optimistic/lower than EPE. This is because when fitting, you chose parameters to make MSE low.. if you had many parameters, you could make MSE really low (overfitting!). But EPE measures how good your model is relative to the p(x, y) - more parameter doesn't necessarily mean a better model! To get a better estimate, you could look at MSE out of sample. And that's what we do to determine those hypers.
@@Mutual_Information thanks so much for taking the time to reply! I will need sometime and probably another pass of the video and putting things on paper before I digest it all :-D but you have given me all elements of explanation. Keep up the good work your videos are some of the best out there, you put the bar very high! :-)
@@bajdoub thanks! It means a lot. I’ll try to keep the standard high :)
Do you use the Manim Python Library for your animation?
No, though I should explore that one day. I use a personal library that leans heavily on Altair, which is a Python static plotting library based on d3.
@@Mutual_Information Cool!
Excellent.
Super cool stuff.
I love the channel. I have a few topic requests... KL Divergence. Diffusion Networks. Policy Gradient RL models.
Policy Gradient RL methods will be out this summer! Diffusion.. that's a whole beast I don't have plans for right now. I'd need to learn quite a bit to get up to speed. KL Divergence, for sure I'll do that. Possibly later this year.
@@Mutual_Information
Diffusion.
Did you see Dalle-2? It's a milestone. I can't wait for the music and videos a system like this well create.
It feels like you're reading out of that textbook on the table behind you
The whole channel started b/c I actually wanted to write a book on ML.. but then I figured few people would read it, so might as well communicate the same those on a YT channel, where it had a better chance. Literally, I'd say "It's a textbook in video format". But then I realized, it can make the videos very dense and a little dry. So I've evolved a bit since.
So can I understand bias and variance in terms of a sampling distribution from which my specific model is taken? If the variance is high, the mean of this sampling distribution will be quite close to the true value. But since the variance of this distribution is so large, it is unlikely that my specific model represents the true value (but not impossible?). And if the model is very low in complexity, the variance of the sampling distiribution will be quite small. But since the expected value from the sampling distribution is far from the true value, it is very unlikely that my specific model represents the true value?
That sounds about right. Think of it this way. There is some true data generating mechanism that is unknow to your model. A complex model is more likely to be able to capture it. In doing so, if you re-sample from the true data generating process.. fit the model.. and look at the average of those fits.. then those will equal the average of the true distribution. This is what I mean when I say "The complex model can 'capture' the true data generating mechanism". Aka, the model is low bias. However, the cost of such flexibility is that the model produces very different ("high variance") fits over different re-samplings of the data.
Does that make sense?
awesome
Woo!
Haha thank you sister
@@Mutual_Information your welcome brother. How are you? How was your day?
Man, please more pictures..
Please provide subtitles for foreign language speakers!
I have a list of outstanding changes I need to make and this is one of them. I’ll make it priority! Thanks for the feedback
😮😮😯❤️
Great explanation! Thanks so much.