For somebody that left machine learning 3 years ago to move to software development, these videos are pure gold in terms of catching up with cutting edge and knowing what I learned back then is still relevant for today.
What I love about these the most is when he mentions what is being used now compared to the paper and all these little gems of information here and there on top of the paper.
26:20 The VGG (2014) paper "Very deep convolutional networks for large-scale image recognition " mentioned explicitly that they tried LRN and found it's not worth the trouble. > First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers. We thus do not employ normalisation in the deeper architectures (B-E). Again, in Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015): > Remove Local Response Normalization. While Inception and other networks (Srivastava et al., 2014) benefit from it, we found that with Batch Normalization it is not necessary.
30:20 SqueezeNet and LeNet talked about it. Having a 9x9x9 conv is equivalent to having 3 3x3x3 filters but takes cubic more time to compute. . 31:37 Global Max Pool was introduced later I think, it helps when dimensions of image is variable. 32:00 Global max pooling also means that the dense params can be dialed back a lot. . 66406 citations as of now. Crazy. Good paper, keep it coming. . Can you do one on Adam Optimizer? I think it is so ubiquitous that people dont even cite it anymore XD
4:20 One of the main reasons it's not used as much is because BatchNorm has the same over all effect but much better. That being said, I still find dropout to be quite effective at lowering overfitting. Specially on lower sample sizes, and when used on the dense layers at the end.
To prevent overfitting, the loss landscape should be as smooth as possible so that the model can better generalize. If I remember correctly, residual connections + batch normalization help smoothing the loss landscape, maybe because they allow us to build even deeper models. My bet for the future is actually on Bayesian networks, which use a learnable version of a Gaussian dropout.
You need a microphone that’s not so sensitive it records every tiny little sound. Actually, it could be automatic gain control (AGC) that increases amplification when you are not speaking and records every time you swallow with more volume than you want. Recommendation: Turn off AGC if you can.
I can nearly guarantee that their net did not really overfit. I’ve trained many nets and past lab mates have increased the number of parameters and the test loss never keeps increasing. Feel free to check my google scholar for paper examples of what I mean but I am sure their nets did fine especially since they didnt quantify if their neg overfit. Regardless fun paper!
I love these historical papers
7 years old paper is called historical. This is what should be called progress.
For somebody that left machine learning 3 years ago to move to software development, these videos are pure gold in terms of catching up with cutting edge and knowing what I learned back then is still relevant for today.
going the other direction here, 30 years coding, but now I wont touch any coding that isnt machine learning :D his videos are certainly gold.
The paper that everyones cites in the introduction. Thanks for sharing!!
What I love about these the most is when he mentions what is being used now compared to the paper and all these little gems of information here and there on top of the paper.
26:20 The VGG (2014) paper "Very deep convolutional networks for large-scale image recognition
" mentioned explicitly that they tried LRN and found it's not worth the trouble.
> First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers. We thus do not employ normalisation in the deeper architectures (B-E).
Again, in Batch normalization: Accelerating deep network training by reducing internal covariate shift
(2015):
> Remove Local Response Normalization. While Inception and other networks (Srivastava et al., 2014) benefit from it, we found that with Batch Normalization it is not necessary.
30:20 SqueezeNet and LeNet talked about it. Having a 9x9x9 conv is equivalent to having 3 3x3x3 filters but takes cubic more time to compute.
.
31:37 Global Max Pool was introduced later I think, it helps when dimensions of image is variable. 32:00 Global max pooling also means that the dense params can be dialed back a lot.
.
66406 citations as of now. Crazy. Good paper, keep it coming.
.
Can you do one on Adam Optimizer? I think it is so ubiquitous that people dont even cite it anymore XD
Thanks a ton for this series! And clarifications of what techniques stayed and which are gone is highly appreciated!
I genuinely wish I had a teacher like Yannic five years ago
I love the classical paper series i do not have masters or phd, but I want to learn deep learning this series help us cover the basics
I don't know why I am paying for college...You videos are amazing!
Thank you so much for spending time to walk thro the paper. The world's a better place cos of folks like you!
Wow! This is such a great idea (classic paper series)! Love it :D
4:20 One of the main reasons it's not used as much is because BatchNorm has the same over all effect but much better. That being said, I still find dropout to be quite effective at lowering overfitting. Specially on lower sample sizes, and when used on the dense layers at the end.
In the beginning was the AlexNet, and the AlexNet was with DNN, and the Alexnet was DNN.
Interesting! You have sparked my interest in learning about the present state of dropout layers.
I was told that this guy is on a break?!
He is so productive even in a break.
Apparently not from a reliable source ;)
He is, the classic paper videos are usually pre-recorded
enjoyed the way you presented it, thank you
Wonderfully presented, thank you! :) I look forward to taking the rest of the journey through this subject with you and your channel. :)
you should make more classics...i just love knowing about them
Thanks for the explanation, It helped me understand and learn a lot things which I couldn't have, if I had read the paper by myself
love your channel, how can someone be up to date with new advancements in ML/DL? like its counter intuitive to me that larger models overfit less
To prevent overfitting, the loss landscape should be as smooth as possible so that the model can better generalize. If I remember correctly, residual connections + batch normalization help smoothing the loss landscape, maybe because they allow us to build even deeper models.
My bet for the future is actually on Bayesian networks, which use a learnable version of a Gaussian dropout.
Thank you for a beautiful explanation with a retrospective overlook
In test time the crops and reflections were used by another paper called OverFeat, that crushed '13 imagenet detection challenge I suppose. (46:06)
You need a microphone that’s not so sensitive it records every tiny little sound. Actually, it could be automatic gain control (AGC) that increases amplification when you are not speaking and records every time you swallow with more volume than you want. Recommendation: Turn off AGC if you can.
I love your channel, Im one of the first who watch your videos and smash the like button.
A lot of stuff mentioned like dropout become out of fashion when batch normalization was introduced. I still use it in denoisers though
Please, more videos like this one!
Thanks for this great video! I really enjoyed.
Amazing, can you do their book as well :D
Does anyone know where the statement large model doesn't overfit 7:56 come from
search for "deep learning double descent"
This is helping me with my literature review assignment, hahaha 😂 thank you!
Hi, i am following ur channel since DERT. Can you make a video explains DeepSORT?
TIA
Great video! Thank you!
So you saying people don’t care about parameter size?
the link doesn't work properly bc download doesnt start
Do you have a powerpoint presentation on this paper?
Great channel, subscribed, liked!
I can nearly guarantee that their net did not really overfit. I’ve trained many nets and past lab mates have increased the number of parameters and the test loss never keeps increasing. Feel free to check my google scholar for paper examples of what I mean but I am sure their nets did fine especially since they didnt quantify if their neg overfit.
Regardless fun paper!
Can alexnet detect multiple objects in single frame ?
So if dropout is not being used, what is being used now?
My bet for the future is on Bayesian networks, which use a sort of a learnable version of a Gaussian dropout.
Global Average Pooling
Does anyone know what he's using to annotate over the papers?
OneNote
@@YannicKilcher Thanks
woow. It's really classic.
Yannic is gonna be in the stone age soon. Can't wait for the invention of wheel.
38.33 why don't people use dropout anymore?
It's usually fine without.
👍
"Action potential" is the name of the spike/signal itself, not the name of the activation threshold. It is a really dumb name :(
Why people now do not care about overfitting?