Nice paper about regularization. Such an elegant solution to districate manifolds in hidden states. Most of the networks I have seen they basically learn only in the last layers. While the backbone just extracts king of random features.
you all prolly dont care at all but does someone know a method to log back into an instagram account..? I stupidly forgot the account password. I love any help you can offer me!
If the bottleneck layer makes the data linearly separable it may as well just be the last hidden layer. In that case this seems to be a technique for making the last hidden representation not just linearly separable but well-spaced. And I think it would induce the softmax inputs to seek an area where softmax is approximately linear.
It's exactly that. Basically it's the extension of MixUp data augmentation to the whole NN. Each layer has an input and an output and each layer learns individually the best representation. Now we are treating the latent representation of the previous layer (e.g cat, dog) as our input and smooth those accordingly.
I agree with your point that not every layer (especially the lower layers) will or should be linearly separable. However I think the objective of manifold mixup is to act as more of a regularization penalty, a given layer should be non-linearly separable only in so far as the benefits (to accuracy) overcome the penalty of mixup. The mixup adds a bias towards linearity but not a strict requirement. Like all regularization methods there will probably have to be a lot more fine tuning and testing before we know if, when and how it gives the right bias variance trade-off.
This video is another great Colab candidate. colab.research.google.com/drive/1qUDe3ENm3fnxND7iibyEF1Ixcw7nu4mK . Thanks again Yannic! Your video inspired me to create a colab ipython notebook that tested out this architecture. I love the concept! It was a pain to implement using Tensorflow Keras Layers. It does appear to help. I also decided that instead of just comparing it to a vanilla classifier that we could compare it to the "Worst" classifier from your other video about "Focusing on the Biggest Losers". Have a great weekend
I am not an expert, and I have not read the paper carefully, but this method seems more like a fancy data augmentation method rather than regularization. Also, there is something to be said about the spiral example, I personally think that batch norm does a very good job. It is not good enough because we, humans, are biased and we "know" from experience and by guessing the intentions of whomever made the dataset the true representation :)
Good point. I would say it's somewhere in between. You sort of create new 'averaged' samples to learn the model to be 'unsure' sometimes and this way the model converges to be more stable representation.
Nice paper about regularization.
Such an elegant solution to districate manifolds in hidden states. Most of the networks I have seen they basically learn only in the last layers. While the backbone just extracts king of random features.
So many papers in rapid succession. This guy is on fire!
\m/
or I'm just procrastinating on doing the dishes :p
It's 9 months later and based on the rate of new videos I'm starting to worry you'll never get around to those dishes
@@valthorhalldorsson9300 sooner than he gets to the dishes, a robot arm would be doing the dishes
you all prolly dont care at all but does someone know a method to log back into an instagram account..?
I stupidly forgot the account password. I love any help you can offer me!
@Luca Hugh Instablaster :)
If the bottleneck layer makes the data linearly separable it may as well just be the last hidden layer. In that case this seems to be a technique for making the last hidden representation not just linearly separable but well-spaced. And I think it would induce the softmax inputs to seek an area where softmax is approximately linear.
This is so coooool. It’s like saying here’s a cat, here’s a dog, here’s a mix of both.
It's exactly that. Basically it's the extension of MixUp data augmentation to the whole NN. Each layer has an input and an output and each layer learns individually the best representation. Now we are treating the latent representation of the previous layer (e.g cat, dog) as our input and smooth those accordingly.
I agree with your point that not every layer (especially the lower layers) will or should be linearly separable.
However I think the objective of manifold mixup is to act as more of a regularization penalty, a given layer should be non-linearly separable only in so far as the benefits (to accuracy) overcome the penalty of mixup. The mixup adds a bias towards linearity but not a strict requirement.
Like all regularization methods there will probably have to be a lot more fine tuning and testing before we know if, when and how it gives the right bias variance trade-off.
Thanks! your paper explanation is really awesome!!!
great explanation. thanks!
Wow! Super interesting paper and great insights.
Great video, thanks a million!
amazing technique!
I like the video, but it's at 256 likes right now so I can't disturb the balance, sorry!
Now you can push towards 512 😁
As I understand this technique is also good for NN prunning
Can you elaborate?
Thankyou! :)
Nice
This video is another great Colab candidate. colab.research.google.com/drive/1qUDe3ENm3fnxND7iibyEF1Ixcw7nu4mK . Thanks again Yannic! Your video inspired me to create a colab ipython notebook that tested out this architecture. I love the concept! It was a pain to implement using Tensorflow Keras Layers. It does appear to help. I also decided that instead of just comparing it to a vanilla classifier that we could compare it to the "Worst" classifier from your other video about "Focusing on the Biggest Losers". Have a great weekend
So many new hyper parameters...
I am not an expert, and I have not read the paper carefully, but this method seems more like a fancy data augmentation method rather than regularization.
Also, there is something to be said about the spiral example, I personally think that batch norm does a very good job. It is not good enough because we, humans, are biased and we "know" from experience and by guessing the intentions of whomever made the dataset the true representation :)
Good point. I would say it's somewhere in between. You sort of create new 'averaged' samples to learn the model to be 'unsure' sometimes and this way the model converges to be more stable representation.
@@levikok1810 that analogy reminds me of DINO and CutMix