The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 21 тра 2024
Stunning evidence for the hypothesis that neural networks work so well because their random initialization almost certainly contains a nearly optimal sub-network that is responsible for most of the final performance.
arxiv.org/abs/1803.03635
Abstract:
Neural network pruning techniques can reduce the parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising accuracy. However, contemporary experience is that the sparse architectures produced by pruning are difficult to train from the start, which would similarly improve training performance.
We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.
We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10-20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10. Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.
Authors: Jonathan Frankle, Michael Carbin
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Наука та технологія

КОМЕНТАРІ • 41

@JackofSome 4 роки тому ⁺¹⁶
Yannic you're spoiling us. I hope you're able to keep your pace once (if???) this virus dies down a bit.
@MrNightLifeLover 4 роки тому ⁺⁴
Very well explained, thanks! Please keep reviewing papers!
@jivan476 2 роки тому ⁺¹³
Could it be that the "winning tickets" can be identified after only a handful of training epochs instead of after a full training (e.g. 50 epochs or more)? If yes, it would mean that we can train for 3-4 epochs, prune 50% of the weights, then re-start the training on these weights only (with same initialisation as before), rinse and repeat. In theory it could allow faster training.
@wenhanzhou5826 Рік тому
I think yes, because there is a paper that discusses that different weights initialization will create different local mininma in the loss landscape for the same data. What you can do is start with a really big network and a large learning rate. The network will find one of the local minima quickly, and then just start pruning to get to the lowest point of that minima.
@sayakpaul3152 4 роки тому ⁺⁵
Thanks for the wonderfully detailed walkthrough :)
It might be worth mentioning that while training neural nets it's also possible to train it in a pruning-aware fashion with all the good stuff like pruning schedules, maximum achievable sparsity, etc.
@milkteamx7183 10 місяців тому ⁺¹
Amazing explanation! Thank you so much! I just looked through your channel and am excited to find that you have many of these videos. Just subscribed!
@nbrpwng 4 роки тому ⁺⁶
This is actually reminiscent of how human brains develop from childhood to adulthood. At birth, humans have far more connections between their neurons and connections primarily die off as they learn and mature, much more than new neurons and connections are formed. And yet humans can still learn despite connection removal, and possibly because of it.
@gorgolyt 4 роки тому ⁺²
Great observation. That could simply be pruning, which doesn't decrease performance, and improves energy efficiency for the organism. But it could be something deeper and more important.
@wolfgangmitterbaur3942 2 роки тому ⁺¹
Thanks a lot for this video. It explains essentials of the paper very good - and easy to follow for a non-native speaker, what is important as well!
@TimScarfe 4 роки тому ⁺⁵
Great video! Looking forward to having a discussion on our street talk podcast!
@freemind.d2714 3 роки тому ⁺¹
Very good one hypothesis, very make sense
@jrkirby93 4 роки тому ⁺¹⁰
I love the idea of sparse neural nets. It feels kinda icky looking at these grossly overparameterized models that are often SOTA and thinking: "Right now, this is the best way of doing this."
Pruning is good technique for finding sparse neural nets. I thought this was a great paper when I first read it.
But I've been working on my own research that approaches sparse NN from the other direction. Instead of starting with fully connected layers and pruning, I start with extremely sparse layers and build it up, one edge at a time. It requires quite a different training procedure though. Instead of back-propagation and gradient decent, I take advantage of the piecewise linear properties of ReLU to guarantee a fully piecewise linear neural net. This allows me to explicitly find the optimal next best edge - and it's optimal value - in a single optimization step.
I hope to finish implementing my research in the coming weeks, and would be happy to show you in more detail if you're interested.
@jepkofficial 3 роки тому ⁺³
What happened with this research?
@jrkirby93 3 роки тому ⁺³
@@jepkofficial Wow was that really 6 months ago? I still haven't finished implementing it. Hard to focus when working alone on independent research. Thanks for the reminder, I should return to that project and get it done.
@Leibniz_28 3 роки тому ⁺¹
How's it going the research?
@laurenpinschannels 2 роки тому ⁺³
checking in on this again, on the off chance you didn't get distracted from this one :)
@Poof57 2 роки тому ⁺¹
@@jrkirby93 woohoo another reminder here :P
@user-sh5hn2gn1k 8 місяців тому
Hi @Yannic Kilcher!
Isn't there any possibility that the weights that are not close to zero (or very small in the magnitude), are the weights that should be pruned?
Can't that be a better idea, to monitor the weights in the initial training (with complete network) and prune based upon "which weights are traveled much further in the initial training with complete network"? 🤔
Kindly enlighten on this!
@user-sh5hn2gn1k 8 місяців тому
Hi @Yannic Kilcher!
It seems that the Random Initialization is very important before the pruning. Right? Because only lucky (in terms of random initialization) weights are kept after pruning. If random initialization is so bad and there is no (or very few) lucky candidate weight (after random initialization) then what to do in that case?
Is there any particular Random initialization recommended by the paper or by practice?
There are some of the recommended random initialization methods like Glorot or He.
@user-sh5hn2gn1k 8 місяців тому
Hi @Yannic Kilcher!
Can't we control the Random Initialization to keep almost every weight in the network (to get the most out of the original network)?
Can't every weight win the lottery?
@thejll 2 місяці тому
Very interesting. Does anyone know of software that allows doing this pruning?
@chesstanay Місяць тому
Where can I read more about the related finding at 17:16?
@kevalan1042 3 роки тому
did they check if those initial weights already tend to be relatively large ?
@eugening 4 роки тому
Good discussion. The sound is a bit too soft.
@HappyManStudiosTV 4 роки тому ⁺²
hey! have you seen uber's follow up work? they basically say
that the trick is just to prune weights that are going *towards* 0,
not near 0
@TimScarfe 4 роки тому
HappyManStudiosTV Interesting
@jordyvanlandeghem3457 3 роки тому
can you link the paper? :) thanks!
@joirnpettersen 4 роки тому ⁺⁶
What if insead of pruning the weights, you assume the low magnitude weights were initialized incorrectly, and re-train the dense network where the high-magnitude weights are kept at their inital initalization, and the low magnitude weighs get new values?
@YannicKilcher 4 роки тому ⁺¹
I've never heard this idea. Nice, might be worth a try. I doubt you're gonna get a massive improvement, but it might be interesting to analyze whether you could find an even smaller winning hypothesis.
@vishwajitkumarvishnu3878 4 роки тому ⁺⁴
How do you read and understand any paper so fast? Does it come by practice or is there a way to read different sections. I want to do that. Uploading a video on how to read a paper might help :)
@YannicKilcher 4 роки тому ⁺¹³
After you've read a bunch both the structure, the methods and the ideas become repetitive over the entire field, that speeds up the reading process a lot. I guess I can do a video on that, but it will be pretty straightforward and obvious.
@vishwajitkumarvishnu3878 4 роки тому ⁺²
@@YannicKilcher it'll be helpful if you make a video. Thanks a lot
@MrSb192 2 роки тому ⁺²
Question: suppose we have a network N that we train up to a certain accuracy on some data, prune p% of the weights using some algorithm (one shot, imp, etc) and revert the remaining weights to the initial values. Now, is there any way to ensure that the resulting pruned network will always perform better than the original when trained for the same#iterations? I mean, is there any algorithm for pruning which can guarantee the finding of a lottery ticket within the network everytime we use it? Or is it just trial and error (which is why, I guess, the term lottery ticket is used)?
@araldjean-charles3924 8 місяців тому
For the initial conditions that work, have anybody look at how much wiggle room you have. Is there an epsilon-neighborhood of the initial state you can safely start from, and how small is epsilon?
@herp_derpingson 4 роки тому ⁺⁴
Reminds me of dropout for some reason. Except we are throwing away the dropped out neurons.
@JungleEd17 4 роки тому ⁺¹
I watched it 2x but I think the connections are thrown out not the neurons.
What's interesting here though:
1. The weights are what are important.
2. Pruning involves throwing out both weight AND structure.
Why not keep the structure but choose new weights. Perhaps it just randomly started at a plattaeu of a local min or randomization ended up created redundancies. Jump the the weight really far a way and try again.
@fsxaircanada01 4 роки тому
I think the motivation is that activations are not the biggest source of memory access and energy loss. If we can get rid of 90% of weights, then it could mean speed and energy improvements
@Blooper1980 3 роки тому ⁺¹
Interesting.. Just need to take the sick out of your mouth next time.

Наступне

Автоматичне відтворення

FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence