SynFlow: Pruning neural networks without any data by iteratively conserving synaptic flow

Yannic Kilcher

699

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 4 лют 2025

КОМЕНТАРІ • 65

@srivatsabhargavajagarlapud2274 4 роки тому ⁺³⁴
Wow!! I mean, how do you keep up and sustain this rate of reading and producing these videos man! I had come across your channel just about week and a half ago (while searching on explanations for DETR) and since have had tough time just sifting through your very interesting videos and picked up quite a bit along the way! You are indeed a role model ! :) Thanks a lot for what you are doing!
_/\_
@ProfessionalTycoons 4 роки тому ⁺⁵
Amazing! The secrets slowly revealing itself.
@raghavjajodia1 4 роки тому ⁺²
Wow, Video on this paper is out already? That was quick!
Really well explained, keep it up 👍
@stephenhough5763 4 роки тому ⁺⁴
Amazing video presenting an amazing paper, afaik currently sparsity doesn't have much benefit ito most real world performance gains (gpus and tpus) but that should start changing soon. 10x more layers with 90% of the weights pruned using Synflow should greatly outperform while having similar (end) param counts.
@Phobos11 4 роки тому ⁺⁴
Awesome! I was waiting for today’s paper 🤣
@visualizedata6691 4 роки тому ⁺¹
Simply Superb. Great work!
@herp_derpingson 4 роки тому ⁺¹²
12:18 Its interesting how the graphs are sigmoid shaped. I would expect that the graph to start out flat-ish because of redundant connections and then fall linearly. But, it seems to flatten out near the end. Basically, it is retaining accuracy even as most of the parameters get pruned. It would be interesting to see what happens if we start adding more parameters and training in this near critical model again. Would it trace back the same path upward? Or would it do better? Or worse?
.
38:38 Intuitively, this is equivalent to asking a node, "How much of the final output do you contribute to?". However, since we are taking absolute values, lets say there is a node which has a weight 1x + 9999 passes an activation of +9999 to the next node which has a weight of 1x - 9999. So, although this loss function would rate both these two nodes highly. In reality, they are always negating each other and never contribute to the final output. Then again, checking interactions between neurons is practically intractable.
.
I really liked the dataless approach in this paper. I think this would inspire more papers to try similar stuff. Good stuff.
.
43:30 IDK. A network initialized with all equal weights to 1 would make this algorithm go crazy with excitement. Lets see if we will see a new family of network initialization policies.
@YannicKilcher 4 роки тому ⁺⁵
- yea indeed, I think there are dozens of questions like this that could all turn out either way and nobody knows
- maybe here one can count on the gaussian initializations to basically never make this happen, because it would be quite the coincidence.d
- I guess we're both waiting for the first generation of "adversarial initializations" to screw with the pruning methods :D
@mrityunjoypanday227 4 роки тому ⁺²
This can help identify motif for Synthetic Petri Dish
@alabrrmrbmmr 4 роки тому ⁺¹
Exceptional work once again!
@vijayabhaskar-j 4 роки тому ⁺¹¹
Yannic, you need to tell us how you manage your time and how you manage to put a video daily?, for me reading a paper itself takes 2-3 days to understand it fully, let alone explain it to others.
@rbain16 4 роки тому ⁺⁴
He's been publishing papers in ML since at least 2016, so he's had a bunch of practice. Plus he's lightning Yannic :D
@YannicKilcher 4 роки тому ⁺³
Once you get a bit of practice, it gets easier :)
@herp_derpingson 4 роки тому
@@YannicKilcher Building the cache, so to speak :)
@EditorsCanPlay 4 роки тому ⁺¹⁷
duude, Do you ever take a break? haha, love it though!
@siyn007 4 роки тому ⁺²
What if you can have a constraint for the pruning to keep at least a percent (say 10%) of the connections in each layer to prevent layer collapse?
Edit:
here's the answer 33:50
@wesleyolis 4 роки тому ⁺¹
I fully agree with your statement of having models being inflated, compression will still have its place.
It is an unavoidable thought of why not inject additional layers with weights for matrix Q as required, such that you increase the spatial resolution/search space to capture increased detail was required for a more accurate model. The other thing best initialization of the Q matric perturbations weights pattern. I don't believe that random weight is right, my work, but don't think it is the best.
one would want an equal distribution of perturbations, give the best possible probability for backpropagation to chance to enhance the weights according to the forward propagation.
Clearly, if we go to inflationary models then one also be inflation perturbations more intelligently, not random.
The thing for me is that might have an iterative algorithm, more intelligently build the structure by swapping out the section of weights for different weights structures, that would allow different
types of mathematical equations to be represented. A^b or like X^2 + Y^2 could be capture here, with linear expansion formula. At on point, I was searching the internet for X^B formulary for computation didn't find it, to look how to constructive set of weights to model that equation in NN. Did discovery it few days back kinda from my understanding in probability book, by the looks of it.
Strangely enough if to inflate models, with a different section of weights, that resemble different mathematical relationship equations. Then we have more insight into mathematics going on, as we could hold parallel mathematics constructive for Q matrix(for how long it remain this). The next step I guess is going to be how one improves the abstraction of matrics for hardware computation. Lost of empty spaces and missing weights, ability to restructure Q matrix, without accuracy loss, such its equivalent, merge and splitting layers into multiple layers, such that one has dens matric of weights for hardware computation. With regards to the Q matrix with perturbations, I think for better matching should be a matrix with 2 incremental symmetry around at F0/FN upper and lower bound and F(n/2), so basically values incrementally increase by small values of same amounts, this means won't' have random weight lines, should get better patterns in my mind versus random jumping all around the Q matric of weights.
@YannicKilcher 4 роки тому
That all seems predicated on these mathematical functions actually being the best kind of representations for these kinds of problems, which is a very strong hypothesis
@LouisChiaki 4 роки тому ⁺³
Can we just add the new loss function into the original loss function of the model and train the original network with this pruning cost (and the one shot data) included? Like we prune the model while training it.
@YannicKilcher 4 роки тому
Yea I guess that's a possibility
@RohitKumarSingh25 4 роки тому ⁺¹
Yannic thanks for making such videos, it really helps a lot. :D I wanted to know, these pruning techniques are not going to improve FLOPs of my model right, because we are just masking the weights in order to prune right? or is there any other way to reduce FLOPs?
@YannicKilcher 4 роки тому ⁺²
Yes, for now that's the case. But you could apply the same methods to induce block-sparsity, which you could then leverage to get to faster neural networks.
@RohitKumarSingh25 4 роки тому
@@YannicKilcher I l look into it.
@Kerrosene 4 роки тому ⁺¹
I was wondering why they use the hadamard product instead of the dR/d(theta) score alone as a metric to evaluate a parameter's effect on the loss? I understand that this new score won't obey the conservation theorem but if the prime issue was to avoid layer collapse, could we just chuck the conservation part out and try this score in a way that prevents layer collapse (like provide a contingency in the algorithm that avoids it maybe using a local masking technique (which is sub par in performance, i know)). Has this been done? any thoughts?
@YannicKilcher 4 роки тому
True. I think the parameter magnitude itself might carry some meaningful information. Like, when it's close to 0, a large gradient on it shouldn't "mean" as much. But that's just my intuition
@MarkusBreitenbach 4 роки тому ⁺²
How can this work for architectures like Resnet, which have bypass connections for layers, without looking at the data? They show results in the paper for resnet, but somehow that doesn't make sense to me. Anybody know what I am missing?
@bluel1ng 4 роки тому ⁺¹
You might take a look at their code at github.com/ganguli-lab/Synaptic-Flow . Why do you think that using the training-data would be required for dealing with the shortcut connections?
@YannicKilcher 4 роки тому
I think the same analysis still applies, though you're right the interesting part in ResNets is their skip connection, so technically they never have to deal with layer collapse.
@Tehom1 4 роки тому ⁺⁹
The problem looks so much like a max flow or min flow cost problem.
@AsmageddonPrince 2 роки тому
I wonder if instead of all-ones datapoint you could use a normalized average of all your training datapoints.
@zhaotan6163 4 роки тому ⁺⁴
it works for image with first layer as CNN. how about a MLP with different features as inputs? it will be problematic for first layer, i.e. feature selection , since it never sees the data and have no idea which feature is more important.
@YannicKilcher 4 роки тому ⁺¹
I guess we'll have to try. But you'd leave all neurons there, just prune the connections.
@billymonday8388 2 роки тому
the algorithm essentially improves the gradient of a network to make it train better. It does not solve everything.
@DeepGamingAI 4 роки тому ⁺⁷
SNIP SNAP SNIP SNAP SNIP SNAP! Do you know the toll that iterative pruning has on a neural network?
@YannicKilcher 4 роки тому ⁺²
Must be horrible :D
@shubhvachher4833 4 роки тому
Priceless comment.
@robbiero368 4 роки тому
Is it possible to iteratively grow the network rather than pruning it, or does that collapse to be essentially the same thing?
@robbiero368 4 роки тому
Oh just heard your similar comments right at the end of the video. Cool.
@hungryskelly 4 роки тому ⁺¹
Phenomenal.
Would you be able to step through the code of one of these papers?
@YannicKilcher 4 роки тому ⁺¹
sure, but this one is just like 3 lines, have a look
@hungryskelly 4 роки тому
@@YannicKilcher Fair point. Would look forward to that kind of thing on other papers.
Thanks for the incredibly insightful content!
@jonassekamane 4 роки тому ⁺¹
So -- if I understood this correctly -- you would in principle be able to
1) take a huge model (which normally requires an entire datacenter to train),
2) prune it down to some reasonable size -- and presumably prune it on relatively small computer, since the method does not use any data in the pruning process, and
3) finally train the smaller pruned model to high accuracy (or SOTA given the network size) -- presumably also on a relative small computer
@jwstolk 4 роки тому ⁺¹
I think that would be correct, if training on a CPU. I don't know how current GPU's handle pruned networks or how much it benefits them. GPU's may need some additional hardware features to really benefit from using a pruned network.
@jonassekamane 4 роки тому ⁺¹
This method applied in reverse could also be quite interesting, i.e. for model search. Assuming the accuracy of a pruned network is reflective of the accuracy of the full network, then you could use Synflow to train and test various pruned models, before scaling up the best performing model and training that... but yes, new hardware might need to be developed.
@bluel1ng 4 роки тому ⁺²
Nearly identical accuracy with 1% or even 0.1% of the weights at initialization? That is fascinating. A bit mind-bending for me seems the fact that this pruning can be done DATA independent - only by feeding 1s through the network? Crazy - maybe the future poised to be sparse and fully-connected initialization become a thing of the past. ;-)
If layer-collapse (aka layer dependent average synaptic saliency score magnitude) is the problem: Why not perform pruning layer-wise in general? How would the base-line methods perform if the pruning-selection would be done for each layer individually instead sorting the scores for the network globally?
@YannicKilcher 4 роки тому
in the paper they claim that layer-wise pruning gives much worse results
@bluel1ng 4 роки тому
@@YannicKilcher I see, they reference "What is the State of Neural Network Pruning?" arxiv.org/abs/2003.03033 ... maybe layerwiese (or fan-in/fan-out dependent) normalization of the saliency scores might be a thing to compensate the magnitude differences. ;-) btw the "linearization" trick they use for ReLUs (W.abs() and then passing 1-Vector) is nice ... for other activation functions this will probably require a bit more work.
@blanamaxima 4 роки тому ⁺¹
Not sure what this thing learn , the dataset or the architectures...
@kpbrl 4 роки тому ⁺¹
Great video once again! Just 1 q, do you have a goal of making at least one video a day? I found this channel while I was searching if anyone had an idea - "reading a paper" - to make a video. Now, I have another idea. Will implement soon and share it here. :)
@YannicKilcher 4 роки тому
I try to make one every day, but I'll probably fail at some point
@Zantorc 4 роки тому ⁺¹¹
I wonder what pruning method the human brain uses. At birth, the number of synapses per neuron is 2,500 and grows to 15,000 by about age 2. From then on they get pruned mostly between ages 2-10 but continuing at a slower rate til the late 20s. The adult brain only retains 50% of the synapses it had as a 2 year old.
@NicheAsQuiche 4 роки тому ⁺³
This has been the most interesting part of the lottery ticket thing to me - it's amazing how many parallels there are between biological neurons and artificial ones. I think the lottery ticket hypothesis paper found good performance between 50% and 70% pruning
@bluel1ng 4 роки тому ⁺¹
I guess "what fires together wires together" is also a good intuition in the reverse sense for pruning. Like muscles the body will likely also try to optimize the brain based on usage/functional relevance. But there is definitely some stability in the system otherwise we would quickly lose all memories that are not recalled frequently. ;-)
@kDrewAn 4 роки тому ⁺¹
Do you have a PayPal? I don't have much but I at least want to buy you a cup of coffee.
@YannicKilcher 4 роки тому ⁺¹
thank you very much :) but I'm horribly over-caffinated already :D
@kDrewAn 4 роки тому
Nice
@sansdomicileconnu 4 роки тому ⁺²
this is pareto law
@GoriIIaTactics 4 роки тому ⁺¹
this sounds like it's trying to solve a minor problem in a really convoluted way
@jerryb2735 4 роки тому ⁺¹
This paper contains no new or deep idea. They do use data when pruning the network. It is the data on which the network was trained. Moreover, the lottery ticket hypothesis is trivial. Once stated rigorously, it takes less than four lines to prove it.
@YannicKilcher 4 роки тому
Enlighten us, please and prove it in four lines :D
@jerryb2735 4 роки тому
@@YannicKilcher Sure, send me the statement of the hypothesis with the definitions of all technical terms used in it.
@YannicKilcher 4 роки тому ⁺³
@@jerryb2735 no, you define and prove it. You claim to be able to do both, so go ahead
@jerryb2735 4 роки тому
@@YannicKilcher False, read my claim carefully.
@StefanReich 4 роки тому ⁺¹
They are sooo on the wrong track...
@MrAmirhossein1 4 роки тому ⁺²
First :D