Wow!! I mean, how do you keep up and sustain this rate of reading and producing these videos man! I had come across your channel just about week and a half ago (while searching on explanations for DETR) and since have had tough time just sifting through your very interesting videos and picked up quite a bit along the way! You are indeed a role model ! :) Thanks a lot for what you are doing! _/\_
Amazing video presenting an amazing paper, afaik currently sparsity doesn't have much benefit ito most real world performance gains (gpus and tpus) but that should start changing soon. 10x more layers with 90% of the weights pruned using Synflow should greatly outperform while having similar (end) param counts.
12:18 Its interesting how the graphs are sigmoid shaped. I would expect that the graph to start out flat-ish because of redundant connections and then fall linearly. But, it seems to flatten out near the end. Basically, it is retaining accuracy even as most of the parameters get pruned. It would be interesting to see what happens if we start adding more parameters and training in this near critical model again. Would it trace back the same path upward? Or would it do better? Or worse? . 38:38 Intuitively, this is equivalent to asking a node, "How much of the final output do you contribute to?". However, since we are taking absolute values, lets say there is a node which has a weight 1x + 9999 passes an activation of +9999 to the next node which has a weight of 1x - 9999. So, although this loss function would rate both these two nodes highly. In reality, they are always negating each other and never contribute to the final output. Then again, checking interactions between neurons is practically intractable. . I really liked the dataless approach in this paper. I think this would inspire more papers to try similar stuff. Good stuff. . 43:30 IDK. A network initialized with all equal weights to 1 would make this algorithm go crazy with excitement. Lets see if we will see a new family of network initialization policies.
- yea indeed, I think there are dozens of questions like this that could all turn out either way and nobody knows - maybe here one can count on the gaussian initializations to basically never make this happen, because it would be quite the coincidence.d - I guess we're both waiting for the first generation of "adversarial initializations" to screw with the pruning methods :D
Yannic, you need to tell us how you manage your time and how you manage to put a video daily?, for me reading a paper itself takes 2-3 days to understand it fully, let alone explain it to others.
What if you can have a constraint for the pruning to keep at least a percent (say 10%) of the connections in each layer to prevent layer collapse? Edit: here's the answer 33:50
I fully agree with your statement of having models being inflated, compression will still have its place. It is an unavoidable thought of why not inject additional layers with weights for matrix Q as required, such that you increase the spatial resolution/search space to capture increased detail was required for a more accurate model. The other thing best initialization of the Q matric perturbations weights pattern. I don't believe that random weight is right, my work, but don't think it is the best. one would want an equal distribution of perturbations, give the best possible probability for backpropagation to chance to enhance the weights according to the forward propagation. Clearly, if we go to inflationary models then one also be inflation perturbations more intelligently, not random. The thing for me is that might have an iterative algorithm, more intelligently build the structure by swapping out the section of weights for different weights structures, that would allow different types of mathematical equations to be represented. A^b or like X^2 + Y^2 could be capture here, with linear expansion formula. At on point, I was searching the internet for X^B formulary for computation didn't find it, to look how to constructive set of weights to model that equation in NN. Did discovery it few days back kinda from my understanding in probability book, by the looks of it. Strangely enough if to inflate models, with a different section of weights, that resemble different mathematical relationship equations. Then we have more insight into mathematics going on, as we could hold parallel mathematics constructive for Q matrix(for how long it remain this). The next step I guess is going to be how one improves the abstraction of matrics for hardware computation. Lost of empty spaces and missing weights, ability to restructure Q matrix, without accuracy loss, such its equivalent, merge and splitting layers into multiple layers, such that one has dens matric of weights for hardware computation. With regards to the Q matrix with perturbations, I think for better matching should be a matrix with 2 incremental symmetry around at F0/FN upper and lower bound and F(n/2), so basically values incrementally increase by small values of same amounts, this means won't' have random weight lines, should get better patterns in my mind versus random jumping all around the Q matric of weights.
That all seems predicated on these mathematical functions actually being the best kind of representations for these kinds of problems, which is a very strong hypothesis
Can we just add the new loss function into the original loss function of the model and train the original network with this pruning cost (and the one shot data) included? Like we prune the model while training it.
Yannic thanks for making such videos, it really helps a lot. :D I wanted to know, these pruning techniques are not going to improve FLOPs of my model right, because we are just masking the weights in order to prune right? or is there any other way to reduce FLOPs?
Yes, for now that's the case. But you could apply the same methods to induce block-sparsity, which you could then leverage to get to faster neural networks.
I was wondering why they use the hadamard product instead of the dR/d(theta) score alone as a metric to evaluate a parameter's effect on the loss? I understand that this new score won't obey the conservation theorem but if the prime issue was to avoid layer collapse, could we just chuck the conservation part out and try this score in a way that prevents layer collapse (like provide a contingency in the algorithm that avoids it maybe using a local masking technique (which is sub par in performance, i know)). Has this been done? any thoughts?
True. I think the parameter magnitude itself might carry some meaningful information. Like, when it's close to 0, a large gradient on it shouldn't "mean" as much. But that's just my intuition
How can this work for architectures like Resnet, which have bypass connections for layers, without looking at the data? They show results in the paper for resnet, but somehow that doesn't make sense to me. Anybody know what I am missing?
You might take a look at their code at github.com/ganguli-lab/Synaptic-Flow . Why do you think that using the training-data would be required for dealing with the shortcut connections?
I think the same analysis still applies, though you're right the interesting part in ResNets is their skip connection, so technically they never have to deal with layer collapse.
it works for image with first layer as CNN. how about a MLP with different features as inputs? it will be problematic for first layer, i.e. feature selection , since it never sees the data and have no idea which feature is more important.
So -- if I understood this correctly -- you would in principle be able to 1) take a huge model (which normally requires an entire datacenter to train), 2) prune it down to some reasonable size -- and presumably prune it on relatively small computer, since the method does not use any data in the pruning process, and 3) finally train the smaller pruned model to high accuracy (or SOTA given the network size) -- presumably also on a relative small computer
I think that would be correct, if training on a CPU. I don't know how current GPU's handle pruned networks or how much it benefits them. GPU's may need some additional hardware features to really benefit from using a pruned network.
This method applied in reverse could also be quite interesting, i.e. for model search. Assuming the accuracy of a pruned network is reflective of the accuracy of the full network, then you could use Synflow to train and test various pruned models, before scaling up the best performing model and training that... but yes, new hardware might need to be developed.
Nearly identical accuracy with 1% or even 0.1% of the weights at initialization? That is fascinating. A bit mind-bending for me seems the fact that this pruning can be done DATA independent - only by feeding 1s through the network? Crazy - maybe the future poised to be sparse and fully-connected initialization become a thing of the past. ;-) If layer-collapse (aka layer dependent average synaptic saliency score magnitude) is the problem: Why not perform pruning layer-wise in general? How would the base-line methods perform if the pruning-selection would be done for each layer individually instead sorting the scores for the network globally?
@@YannicKilcher I see, they reference "What is the State of Neural Network Pruning?" arxiv.org/abs/2003.03033 ... maybe layerwiese (or fan-in/fan-out dependent) normalization of the saliency scores might be a thing to compensate the magnitude differences. ;-) btw the "linearization" trick they use for ReLUs (W.abs() and then passing 1-Vector) is nice ... for other activation functions this will probably require a bit more work.
Great video once again! Just 1 q, do you have a goal of making at least one video a day? I found this channel while I was searching if anyone had an idea - "reading a paper" - to make a video. Now, I have another idea. Will implement soon and share it here. :)
I wonder what pruning method the human brain uses. At birth, the number of synapses per neuron is 2,500 and grows to 15,000 by about age 2. From then on they get pruned mostly between ages 2-10 but continuing at a slower rate til the late 20s. The adult brain only retains 50% of the synapses it had as a 2 year old.
This has been the most interesting part of the lottery ticket thing to me - it's amazing how many parallels there are between biological neurons and artificial ones. I think the lottery ticket hypothesis paper found good performance between 50% and 70% pruning
I guess "what fires together wires together" is also a good intuition in the reverse sense for pruning. Like muscles the body will likely also try to optimize the brain based on usage/functional relevance. But there is definitely some stability in the system otherwise we would quickly lose all memories that are not recalled frequently. ;-)
This paper contains no new or deep idea. They do use data when pruning the network. It is the data on which the network was trained. Moreover, the lottery ticket hypothesis is trivial. Once stated rigorously, it takes less than four lines to prove it.
Wow!! I mean, how do you keep up and sustain this rate of reading and producing these videos man! I had come across your channel just about week and a half ago (while searching on explanations for DETR) and since have had tough time just sifting through your very interesting videos and picked up quite a bit along the way! You are indeed a role model ! :) Thanks a lot for what you are doing!
_/\_
Amazing! The secrets slowly revealing itself.
Wow, Video on this paper is out already? That was quick!
Really well explained, keep it up 👍
Amazing video presenting an amazing paper, afaik currently sparsity doesn't have much benefit ito most real world performance gains (gpus and tpus) but that should start changing soon. 10x more layers with 90% of the weights pruned using Synflow should greatly outperform while having similar (end) param counts.
Awesome! I was waiting for today’s paper 🤣
Simply Superb. Great work!
12:18 Its interesting how the graphs are sigmoid shaped. I would expect that the graph to start out flat-ish because of redundant connections and then fall linearly. But, it seems to flatten out near the end. Basically, it is retaining accuracy even as most of the parameters get pruned. It would be interesting to see what happens if we start adding more parameters and training in this near critical model again. Would it trace back the same path upward? Or would it do better? Or worse?
.
38:38 Intuitively, this is equivalent to asking a node, "How much of the final output do you contribute to?". However, since we are taking absolute values, lets say there is a node which has a weight 1x + 9999 passes an activation of +9999 to the next node which has a weight of 1x - 9999. So, although this loss function would rate both these two nodes highly. In reality, they are always negating each other and never contribute to the final output. Then again, checking interactions between neurons is practically intractable.
.
I really liked the dataless approach in this paper. I think this would inspire more papers to try similar stuff. Good stuff.
.
43:30 IDK. A network initialized with all equal weights to 1 would make this algorithm go crazy with excitement. Lets see if we will see a new family of network initialization policies.
- yea indeed, I think there are dozens of questions like this that could all turn out either way and nobody knows
- maybe here one can count on the gaussian initializations to basically never make this happen, because it would be quite the coincidence.d
- I guess we're both waiting for the first generation of "adversarial initializations" to screw with the pruning methods :D
This can help identify motif for Synthetic Petri Dish
Exceptional work once again!
Yannic, you need to tell us how you manage your time and how you manage to put a video daily?, for me reading a paper itself takes 2-3 days to understand it fully, let alone explain it to others.
He's been publishing papers in ML since at least 2016, so he's had a bunch of practice. Plus he's lightning Yannic :D
Once you get a bit of practice, it gets easier :)
@@YannicKilcher Building the cache, so to speak :)
duude, Do you ever take a break? haha, love it though!
What if you can have a constraint for the pruning to keep at least a percent (say 10%) of the connections in each layer to prevent layer collapse?
Edit:
here's the answer 33:50
I fully agree with your statement of having models being inflated, compression will still have its place.
It is an unavoidable thought of why not inject additional layers with weights for matrix Q as required, such that you increase the spatial resolution/search space to capture increased detail was required for a more accurate model. The other thing best initialization of the Q matric perturbations weights pattern. I don't believe that random weight is right, my work, but don't think it is the best.
one would want an equal distribution of perturbations, give the best possible probability for backpropagation to chance to enhance the weights according to the forward propagation.
Clearly, if we go to inflationary models then one also be inflation perturbations more intelligently, not random.
The thing for me is that might have an iterative algorithm, more intelligently build the structure by swapping out the section of weights for different weights structures, that would allow different
types of mathematical equations to be represented. A^b or like X^2 + Y^2 could be capture here, with linear expansion formula. At on point, I was searching the internet for X^B formulary for computation didn't find it, to look how to constructive set of weights to model that equation in NN. Did discovery it few days back kinda from my understanding in probability book, by the looks of it.
Strangely enough if to inflate models, with a different section of weights, that resemble different mathematical relationship equations. Then we have more insight into mathematics going on, as we could hold parallel mathematics constructive for Q matrix(for how long it remain this). The next step I guess is going to be how one improves the abstraction of matrics for hardware computation. Lost of empty spaces and missing weights, ability to restructure Q matrix, without accuracy loss, such its equivalent, merge and splitting layers into multiple layers, such that one has dens matric of weights for hardware computation. With regards to the Q matrix with perturbations, I think for better matching should be a matrix with 2 incremental symmetry around at F0/FN upper and lower bound and F(n/2), so basically values incrementally increase by small values of same amounts, this means won't' have random weight lines, should get better patterns in my mind versus random jumping all around the Q matric of weights.
That all seems predicated on these mathematical functions actually being the best kind of representations for these kinds of problems, which is a very strong hypothesis
Can we just add the new loss function into the original loss function of the model and train the original network with this pruning cost (and the one shot data) included? Like we prune the model while training it.
Yea I guess that's a possibility
Yannic thanks for making such videos, it really helps a lot. :D I wanted to know, these pruning techniques are not going to improve FLOPs of my model right, because we are just masking the weights in order to prune right? or is there any other way to reduce FLOPs?
Yes, for now that's the case. But you could apply the same methods to induce block-sparsity, which you could then leverage to get to faster neural networks.
@@YannicKilcher I l look into it.
I was wondering why they use the hadamard product instead of the dR/d(theta) score alone as a metric to evaluate a parameter's effect on the loss? I understand that this new score won't obey the conservation theorem but if the prime issue was to avoid layer collapse, could we just chuck the conservation part out and try this score in a way that prevents layer collapse (like provide a contingency in the algorithm that avoids it maybe using a local masking technique (which is sub par in performance, i know)). Has this been done? any thoughts?
True. I think the parameter magnitude itself might carry some meaningful information. Like, when it's close to 0, a large gradient on it shouldn't "mean" as much. But that's just my intuition
How can this work for architectures like Resnet, which have bypass connections for layers, without looking at the data? They show results in the paper for resnet, but somehow that doesn't make sense to me. Anybody know what I am missing?
You might take a look at their code at github.com/ganguli-lab/Synaptic-Flow . Why do you think that using the training-data would be required for dealing with the shortcut connections?
I think the same analysis still applies, though you're right the interesting part in ResNets is their skip connection, so technically they never have to deal with layer collapse.
The problem looks so much like a max flow or min flow cost problem.
I wonder if instead of all-ones datapoint you could use a normalized average of all your training datapoints.
it works for image with first layer as CNN. how about a MLP with different features as inputs? it will be problematic for first layer, i.e. feature selection , since it never sees the data and have no idea which feature is more important.
I guess we'll have to try. But you'd leave all neurons there, just prune the connections.
the algorithm essentially improves the gradient of a network to make it train better. It does not solve everything.
SNIP SNAP SNIP SNAP SNIP SNAP! Do you know the toll that iterative pruning has on a neural network?
Must be horrible :D
Priceless comment.
Is it possible to iteratively grow the network rather than pruning it, or does that collapse to be essentially the same thing?
Oh just heard your similar comments right at the end of the video. Cool.
Phenomenal.
Would you be able to step through the code of one of these papers?
sure, but this one is just like 3 lines, have a look
@@YannicKilcher Fair point. Would look forward to that kind of thing on other papers.
Thanks for the incredibly insightful content!
So -- if I understood this correctly -- you would in principle be able to
1) take a huge model (which normally requires an entire datacenter to train),
2) prune it down to some reasonable size -- and presumably prune it on relatively small computer, since the method does not use any data in the pruning process, and
3) finally train the smaller pruned model to high accuracy (or SOTA given the network size) -- presumably also on a relative small computer
I think that would be correct, if training on a CPU. I don't know how current GPU's handle pruned networks or how much it benefits them. GPU's may need some additional hardware features to really benefit from using a pruned network.
This method applied in reverse could also be quite interesting, i.e. for model search. Assuming the accuracy of a pruned network is reflective of the accuracy of the full network, then you could use Synflow to train and test various pruned models, before scaling up the best performing model and training that... but yes, new hardware might need to be developed.
Nearly identical accuracy with 1% or even 0.1% of the weights at initialization? That is fascinating. A bit mind-bending for me seems the fact that this pruning can be done DATA independent - only by feeding 1s through the network? Crazy - maybe the future poised to be sparse and fully-connected initialization become a thing of the past. ;-)
If layer-collapse (aka layer dependent average synaptic saliency score magnitude) is the problem: Why not perform pruning layer-wise in general? How would the base-line methods perform if the pruning-selection would be done for each layer individually instead sorting the scores for the network globally?
in the paper they claim that layer-wise pruning gives much worse results
@@YannicKilcher I see, they reference "What is the State of Neural Network Pruning?" arxiv.org/abs/2003.03033 ... maybe layerwiese (or fan-in/fan-out dependent) normalization of the saliency scores might be a thing to compensate the magnitude differences. ;-) btw the "linearization" trick they use for ReLUs (W.abs() and then passing 1-Vector) is nice ... for other activation functions this will probably require a bit more work.
Not sure what this thing learn , the dataset or the architectures...
Great video once again! Just 1 q, do you have a goal of making at least one video a day? I found this channel while I was searching if anyone had an idea - "reading a paper" - to make a video. Now, I have another idea. Will implement soon and share it here. :)
I try to make one every day, but I'll probably fail at some point
I wonder what pruning method the human brain uses. At birth, the number of synapses per neuron is 2,500 and grows to 15,000 by about age 2. From then on they get pruned mostly between ages 2-10 but continuing at a slower rate til the late 20s. The adult brain only retains 50% of the synapses it had as a 2 year old.
This has been the most interesting part of the lottery ticket thing to me - it's amazing how many parallels there are between biological neurons and artificial ones. I think the lottery ticket hypothesis paper found good performance between 50% and 70% pruning
I guess "what fires together wires together" is also a good intuition in the reverse sense for pruning. Like muscles the body will likely also try to optimize the brain based on usage/functional relevance. But there is definitely some stability in the system otherwise we would quickly lose all memories that are not recalled frequently. ;-)
Do you have a PayPal? I don't have much but I at least want to buy you a cup of coffee.
thank you very much :) but I'm horribly over-caffinated already :D
Nice
this is pareto law
this sounds like it's trying to solve a minor problem in a really convoluted way
This paper contains no new or deep idea. They do use data when pruning the network. It is the data on which the network was trained. Moreover, the lottery ticket hypothesis is trivial. Once stated rigorously, it takes less than four lines to prove it.
Enlighten us, please and prove it in four lines :D
@@YannicKilcher Sure, send me the statement of the hypothesis with the definitions of all technical terms used in it.
@@jerryb2735 no, you define and prove it. You claim to be able to do both, so go ahead
@@YannicKilcher False, read my claim carefully.
They are sooo on the wrong track...
First :D