The residual block guarantees that at least your deep NN performs as well as shallow one, so there's no reason to train a shallow NN rather than deep NN because theoritcailly deeper NN outperforms shallower NN.
I don't really understand what he means when he says the identity function is easy for the residual block to learn. It's not really learnt anything if all we do is append the initial input to the end. Given that we're saying the conv blocks are effectively superfluous because the weights are close to zero, I can't see what's gained in the whole process. We just appear to have extra calculation for the sake of it when we already have the output of layer a(l)
Basically the baseline here is that you're trying to hopefully improve performance. In your worst case scenario the deeper layers dont learn anything and het your performance doesnt take a hit thanks to your skip connections. But in most cases these layers will learn something too that can only help improve performance. So yes although there are a lot of extra calculations you might get better performance. Again depends on application and trade offs
I have a question, if the result of the second network with two extra layers and a skip connection is same as the first network without them, because a[l+2] is likely to became a[l], then why we need to add the extra layers just for making it deeper? Or just like andrew said, maybe we will be lucky that the extra layer learn something and meanwhile we dont hurt the performance? Or the case in the video where a[l+2] is equal to a[l] is an edge case, where usually the extra layers can always learn more thing when we are retaining the original performance
so if the input encounters a skip connection route, does it take both or does it always take the skip connection? If it's the latter case then what's the point in even including all those skipped layers?
Andrew Ng is a true technologist, soo involved in his lectures that he almost started Kung Fu art @8:47 when it sounded like Kung kung kung fu, kung kung kung fu ... 😆😆😆
Question: let’s assume I do have an identity function learned, then a[l+2] = a [l], then what? I feel like we are doing f(x) + 0 = f(x), what’s the point of “adding nothing”? Since I am not following here, I can’t tell why Residual Networks is good for deeper NN training.
The residual block ensures that our layer at least learns the output from the previous layer, so the performance doesn't get worse. This is helpful because the plain networks often struggle to learn even the identity mapping with increased depth, leading to worse performance.
It's my understanding that it may still be somewhat dependent on the problem. The skip connections are essentially restoring the identify of the input from the first layer of the block, thus keeping the block output similar to the input. The feature space that you are learning in those intermediate layers in the residual block is something you may need to consider for your individual problem, in terms of there being too much or too little parameter space. This is also dependent on the quantity and variability of your data. In general, ResNets are useful because as layers are stacked, the solution space increases hugely. That said, keeping the solution space somewhere around the input, constrains it so that it doesn't grow out of control. Two or three intermediate layers is usually enough for the block to learn a reasonable amount, but you may want to consider the width of those intermediate layers as well. In terms of why you may not want to stack 10 layers inside a residual block, consider the reason we use residual blocks in the first place. Stacking too many layers balloons the solution space, which SGD will try all sorts solutions and it will be difficult to converge on a reasonable solution. Thus we usually want to keep the blocks small, because we want to avoid that whole problem of stacking too many layers, especially inside the residual block, as residual blocks are like individual building blocks of the whole network.
In case we learn something useful in g(w[l+2] x a[l+1]+ b[l+2]), isnt it possible that adding the activations of the previous layer a[l] can actually decrease performance? So in that case a plain network would do a better job?
I guess if the performance is worse, then w and b will be 0 and nothing is learned. al is preserved through the 'game progress saving' technique, so al+2 is at least as good as al
I have a question: You mention that the W matrix and b MAY decay IF we use weight-decay. Isn't that a big IF though? Like is weight-decay a part of the residual network / skip-connections? In most cases, the W and b will not be decaying to 0, how is residual network / skip-connections useful in those cases?
"Residual networks can easily learn the identity function" - but isn't this true only when the weights and biases are 0? In the real situation, why would this happen? Its not making sense to me why you would skip connections and have the learned weights go to "0". If someone could please clarify this, I would be extremely grateful.
It essentially means that the residual layer can easily learn the identity function over the input by setting the weights to zero. This leads to layer giving an output that is at least NOT WORSE than the output of the previous layer. On the other hand, plain networks may struggle to learn the identity mapping and as a result can lead to worse performance with increasing layers.
If the residual block has not learnt anything or is not useful, regularisation will help negate the effect of that layer and help bypass the previous activations thereby not sacrificing the performance of the layer. If the residual block has learnt something useful, even after regularisation, the knowledge learnt is stored plus the activations from the previous layer are also added, thereby not sacrificing the performance of the layer. So, it helps you keep deep layers with its ability to learn and not learn information.
@@mufaddalkanpurwala462 Hi, thanks for your explain. There're some points I'm still not clear: - l2 regularisation makes W close to 0 but not exactly 0. Moreover, W is a matrix so it's very unlikely for all elements in it to be 0. So how is the layer skipped ? - Why would we want add activations from previous layer with knowledge learned ? why adding them won't sacrificing the performance of the layer ? Hope you can help me with this, thanks a lot !
Can someone please correct me if I'm wrong? As far as I understand if L2 regularization(weight decay) causes z[L+2] to become 0. Relu just carries a[L] to the next layer. Otherwise, it learns from both z[L+2] and a[L]. So, it bypass the vanishing gradient problem but increases the exploding gradient problem Am I right?
(copied from the previous video in series) Just a question for those out there - has anyone been able to use techniques from ResNets to improve the convergence speed of deep fully connected networks? Usually people use skip connections in the context of convolutional neural nets but I haven't seen much gain in performance with fully connected ResNets, so just wondering if there's something else I may be missing.
The dimension of the image is reduced. Pooling allows the network to learn more features over a larger window of the image, at the cost of lower resolution.
@Sunny kumar you can't explicitly make w and b as 1.. they are set by the gradient descent algo... If you are confused “then how w can become 0 ?” it is possible by applying l1 regularisation ( read about this )
It's a function that outputs the exact same value as input, like y = x. For example, in ReLu function, if input x > 0 then output y = x. So in this case ReLu is a identity function.
Yes. If g(.) is ReLU: a[l+2] = g(z[l+2] + a[l]) = z[l+2] + a[l] if z[l+2] > -a[l] else 0. Since a[l] is always non-negative, if z[l+2] gets a negative value whose magnitude is larger than a[l], it results in a[l+2] being 0.
5 років тому+1
Activation function that you use for z(l+2) is ReLU, which has the minimum value of the zero. Then, z(l+2) will be 0 at minimum. It's kinda attempt for trying to solve vanishing gradients. I'm really interested in if they add random small numbers instead of a(l), means that not g(z(l+2)+a(l)) but g(z(l+2)+ random()) will it work well? This is the first question that comes to my mind. Hope they investigate it, I didn't realise it but I hope you know the paper and would like to share it.
5 років тому
@@giofou711 Activation function that you use for z(l+2) is ReLU, which has the minimum value of the zero. Then, z(l+2) will be 0 at minimum. It's kinda attempt for trying to solve vanishing gradients. I'm really interested in if they add random small numbers instead of a(l), means that not g(z(l+2)+a(l)) but g(z(l+2)+ random()) will it work well? This is the first question that comes to my mind. Hope they investigate it, I didn't realise it but I hope you know the paper and would like to share it.
so does it mean that resnet models must to train with a certain degree of weight decay for it to make sense, otherwise it is just equivalent to a plain network
I’m still left wondering, why is it good to learn the identity? A lot of videos I see just say “the identity is good to learn” but I don’t intuitively see why a model would want to learn that, and why the inability to learn the identity causes instability in deeper networks.
@@MrBemnet1 That make sense. But by adding more layers, is the extra ReLU functions at the end the only difference? This is compared to having a shallow layer.
what a teacher he is ...watching his video is equivalent to reading 10 articles and watching 100 videos
You are high on ML.
watched 3 times,during past 2 year, each time learn something new!
or each time you didn't listen to him carefully?
Finally it clicks in my head. Thanks Andrew !!!
Your brain clicks?
The residual block guarantees that at least your deep NN performs as well as shallow one, so there's no reason to train a shallow NN rather than deep NN because theoritcailly deeper NN outperforms shallower NN.
I don't really understand what he means when he says the identity function is easy for the residual block to learn. It's not really learnt anything if all we do is append the initial input to the end. Given that we're saying the conv blocks are effectively superfluous because the weights are close to zero, I can't see what's gained in the whole process. We just appear to have extra calculation for the sake of it when we already have the output of layer a(l)
Basically the baseline here is that you're trying to hopefully improve performance. In your worst case scenario the deeper layers dont learn anything and het your performance doesnt take a hit thanks to your skip connections. But in most cases these layers will learn something too that can only help improve performance. So yes although there are a lot of extra calculations you might get better performance. Again depends on application and trade offs
Thanks! This is why ResNet at the same time Andrew works!
I have a question, if the result of the second network with two extra layers and a skip connection is same as the first network without them, because a[l+2] is likely to became a[l], then why we need to add the extra layers just for making it deeper? Or just like andrew said, maybe we will be lucky that the extra layer learn something and meanwhile we dont hurt the performance? Or the case in the video where a[l+2] is equal to a[l] is an edge case, where usually the extra layers can always learn more thing when we are retaining the original performance
so if the input encounters a skip connection route, does it take both or does it always take the skip connection? If it's the latter case then what's the point in even including all those skipped layers?
Andrew Ng is a true technologist, soo involved in his lectures that he almost started Kung Fu art @8:47 when it sounded like Kung kung kung fu, kung kung kung fu ... 😆😆😆
Question: let’s assume I do have an identity function learned, then a[l+2] = a [l], then what? I feel like we are doing f(x) + 0 = f(x), what’s the point of “adding nothing”? Since I am not following here, I can’t tell why Residual Networks is good for deeper NN training.
The residual block ensures that our layer at least learns the output from the previous layer, so the performance doesn't get worse. This is helpful because the plain networks often struggle to learn even the identity mapping with increased depth, leading to worse performance.
But how do we know how deep we should skip to? For example, how do we know the 10th layer will or will not improve from input of the 4th layer?
It's my understanding that it may still be somewhat dependent on the problem. The skip connections are essentially restoring the identify of the input from the first layer of the block, thus keeping the block output similar to the input. The feature space that you are learning in those intermediate layers in the residual block is something you may need to consider for your individual problem, in terms of there being too much or too little parameter space. This is also dependent on the quantity and variability of your data. In general, ResNets are useful because as layers are stacked, the solution space increases hugely. That said, keeping the solution space somewhere around the input, constrains it so that it doesn't grow out of control. Two or three intermediate layers is usually enough for the block to learn a reasonable amount, but you may want to consider the width of those intermediate layers as well.
In terms of why you may not want to stack 10 layers inside a residual block, consider the reason we use residual blocks in the first place. Stacking too many layers balloons the solution space, which SGD will try all sorts solutions and it will be difficult to converge on a reasonable solution. Thus we usually want to keep the blocks small, because we want to avoid that whole problem of stacking too many layers, especially inside the residual block, as residual blocks are like individual building blocks of the whole network.
By empirical analysis
We don't, but it doesn't hurt to save our game too often
The L2 regularization is a kind of mandatory?
In case we learn something useful in g(w[l+2] x a[l+1]+ b[l+2]), isnt it possible that adding the activations of the previous layer a[l] can actually decrease performance? So in that case a plain network would do a better job?
I guess if the performance is worse, then w and b will be 0 and nothing is learned. al is preserved through the 'game progress saving' technique, so al+2 is at least as good as al
I have a question:
You mention that the W matrix and b MAY decay IF we use weight-decay. Isn't that a big IF though?
Like is weight-decay a part of the residual network / skip-connections? In most cases, the W and b will not be decaying to 0, how is residual network / skip-connections useful in those cases?
"Residual networks can easily learn the identity function" - but isn't this true only when the weights and biases are 0? In the real situation, why would this happen? Its not making sense to me why you would skip connections and have the learned weights go to "0". If someone could please clarify this, I would be extremely grateful.
It essentially means that the residual layer can easily learn the identity function over the input by setting the weights to zero. This leads to layer giving an output that is at least NOT WORSE than the output of the previous layer. On the other hand, plain networks may struggle to learn the identity mapping and as a result can lead to worse performance with increasing layers.
Hi did you figure it out ? I'm stuck at it now :(
very good explantion.Need to watch again
i love this video! thanks professor andrew!!
Thanks Andrew! now I understand it.
Thank you !
thank you very much Andrew, Could anyone please explain what 3X3 conv means? I would really appreciate that
do you mean 3x3 filters?
If a layer is meant to be skipped, why was it there in the first place?
If the residual block has not learnt anything or is not useful, regularisation will help negate the effect of that layer and help bypass the previous activations thereby not sacrificing the performance of the layer.
If the residual block has learnt something useful, even after regularisation, the knowledge learnt is stored plus the activations from the previous layer are also added, thereby not sacrificing the performance of the layer.
So, it helps you keep deep layers with its ability to learn and not learn information.
@@mufaddalkanpurwala462 Hi, thanks for your explain. There're some points I'm still not clear:
- l2 regularisation makes W close to 0 but not exactly 0. Moreover, W is a matrix so it's very unlikely for all elements in it to be 0. So how is the layer skipped ?
- Why would we want add activations from previous layer with knowledge learned ? why adding them won't sacrificing the performance of the layer ?
Hope you can help me with this, thanks a lot !
Can someone please correct me if I'm wrong? As far as I understand if L2 regularization(weight decay) causes z[L+2] to become 0. Relu just carries a[L] to the next layer. Otherwise, it learns from both z[L+2] and a[L]. So, it bypass the vanishing gradient problem but increases the exploding gradient problem Am I right?
I also have this question
same question
same here
Just amazing. So clear.
am I the only one that thought that the pointer at 0:55 was actually a bug on their screen?
(copied from the previous video in series) Just a question for those out there - has anyone been able to use techniques from ResNets to improve the convergence speed of deep fully connected networks? Usually people use skip connections in the context of convolutional neural nets but I haven't seen much gain in performance with fully connected ResNets, so just wondering if there's something else I may be missing.
Thanks
Why does the filter number double after pooling is applied? Any suggestion.
The dimension of the image is reduced. Pooling allows the network to learn more features over a larger window of the image, at the cost of lower resolution.
@@snippletrap thank u. Very good explanation
Still I cannot get the intuition why Skip connection works better. It seems still experimental to me. ^^;
But we can also take w and b as 1 so that a[l+1]=a[l] and a[l+2]=a[l+1]. By doing so, we can get identity function without ResNets. Isn't it?
@Sunny kumar you can't explicitly make w and b as 1.. they are set by the gradient descent algo... If you are confused “then how w can become 0 ?” it is possible by applying l1 regularisation ( read about this )
Can anybody tell me what is identity function?
g(x) = x (Same as linear function)
It's a function that outputs the exact same value as input, like y = x. For example, in ReLu function, if input x > 0 then output y = x. So in this case ReLu is a identity function.
y=x;
what is a weight decay? anyone please help. thanks..
L2 regularization
Does Doesn't the performance gets affected if z[l+2] is negative?
Yes. If g(.) is ReLU: a[l+2] = g(z[l+2] + a[l]) = z[l+2] + a[l] if z[l+2] > -a[l] else 0. Since a[l] is always non-negative, if z[l+2] gets a negative value whose magnitude is larger than a[l], it results in a[l+2] being 0.
Activation function that you use for z(l+2) is ReLU, which has the minimum value of the zero. Then, z(l+2) will be 0 at minimum. It's kinda attempt for trying to solve vanishing gradients. I'm really interested in if they add random small numbers instead of a(l), means that not g(z(l+2)+a(l)) but g(z(l+2)+ random()) will it work well? This is the first question that comes to my mind. Hope they investigate it, I didn't realise it but I hope you know the paper and would like to share it.
@@giofou711 Activation function that you use for z(l+2) is ReLU, which has the minimum value of the zero. Then, z(l+2) will be 0 at minimum. It's kinda attempt for trying to solve vanishing gradients. I'm really interested in if they add random small numbers instead of a(l), means that not g(z(l+2)+a(l)) but g(z(l+2)+ random()) will it work well? This is the first question that comes to my mind. Hope they investigate it, I didn't realise it but I hope you know the paper and would like to share it.
so does it mean that resnet models must to train with a certain degree of weight decay for it to make sense, otherwise it is just equivalent to a plain network
For learning identity
I’m still left wondering, why is it good to learn the identity? A lot of videos I see just say “the identity is good to learn” but I don’t intuitively see why a model would want to learn that, and why the inability to learn the identity causes instability in deeper networks.
if the network learns identity then at least adding additional layers will not decrease performance.
@@MrBemnet1 That make sense. But by adding more layers, is the extra ReLU functions at the end the only difference? This is compared to having a shallow layer.
Wow my lecturer made such a mess to explain why the layers just learn identity mapping, this was much easier to understand
this videos have a weird signal tone coming from it
his voices puts me to sleep. good video tho
Set the speed to 1.5 would help
@@chauphamminh1121 god speed, seraph.