C4W2L04 Why ResNets Work

Поділитися
Вставка
  • Опубліковано 24 гру 2024

КОМЕНТАРІ • 69

  • @kartiksirwani4657
    @kartiksirwani4657 4 роки тому +114

    what a teacher he is ...watching his video is equivalent to reading 10 articles and watching 100 videos

  • @muzeroj173
    @muzeroj173 3 роки тому +4

    watched 3 times,during past 2 year, each time learn something new!

  • @billykotsos4642
    @billykotsos4642 4 роки тому +13

    Finally it clicks in my head. Thanks Andrew !!!

  • @iammakimadog
    @iammakimadog 3 роки тому +16

    The residual block guarantees that at least your deep NN performs as well as shallow one, so there's no reason to train a shallow NN rather than deep NN because theoritcailly deeper NN outperforms shallower NN.

  • @razorphone77
    @razorphone77 5 років тому +14

    I don't really understand what he means when he says the identity function is easy for the residual block to learn. It's not really learnt anything if all we do is append the initial input to the end. Given that we're saying the conv blocks are effectively superfluous because the weights are close to zero, I can't see what's gained in the whole process. We just appear to have extra calculation for the sake of it when we already have the output of layer a(l)

    • @SuperVaio123
      @SuperVaio123 5 років тому +22

      Basically the baseline here is that you're trying to hopefully improve performance. In your worst case scenario the deeper layers dont learn anything and het your performance doesnt take a hit thanks to your skip connections. But in most cases these layers will learn something too that can only help improve performance. So yes although there are a lot of extra calculations you might get better performance. Again depends on application and trade offs

  • @firstpenguin5653
    @firstpenguin5653 3 роки тому +1

    Thanks! This is why ResNet at the same time Andrew works!

  • @bowbowzai3757
    @bowbowzai3757 10 місяців тому +1

    I have a question, if the result of the second network with two extra layers and a skip connection is same as the first network without them, because a[l+2] is likely to became a[l], then why we need to add the extra layers just for making it deeper? Or just like andrew said, maybe we will be lucky that the extra layer learn something and meanwhile we dont hurt the performance? Or the case in the video where a[l+2] is equal to a[l] is an edge case, where usually the extra layers can always learn more thing when we are retaining the original performance

  • @RH-mk3rp
    @RH-mk3rp 2 роки тому +1

    so if the input encounters a skip connection route, does it take both or does it always take the skip connection? If it's the latter case then what's the point in even including all those skipped layers?

  • @ahmadsaeedkhattak20
    @ahmadsaeedkhattak20 Рік тому

    Andrew Ng is a true technologist, soo involved in his lectures that he almost started Kung Fu art @8:47 when it sounded like Kung kung kung fu, kung kung kung fu ... 😆😆😆

  • @43SunSon
    @43SunSon 2 роки тому +1

    Question: let’s assume I do have an identity function learned, then a[l+2] = a [l], then what? I feel like we are doing f(x) + 0 = f(x), what’s the point of “adding nothing”? Since I am not following here, I can’t tell why Residual Networks is good for deeper NN training.

    • @kartikeyakhare5089
      @kartikeyakhare5089 Рік тому +2

      The residual block ensures that our layer at least learns the output from the previous layer, so the performance doesn't get worse. This is helpful because the plain networks often struggle to learn even the identity mapping with increased depth, leading to worse performance.

  • @X_platform
    @X_platform 7 років тому +9

    But how do we know how deep we should skip to? For example, how do we know the 10th layer will or will not improve from input of the 4th layer?

    • @joeycarson5510
      @joeycarson5510 6 років тому +3

      It's my understanding that it may still be somewhat dependent on the problem. The skip connections are essentially restoring the identify of the input from the first layer of the block, thus keeping the block output similar to the input. The feature space that you are learning in those intermediate layers in the residual block is something you may need to consider for your individual problem, in terms of there being too much or too little parameter space. This is also dependent on the quantity and variability of your data. In general, ResNets are useful because as layers are stacked, the solution space increases hugely. That said, keeping the solution space somewhere around the input, constrains it so that it doesn't grow out of control. Two or three intermediate layers is usually enough for the block to learn a reasonable amount, but you may want to consider the width of those intermediate layers as well.
      In terms of why you may not want to stack 10 layers inside a residual block, consider the reason we use residual blocks in the first place. Stacking too many layers balloons the solution space, which SGD will try all sorts solutions and it will be difficult to converge on a reasonable solution. Thus we usually want to keep the blocks small, because we want to avoid that whole problem of stacking too many layers, especially inside the residual block, as residual blocks are like individual building blocks of the whole network.

    • @RehanAsif
      @RehanAsif 5 років тому

      By empirical analysis

    • @zxynj
      @zxynj 2 роки тому

      We don't, but it doesn't hurt to save our game too often

  • @heejuneAhn
    @heejuneAhn 2 роки тому +1

    The L2 regularization is a kind of mandatory?

  • @robingutsche1117
    @robingutsche1117 3 роки тому +3

    In case we learn something useful in g(w[l+2] x a[l+1]+ b[l+2]), isnt it possible that adding the activations of the previous layer a[l] can actually decrease performance? So in that case a plain network would do a better job?

    • @zxynj
      @zxynj 2 роки тому +1

      I guess if the performance is worse, then w and b will be 0 and nothing is learned. al is preserved through the 'game progress saving' technique, so al+2 is at least as good as al

  • @anasputhawala6390
    @anasputhawala6390 2 роки тому

    I have a question:
    You mention that the W matrix and b MAY decay IF we use weight-decay. Isn't that a big IF though?
    Like is weight-decay a part of the residual network / skip-connections? In most cases, the W and b will not be decaying to 0, how is residual network / skip-connections useful in those cases?

  • @anirudhgangadhar6158
    @anirudhgangadhar6158 3 роки тому +1

    "Residual networks can easily learn the identity function" - but isn't this true only when the weights and biases are 0? In the real situation, why would this happen? Its not making sense to me why you would skip connections and have the learned weights go to "0". If someone could please clarify this, I would be extremely grateful.

    • @1233-f7h
      @1233-f7h 2 роки тому +1

      It essentially means that the residual layer can easily learn the identity function over the input by setting the weights to zero. This leads to layer giving an output that is at least NOT WORSE than the output of the previous layer. On the other hand, plain networks may struggle to learn the identity mapping and as a result can lead to worse performance with increasing layers.

    • @derekthompson2301
      @derekthompson2301 2 роки тому

      Hi did you figure it out ? I'm stuck at it now :(

  • @sandipansarkar9211
    @sandipansarkar9211 4 роки тому

    very good explantion.Need to watch again

  • @도정찬
    @도정찬 3 роки тому +1

    i love this video! thanks professor andrew!!

  • @hackercop
    @hackercop 3 роки тому

    Thanks Andrew! now I understand it.

  • @ahmedb2559
    @ahmedb2559 Рік тому

    Thank you !

  • @mohammedalsubaie3512
    @mohammedalsubaie3512 2 роки тому

    thank you very much Andrew, Could anyone please explain what 3X3 conv means? I would really appreciate that

  • @elgs1980
    @elgs1980 4 роки тому +1

    If a layer is meant to be skipped, why was it there in the first place?

    • @mufaddalkanpurwala462
      @mufaddalkanpurwala462 4 роки тому +3

      If the residual block has not learnt anything or is not useful, regularisation will help negate the effect of that layer and help bypass the previous activations thereby not sacrificing the performance of the layer.
      If the residual block has learnt something useful, even after regularisation, the knowledge learnt is stored plus the activations from the previous layer are also added, thereby not sacrificing the performance of the layer.
      So, it helps you keep deep layers with its ability to learn and not learn information.

    • @derekthompson2301
      @derekthompson2301 2 роки тому

      ​@@mufaddalkanpurwala462 Hi, thanks for your explain. There're some points I'm still not clear:
      - l2 regularisation makes W close to 0 but not exactly 0. Moreover, W is a matrix so it's very unlikely for all elements in it to be 0. So how is the layer skipped ?
      - Why would we want add activations from previous layer with knowledge learned ? why adding them won't sacrificing the performance of the layer ?
      Hope you can help me with this, thanks a lot !

  • @baranaldemir5570
    @baranaldemir5570 4 роки тому +1

    Can someone please correct me if I'm wrong? As far as I understand if L2 regularization(weight decay) causes z[L+2] to become 0. Relu just carries a[L] to the next layer. Otherwise, it learns from both z[L+2] and a[L]. So, it bypass the vanishing gradient problem but increases the exploding gradient problem Am I right?

  • @rm175
    @rm175 2 роки тому

    Just amazing. So clear.

  • @6884
    @6884 2 роки тому

    am I the only one that thought that the pointer at 0:55 was actually a bug on their screen?

  • @patrickyu8470
    @patrickyu8470 2 роки тому

    (copied from the previous video in series) Just a question for those out there - has anyone been able to use techniques from ResNets to improve the convergence speed of deep fully connected networks? Usually people use skip connections in the context of convolutional neural nets but I haven't seen much gain in performance with fully connected ResNets, so just wondering if there's something else I may be missing.

  • @ati43888
    @ati43888 8 місяців тому

    Thanks

  • @shuyuwang4867
    @shuyuwang4867 4 роки тому

    Why does the filter number double after pooling is applied? Any suggestion.

    • @snippletrap
      @snippletrap 4 роки тому +2

      The dimension of the image is reduced. Pooling allows the network to learn more features over a larger window of the image, at the cost of lower resolution.

    • @shuyuwang4867
      @shuyuwang4867 4 роки тому

      @@snippletrap thank u. Very good explanation

  • @heejuneAhn
    @heejuneAhn 2 роки тому

    Still I cannot get the intuition why Skip connection works better. It seems still experimental to me. ^^;

  • @Ashokkumar-ds1nq
    @Ashokkumar-ds1nq 4 роки тому

    But we can also take w and b as 1 so that a[l+1]=a[l] and a[l+2]=a[l+1]. By doing so, we can get identity function without ResNets. Isn't it?

    • @5paceb0i
      @5paceb0i 4 роки тому

      @Sunny kumar you can't explicitly make w and b as 1.. they are set by the gradient descent algo... If you are confused “then how w can become 0 ?” it is possible by applying l1 regularisation ( read about this )

  • @vinayakpevekar
    @vinayakpevekar 6 років тому +11

    Can anybody tell me what is identity function?

    • @ajaysubramanian7026
      @ajaysubramanian7026 6 років тому +2

      g(x) = x (Same as linear function)

    • @永田善也-r2l
      @永田善也-r2l 6 років тому +12

      It's a function that outputs the exact same value as input, like y = x. For example, in ReLu function, if input x > 0 then output y = x. So in this case ReLu is a identity function.

    • @mohammedsamir9833
      @mohammedsamir9833 5 років тому

      y=x;

  • @MuhannadGhazal
    @MuhannadGhazal 4 роки тому

    what is a weight decay? anyone please help. thanks..

  • @shashankcharyavusali5914
    @shashankcharyavusali5914 6 років тому

    Does Doesn't the performance gets affected if z[l+2] is negative?

    • @giofou711
      @giofou711 6 років тому

      Yes. If g(.) is ReLU: a[l+2] = g(z[l+2] + a[l]) = z[l+2] + a[l] if z[l+2] > -a[l] else 0. Since a[l] is always non-negative, if z[l+2] gets a negative value whose magnitude is larger than a[l], it results in a[l+2] being 0.

    •  5 років тому +1

      Activation function that you use for z(l+2) is ReLU, which has the minimum value of the zero. Then, z(l+2) will be 0 at minimum. It's kinda attempt for trying to solve vanishing gradients. I'm really interested in if they add random small numbers instead of a(l), means that not g(z(l+2)+a(l)) but g(z(l+2)+ random()) will it work well? This is the first question that comes to my mind. Hope they investigate it, I didn't realise it but I hope you know the paper and would like to share it.

    •  5 років тому

      @@giofou711 Activation function that you use for z(l+2) is ReLU, which has the minimum value of the zero. Then, z(l+2) will be 0 at minimum. It's kinda attempt for trying to solve vanishing gradients. I'm really interested in if they add random small numbers instead of a(l), means that not g(z(l+2)+a(l)) but g(z(l+2)+ random()) will it work well? This is the first question that comes to my mind. Hope they investigate it, I didn't realise it but I hope you know the paper and would like to share it.

  • @jorjiang1
    @jorjiang1 5 років тому +1

    so does it mean that resnet models must to train with a certain degree of weight decay for it to make sense, otherwise it is just equivalent to a plain network

  • @yongwookim1
    @yongwookim1 4 місяці тому

    For learning identity

  • @paulcurry8383
    @paulcurry8383 4 роки тому

    I’m still left wondering, why is it good to learn the identity? A lot of videos I see just say “the identity is good to learn” but I don’t intuitively see why a model would want to learn that, and why the inability to learn the identity causes instability in deeper networks.

    • @MrBemnet1
      @MrBemnet1 3 роки тому +2

      if the network learns identity then at least adding additional layers will not decrease performance.

    • @frasergilbert2949
      @frasergilbert2949 3 роки тому

      @@MrBemnet1 That make sense. But by adding more layers, is the extra ReLU functions at the end the only difference? This is compared to having a shallow layer.

  • @ruchirjain1163
    @ruchirjain1163 3 роки тому +3

    Wow my lecturer made such a mess to explain why the layers just learn identity mapping, this was much easier to understand

  • @arpitaingermany
    @arpitaingermany 5 місяців тому

    this videos have a weird signal tone coming from it

  • @madisonforsyth9184
    @madisonforsyth9184 5 років тому +44

    his voices puts me to sleep. good video tho