[Classic] Deep Residual Learning for Image Recognition (Paper Explained)

Поділитися
Вставка
  • Опубліковано 17 лис 2024

КОМЕНТАРІ •

  • @YannicKilcher
    @YannicKilcher  4 роки тому +65

    This is a pre-recorded scheduled release :D still on break :)

    • @Phobos11
      @Phobos11 4 роки тому +6

      Yannic Kilcher welcomed surprise 😄

  • @VALedu11
    @VALedu11 4 роки тому +50

    for someone like me who has ventured into neural nets recently, this explanation is a boon. IT was like listening to classics. Legendary paper and equally awesome explanation.

  • @Notshife
    @Notshife 4 роки тому +54

    Yep, revisiting this classic paper in your usual style was still interesting to me. Thanks as always

  • @li-lianang8304
    @li-lianang8304 2 роки тому +7

    I've watched like 5 other videos explaining ResNets and this was the only video I needed. Thank you so much for explaining it so clearly!!

  • @thomaesm
    @thomaesm 3 роки тому +8

    I really wanted to drop you a line that I really, really enjoyed your paper walkthrough; super informative and entertaining! Thank you so much for uploading this! :)

  • @cycman98
    @cycman98 4 роки тому +6

    Visiting old and influential papers seems like a great idea

  • @scottmiller2591
    @scottmiller2591 4 роки тому +5

    I was doing something similar for a few decades before this paper came out (no ReLU on the stage output, though). I was engaged in studies in layer by layer training, and the argument for me was "why spend all that time generating a good output for layer k, just to distort it in layer k+1?" Also, I think the physicist in me liked the notion of nonlinear perturbation of a linear model, since linear models work really well a lot of the time (MNIST, I'm looking at you). At any rate, this approach worked quite well in the time series signal processing I was doing, and when the paper came out, I read with relish to see what else they had found that was new. Unfortunately, like you I found that underneath the key idea was a heap of tricks to make the whole thing hang together which seemed to obscure how much was ResNet and how much was tricks.

  • @zawdvfth1
    @zawdvfth1 4 роки тому +2

    "Sadly, the world has taken the ResNet, but the world hasn't all taken the research methodology of this paper." I really appreciate your picks are not only those papers surpassing the performance of the state of the art, but also those with intriguing insights or papers inspiring us by their ways of conducting experiments and testing hypotheses. Most vanish, but residual, as it moves forward.

  • @jingrenxu3250
    @jingrenxu3250 3 роки тому +1

    Wow, you read the author names perfectly!

  • @alandolhasz7863
    @alandolhasz7863 4 роки тому +3

    I've used Resnets quite a bit and thought I understood the paper reasonably well when I read it, but I was wrong. Great video!

  • @SunilMeena-do7xn
    @SunilMeena-do7xn 4 роки тому +3

    Thanks Yannic. Revisiting these classic papers is very helpful for beginners like me.

  • @milindbebarta2226
    @milindbebarta2226 Рік тому +1

    This is probably one of the better videos on these classic research papers on UA-cam. I've seen some terrible explanations but you did pretty well. Good job!

  • @nathandfox
    @nathandfox 3 роки тому

    Revisiting classic paper is SO NICE for new people enter into the field to understand the history of the million tricks that get automatically applied nowadays.

  • @yoyoyoyo7813
    @yoyoyoyo7813 3 роки тому

    im struggling to understand papers, but your explanation to me it really hand held me to grasp this particular paper. For that to me you are awesome. Thank you so much

  • @LNJP13579
    @LNJP13579 4 роки тому +2

    Yannic - you are doing a superb job. Your quality content has "lower dopamine rush effect". Thus, it wud not be viral, but with time you would be a force to reckon with. Not many can explain with so much clarity, depth & speed(daily 1 paper). I have one request. If you can create an ACTIVE mapping of paper with CITATIONS(and similar metrics) so that I get to choose the MOST RELEVANT PAPERS to watch. It would be a great time saver & drastically improve views on better metric videos :) .

  • @alexandrostsagkaropoulos
    @alexandrostsagkaropoulos Рік тому

    Your explanations resonate so good with me that is like pushing knowledge directly in my head. Does anyone has the same feeling?

  • @slackstation
    @slackstation 4 роки тому +3

    Great paper. It must be obvious to you but, to a layman, I finally understand where the "Res" in "ResNet" comes from. Great work.

  • @xuraiis3100
    @xuraiis3100 4 роки тому +8

    10:50 This should have been so obvious, how did I never think of it like that 😨

  • @frederickwilliam6497
    @frederickwilliam6497 4 роки тому +7

    Building hype for attention is all you need v2! Nice selection!

  • @anadianBaconator
    @anadianBaconator 4 роки тому +60

    That was a short break

  • @MrjbushM
    @MrjbushM 4 роки тому +1

    Thanks for this videos the classic series, not all of us have masters or PHD degree, this classic papers help us to understand the main and core ideas of deep learning, papers that important and push fordward the field.

  • @timdernedde993
    @timdernedde993 4 роки тому +4

    Really enjoyed this video! I think going through these older paper that had lasting impact for multiple years is really a great insight especially to those who are fairly new to the field like me

  • @RaviAnnaswamy
    @RaviAnnaswamy 2 роки тому

    I like how you have highlighted that if there is a small architecture exists that can solve a problem, residual connections will help discover it from within a larger architecture - I think this is a great explanation of the power of residual connections. This has two nice implications. I do not need to worry that I should exactly find how many layers are appropriate to capture. I can start with a supersized architecture and let training reduce to the subset that is needed! Let data carve out the subnetwork architecture. Secondly, even if the subnetwork is small, it is harder to directly train a small network. Easier to train a larger network with more degrees of freedom which functionally reduces to the smaller network. One can distill later.

  • @MyU2beCall
    @MyU2beCall 4 роки тому

    COOL ! To discuss those classics. A formidable tribute to the writers and a great way to emphasize their contributions to the history of Artificial Intelligence.

  • @woolfel
    @woolfel 4 роки тому +2

    nice explanation. I've read the paper before and missed a lot of details. still more insights to learn from that paper.

  • @__init__k917
    @__init__k917 3 роки тому

    Would love to see papers like these which have used unique tricks to train, I request you do more videos on paper which solves the problem of training neural networks, tips and tricks and why they work. Why local response normalisation works, what's the best way to initialise your network layers for a vision task, for a NLP task. In a nutshell what works and why.🙏

  • @RefaelVivanti
    @RefaelVivanti 4 роки тому +2

    Thanks. this was fun. I knew some of it but you put it on context.
    Please do more of these classics. If you can, maybe something on UNET/fully convolutional basic papers.

  • @reasoning9273
    @reasoning9273 2 роки тому

    Great video! I have watched like five videos about ResNet on youtube and this one is by far the best. Thanks.

  • @briancase6180
    @briancase6180 3 роки тому

    This is a great series. I'm a very experienced software and hardware engineer who's just now getting serious about learning about ML and feel learning and the whole space. So, what really helps me at this point is not NN 101 but what is the landscape, what do all the acronyms mean, what is the relative importance of various ideas and techniques. This review of classic material is extremely helpful: it paints a picture of the world and helps me put things in their places in my mental model. Then I can dive deeper when I see something important for my current tasks and needs. Keep these coming!

  • @MiottoGuilherme
    @MiottoGuilherme 4 роки тому +1

    Great video! I think there is a lot of value on reviewing old papers when they a cited all the time by the new ones. That is exactly the case of ResNets.

  • @johngrabner
    @johngrabner 4 роки тому +3

    Would love a video enumerating with explanation all the learned lessons organized by importance to modern solutions.

  • @DiegoJimenez-ic8by
    @DiegoJimenez-ic8by 4 роки тому +2

    Thanks for visiting iconic papers, great content!!!

  • @rockapedra1130
    @rockapedra1130 3 роки тому +1

    Another excellent summary! Yannic is one of the best educators out there!

  • @WLeigh-pt6qs
    @WLeigh-pt6qs 3 роки тому

    Hey Yannic, you are such a good company for learning deep learning. You lifted me from all the struggles. Thank you for sharing your insight.

  • @duncanmays68
    @duncanmays68 3 роки тому +3

    I disagree with the assertion that the layers are learning “smaller” functions in ResNets. The results cited to support this claim, that the activations of the layers in the ResNets are larger than those in comparable feed-forward networks, can be caused by small weights and large biases, which L-2 regularization would encourage since it only operates on weights and not biases. The average magnitude of the weights in a layer have no relation to the complexity of the function they encode, since the weights of a layer can simply be scaled down without drastically changing this function. Moreover, in their paper on the Lottery Ticket Hypothesis, Frankle et al. find that ResNets are generally less compressible than feed-forward networks, meaning the functions they encode are more complex than in comparable feed-forward networks.

  • @xxlvulkann6743
    @xxlvulkann6743 3 місяці тому

    I appreciate these videos! Very helpful for putting ML developments in context

  • @rippleproject7467
    @rippleproject7467 3 роки тому +2

    I think the identity layer on a 3x3 matrix wud be a diagonal set of 1 instead of a 1 in the center. @Yannic Kilcher 08:50

  • @Annachrome
    @Annachrome Рік тому

    self learning anns and coming across these papers is daunting - tysm!!

  • @to33x
    @to33x 4 роки тому +2

    Came here from DongXii to support our NIO superstar, Ren Shaoqing!

  • @ahmedabbas2595
    @ahmedabbas2595 2 роки тому

    This is beautiful! a beautiful paper and a beautiful explanation, simplicity is genius!

  • @RaviAnnaswamy
    @RaviAnnaswamy 2 роки тому

    Very enjoyable, insight filled presentation, Yannic, thanks! It almost seems like residual connections allow the network to only use the layers that dont corrupt the insight. Since every fully connected or convolutional layer is a destructive operation (reduction) of its inputs, signal may get distorted beyond recovery over a few blocks. By having a sideline crosswire where not only the original input but any derived computation can potentially be preserved at each step, network is freed from the 'tyranny of tranformation'. :)
    Both the paper and Yannic highlight the idea that - the goal shifts from 'deriving new insights from data' to 'preserving input as long (deep) as needed' - while all other types of layers in a network distort information or derive inferences from data, the residual connection allows preserving information and protecting it from being automatically distorted, so that any information can be safely copied over to any later layer.

    • @RaviAnnaswamy
      @RaviAnnaswamy 2 роки тому

      residual connection can be seen as similar to the invention of zero to arithmetic.

  • @hleyjr
    @hleyjr 3 роки тому +1

    Thank you for explaining it! So much easier for a beginner like me to understand

  • @aadil0001
    @aadil0001 4 роки тому

    Revisiting the classics which had massively changed and forged the direction for DL research is so fun. Loved the way you explained the things. So cool. Thanks a lot :)

  • @lilhikaru8361
    @lilhikaru8361 3 роки тому

    Excellent video featuring an extraordinary paper. Good job bro

  • @emmarbee
    @emmarbee 4 роки тому +1

    Loved it and subscribed! And yes please do more of classics!

  • @dhruvgrover7416
    @dhruvgrover7416 4 роки тому +2

    Loved the way u are reviewing papers.

  • @anheuser-busch
    @anheuser-busch 4 роки тому

    Thanks for this! And I really enjoy going through the old papers, since you can pick up things you missed when first reading them. Enjoy the break!!

  • @Parisneo
    @Parisneo 3 роки тому

    I loved this paper. Resnets are still cool. Nowadays there are a more complicated versions of these nets but the ideas still pretty much hold these days.
    Nice video by the way.

  • @aa-xn5hc
    @aa-xn5hc 4 роки тому

    I love this series on historical papers

  • @wamkong
    @wamkong 3 роки тому

    Great discussion of the paper. Thanks for doing this.

  • @lucashou4920
    @lucashou4920 Рік тому

    Amazing explanation. Keep up the good work!

  • @kbkim-f4z
    @kbkim-f4z 3 роки тому

    This is it!!!!! Great thanks from South Korea!!!!!

  • @gringo6969
    @gringo6969 4 роки тому +1

    Great idea to review classic papers.

  • @TimScarfe
    @TimScarfe 4 роки тому +1

    I love the old papers idea! Nice video

  • @yourdudecodes
    @yourdudecodes Місяць тому

    Loved it. Thanks
    - Deep Learning Enthusiast

  • @herp_derpingson
    @herp_derpingson 4 роки тому +1

    24:06 I think LeNet also did something similar but my memory fades.
    .
    Legendary paper. Great work. Too bad, I think in last two years we havent seen any major breakthroughs.

    • @tylertheeverlasting
      @tylertheeverlasting 4 роки тому +1

      Are large scale use of transformers not a big breakthrough?

    • @herp_derpingson
      @herp_derpingson 4 роки тому +1

      @@tylertheeverlasting Transformers came out in 2017, if I remember it right.

  • @ramchandracheke
    @ramchandracheke 4 роки тому

    Hats off to Dedication level 💯

  • @chaima7774
    @chaima7774 2 роки тому

    Thanks for these great explanations , still a beginner in deep learning but I understood the paper very well !

  • @danbochman
    @danbochman 4 роки тому +1

    Love the [Classic] series.

  • @julianoamadeulopesmoura5666
    @julianoamadeulopesmoura5666 4 роки тому +1

    I've got the impression that you're a very good chinese speaker for your pronounciation of the authors' names.

  • @GauravSharma-ui4yd
    @GauravSharma-ui4yd 4 роки тому +3

    What is inception-net hypothesis?? In xception-net paper, the author explained the hypothesis of inception-net. But I couldn't grasp it fully and get lost a bit. Can you explain that??

    • @YannicKilcher
      @YannicKilcher  4 роки тому

      I'm sorry I have no clue what the inception-net hypothesis is, but also I don't know too much about inception networks.

  • @MrMIB983
    @MrMIB983 4 роки тому +1

    Universal transformer please! Love your videos, great job

  • @gorgolyt
    @gorgolyt 3 роки тому +2

    Little question about the connections when the shape changes: a simple 1x1 convolution can give the right depth but the feature maps would still be the original size. So I assume the 1x1 convolutions are also with stride 2?

    • @xxlvulkann6743
      @xxlvulkann6743 3 місяці тому

      This is correct and is specified in the paper. I had the same question

  • @romagluskin5133
    @romagluskin5133 4 роки тому +1

    what a fantastic summary, thank you very much !

  • @shardulparab3102
    @shardulparab3102 4 роки тому +1

    Another Great one! Would like to request if a review is possible on angular losses especially ArcFace, as it has begun being adopted for multiple classification tasks as another *classic* review.
    Thanks!

  • @matthewevanusa8853
    @matthewevanusa8853 3 роки тому

    Best explanation I have seen, nice work

  • @OwenCampbellMoore
    @OwenCampbellMoore 4 роки тому

    Love these reviews of earlier landmark papers! Thanks!!!

  • @lolitzshelly
    @lolitzshelly 3 роки тому +1

    Thank you for this clear explanation!

  • @norik1616
    @norik1616 4 роки тому +6

    I love how you will *not* review papers based on impact, except when you do :D
    JK, please mix in more [classic] papers, or whatever else you feel like - just keep the drive for ML. Is's contagious! 💦
    An idea: combined review/your take of a whole class of models (eg. MobileNet and its variants &| YOLO variants)

  • @seanbenhur
    @seanbenhur 4 роки тому +1

    Please make more videos on Classic Papers..like yolo..inception!!

  • @goldfishjy95
    @goldfishjy95 3 роки тому

    Thank you! this is unbelievably helpful as someone whos just starting out. subscribed!

  • @animeshsinha
    @animeshsinha Рік тому

    Thank You for this beautiful explanation!!

  • @bernardoramos9409
    @bernardoramos9409 4 роки тому +2

    These skip connections were also "learned" automatically by AutoML

  • @epicmarc
    @epicmarc 3 роки тому

    I was wondering if there is any consensus now on why ResNet works so well? The common answer seems to be that the residual connections help avoid the vanishing gradient problem, but in the paper the authors argue that they don't think this is the case!

  • @yahuiz7877
    @yahuiz7877 2 роки тому

    looking forward to more videos like this!

  • @mariosconstantinou8271
    @mariosconstantinou8271 2 роки тому

    In 3.4 - Implimentation, it says that they use BN after each conv layer and before the activation. Does this stand true for ResNet50+? In the bottleneck blocks, do they add BN after the first 1x1 conv layer and then the 3x3 and lastly the 1x1 again? Or was the Implementation part, discussing the ResNet34 structure?

  • @timobohnstedt5143
    @timobohnstedt5143 3 роки тому

    If questions due to the parameters of the ResNet. As far as I understood it you concatenate the input with the output of another layer. This enables you to train more stable networks. Why does this lead to fewer parameters than the VGG? I would suggest that this is the case because you perform the more costly operations (more filters) on layers that are already reduced in their dimensions due to the stride? Is this correct?

  • @nirmaladhikaree9609
    @nirmaladhikaree9609 6 місяців тому

    i dont understand from the starting of the quoted statement (which I will write) upto 9:28, you are saying, "instead of learning to transform X via neural networks to X, which is an identity function, why don't we have X stay X and then learn whatever we need to change?"
    can you explain me this part with some analogy? I am beginner here. Thanks !!

  • @bijjalanaganithin3798
    @bijjalanaganithin3798 4 роки тому +1

    Loved the explanation Thank You so much!

  • @oncedidactic
    @oncedidactic 3 роки тому

    This is really valuable tbh. Great video!

  • @PetrosV5
    @PetrosV5 3 роки тому

    Amazing narration, keep up the excellent work.

  • @sabako123
    @sabako123 3 роки тому

    Thank you Yannic for this great work

  • @tungvuthanh5537
    @tungvuthanh5537 3 роки тому

    This helped me so much , big thanks to you

  • @SirDumbledore16
    @SirDumbledore16 Рік тому +1

    that chuckle at 13:06 😂

  • @lucidraisin
    @lucidraisin 4 роки тому +2

    You are back! I was getting withdrawals lol

  • @CSBGAGANHEGDE
    @CSBGAGANHEGDE 2 роки тому

    You have broken down the language in the paper to very simple and easily digestable form. Thank you

  • @fugufish247
    @fugufish247 3 роки тому

    Fantastic explanation

  • @User127169
    @User127169 2 роки тому

    Great video. At 27:25 Yannic says: "Overfitting was still a thing back then". Is it a solved problem today? Can anyone please share some resources or papers about these new augmentation methods he is referring to?

    • @RaviAnnaswamy
      @RaviAnnaswamy 2 роки тому +1

      Regarding overfitting - I think it is solved in practice using regularization in many many architectures.
      Pre-2010 regularization mostly meant controlling the number and size of coefficients or weights. Post 2010 more tricks have been developed.
      Overfitting happens when a powerful high capacity network eagerly memorizes data fast (using incidental hints to associate input to output rather than the true relationships). If learning of associations can be made slow and deliberate, and this process of quick learning of hints can be 'delayed', then there is a possibility that a network can learn true relations.
      Using these mechanisms (scaled weight initialization, drop out, adding more layers with residual bypass, batch and layer normalization, weight decay, very very low learning rate, large dataset, augmentations that are reasonable, fewer epochs on the data) we slow down the training curve to such a level that whatever is being learned is also performing true on the validation data.
      So modern network training is art tweaking the mix of these while watching the training curve and validation curve coevolve.
      I think innovation since 2016-17 is that a processing layer, a norm layer, drop out activation are bundled into a block. What was once a single feedforward layer is now a block which is self contained 'damped' regularized learner unit. The block concept avoids having to tweak each section of the architecture. If a single block can be 'guaranteed' to learn without overfitting reasonably well, then you can stack many of them as needed for the size of the dataset and structure of domain. Almost like n repeaters can be used to transmit electricity or radio waves, if distance is more you simply add yet another transformer or transmitter block, respectively.
      I also think that all big players have mastered this into a practical engineering that they confidently train 100+ layer (even GPT2 has 12 transformer blocks each with at least 10 layers) without worrying about overfitting, because all these tricks are baked in to the blocks.

  • @shambhaviaggarwal9977
    @shambhaviaggarwal9977 3 роки тому

    Thank you so much! Keep making such awesome videos

  • @LouisChiaki
    @LouisChiaki 4 роки тому

    Nice review about residual network!

  • @GoriIIaTactics
    @GoriIIaTactics 4 роки тому

    One thing I'm confused about is why resnet requires fewer computational cycles.
    EDIT: NVM I saw the filter numbers. Now I'm confused as to why they designed each layer to have less channels than in VGG

  • @sebastianamaruescalantecco7916
    @sebastianamaruescalantecco7916 3 роки тому

    Thank you very much for the explanation! I'm just starting to use the pretrained nets I wondered how could I improve the performance of my models, and this video cleared many doubts I had. Keep up the amazing work!

  • @davidvc4560
    @davidvc4560 2 роки тому

    excellent explanation

  • @rodrigogoni2949
    @rodrigogoni2949 Рік тому

    Very clear thank you!

  • @메린-q8b
    @메린-q8b 4 роки тому

    I have been studying ResNet for a few days and I have seen this video and understood most of the things I didn't understand. Thanks. But I have a question. It starts with the idea that when the network deepens, the newly added layers do not have the same problem as the authors found, just by performing the identity mapping role. (7:00) But why do they add skip connection from the beginning of the network? For that reason, I thought that I could learn without skip connection up to the depth of VGG, which has been proven and apply skip connection to later layers. Is it something that should be taken experimentally?

  • @kamyarjanparvari4244
    @kamyarjanparvari4244 2 роки тому

    Very Helpful. thanks a lot. 👍👌

  • @lenayoharna4030
    @lenayoharna4030 2 роки тому

    such a great explanation... tysm

  • @everythingaccount9619
    @everythingaccount9619 2 роки тому

    Hey Guys, what is meant by "weight layer tend towards zero function"? Can someone explain this to me? I am very new to all of this.

  • @ghfghf7
    @ghfghf7 4 роки тому +1

    Give us a chance too catch up!

  • @KB-zg8ho
    @KB-zg8ho 3 роки тому

    can you please continue explaining more papers