PyTorch Hooks Explained - In-depth Tutorial

Поділитися
Вставка
  • Опубліковано 14 жов 2024

КОМЕНТАРІ • 139

  • @rockapedra1130
    @rockapedra1130 6 місяців тому +3

    These lectures are gold!!

  • @altostratous
    @altostratous 3 роки тому +2

    Most professional video I've ever seen in programming.

  • @scottmiller2591
    @scottmiller2591 3 роки тому +4

    Very nice presentation - good pacing, good use of animation, good example - for complicated subject that is not explained clearly in one place in the documentation, but scattered throughout without a unifying set of examples.

  • @hilmandayo
    @hilmandayo 4 роки тому +10

    I LOVE the small and little details/catch-up you threw into the video!
    It can clear a lot of doubts that beginner will probably face.
    Keep the videos coming! You contribute a lot to the world with this kind of video.
    Between, your channel makes a huge transition from producing totally random videos of exercise, etc. to deep learning ha3.

    • @elliotwaite
      @elliotwaite  4 роки тому

      Thanks, Hilman! I'll keep the video coming.
      Hah, yep, this channel has been through some interesting phases. But I think I've found my niche.

  • @kartheekakella2757
    @kartheekakella2757 4 роки тому +13

    awesome vid! this channel's gonna go viral, take my word for it..

    • @elliotwaite
      @elliotwaite  4 роки тому +1

      Thanks, Sukruth! You're the first to comment this prediction as far as I can remember. I'll try to make it come true.

    • @xzl20212
      @xzl20212 2 роки тому

      @@elliotwaite I really appreciate the quality of your video. Glad you do not sacrifice quality for the subscription.

  • @rsd2dcc
    @rsd2dcc 3 роки тому +3

    Finally, i've nailed the hooks. Thank you :)

  • @samanthaqiu3416
    @samanthaqiu3416 4 роки тому +4

    very interesting. I saw your autograd video and was very cool. Something that gets confusing for me is when you need to retain the graph in order to use the gradients computed in a first backward, in a second metaloss calculation

    • @elliotwaite
      @elliotwaite  4 роки тому +6

      Ah, the reason you have to set retain_graph=True is because the default behavior of the backward method is that after the gradients have been passed through, it will delete the data stored in the backward graph that was needed to calculate those gradients (such as the data for to the tensors that were used in the forward pass). The reason that this is the default behavior is because most of the time people only do one pass through the backward graph, and deleting the graph's data saves memory. So you have to specify that you want to keep the graph's data in memory if you want to do a second pass through it.
      Let me know if I didn't answer your question, or if there is anything you're still unsure about.

  • @abhijitdeo2683
    @abhijitdeo2683 3 роки тому +1

    Dude this content is gold

    • @abhijitdeo2683
      @abhijitdeo2683 3 роки тому +1

      If u are planning for memeber only content and stuff too, I'm in man.. this is literally gold

    • @elliotwaite
      @elliotwaite  3 роки тому +1

      @@abhijitdeo2683, thanks! I don't have any plans for member only content at the moment, but I appreciate your comment.

  • @datascience3008
    @datascience3008 Рік тому

    Its amazing how you can reply to all comments that you recieve.

    • @elliotwaite
      @elliotwaite  Рік тому

      🙂 I enjoy the subject matter. Also, it's not too many to get overwhelmed by at the moment.

  • @junweizheng1994
    @junweizheng1994 2 роки тому +1

    My first comment on UA-cam. This video is amazing and I can image you have done lots of work for making this video. I really appreciate that. Good contents, good presentation, good slides. This channel will get popular if you continue making great videos like this!

    • @elliotwaite
      @elliotwaite  2 роки тому +1

      Thank you! I hope to make more videos in the future.

    • @chaupham1186
      @chaupham1186 Рік тому +1

      @@elliotwaite Looking forward to it! Thanks for great videos

  • @pizhichil
    @pizhichil 4 роки тому +1

    Thank you so much for this video. As always, very helpful. If not this one, it would have taken a large effort to understand everything ... thanks

    • @elliotwaite
      @elliotwaite  4 роки тому

      Thanks, Amit! I'm glad you found it helpful.

  • @khushpatelmd
    @khushpatelmd 3 роки тому +1

    Thank you so much. Please make more videos. You are incredible teacher!!

    • @elliotwaite
      @elliotwaite  3 роки тому +1

      Thanks for the comment, glad you liked the video. I have been thinking about starting to make videos again, and your encouragement helps.

  • @raunakkbanerjee9016
    @raunakkbanerjee9016 2 місяці тому

    Excellent video.. crystal clear explanations

    • @elliotwaite
      @elliotwaite  2 місяці тому

      @@raunakkbanerjee9016 thanks!

  • @jiangpengli86
    @jiangpengli86 Місяць тому

    Marvelous tutorial. Thank you so much.

    • @elliotwaite
      @elliotwaite  Місяць тому

      @@jiangpengli86 I'm glad to hear you liked it.

  • @fernandofariajunior
    @fernandofariajunior 4 місяці тому

    This video is so helpful, thanks for making it!

    • @elliotwaite
      @elliotwaite  4 місяці тому

      Thanks. I'm glad you liked it.

  • @zichenwang8068
    @zichenwang8068 Рік тому

    Thank you so much for sharing this high-quality tutorial.

    • @elliotwaite
      @elliotwaite  Рік тому +1

      Woah, this is the first Super Thanks I've ever received. Thanks, Zichen! 😊 I'm glad you found the tutorial helpful.

  • @catthrowvandisc5633
    @catthrowvandisc5633 4 роки тому +1

    hey Elliot, thank you for this as well! i came to your channel for your autograd video and it really helped me quickly get a clearer picture. this one's just as good too. i really like how you incrementally take the problems deeper in each of your videos, they are very helpful to cement the understanding. would you be able to do one on pytorch distributed training? i couldn't find a good video explanation to help with it.

    • @elliotwaite
      @elliotwaite  4 роки тому +2

      Glad to hear you're finding these videos helpful. And thanks for the suggestion! I still don't have a good understanding of how PyTorch distributed training works either yet, but it seems like something I should learn at some point. I'm not sure when I'll get around to this one, since it seems like it might take a bit of research to get a deeper understanding of it, but I'll definitely add it to my list of potential future videos. And if I decide to make it, I'll leave another reply here letting you know when it's posted.

  • @andreborgescavalcante4589
    @andreborgescavalcante4589 2 роки тому

    One thing related is that for reading grad properties of intermediate tensors, we only need use first retain_grad() and then to return that tensor as an output of the forward method.

    • @elliotwaite
      @elliotwaite  2 роки тому

      Thanks for mentioning this tip.

  • @abdelmananabdelrahman4099
    @abdelmananabdelrahman4099 Рік тому

    Wow. Great content! We need more of these videos.

    • @elliotwaite
      @elliotwaite  Рік тому

      Thank you for the encouraging comment. I may make more in the future.

  • @ohotpow
    @ohotpow 3 роки тому

    Very good video! It should be liked in the pytorch documentation.

  • @UrBigSisKey
    @UrBigSisKey Рік тому +1

    This video was the best ressource I could find on the internet on this topic, really helped me so much, thank you for all your efforts and clear explanation! 😊

  • @shaozhuowang3403
    @shaozhuowang3403 4 роки тому +3

    It's great as always, thank you guys.

  • @aymensekhri2133
    @aymensekhri2133 2 роки тому

    Very amazing, could you please take for example the state-of-the-art models in deep learning and break them down and explain how the flow is working espacially those models that contains very specific pytorch methods like "register hooks". Because i have noticed that on youtube most of the youtubers are focusing on big terms on pytorch and they are explaing the simple concepts but once we get to the SOTA models we find many things new and complex.

    • @elliotwaite
      @elliotwaite  2 роки тому

      That's a good suggestion for potential future videos, thanks. I've noticed that as well, that UA-camrs usually don't break down the more complex PyTorch models. I'm currently busy working on another project and have taken a break from making UA-cam videos, but I'll add this idea to my list of video ideas in case I get back into making UA-cam videos in the future.

  • @rayanzaki9314
    @rayanzaki9314 3 роки тому +1

    Awesome Explanation. It was really very helpful. Could you make a video on Quantization in Pytorch? Thanks

    • @elliotwaite
      @elliotwaite  3 роки тому

      Thanks! Glad you liked it. And thanks for the suggestion. I'll add that to my list of potential future videos. Quantization is something I also want to learn more about at some point, and I'll probably make a video about it when I do.

  • @ArunKumar-bp5lo
    @ArunKumar-bp5lo 2 роки тому

    Nicely explain visually

  • @shubhamthapa7586
    @shubhamthapa7586 3 роки тому

    Wow thanx for making this video , finally my doubts are cleared now !

    • @elliotwaite
      @elliotwaite  3 роки тому

      Gald it helped.

    • @shubhamthapa7586
      @shubhamthapa7586 3 роки тому

      @@elliotwaite yeah I was trying to implement grad cam so thought of clearing the concept of hooks first and this video is just perfect for that.

  • @carlossegura403
    @carlossegura403 3 роки тому

    Great summary and video-quality to PyTorch hooks!

  • @sudhirdeshmukh8445
    @sudhirdeshmukh8445 3 роки тому

    Hi Elliot, thanks for yet another wonderful PyTorch video. I was just wondering why there is "@staticmethod" mentioned before the forward function of a module. Why use "@staticmethod" also when and where. REF: 15:21 in the video.
    Thank you

    • @elliotwaite
      @elliotwaite  3 роки тому +1

      "@staticmethod" is just a function decorator that makes it so that when the method is called, its first argument isn't auto-filled with the value of the instance that is calling the method, or in other words, it just makes it so that the first parameter of the method doesn't have to be "self". You can use it on any methods where the "self" parameter is not used, in which case you can add the "@staticmethod" decorator and remove the "self" parameter. Some tests show that using it when appropriate provides a tiny performance boost, but I think it mostly just makes it cleaner in the sense that you don't list any unused parameters. You can find more info about it here: docs.python.org/3/library/functions.html#staticmethod

  • @programmer8064
    @programmer8064 Рік тому

    Thank you so much!!!!!!!!!!!!!!!!!!!!! I love this video!!!

  • @mariogalindoq
    @mariogalindoq 2 роки тому

    Elliot: very interesting, but could you give examples of using hooks? That's it, why could be useful to use hooks? In which problem do you needs hooks? What can be done with hooks that can't be done other way?

    • @elliotwaite
      @elliotwaite  2 роки тому

      Good question. I probably should have included more specific use cases in the video. The best use case I can think of for hooks is when you want to change how data flows through an existing module (so these would be using the module style hooks, `register_forward_hook` and `register_forward_pre_hook`). For example, these hooks were used in PyTorch's library to implement these tools:
      Quantization: pytorch.org/docs/stable/quantization.html
      Pruning: pytorch.org/tutorials/intermediate/pruning_tutorial.html
      Spectral Norm: pytorch.org/docs/stable/generated/torch.nn.utils.spectral_norm.html
      Weight Norm: pytorch.org/docs/stable/generated/torch.nn.utils.weight_norm.html
      You can download the PyTorch library and search the source code for "register_forward_" to see how they were used.
      Good use cases for the Tensor hooks are harder to think of, but they could be used to experiment with novel ways of managing the gradients as they flows through the backward graph. For example, to do gradient clipping, most people just clip the gradients at the end of the backward pass, but maybe it would be better in certain cases to clip the gradients as they are flowing through the backward graph instead, which could be done with Tensor hooks (that might actually be a good experiment to try 🤔).
      I personally haven't ever run into a situation where I needed to use them in the projects I've worked on, but it's good to know they're there if needed, and I wanted to explain them since my viewers had asked about them.

  • @markomitrovic4925
    @markomitrovic4925 Рік тому

    Thank you for the explanation :)

  • @phuclai4492
    @phuclai4492 Рік тому

    great video, I love it !!! I hope you make more great videos in the future.

  • @andreborgescavalcante4589
    @andreborgescavalcante4589 2 роки тому

    Amazing video.

  • @jizhang2407
    @jizhang2407 3 роки тому

    @11:00, I don't get it why `return grad + 2` will update the `grad` variable, if this is not done by `grad +=2` and then `return grad` in thie c_hook function... Can anybody enlighten me? Thanks. Anyway, brilliant tutorial, and I learned a lot. Thank you, Elliot.

    • @elliotwaite
      @elliotwaite  3 роки тому

      Thanks, glad you liked the tutorial. About your question, I think during the backward pass, the PyTorch code does something like this:
      grad = registered_hook_function(grad)
      So it updates the grad variable with whatever was returned from the hook function, but not in the same as an in-place operation would because the grad variable is now pointing to a different tensor, the tensor returned from the hook function (unless the returned tensor is the same tensor that was passed into the hook function). This new tensor is then passed along as the gradient to the next backward nodes in the graph.

  • @jonatan01i
    @jonatan01i 3 роки тому

    Thank you for this video, very informative, helped me a lot! Thanks!

  • @oheldad
    @oheldad 2 роки тому

    Great tutorial well done !

  • @az8134
    @az8134 3 роки тому +1

    damn, I never looked into those details before.

  • @pouyaparsa5851
    @pouyaparsa5851 4 роки тому +2

    Nice job, thank you
    I wonder is there any way to see these nodes in code and print their properties like where they point to an so on ?
    in other words could we go any further than knowing grand_fn is actually there ?

    • @elliotwaite
      @elliotwaite  4 роки тому +1

      Yeah, you can inspect some things, like the nodes in the backwards graph. I advise using a debugger. I use the one in PyCharm. The grad_fn property will point to a node in the backward graph, and then that node will have a next_functions property which will be a tuple of tuples that contain other nodes in the backward graph, and so on. For example:
      a = torch.tensor(2.0, requires_grad=True)
      b = (a * 3) * 4
      print(b.grad_fn.next_functions[0][0].next_functions[0][0])
      # Will print out:
      print(b.grad_fn.next_functions[0][0].next_functions[0][0].variable)
      # Will print out: tensor(2., requires_grad=True)
      The first print function will work its way back through the backward graph until it gets to the AccumulateGrad node for the `a` tensor, so it will print out that AccumulateGrad node object. And the second print statement will print out the variable associated with that AccumulateGrad node, which is the `a` tensor, so it will print out the `a` tensor.
      However, the things you can access is limited, for example, I don't think you can access which tensors are associated with the intermediate nodes, like the MulBackward0 nodes, since I think that information is stored on the C++ side of things.
      Good question. Thanks, Pouya!

    • @pouyaparsa5851
      @pouyaparsa5851 4 роки тому +1

      @@elliotwaite thanks for this perfect answer !

  • @thevikinglord9209
    @thevikinglord9209 2 роки тому

    Nice video, so how do you check out the graph ?

    • @elliotwaite
      @elliotwaite  2 роки тому

      Do you mean how do you check out the backward graph of the code you wrote? If so, I explored the backward graph by using a debugger and looking at the `grad_fn` property on the tensors.

  • @ハェフィシェフ
    @ハェフィシェフ 2 роки тому

    Do you make the diagrams yourself or did you write code for it?

    • @elliotwaite
      @elliotwaite  2 роки тому +1

      I made them myself using Figma.

  • @ravivaishnav20
    @ravivaishnav20 4 роки тому +1

    Awesome explanation, Could you please give intuition on Data parallesim in pytorch, and is there any we can use colab GPU with our Laptop GPU ?

    • @elliotwaite
      @elliotwaite  4 роки тому

      Thanks!
      I'll add data parallelism in PyTorch to my list of potential future videos (but not sure when I might get around to making it). For RL tasks, I've just been using MPI for Python (mpi4py.readthedocs.io/en/stable/). An example of it being used can be seen in OpenAI's Spinning Up in Deep RL code:
      github.com/openai/spinningup/blob/master/spinup/utils/mpi_pytorch.py
      github.com/openai/spinningup/blob/master/spinup/utils/mpi_tools.py
      github.com/openai/spinningup/blob/master/spinup/algos/pytorch/vpg/vpg.py
      And there's also PyTorch's built-in distributed training tools, but I haven't dived into those much yet.
      Using a colab GPU and your laptop's GPU in parallel should be possible, but I'm not sure the details of how you would get it to work. I would imagine you'd establish a way to communicate between the two processes running on the separate machines, then you synchronize the models at the start of training, and then each model computes the gradients for a separate batch of data, and then those gradients would get averaged using the communication method before using them to update the models.

  • @케이케이-u8y
    @케이케이-u8y 3 роки тому

    Hi Elliot , very good video , I have a question in your video 14:39
    I define def d_hook(grad): grad*=100
    Replacing e = c + d to e = c * d and i wrote code d.register_grad(d_hook) and it affect to c , c gradient = 100
    But, i do e= c+ 1*d and use hook func grad*=100 and I wrote the code as d.register_grad(d_hook) .it does not affect , c gradient= 1
    I don't know why there is such a difference

    • @elliotwaite
      @elliotwaite  3 роки тому +1

      When going backwards through c + d, the same gradient gets passed to both c and d, so multiplying the d gradient in-place by 100 will also affect the c gradient. When going backwards through c + 1 * d, the same gradient is passed to both c and the 1 * d term, and then the gradient passed to the 1 * d term is multiplied by 1 to get the gradient for d, and this multiplication by one ends up creating a new tensor, so now when you update the d gradient in-place it is no longer the same tensor used for the c gradient, so it doesn't affect the c gradient.
      I hope that makes sense. Let me know if that doesn't answer your question.

    • @케이케이-u8y
      @케이케이-u8y 3 роки тому +1

      @@elliotwaite Thanks your answer is really great.

  • @PankajGupta-ki9gx
    @PankajGupta-ki9gx 2 роки тому

    Can you make a full-fledged Series covering various PyTorch functionalities and inbuild classes since the documentation is quite tough to interpret for custom datasets in form of a playlist!
    Thank You

    • @elliotwaite
      @elliotwaite  2 роки тому

      Thanks for the suggestion. I haven't been interested in making PyTorch videos lately, but I'll add your recommendation to my potential future videos ideas list.

  • @rachelliu7253
    @rachelliu7253 2 роки тому

    Thanks so much

    • @elliotwaite
      @elliotwaite  2 роки тому

      You're welcome. Thanks for the comment.

  • @samllanwarne6512
    @samllanwarne6512 3 роки тому +1

    at 4:55 when you multiply by 4, should accumulatedGrad for a be 8 and accumulatedGrad for b be 12, the other way around to in the video?

    • @elliotwaite
      @elliotwaite  3 роки тому +1

      I think the gradients are correct in this case, but it is a bit counterintuitive and this has mixed me up before. The counter intuitive part is that when you backprop through a multiplication, the gradients actually get multiplied by the flip of the input values. For example when you backprop through A * B, the gradient for A is B * the incoming gradient, and the gradient for B is A * the incoming gradient. This is because for each little increase in A, it will increase the output by that little increase times B, and for each little increase in B it will increase the output by that little increase times A.

    • @samllanwarne6512
      @samllanwarne6512 3 роки тому +1

      @@elliotwaite Thank you Elliot!

  • @peasant12345
    @peasant12345 Рік тому

    I still don't follow why c.grad will be modified to 100 14:32. Does it have an enforced integrity something like the grad of two sides of an add operation must be equal?

    • @elliotwaite
      @elliotwaite  Рік тому

      Yeah, finding the gradient is basically saying, "if I change the input by a little bit, by how much will that change the output." And the add operation (or the sum of any number of inputs) will result in the inputs all having the same gradient with respect to the output, because changing any of the inputs by a little bit (dx), will changing the output by that exact same amount (dx). So the gradient for all the inputs of a sum with respect to the output is 1, which means that when backpropagating the gradients through the add (or sum) operation, the code can use the optimization of just passing the same gradient tensor to all the inputs. But this optimization is only safe if none of the backward hook functions apply any in-place operations to this shared tensor, because the result of an in-place operation would be visible to all that are using that shared tensor.

  • @TheMazyProduction
    @TheMazyProduction 4 роки тому +2

    What do you use to make these diagrams?

    • @elliotwaite
      @elliotwaite  4 роки тому +2

      For this one I used Figma. I learned from this tutorial: ua-cam.com/video/OM-lTzFm9JQ/v-deo.html

    • @TheMazyProduction
      @TheMazyProduction 4 роки тому +1

      @@elliotwaite Perfect I needed something like this to make flowcharts!

  • @jonatan01i
    @jonatan01i 3 роки тому

    Besides the inplace operation on the grads,
    is there any difference between
    - using hooks
    and
    - going through the model.parameters()'s grads
    in order to modify them before a call on optimizer.step()?

    • @elliotwaite
      @elliotwaite  3 роки тому +1

      As far as I'm aware, there won't be any difference. When you call optimizer.step(), the optimizer will only be concerned with what the current grad values of the parameters are, and it won't matter how those grad values were assigned.

    • @jonatan01i
      @jonatan01i 3 роки тому +1

      @@elliotwaite Makes sense, thank you!

  • @Aditya-ne4lk
    @Aditya-ne4lk 4 роки тому +1

    a.grad will have the same shape as the shape of a correct?

    • @elliotwaite
      @elliotwaite  4 роки тому +1

      Yep. The gradient tensor will have a gradient value corresponding to each of the values in the A tensor, so it will have the same shape as A.

  • @AmitYadav-zs4ft
    @AmitYadav-zs4ft 3 роки тому

    Hi, what are using to display those cpu/memory specs in your upper right corner? In linux, we have system monitor, but I am looking for an alternative in mac. Thanks

    • @elliotwaite
      @elliotwaite  3 роки тому +1

      iStat Menus is the one I use: bjango.com/mac/istatmenus/

  • @jizhang2407
    @jizhang2407 3 роки тому

    @14:21, I don't understand why the in-place operation inside d_hook also changes the gradient passed to MulBackward0. Isn't the gradient of e, i.e. 1.0, is passed to both c and d as two "1.0"s, i.e. the same but two independent values? Can anybody enlighten me? Thanks.

    • @elliotwaite
      @elliotwaite  2 роки тому

      Since the AddBackward0 node doesn't change the gradient, it saves memory by not duplicating the data and just passing along the same "1.0" tensor object to both c and d (or in other words, it passes along pointers that point to the same underlying data). This is why when that data is change in the d_hook by an in-place operation, it also affects the data that is seen in the c node.
      I hope that helps clarify.
      P.S. - Sorry for the late reply. I'm not sure how I missed your comment earlier. Thanks for the question.

  • @michpo1445
    @michpo1445 Рік тому

    What is the program you're using to graphically design py torch code?

    • @elliotwaite
      @elliotwaite  Рік тому

      I make the designs with Figma.

    • @michpo1445
      @michpo1445 Рік тому

      @@elliotwaite Thanks, but to clarify does this tool create the pytorch code for you, or do you just use it to graphically represent what you are coding?

    • @elliotwaite
      @elliotwaite  Рік тому +1

      @@michpo1445 it wasn't auto-generated, I just designed the slides by hand to match the info I was seeing in the Python debugger.

  • @nezgi8220
    @nezgi8220 3 роки тому

    What about stacking a and b to another tensor? How grads are calculated if many of these grad required mini tensors stacks to a big tensor?

    • @elliotwaite
      @elliotwaite  3 роки тому

      In the backward pass, the gradient will get distributed to each of the tensors that were stacked together, only passing along to each tensor the part of the gradient that corresponds with that tensor.

    • @nezgi8220
      @nezgi8220 3 роки тому

      @@elliotwaite Indeed, it is, I tested empirically. What a miracle!

  • @anas.2k866
    @anas.2k866 2 роки тому

    Thanks, so we cant track the gradient in backward when it is in module ? Is there any other way

    • @elliotwaite
      @elliotwaite  2 роки тому

      After I made this video, PyTorch added a new hook called register_full_backward_hook() that works on modules. It is called whenever the gradient with respect to the inputs is computed. The docs for it are here:
      pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_full_backward_hook
      However, if you are asking about tracking the gradients inside the module, and not just the gradients of the inputs, to do that, I think you would have to updated the actual module code by adding calls to register hooks on the intermediate tensors.

    • @anas.2k866
      @anas.2k866 2 роки тому

      @@elliotwaite Ah thank you. So if I put this hook in the layer 5 of my multylayer perceptron and I lunch the loss.backward(), the grad_input is the gradient of the loss with respect to the weihts and bias of layer 5 is not it ? and what is the grad_output. Thanks again for your huge effort !!!!

    • @elliotwaite
      @elliotwaite  2 роки тому

      @@anas.2k866 hooks are usually used to intercept gradients in places where you wouldn't otherwise have access to them.
      The graident of the loss with respect to the weights and bias of layer 5 will already be accessible by accessing the the `.grad` attribute of the weights/bias tensors of that module.
      Registering a hook on a module would be used for something else. It would be used to access the gradients just before they enter the module in the backward pass and just after they leave the module in the backward pass. The gradients just before entering the module in the backward pass will be the `grad_output` value of the hook function, because those will be the gradients with respect to the output of the module. And after those gradients flow backward through the module, you'll get the `grad_input` values, the gradients of the loss with respect to the inputs of the module. The variable names `grad_input` and `grad_output` are using input/output to refer to input/output of the forward pass, which is why they are the reverse names of what they are when flowing backward through the graph (`grad_output` is the input gradient in the backward pass and `grad_input` is the output gradient in the backward pass).

    • @anas.2k866
      @anas.2k866 2 роки тому

      @@elliotwaite ah ok so grad_output is the gradient of the loss with respect to the output of the module, wich in my case the gradient of the loss with respect to the activation of the neurones in layer 5. And grad_input is the gradient of the loss with respect to the activation of the neurones in layer 4 ?

    • @elliotwaite
      @elliotwaite  2 роки тому +1

      @@anas.2k866 Yep

  • @liweidai4474
    @liweidai4474 3 роки тому

    There is a register_full_backward_hook() method now which is recommended over register_backward_hook(). You can check it out. One thing bothers me is that, in the documentation it clearly says that modifying inputs or outputs inplace is not allowed when using backward hooks and will raise an error. But whatif I indeed have some a predefined model, which has some inplace operators. I just want to know the grads w.r.t to the tensors before doing the inplace op and after the inplace op. Is there any means to accomplish this other than have to modify my code to not use inplace op? How about the register hooks for the tensors method?

    • @elliotwaite
      @elliotwaite  3 роки тому +1

      Thanks for letting me know about the new register_full_backward_hook() method. I've added a note about this to the video description.
      And about your question, I don't know the answer to this one.

  • @BlackHermit
    @BlackHermit 3 роки тому

    So, still no answer as to why the 0 is there in "MulBackward0"?

    • @elliotwaite
      @elliotwaite  3 роки тому +2

      I think I just figured it out. It allows for the same operation to be called in multiple ways (function overloading), and each different overloaded way of calling that operation gets a different index number for its backward version.
      The part of the PyTorch library that generates these backward operation names can be found here (and the comment above the code also describes that this is done to de-duplicate overloaded operation names):
      github.com/pytorch/pytorch/blob/master/tools/autograd/load_derivatives.py#L355
      For example, the `min` operation can be called in multiple ways. In the code below, I call the `min` operation in two of these different ways, and the resulting backward operations associated with different output tensors end up having different index numbers. "torch.min(a)" generates a MinBackward1 operation, and "torch.min(a, dim=0, keepdim=False)" generates a MinBackward0 operation.
      Code example:
      import torch
      a = torch.tensor([2.0, 3.0], requires_grad=True)
      b = torch.min(a)
      (c, c_indices) = torch.min(a, dim=0, keepdim=False)
      print(b) # Prints: tensor(2., grad_fn=)
      print(c) # Prints: tensor(2., grad_fn=)

    • @BlackHermit
      @BlackHermit 3 роки тому +1

      @@elliotwaite Oh, interesting. Thanks!

  • @Jimmy-et1bp
    @Jimmy-et1bp 3 роки тому

    how that forward_pre_hook and forward_hook affect a,b,c gradients?

    • @elliotwaite
      @elliotwaite  3 роки тому

      Any operations performed within the forward_pre_hook or forward_hook functions will affect the gradients the same as any computations performed in the module's forward method. It's almost as if you are just inserting the forward_pre_hook function's code into the beginning of the forward method, and inserting the forward_hook function's code into the end of the forward method (forward_hook probably should have been named forward_post_hook, I'm not sure why it wasn't).

  • @MaximYudayev
    @MaximYudayev 3 роки тому

    Hi Elliot. Great videos. Sub’d :) It would be outstanding if we could dive into customization of the quantization workflow! Things like making custom modules compatible with the fusing and quantization workflows, as well as expanding the data type formats. Thank you!

    • @elliotwaite
      @elliotwaite  3 роки тому

      Thanks. Glad you liked the videos.
      So far I've only briefly looked into PyTorch's quantization capabilities, but it looks interesting. But I'm not sure if I'll ever get around to making a video about it since I've been more focused on learning Jax these days, but I'll add the idea to my list of potential future UA-cam videos. Thanks for the recommendation.

  • @shvprkatta
    @shvprkatta 3 роки тому

    Thanks a ton Elliot!...it would take a lot of time to understand the concepts otherwise...

    • @elliotwaite
      @elliotwaite  3 роки тому +1

      Thanks! Glad you found it helpful.

  • @randomforrest9251
    @randomforrest9251 3 роки тому

    So why is it fallee mulbackward0?

    • @elliotwaite
      @elliotwaite  3 роки тому +2

      I recently figured it out. It allows for the same operation to be called in multiple ways (function overloading), and each different overloaded way of calling that operation gets a different index number for its backward version.
      The part of the PyTorch library that generates these backward operation names can be found here (and the comment above the code also describes that this is done to de-duplicate overloaded operation names):
      github.com/pytorch/pytorch/blob/master/tools/autograd/load_derivatives.py#L565
      For example, the `min` operation can be called in multiple ways. In the code below, I call the `min` operation in two of these different ways, and the resulting backward operations associated with different output tensors end up having different index numbers. "torch.min(a)" generates a MinBackward1 operation, and "torch.min(a, dim=0, keepdim=False)" generates a MinBackward0 operation.
      Code example:
      import torch
      a = torch.tensor([2.0, 3.0], requires_grad=True)
      b = torch.min(a)
      (c, c_indices) = torch.min(a, dim=0, keepdim=False)
      print(b) # Prints: tensor(2., grad_fn=)
      print(c) # Prints: tensor(2., grad_fn=)

    • @randomforrest9251
      @randomforrest9251 3 роки тому +1

      @@elliotwaite thank you a lot!