Batch Normalization - EXPLAINED!

Поділитися
Вставка
  • Опубліковано 4 чер 2024
  • What is Batch Normalization? Why is it important in Neural networks? We get into math details too. Code in references.
    Follow me on M E D I U M: towardsdatascience.com/likeli...
    REFERENCES
    [1] 2015 paper that introduced Batch Normalization: arxiv.org/abs/1502.03167
    [2] The paper that claims Batch Norm does NOT reduce internal covariate shift as claimed in [1]: arxiv.org/abs/1805.11604
    [3] Using BN + Dropout: arxiv.org/abs/1905.05928
    [4] Andrew Ng on why normalization speeds up training: www.coursera.org/lecture/deep...
    [5] Ian Goodfellow on how Batch Normalization helps regularization: www.quora.com/Is-there-a-theo...
    [6] Code Batch Normalization from scratch: kratzert.github.io/2016/02/12...

КОМЕНТАРІ • 127

  • @ssshukla26
    @ssshukla26 4 роки тому +48

    Shouldn't be Gamma should approximate to the true variance of the neuron activation and beta should approximate to the true mean of the neuron activation? I am just confused...

    • @CodeEmporium
      @CodeEmporium  4 роки тому +25

      You're right. Misspoke there. Nice catch!

    • @ssshukla26
      @ssshukla26 4 роки тому

      @@CodeEmporium Cool

    • @dhananjaysonawane1996
      @dhananjaysonawane1996 3 роки тому +1

      How is this approximation happening?
      And how do we use beta, gamma at test time? We have only one example at a time during testing.

    • @FMAdestroyer
      @FMAdestroyer 2 роки тому +1

      @@dhananjaysonawane1996 in most frameworks when you create a BN Layer, the mean and variance (Beta and gamma) are both learnable parameters usually represented as the weights and bias from the layer. You can deduce that from Torch BN2D Layer's description bellow
      "The mean and standard-deviation are calculated per-dimension over the mini-batches and γ and β are learnable parameter vectors of size C (where C is the input size)."

    • @AndyLee-xq8wq
      @AndyLee-xq8wq Рік тому

      Thanks for clarification!

  • @efaustmann
    @efaustmann 4 роки тому +22

    Exactly what I was looking for. Very well researched and explained in a simply way with visualizations. Thank you very much!

  • @jodumagpi
    @jodumagpi 4 роки тому

    This is good! I think that giving an example as well as the use cases (advantages) before diving into the details alwayd gets the job done

  • @sumanthbalaji1768
    @sumanthbalaji1768 4 роки тому +8

    Just found your channel and binged through all your videos so heres a general review. As a student i assure you your content is on point and goes in depth unlike other channels that just skim the surface. Keep it up and dont be afraid to go more in depth on concepts. We love it. Keep it up brother you have earned a supporter till your channels end

    • @CodeEmporium
      @CodeEmporium  4 роки тому +2

      Thanks ma guy. I'll keep pushing up content. Good to know my audience loves the details ;)

    • @sumanthbalaji1768
      @sumanthbalaji1768 4 роки тому

      @@CodeEmporium damn did not actually expect you to reply lol. Maybe let me throw a topic suggestion then. More NLP please, take a look at summarisation tasks as a topic. Would be damn interesting.

  • @maxb5560
    @maxb5560 4 роки тому +1

    Love your videos. They help me alot understanding machine learning more and more

  • @yeripark1135
    @yeripark1135 2 роки тому

    I clearly understand the need of batch normalization and its advantages! Thanks !!

  • @ultrasgreen1349
    @ultrasgreen1349 Рік тому

    thats actually a very very good and intuitive video. Honestly Thank you

  • @balthiertsk8596
    @balthiertsk8596 2 роки тому

    Hey man, thank you.
    I really appreciate this quality content!

  • @parthshastri2451
    @parthshastri2451 3 роки тому +9

    why did you plot the cost against height and the age isnt it supposed to be a function of weights in a neural network

  • @EB3103
    @EB3103 2 роки тому +2

    The loss is not a function of the features but a function of the weights

  • @Slisus
    @Slisus 2 роки тому

    Awesome video. I really like, how you go into the actual papers behind it.

  • @ahmedshehata9522
    @ahmedshehata9522 2 роки тому

    You are really and also really good because you reference paper and introduce the idea

  • @user-nx8ux5ls7q
    @user-nx8ux5ls7q 2 роки тому

    Do we calculate the mean and SD across a mini-batch for a given neutron or across all the neurone in a layer? Andrew NG says it's across each layer. Thanks.

  • @luisfraga3281
    @luisfraga3281 3 роки тому

    Hello, I wonder what if we don't normalize the image input data (RGB 0-255) and then we use batch normalization? Is it going to work smoothly? or is it going to mess up with the learning?

  • @dragonman101
    @dragonman101 3 роки тому +1

    Quick note: at 6:50 there should be brackets after 1/3 (see below)
    Yours: 1/3 (4 - 5.33)^2 + (5 - 5.33)^2 + (7 - 5.33)^2

  • @angusbarr7952
    @angusbarr7952 4 роки тому +16

    Hey! Just cited you in my undergrad project because your example finally made me understand batch norm. Thanks a lot!

    • @CodeEmporium
      @CodeEmporium  4 роки тому +4

      Sweet! Glad it was helpful homie

  • @taghyeertaghyeer5974
    @taghyeertaghyeer5974 Рік тому +3

    Hello, thank you for your video.
    I am wondering regarding the batch normalisation speeding up the training: you showed at 2:42 the contour plot of the loss as a function of height and age. However, the loss function contours should be plotted against the weights (the optimization is performed in the weights' space, and not the input space). In other words, why did you base your argument on the loss function with weight and and height being the variable (they should be held constant during optimization)?
    Thank you! Lana

    • @marcinstrzesak346
      @marcinstrzesak346 8 місяців тому

      For me, it also seemed quite confusing. I'm glad someone else noticed it too.

    • @atuldivekar
      @atuldivekar 4 місяці тому

      The contour plot is being shown as a function of height and age to show the dependence of the loss on the input distribution, not the weights

  • @hervebenganga8561
    @hervebenganga8561 Рік тому

    This is beautiful. Thank you

  • @sriharihumbarwadi5981
    @sriharihumbarwadi5981 4 роки тому +1

    Can you please make a video on how batch normalization and l1/l2 regularization interact with each other ?

  • @MaralSheikhzadeh
    @MaralSheikhzadeh 2 роки тому

    thanks, this video helped me understand BN better. and I liked your sense of humor. made watching is more fun.:)

  • @pranavjangir8338
    @pranavjangir8338 3 роки тому +1

    Is not Batch Normalization also used to counter the exploding gradient problem? Would have loved some explanation on that too..

  • @seyyedpooyahekmatiathar624
    @seyyedpooyahekmatiathar624 4 роки тому +2

    Subtracting the mean and dividing by std is standardization. Normalization is when you change the range of the dataset to be [0,1].

  • @chandnimaria9748
    @chandnimaria9748 8 місяців тому

    Just what I was looking for, thanks.

  • @abhishekp4818
    @abhishekp4818 4 роки тому

    @CodeEmporium , could you please tell me that why do we need to normalize the outputs of activation function whe they are already within a small range(example sigmoid ranges from 0 to 1)?
    and if we do normalize them, then how do we compute and updates of its parameters during backpropgation?
    please answer.

    • @boke6184
      @boke6184 4 роки тому

      The activation function should be the modifiing the predictability of error or learning too

  • @ayandogra2952
    @ayandogra2952 3 роки тому

    Amazing work
    really liked it

  • @SaifMohamed-de8uo
    @SaifMohamed-de8uo 6 днів тому

    Great explanation thank you!

  • @mizzonimirko
    @mizzonimirko Рік тому

    I do not understand property how this Is going to be implemented. At the end of an epoch actually we perform those operations right? At the end of that epoch, at this point the layer where i have applied It Is normalized right?

  • @iliasaarab7922
    @iliasaarab7922 3 роки тому

    Great explanation, thanks!

  • @ryanchen6147
    @ryanchen6147 Рік тому +2

    at 3:27, I think your axises should be the *weight* for the height feature and the *weight* for the age feature if that is a contour plot of the cost function

    • @mohameddjilani4109
      @mohameddjilani4109 Рік тому +1

      Yes , that was an error across a long period in the video

  • @oheldad
    @oheldad 4 роки тому +6

    Hey there . Im on my way to become data scientist , and your videos help me a lot ! Keep going Im sure I am not the only one you inspired :) thank you !!

    • @CodeEmporium
      @CodeEmporium  4 роки тому +1

      Awesome! Glad these videos help! Good luck with your Data science ventures :)

    • @ccuuttww
      @ccuuttww 4 роки тому +2

      Your aim should not become a data scientist to fit other people expectation you should become a people who can deal with data and estimate any unknown parameter with your own standard

    • @oheldad
      @oheldad 4 роки тому

      @@ccuuttww dont know why you decided that Im fulfilling others expectations on me - its not true. Im on the last semester of my electrical engineering degree , and decided to change path a little :)

    • @ccuuttww
      @ccuuttww 4 роки тому

      because most of people think in the following pattern : Finish all exam semester and graduate with good marks send mass CV and try to get a job titled:"Data Scientist"
      try to fit their jobs what they learn from university like a trained monkey however u are not deal with a real wold situation u just try to deal with your customer or your boss since this topic never have standard answer u can only define by yourself and your client only trust your title
      I fell this is really bad

  • @ccuuttww
    @ccuuttww 4 роки тому +1

    I wonder is it suitable to use population estimator?
    I think nowadays most of the machine learning learner/student/fans
    spent very less time on statistics after several year study I find that The model selection and the statistical theory take the most important part
    especially the Bayesian learning the most underrated topic today

  • @shaz-z506
    @shaz-z506 4 роки тому

    Good video, could you please make a video on capsule network.

  • @JapiSandhu
    @JapiSandhu 2 роки тому

    can I add a Batch Normalization layer after an LSTM layer in pytorch?

  • @SillyMakesVids
    @SillyMakesVids 4 роки тому

    Sorry, but where did gamma and beta come from and how is it used?

  • @mohammadkaramisheykhlan9
    @mohammadkaramisheykhlan9 2 роки тому

    How can we use batch normalization in the test set?

  • @user-nx8ux5ls7q
    @user-nx8ux5ls7q 2 роки тому

    Also if someone can say how to make gamma and beta learnable? gamma can be thought as an additional weight attached to the activation but how about beta? how to train that?

  • @user-wf2fq2vn5m
    @user-wf2fq2vn5m 3 роки тому

    Awesome explanation.

  • @lamnguyentrong275
    @lamnguyentrong275 4 роки тому +3

    wow, easy to understand , and clear accent. Thank you, sir. u done a great job

  • @thoughte2432
    @thoughte2432 3 роки тому +4

    I found this a really good and intuitive explanation, thanks for that. But there was one thing that confused me: isn't the effect of batch normalization the smoothing of the loss function? I found it difficult to associate the loss function directly to the graph shown at 2:50.

    • @Paivren
      @Paivren 10 місяців тому

      yes, the graph is a bit weird in the sense that the loss function is not a function of the features but of the model parameters.

  • @God-vl5uz
    @God-vl5uz 14 днів тому

    Thank you!

  • @danieldeychakiwsky1928
    @danieldeychakiwsky1928 3 роки тому +7

    Thanks for the video. I wanted to add that there's debate in the community over whether to normalize pre vs. post non-linearity within the layers, i.e., for a given neuron in some layer, do you normalize the result of the linear function that gets piped through non-linearity or do you pipe the linear combination through non-linearity and then apply normalization, in both cases, over the mini-batch.

    • @kennethleung4487
      @kennethleung4487 3 роки тому +3

      Here's what I found from MachineLearningMastery:
      o Batch normalization may be used on inputs to the layer before or after the activation function in the previous layer
      o It may be more appropriate after the activation function if for S-shaped functions like the hyperbolic tangent and logistic function
      o It may be appropriate before the activation function for activations that may result in non-Gaussian distributions like the rectified linear activation function, the modern default for most network types

  • @manthanladva6547
    @manthanladva6547 4 роки тому

    Thanks for awesome video
    Get many idea about Batch Norm

  • @uniquetobin4real
    @uniquetobin4real 4 роки тому

    The best I have seen so far

  • @strateeg32
    @strateeg32 2 роки тому

    Awesome thank you!

  • @anishjain8096
    @anishjain8096 4 роки тому

    Hey brother can you please tell me how on fly data augmentation increase the image data set every on blogs and vedios they said it increase the data size but hiw

    • @CodeEmporium
      @CodeEmporium  4 роки тому

      For images, you would need to make minor distortions (rotation, crop, scale, blur) in an image such that the result is a realistic input. This way, you have more training data for your model to generalize

  • @aminmw5258
    @aminmw5258 Рік тому

    Thank you bro.

  • @superghettoindian01
    @superghettoindian01 Рік тому

    I see you are checking all these comments - so will try to comment on all the videos I see going forward and how I’m using these videos.
    Currently using this video as supplement to Andrej Karpathy’s makemore series pt 3.
    The other video has a more detailed implementation of batch normalization but you do a great job of summarizing the key concepts. I hope one day you and Andrej can create a video together 😊.

    • @CodeEmporium
      @CodeEmporium  Рік тому +1

      Thanks a ton for the comment. Honestly, any critical feedback is appreciated. So thanks you. It would certainly be a privilege to collaborate with Andrej for sure. Maybe in the future :)

  • @hemaswaroop7970
    @hemaswaroop7970 4 роки тому

    Thanks, Man!

  • @akremgomri9085
    @akremgomri9085 17 днів тому

    Very good explanation. However, there is something I didn't understand. Doesn't batch normalisation modify the inout data so that m=0 and v=1 as explained in the beginning ?? So how the heck we moved from normalisation being applied on inputs, to normalisation affecting activation function ? 😅😅

  • @sanjaykrish8719
    @sanjaykrish8719 4 роки тому

    Fantastic explanation using contour plots.

    • @CodeEmporium
      @CodeEmporium  4 роки тому +1

      Thanks! Contour plots are the best!

  • @Seto-fs4sj
    @Seto-fs4sj 4 роки тому

    what about layer norm ?

  • @SunnySingh-tp6nt
    @SunnySingh-tp6nt Місяць тому

    can I get these slides?

  • @enveraaa8414
    @enveraaa8414 3 роки тому

    Bro you have made the perfect video

  • @nobelyhacker
    @nobelyhacker 2 роки тому

    Nice video, but i guess there is a little error at 6:57? I guess you have to multiply the whole with 1/3 not only the first term

  • @erich_l4644
    @erich_l4644 4 роки тому +1

    This was so well put together- why less than 10k views? Oh... it's batch normalization

  • @rockzzstartzz2339
    @rockzzstartzz2339 4 роки тому

    Why to use beta and gamma?

  • @PavanTripathi-rj7bd
    @PavanTripathi-rj7bd Рік тому

    great explanation

    • @CodeEmporium
      @CodeEmporium  Рік тому

      Thank you! Enjoy your stay on the channel :)

  • @samratkorupolu
    @samratkorupolu 3 роки тому

    wow, you explained pretty clearly

  • @priyankakaswan7528
    @priyankakaswan7528 3 роки тому

    the real magic starts at 6.07, this video was exactly what I needed

  • @kriz1718
    @kriz1718 4 роки тому

    Very helpfull!!

  • @JapiSandhu
    @JapiSandhu 2 роки тому

    this is a great video

  • @gyanendradas
    @gyanendradas 4 роки тому

    Can u make a video for all types pooling layers

    • @CodeEmporium
      @CodeEmporium  4 роки тому +1

      Interesting. I'll look into this. Thanks for the idea

  • @sultanatasnimjahan5114
    @sultanatasnimjahan5114 6 місяців тому

    thanks

  • @pranaysingh3950
    @pranaysingh3950 2 роки тому

    Thanks!

  • @ajayvishwakarma6943
    @ajayvishwakarma6943 4 роки тому

    Thanks buddy

  • @abheerchrome
    @abheerchrome 3 роки тому

    grate video bro keep it up

  • @themightyquinn100
    @themightyquinn100 Рік тому

    Wasn't there an episode where Peter was playing against Larry Bird?

  • @aaronk839
    @aaronk839 4 роки тому +26

    Good explanation until 7:17 after which, I think, you miss the point which makes the whole thing very confusing. You say: "Gamma should approximate to the true mean of the neuron activation and beta should approximate to the true variance of the neuron activation." Apart from the fact that this should be the other way around, as you acknowledge in the comments, you don't say what you mean by "true mean" and "true variance".
    I learned from Andrew Ng's video (ua-cam.com/video/tNIpEZLv_eg/v-deo.html) that the actual reason for introducing two learnable parameters is that you actually don't necessarily want all batch data to be normalized to mean 0 and variance 1. Instead, shifting and scaling all normalized data at one neuron to obtain a different mean (beta) and variance (gamma) might be advantageous in order to exploit the non-linearity of your activation functions.
    Please don't skip over important parts like this one with sloppy explanations in future videos. This gives people the impression that they understand what's going on, when they actually don't.

    • @dragonman101
      @dragonman101 3 роки тому +3

      Thank you very much for this explanation. The link and the correction are very helpful and do provide some clarity to a question I had.
      That being said, I don't think it's fair to call his explanation sloppy. He broke down complicated material in a fantastic and clear way for the most part. He even linked to research so we could do further reading, which is great because now I have a solid foundation to understand what I read in the papers. He should be encouraged to fix his few mistakes rather than slapped on the wrist.

    • @sachinkun21
      @sachinkun21 2 роки тому

      thanks a ton!! I was actually looking for this comment as I had the same question as to why do we even need to approximate!

  • @PierreH1968
    @PierreH1968 3 роки тому

    Great explanation, very helpful!

  • @elyasmoshirpanahi7184
    @elyasmoshirpanahi7184 Рік тому

    Nice content

  • @its_azmii
    @its_azmii 4 роки тому

    hey can u link the graph that you used please?

  • @novinnouri764
    @novinnouri764 2 роки тому

    thansk

  • @GauravSharma-ui4yd
    @GauravSharma-ui4yd 4 роки тому

    Awesome, keep going like this

    • @CodeEmporium
      @CodeEmporium  4 роки тому +1

      Thanks for watching every video Gaurav :)

  • @ai__76
    @ai__76 2 роки тому

    Nice animations

  • @sevfx
    @sevfx Рік тому

    Great explanation, but missing parantheses at 6:52 :p

  • @akhileshpandey123
    @akhileshpandey123 2 роки тому

    Nice explanation :+1

  • @adosar7261
    @adosar7261 Рік тому

    And why not just normalizing the whole training set instead of batch normalization?

    • @CodeEmporium
      @CodeEmporium  Рік тому

      Batch normalization will normalize through different steps of the network. If we want to “normalize the whole training set”, we need to pass all training examples at once to the network as a single batch. This is what we see in “batch gradient descent”, but isn’t super common for large datasets because of memory constraints.

  • @Acampandoconfrikis
    @Acampandoconfrikis 3 роки тому

    Hey 🅱eter, did you make it to the NBA?

  • @boke6184
    @boke6184 4 роки тому

    This is good for ghost box

  • @99dynasty
    @99dynasty Рік тому

    BatchNorm reparametrizes the underlying optimization problem to make it more stable (in the sense of loss Lipschitzness) and smooth (in the sense of “effective” β-smoothness of the loss).
    Not my words

  • @lazarus8011
    @lazarus8011 3 дні тому

    Good video
    here's a comment for the algorithm

  • @irodionzaytsev
    @irodionzaytsev 2 роки тому

    The only difficult part of batch norm, namely the back prop isn't explained.

  • @nyri0
    @nyri0 2 роки тому

    Your visualizations are misleading. Normalization doesn't turn the shape on the left into the circle seen on the right. It will be less elongated but still keep a diagonal ellipse shape.

  • @xuantungnguyen9719
    @xuantungnguyen9719 3 роки тому

    good visualization

  • @sealivezentrum
    @sealivezentrum 3 роки тому +1

    fuck me, you explained way better than my prof did

  • @SAINIVEDH
    @SAINIVEDH 3 роки тому

    For RNN's Batch Normalisation should be avoided, use Layer Normalisation instead

  • @eniolaajiboye4399
    @eniolaajiboye4399 2 роки тому

    🤯

  • @alexdalton4535
    @alexdalton4535 3 роки тому

    why didnt peter make it..

  • @roeeorland
    @roeeorland Рік тому

    Peter is most definitely not 1.9m
    That’s 6’3

  • @rodi4850
    @rodi4850 4 роки тому +2

    Sorry to say but very poor video. Intro was way too long and explaining more the math and why BN works was left for 1-2mins.

    • @CodeEmporium
      @CodeEmporium  4 роки тому +5

      Thanks for watching till the end. I tried going for a layered approach to the explanation - get the big picture. Then the applications. Then details. I wasn't sure how much more math was necessary. This was the main math in the paper, so I thought that was adequate. Always open to suggestions if you have any. If you've looked at my recent videos, you can tell the delivery is not consistent. Trying to see what works

    • @PhilbertLin
      @PhilbertLin 4 роки тому

      I think the intro with the samples in the first few minutes was a little drawn out but the majority of the video spent on intuition and visuals without math was nice. Didn’t go through the paper so can’t comment on how much more math detail is needed.

  • @ahmedelsabagh6990
    @ahmedelsabagh6990 3 роки тому

    55555 you get it :) HaHa