Reparameterization Trick - WHY & BUILDING BLOCKS EXPLAINED!

Поділитися
Вставка
  • Опубліковано 4 січ 2022
  • This tutorial provides an in-depth explanation of challenges and remedies for gradient estimation in neural networks that include random variables.
    While the final implementation of the method (called Reparameterization Trick) is quite simple, it is interesting and somewhere important to understand how and why the method can be applied in the first place.
    Recommended videos to watch before this one
    Evidence Lower Bound
    • Evidence Lower Bound (...
    3 Big Ideas - Variational AutoEncoder, Latent Variable Model, Amortized Inference
    • Variational Autoencode...
    KL Divergence
    • KL Divergence - CLEARL...
    Links to various papers mentioned in the tutorial
    Auto-Encoding Variational Bayes
    arxiv.org/abs/1312.6114
    Doubly Stochastic Variational Bayes for non-Conjugate Inference
    proceedings.mlr.press/v32/tit...
    Stochastic Backpropagation and Approximate Inference in Deep Generative Models
    arxiv.org/abs/1401.4082
    Gradient Estimation Using Stochastic Computation Graphs
    arxiv.org/abs/1506.05254
    A thread with some insights about the name - "The Law Of The Unconscious Statistician"
    math.stackexchange.com/questi...
    #gradientestimation
    #elbo
    #variationalautoencoder
  • Наука та технологія

КОМЕНТАРІ • 72

  • @anselmud
    @anselmud 2 роки тому +22

    I watched in sequence your videos about KL, ELBO, VAE and now this. They helped me a lot to clarify my understanding on Variational Auto-Encoders. Pure gold. Thanks!

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому

      🙏 ...glad that you found them helpful!

  • @mikhaeldito
    @mikhaeldito 2 роки тому +20

    Glad that someone finally take the time to decrypt the symbols in the loss function equation!! What a great channel :)

  • @sklkd93
    @sklkd93 Рік тому +5

    Man, these have to be the best ML videos on UA-cam. I don't have a degree in Stats and you are absolutely right - the biggest roadblock for understanding is just parsing the notation. The fact that you explain the terms and give concrete examples for them in the context of the neural network is INCREDIBLY helpful.
    I've watched half a dozen videos on VAEs and this is the one that finally got me to a solid mathematical understanding .

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      🙏 I don’t have a degree in stats either 😄

    • @RajanNarasimhan
      @RajanNarasimhan 2 місяці тому

      @@KapilSachdeva
      what was your path to decoding this? I am curious about where you started and how you ended up here. I am sure that's just as interesting as this video.

  • @ssshukla26
    @ssshukla26 2 роки тому +5

    I knew the concept now I know the maths. Thanks for the videos sir.

  • @adamsulak8751
    @adamsulak8751 5 місяців тому +1

    Incredible quality of teaching 👌.

  • @ThePRASANTHof1994
    @ThePRASANTHof1994 Рік тому +1

    I just found treasure! This was the clearest explanation I've come across so far... And now I'm going to binge-watch this channel's videos like I do Netflix shows. :D

  • @television9233
    @television9233 2 роки тому +4

    For the quiz at the end:
    From what I understood, the Encoder network (parametrized by phi) predicts some mu and sigma (based on input X) which then define a normal distribution that the latent variable is sampled from.
    So I think the answer is 2 "predicts" not "learns".

  • @leif-martinsunde1364
    @leif-martinsunde1364 Рік тому +2

    Wonderful video Kapil. Thanks from the University of Oslo.

  • @vslaykovsky
    @vslaykovsky Рік тому +2

    This explanation is what I was looking for for many days! Thank you!

  • @user-lm7nn2jm3h
    @user-lm7nn2jm3h 8 місяців тому +1

    I have watched so many ML / deep learning videos from so many creators, and you are the best. I feel like I finally understand what's going on.. Thank you so much

  • @mohdaquib9808
    @mohdaquib9808 2 роки тому +3

    Thanks a lot sir for your excellent explanation. It made me understand the key idea behind the reparameterization trick.

  • @prachijadhav9098
    @prachijadhav9098 2 роки тому +1

    I was looking for this. It’s full of essential information. Convention matters, and you clearly explained the
    differences in this context.

  • @ayushsaraf8421
    @ayushsaraf8421 8 місяців тому +1

    This series was so informative and enjoyable. Absolutely love it! Hope to understand the diffusion models much better and have some ideas about extensions

  • @chyldstudios
    @chyldstudios Рік тому +1

    Enjoyed watching your clear explanation of the re-parameterization trick. Well done!

  • @SY-me5rk
    @SY-me5rk 2 роки тому +2

    I hope to also learn your style of delivery from these videos. Its so effective in breaking down the complexity of topics. Looking forward to whatever your next video is.

  • @alirezamogharabi8733
    @alirezamogharabi8733 Рік тому +1

    Really appreciate, I enjoyed your teaching style and great expectations! Thank you ❤️❤️

  • @inazuma3gou
    @inazuma3gou Рік тому +1

    Wow~ an amazing tutorial. Thank you!

  • @longfellowrose1013
    @longfellowrose1013 Рік тому +2

    Amazing video for VAE and VI. Could you make a tutorial about Variational inference in Latent Dirichlet allocation? The descriptions and explanation for this part of work are rather rare.

  • @atharvajoshi4243
    @atharvajoshi4243 9 місяців тому +1

    Thank you for this series. It has really helped understand the theoretical basis of the VAE model. I had a couple of questions:
    Q1) At 21:30, is dx=d(epsilon) only because we have a linear location-scale transform or is that a general property of LOTUS?
    Q2) At 9:00, how are the terms combined to give the joint distribution when the parameters of the distribution are different? We would have the log of the multiplication of the probabilities but the two thetas are different right? Sorry if this is a stupid question.

    • @KapilSachdeva
      @KapilSachdeva  9 місяців тому

      Q1) it has nothing to do with LOTUS just linear location-scale transform
      Q2) theta here represents the parameters of the “joint distribution”. Do not think of it as log of multiplication of probs rather think that it is a distribution of two random variables and theta represents the parameters of this distribution.

  • @somasundaramsankaranarayan4592
    @somasundaramsankaranarayan4592 14 днів тому

    At 6:39, the distribution p_\theta(x|z) cannot have mean mu and stddev sigma as the mean and std dev live in the latent space (the space of z) and x lives in the input space.

  • @ArashSadr
    @ArashSadr 2 роки тому +1

    As always I am stunned by your video! May I ask with what software you produce such videos?

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому +1

      🙏 thanks Arash for the kind words.
      I use PowerPoint primarily except for very few advanced animations I use manim (github.com/manimCommunity/manim)

  • @slemanbisharat6390
    @slemanbisharat6390 Рік тому

    thank you sir clear explanation , i want to ask regarding the expression p(xi,z) , is this the joint probability or its the likelihood under z and theta ?

  • @jimmylovesyouall
    @jimmylovesyouall Рік тому +1

    In 6:35, the output of the decoder is the reconstruction of X not μ and σ?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому +1

      The output of the decoder could be either of following -
      a) Direct prediction of X (the input vector)
      or
      b) Prediction of mu and sigma of distribution from which X came.
      Note the mu and sigma if predicted (by the decoder) will be that of X and not Z

  • @rubyshrestha5747
    @rubyshrestha5747 2 роки тому +1

    Thank you for a detailed explanation. I had one question though. I am not being able to understand why we cannot take derivative over theta inside the integration when integral is over x (in 20:52). Could you please help me get insight on it?

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому +2

      Hello Ruby, thanks for your comment and more importantly for paying attention. The reason you are confused here is because I have a typo in this example. The dx in this example should have been dtheta.
      Now that I look back I am not happy with this simpler example that I tried to use before explaining it for ELBO. Not only there is a typo but it can create confusion. I would suggest ignoring this (so-called simpler) example and directly see it for ELBO. Apologies!

  • @MLDawn
    @MLDawn 8 місяців тому

    Absolutely brilliant! One issue that I have is that the Leibniz integral rule is concerned with the support of the integral being functions of the variable w.r.t which we are trying to take the derivative. I don't see how this applies to our case in your video! Isn't the support here just a lower and upper bound CONSTANT values for the Phi parameter? In other words, am I wrong in saying that the support is NOT a function of Phi, thus, we should be able to move the derivative inside the integral? I would appreciate your feedback on this. Thanks

    • @KapilSachdeva
      @KapilSachdeva  8 місяців тому

      This is where the notation creates confusion. You should think phi to be a function (a neural network in this case) that your are learning/discovering.

  • @blasttrash
    @blasttrash Рік тому +1

    19:09 Since the base distribution is free of our parameters, when we backprop and do differentiation, we don't have to do differentiation on unit normal distribution? Is this correct?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому +1

      Correct. Now this should also make you think if this assumption of prior being standard normal is a good assumption?
      There are variants of variational autoencoder in which you can also learn/estimate the parameter of prior distribution.

  • @midhununni951
    @midhununni951 7 місяців тому +1

    Incredibly clear, and thank you so much for these videos. Looking forward to more...

  • @spandanbasu5653
    @spandanbasu5653 Рік тому

    I have a question. From the change of variable concept, we assign z to be a deterministic function of a sample from base distribution and parameters of the target distribution. But when we apply this in the case of ELBO, we assign z to be a deterministic function of Phi, x and epsilon, the Phi is the parameters of the encoder network but not the parameters of the target distribution p(z|x). Would this not create any inconsistency in the application?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      ELBO (the loss function) is used during the “training” of the neural network. During training you are learning the parameters of encoder (and decoder) networks. Once the networks are trained then q(z|x) would be approximate of p(z|x).

  • @RAP4EVERMRC96
    @RAP4EVERMRC96 Рік тому +1

    8:48 how can the terms be combined if one follows the conventional syntax (sigma denoting the parameters of the density function) and the other the non-conventional syntax (sigma denoting the parameters of the decoder leading to estimates of the parameters of the density function). In essence the sigmas they are refrencing are not the same.

    • @KapilSachdeva
      @KapilSachdeva  Рік тому +1

      Assuming when you mentioned “sigma” you meant “theta”.
      This is yet another example of abuse of notation and hence your confusion is normal. Even though I say that theta is the parameters of decoder network in this situation think that network has predicted mu and sigma (watch the VAE tutorial) and in the symbolic expression when combining two terms we are considering theta to a set of mu and sigma.

    • @RAP4EVERMRC96
      @RAP4EVERMRC96 Рік тому +1

      @@KapilSachdeva thanks for clearing that up and yes I meant theta. I always mix them up.

    • @KapilSachdeva
      @KapilSachdeva  Рік тому +1

      😊

  • @vslaykovsky
    @vslaykovsky Рік тому +1

    3:00 shouldn't it be the "negative reconstruction error" instead?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      Since in optimization we minimize then we minimize the negative ELBO which will result in the negative reconstruction error.

  • @anupgupta3644
    @anupgupta3644 Рік тому +1

    It predicts the parameters of the latent variable.

  • @omidmahjobian3377
    @omidmahjobian3377 2 роки тому +3

    GEM

  • @medomed1105
    @medomed1105 2 роки тому

    Is there a difference between VAE and GAN

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому +1

      They are two different architectures with some goals that are shared.
      VAE were primarily designed to do efficient latent variable model inference (see previous tutorial for more details to understand this line) but they can be used as generative models.
      GAN is a generative architecture whose training regime (loss function, setup etc) is very different from VAE. For long time they produced much better images but now VAE have also caught up to the quality of generated images
      Both of the architecture are somewhat difficult to train. VAE are relatively easier to train though.
      Hope this sheds some light.

    • @medomed1105
      @medomed1105 2 роки тому +1

      @@KapilSachdeva thank you very much
      If there is a possibility to make tutorial about GAN it will be very appreciated
      Thanks again

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому

      🙏

  • @Daydream_Dynamo
    @Daydream_Dynamo 4 дні тому

    It learns the parameter right?