Evidence Lower Bound (ELBO) - CLEARLY EXPLAINED!

Поділитися
Вставка
  • Опубліковано 27 лис 2024
  • This tutorial explains what ELBO is and shows its derivation step by step.
    #variationalinference
    #kldivergence
    #bayesianstatistics
  • Наука та технологія

КОМЕНТАРІ • 123

  • @AndreiMargeloiu
    @AndreiMargeloiu 3 роки тому +40

    Cristal clear explanation, the world needs more people like you!

  • @TheProblembaer2
    @TheProblembaer2 9 місяців тому +4

    Again, thank you. This is incredible well explained, the small steps and the explanation behind, pure gold.

  • @sonny1552
    @sonny1552 Рік тому +2

    Best explanation ever! I found this video for my understanding of VAE at first, but I recently found that this is also directly related to diffusion models. Thanks for making this video.

  • @genericperson8238
    @genericperson8238 2 роки тому +9

    Absolutely beautiful. The explanation is so insanely well thought out and clear.

  • @T_rex-te3us
    @T_rex-te3us Рік тому +1

    Insane explanation Mr. Sachdeva! Thank you so much - I wish you all the best in life

  • @9speedbird
    @9speedbird Рік тому +1

    That was great, been going through paper after paper, all I needed was this! Thanks!

  • @danmathewsrobin5991
    @danmathewsrobin5991 3 роки тому +4

    Fantastic tutorial!! Hoping to see more similar content. Thank you

  • @thatipelli1
    @thatipelli1 3 роки тому +4

    Thanks, your tutorial cleared my doubts!!

  • @ajwadakil6892
    @ajwadakil6892 Рік тому +1

    Great Explanation. Can you tell me which books / articles that I may refer to for further and deeper reading regarding variational inferences, bayesian statistics and concepts related to in depth probability?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому +2

      For Bayesian Statistics, I would recommend reading:
      Statistical Rethinking by Richard Mclearth [See this page for more information - xcelab.net/rm/]
      A good overview is this paper (Variational Inference: A Review for Statisticians by David M. Blei et al)
      arxiv.org/abs/1601.00670
      For Basic/Foundational Variational Inference, PRML is a good source
      www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf
      There are many books and lecture notes on Probability theory. Pick any one.

  • @bevandenizclgn9282
    @bevandenizclgn9282 8 місяців тому

    Best explanation I have found so far, thank u!

  • @BirthdayDoggy
    @BirthdayDoggy 10 місяців тому +1

    Thank you so much for this explanation :) Very clear and well explained. I wish you all the best

  • @kappa12385
    @kappa12385 2 роки тому +1

    Kadak sikhaya bhau. Majha aa gaya.

  • @schrodingerac
    @schrodingerac 3 роки тому +3

    excellent presentation and explanation
    Thank you very much sir

  • @AruneshKumarSinghPro
    @AruneshKumarSinghPro Рік тому +2

    This one is masterpiece. Can you please put one video on Hierarchical Variational AutoEncoders when you have time. Looking forward to it.

  • @Aruuuq
    @Aruuuq 3 роки тому +3

    Amazing tutorial! Keep up the good work.

  • @brookestephenson4354
    @brookestephenson4354 3 роки тому +3

    Very clear explanation! Thank you very much!

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому

      Thanks Brooke. Happy that you found it helpful!

  • @HelloWorlds__JTS
    @HelloWorlds__JTS Рік тому +1

    Great explanations! I do have one correction to suggest: At (6:41) you say D_KL is always non-negative; but this can only be true if q is chosen to bound p from above over enough of their overlap (... for the given example, i.e. reverse-KL).

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      🙏 Correct

    • @HelloWorlds__JTS
      @HelloWorlds__JTS 10 місяців тому

      @@KapilSachdeva I was wrong to make my earlier suggestion, because p and q are probabilities. I can give details if anyone requests it, but it's trivial to see using total variation distance or Jensen's inequality.

  • @vi5hnupradeep
    @vi5hnupradeep 3 роки тому +2

    Thankyou so much sir ! I'm glad that I found your video 💯

  • @sahhaf1234
    @sahhaf1234 Рік тому +1

    Good explanation. I can follow the algebra easily. The problem is this: what is known and what is not known in this formulation? In other words, @0:26, I think we try to find the posterior. But, do we know the prior? Do we know the likelihood? Or, is it that we do not know them but can sample them?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      Good questions and you have mostly answered them yourself. Prior is what you assume. Likelihood function you need to know (or model). But the most difficult will be computing the normalizing constant. Most of the time computationally intractable

  • @easter.bunny.6
    @easter.bunny.6 5 місяців тому

    Thanks for the lecture sir! I have a question at 4:54, how did you expand that E[log_p_theta(x)] into Integral(q(z|x)log_p_theta(x)dz)? Thanks!

  • @mahayat
    @mahayat 3 роки тому +3

    best and clear explanation!

  • @wolfgangpaier6208
    @wolfgangpaier6208 Рік тому +1

    Hi, I really appreciate your video tutorial because it’s super helpful and easy to understand. I only have one question left. At 10:27 you replaced the conditional distribution q(z|x) by q(z). Is this also true for Variational Auto-Encoders? Because for VAEs, if I understand right, q(z) is approximated by a neural network that predicts z from x. So I would expect that it’s a conditional distribution where z depends on x.

    • @KapilSachdeva
      @KapilSachdeva  Рік тому +1

      In the case of VAE it will always be conditional distribution. Your understanding is correct 🙏

    • @wolfgangpaier6208
      @wolfgangpaier6208 Рік тому

      @@KapilSachdeva ok. Thanks a lot for the fast response 🙏

  • @chethankr3598
    @chethankr3598 Рік тому +1

    This is an awesome explaination. Thank you.

  • @mmattb
    @mmattb Рік тому +1

    Sorry to bother you again Kapil - is the integral at 5:05 supposed to have d(z|x) instead of dz? If not, I'm certainly confused haha.

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      No bother at all. Conceptually you can think of it like that but I have not seen/encountered differential portion of the integral using the conditional (the pipe) thing. So just a notation thing here. Your understanding is correct.

  • @anshumansinha5874
    @anshumansinha5874 Рік тому

    So, we have to maximise the ELBO (@9:28), right? As that would make it go closer to the log likelihood of the original data.
    1. Will that mean we should find parameter 'phi' which increase the reconstruction error (as it is the first term)?
    2. And find 'phi' such that the second term gets minimised? Which would mean q_phi(z|x) should be as close as possible from the prior p(z) ?
    But don't we need to minimise the reconstruction error while not going far from the assumed prior p(z). How to get these inferences from the derived equation @9:28

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      We minimize the “negative” ELBO

    • @YT-yt-yt-3
      @YT-yt-yt-3 4 місяці тому

      ​@@KapilSachdeva The terminologies and signs surrounding KL divergence and ELBO are what make them complex else it's simple concept. Is it really an ' reconstruction error'? I mean, is it the likelihood of observing data given
      z that needs to be maximized? Why is it called error?

  • @mmattb
    @mmattb Рік тому +1

    One more question: at 10:11 I can see the right hand term looks like a KL divergence between the distributions, but I'm confused: what would you integrate over if you expanded that? In the KL formulation typically the top and bottom of the fraction are distributions over the same variable. Is it just an intuition to call this KL, or is it literally a KL divergence; if the latter, do you mind writing out the general formula for KL when the top and bottom are distributions over different variables (z|x vs z in this case)?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      Z|X just means that you got the Z given X but it still remains the (conditional) distribution for Z. Hence your statement about using KL divergence over the same variable is still valid. Hope this makes sense.

    • @mmattb
      @mmattb Рік тому +1

      @@KapilSachdeva ohhhh so both of them are defined over the same domain as Z. That makes sense. Thanks again.

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      🙏

  • @abhinav9058
    @abhinav9058 2 роки тому +2

    Subscribed sir awesome tutorial
    Learning variantional auto encoder 😃

  • @chadsamuelson1808
    @chadsamuelson1808 2 роки тому +1

    Amazingly clear explanation!

  • @kadrimufti4295
    @kadrimufti4295 6 місяців тому

    At the 4:45 mark, how did you expand the third term Expectation into its integral form in that way? How is it an "expectation with respect to z" when there is no z but only x?

  • @lihuil3115
    @lihuil3115 2 роки тому +2

    best explanation ever!

  • @alexfrangos2402
    @alexfrangos2402 Рік тому +1

    Amazing explanation, thank you so much!

  • @ziangshi182
    @ziangshi182 Рік тому +1

    Fantastic Explanation!

  • @wadewang574
    @wadewang574 Рік тому +1

    At 4:40, how to see the third component is an expectation with respect to z instead of x ?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      Because the KL divergence (which in turn is the expected value) is between p(z|x) and q(z|x).
      Now you need to have a good understanding of KL divergence and expected value to understand it.

  • @riaarora3126
    @riaarora3126 2 роки тому +1

    Wow, clarity supremacy

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому

      🙏 😀 “clarity supremacy” …. Good luck with your learnings.

  • @UdemmyUdemmy
    @UdemmyUdemmy Рік тому +1

    U are a legend!

  • @satadrudas3675
    @satadrudas3675 11 місяців тому +1

    Explaied very well. Thanks

  • @alfcnz
    @alfcnz 3 роки тому +2

    Thanks! 😍😍😍

  • @the_akhash
    @the_akhash 2 роки тому +1

    Thanks for the explanation!

  • @user-or7ji5hv8y
    @user-or7ji5hv8y 3 роки тому +3

    Just one question, at ua-cam.com/video/IXsA5Rpp25w/v-deo.html, when you expanded log p(x), how did you know to use q(z | x) instead of simply q(x)? Thank you.

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому +4

      We are after approximating the posterior p(z|x). We do this approximation using q, a distribution we know how to sample from and whose parameters we intend to find using optimization procedure. So the distribution q would be different from p but would still be about (or for) z|x. In other words, it is an "assumed" distribution for "z|x".
      The symbol/notation "E_q" .... (sorry can't write latex/typeset in the comments 😟) means that it is an expectation where the probability distribution is "q". Whatever is in the subscript of symbol E implies the probability distribution.
      Since in this entire tutorial q is a distribution of z given x ( i.e. z|x); the notations E_q and E_q(z|x) are same .....i.e. q and q(z|x) are same. This is why when it expanded it was q(z|x) and not q(x)
      Watch my video on Importance Sampling (starting portion at least where I clarify the Expectation notation & symbols). Here is the link to the video - ua-cam.com/video/ivBtpzHcvpg/v-deo.html

    • @ericzhang4486
      @ericzhang4486 3 роки тому

      @@KapilSachdeva does that mean: the expectation of log p(x) don't depend on distribution q, since at the end E_q[ log p(x)] becomes to log p(x)?

    • @KapilSachdeva
      @KapilSachdeva  3 роки тому

      @@ericzhang4486 since log p(x) does not have any 'z' in it, log p(x) will be treated as constant when your sampling distribution when computing expectation is q(z) (or even q(z|x)). This is why the equation gets simplified by taking this constant out of the integral. Let me if know this helps you understand it.

    • @ericzhang4486
      @ericzhang4486 3 роки тому

      @@KapilSachdeva it makes perfectly sense. Thank you so much!

    • @ericzhang4486
      @ericzhang4486 3 роки тому

      I come to your video from the equation 1 in DALL-E paper (arxiv.org/pdf/2102.12092.pdf). If it's possible, could you give me a little enlightenment on how elbo is derived in that case? Feel free to leave, if you don't have time. Thank you!

  • @anshumansinha5874
    @anshumansinha5874 Рік тому

    Why is the first term reconstruction error? I mean we are getting back x from latent variable z; but reconstruction should it not be x-x' like initial x and final x from (x|z) ? Also, how to read that expression? Eq[log(p(x|z))] = \Int (q(x)*log(p(x|z)*dx) ; i.e we want to average out the function of random variable x with the weight parameter q(x); what does that mean in the sense of VAE?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому

      > Why is the first term reconstruction error?
      One way to see the error in reconstruction is x - x' i.e. the difference or square of the difference. This is what you are familiar with. Another way to see it in terms of "likelihood". That type of objective function is called maximum likelihood estimation. Read on MLE to see what it is about if you are not familiar with it. In other words, what is have is another objective/loss function that you will maximize/minimize.
      That said, you can indeed replace the E[log p(x|z)] with the MSE. It is done in quite many implementations. In the VAE, tutorial I talk about it as well.
      > what does that mean in the sense of VAE?
      For that you will want to the VAE tutorial. In that I explain why we need to do this!. If not clear from that tutorial ask the question in the comments of that vide.

  • @peterhall6656
    @peterhall6656 Рік тому +1

    Top drawer explanation.

  • @AI_ML_DL_LLM
    @AI_ML_DL_LLM Рік тому +1

    it is a great one, would be greater if you could start with a simple numerical example

  • @MrArtod
    @MrArtod 3 роки тому +1

    Best explanation, thx!

  • @mikhaildoroshenko2169
    @mikhaildoroshenko2169 2 роки тому

    Can we choose the prior distribution of z in any way we want or do we have to estimate it somehow?

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому +1

      In Bayesian Statistics, choosing/selecting prior is one of the challenging aspects.
      The prior distribution can be chosen based on your domain knowledge (when you have small datasets) or estimated from the data itself (when your dataset is large).
      Method of "estimating" the prior from data is called "Empirical Bayes" (en.wikipedia.org/wiki/Empirical_Bayes_method)
      There are few modern research papers that try to "learn" prior as an additional step in VAE.

  • @UdemmyUdemmy
    @UdemmyUdemmy Рік тому +1

    hTis one video is worth a million gold particles..

  • @mammamiachemale
    @mammamiachemale 2 роки тому +1

    I love you, great!!!

  • @Pruthvikajaykumar
    @Pruthvikajaykumar 2 роки тому +1

    Thank you so much

  • @Maciek17PL
    @Maciek17PL Рік тому

    what is log p blue theta (x) at 5:40? is it a pdf or a single number?

    • @KapilSachdeva
      @KapilSachdeva  Рік тому +1

      it would be a density but if used for optimization you would get a scalar value for a given batch of samples

  • @RAP4EVERMRC96
    @RAP4EVERMRC96 Рік тому

    4:33 why is it + ('plus') Expected value of log of p of x as to - ('minus')?

  • @heshamali5208
    @heshamali5208 2 роки тому

    in minute 9:200. how it's log p(z|x) / p(z). it was addition. shouldn't be log p(z|x) * p(z)? please correct it to me sir. thanks.

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому

      Hello Hesham, I do not see the expression "log p(z|x)/p(z)" any where in the tutorial. Could you check again the screen which is causing some confusion for you and may you have a typo in the above comment?

    • @heshamali5208
      @heshamali5208 2 роки тому

      @@KapilSachdeva thanks for your kind reply sir. I mean in the third line in minute 9:22, we moved from Eq[ log q(z|x)] + Eq[ log p(z)] to --> Eq[log q(z|x) /p(z)] which I don't no why it is division and not multiplication as it was addition before taking a common log.

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому

      @@heshamali5208 Here is how you should see it. I did not show one intermediary step and hence your confusion.
      Let’s look at only the two last terms in the equation.
      -E[log q(z|x)] + E[log p(z)]
      -E[ log q(z|x) - log p(z)] {I have take the expectation out as it common}
      -E[log q(z|x) / p(z)]
      Hope this clarifies now.

    • @heshamali5208
      @heshamali5208 2 роки тому +1

      @@KapilSachdeva ok thanks sir. it is clear now

  • @heshamali5208
    @heshamali5208 2 роки тому

    why when maximizing the first component the second component will be minimized directly?

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому

      let's say
      fixed_amount = a + b
      if `a` increases then `b` must decrease in order to respect above equation.
      ##
      log_evidence is fixed. It is the total probability after taking into consider all parameters and hidden variables. As the tutorial shows, it consists of two components. If you maximize one component then the other should decrease.

    • @heshamali5208
      @heshamali5208 2 роки тому

      ​@@KapilSachdeva Thanks sir. my last question is how computational I could calculate Q(Z)||P(Z). like how do I know P(Z), while all I can get is latent variable Z which in my understanding it is Q(Z)? so how do I make sure that the predicted distribution of Z is close as possible to the actual distribution of Z? I know now how I could get P(X/Z). my question how do I calculate the regularization term?

    • @KapilSachdeva
      @KapilSachdeva  2 роки тому

      I explain this in the tutorial on variational auto encoder. ua-cam.com/video/h9kWaQQloPk/v-deo.html

    • @heshamali5208
      @heshamali5208 2 роки тому

      @@KapilSachdeva Thanks sir for your fast reply.

  • @yongen5398
    @yongen5398 3 роки тому +1

    haha that " I have cheated you" at 7:36

  • @anshumansinha5874
    @anshumansinha5874 Рік тому

    Why even know the posterior p(z|x) ? I think you can start with that.

    • @KapilSachdeva
      @KapilSachdeva  Рік тому +1

      For that watch the “towards Bayesian regression” series on my channel.

    • @anshumansinha5874
      @anshumansinha5874 Рік тому

      @@KapilSachdeva Oh great that’ll be of a lot help! And great video series!

  • @NadavBenedek
    @NadavBenedek 10 місяців тому

    Not clear enough. In the first minute you say 'intractable', but you need to give an example of why this is intractable and why other terms are not. Also, explain why the denominator is intractable while the nomination is not.

  • @vivekpokharel4731
    @vivekpokharel4731 2 місяці тому

    Thank you so much.