Evidence Lower Bound (ELBO) - CLEARLY EXPLAINED!

Kapil Sachdeva

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 3 січ 2025

КОМЕНТАРІ • 124

@AndreiMargeloiu 3 роки тому ⁺⁴⁰
Cristal clear explanation, the world needs more people like you!
@KapilSachdeva 3 роки тому ⁺¹
🙏🙏
@TheProblembaer2 10 місяців тому ⁺⁴
Again, thank you. This is incredible well explained, the small steps and the explanation behind, pure gold.
@KapilSachdeva 10 місяців тому ⁺¹
🙏
@genericperson8238 2 роки тому ⁺⁹
Absolutely beautiful. The explanation is so insanely well thought out and clear.
@KapilSachdeva 2 роки тому
🙏
@sonny1552 Рік тому ⁺²
Best explanation ever! I found this video for my understanding of VAE at first, but I recently found that this is also directly related to diffusion models. Thanks for making this video.
@KapilSachdeva Рік тому
🙏
@T_rex-te3us Рік тому ⁺¹
Insane explanation Mr. Sachdeva! Thank you so much - I wish you all the best in life
@KapilSachdeva Рік тому
🙏
@9speedbird Рік тому ⁺¹
That was great, been going through paper after paper, all I needed was this! Thanks!
@KapilSachdeva Рік тому
🙏
@bevandenizclgn9282 10 місяців тому
Best explanation I have found so far, thank u!
@thatipelli1 3 роки тому ⁺⁴
Thanks, your tutorial cleared my doubts!!
@BirthdayDoggy 11 місяців тому ⁺¹
Thank you so much for this explanation :) Very clear and well explained. I wish you all the best
@KapilSachdeva 11 місяців тому
🙏
@danmathewsrobin5991 3 роки тому ⁺⁴
Fantastic tutorial!! Hoping to see more similar content. Thank you
@KapilSachdeva 3 роки тому
🙏
@schrodingerac 3 роки тому ⁺³
excellent presentation and explanation
Thank you very much sir
@KapilSachdeva 3 роки тому
🙏
@kappa12385 2 роки тому ⁺¹
Kadak sikhaya bhau. Majha aa gaya.
@KapilSachdeva 2 роки тому
🙏
@brookestephenson4354 3 роки тому ⁺³
Very clear explanation! Thank you very much!
@KapilSachdeva 3 роки тому
Thanks Brooke. Happy that you found it helpful!
@mahayat 3 роки тому ⁺³
best and clear explanation!
@KapilSachdeva 3 роки тому
🙏
@chethankr3598 Рік тому ⁺¹
This is an awesome explaination. Thank you.
@KapilSachdeva Рік тому
🙏
@AruneshKumarSinghPro Рік тому ⁺²
This one is masterpiece. Can you please put one video on Hierarchical Variational AutoEncoders when you have time. Looking forward to it.
@KapilSachdeva Рік тому
🙏
@vi5hnupradeep 3 роки тому ⁺²
Thankyou so much sir ! I'm glad that I found your video 💯
@KapilSachdeva 3 роки тому
🙏
@chadsamuelson1808 2 роки тому ⁺¹
Amazingly clear explanation!
@KapilSachdeva 2 роки тому
🙏
@Aruuuq 3 роки тому ⁺³
Amazing tutorial! Keep up the good work.
@KapilSachdeva 3 роки тому
🙏
@lihuil3115 2 роки тому ⁺²
best explanation ever!
@KapilSachdeva 2 роки тому
🙏
@ziangshi182 Рік тому ⁺¹
Fantastic Explanation!
@KapilSachdeva Рік тому
🙏
@xiaofanlin9185 9 днів тому
5:55 why is log P_theta(x) constant? Doesn't it depend on theta? When we do optimization to find theta for p_theta(z) and p_theta(x|z) wouldn't this cause problems?
@HelloWorlds__JTS Рік тому ⁺¹
Great explanations! I do have one correction to suggest: At (6:41) you say D_KL is always non-negative; but this can only be true if q is chosen to bound p from above over enough of their overlap (... for the given example, i.e. reverse-KL).
@KapilSachdeva Рік тому
🙏 Correct
@HelloWorlds__JTS 11 місяців тому
@@KapilSachdeva I was wrong to make my earlier suggestion, because p and q are probabilities. I can give details if anyone requests it, but it's trivial to see using total variation distance or Jensen's inequality.
@wadewang574 Рік тому ⁺¹
At 4:40, how to see the third component is an expectation with respect to z instead of x ?
@KapilSachdeva Рік тому
Because the KL divergence (which in turn is the expected value) is between p(z|x) and q(z|x).
Now you need to have a good understanding of KL divergence and expected value to understand it.
@xiaofanlin9185 9 днів тому
0:50 Can I conclude that these thetas are different from each other, unrelated and independent?
@ajwadakil6892 Рік тому ⁺¹
Great Explanation. Can you tell me which books / articles that I may refer to for further and deeper reading regarding variational inferences, bayesian statistics and concepts related to in depth probability?
@KapilSachdeva Рік тому ⁺²
For Bayesian Statistics, I would recommend reading:
Statistical Rethinking by Richard Mclearth [See this page for more information - xcelab.net/rm/]
A good overview is this paper (Variational Inference: A Review for Statisticians by David M. Blei et al)
arxiv.org/abs/1601.00670
For Basic/Foundational Variational Inference, PRML is a good source
www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf
There are many books and lecture notes on Probability theory. Pick any one.
@anshumansinha5874 Рік тому
So, we have to maximise the ELBO (@9:28), right? As that would make it go closer to the log likelihood of the original data.
1. Will that mean we should find parameter 'phi' which increase the reconstruction error (as it is the first term)?
2. And find 'phi' such that the second term gets minimised? Which would mean q_phi(z|x) should be as close as possible from the prior p(z) ?
But don't we need to minimise the reconstruction error while not going far from the assumed prior p(z). How to get these inferences from the derived equation @9:28
@KapilSachdeva Рік тому
We minimize the “negative” ELBO
@YT-yt-yt-3 6 місяців тому
@@KapilSachdeva The terminologies and signs surrounding KL divergence and ELBO are what make them complex else it's simple concept. Is it really an ' reconstruction error'? I mean, is it the likelihood of observing data given
z that needs to be maximized? Why is it called error?
@mmattb Рік тому ⁺¹
One more question: at 10:11 I can see the right hand term looks like a KL divergence between the distributions, but I'm confused: what would you integrate over if you expanded that? In the KL formulation typically the top and bottom of the fraction are distributions over the same variable. Is it just an intuition to call this KL, or is it literally a KL divergence; if the latter, do you mind writing out the general formula for KL when the top and bottom are distributions over different variables (z|x vs z in this case)?
@KapilSachdeva Рік тому
Z|X just means that you got the Z given X but it still remains the (conditional) distribution for Z. Hence your statement about using KL divergence over the same variable is still valid. Hope this makes sense.
@mmattb Рік тому ⁺¹
@@KapilSachdeva ohhhh so both of them are defined over the same domain as Z. That makes sense. Thanks again.
@KapilSachdeva Рік тому
🙏
@alexfrangos2402 2 роки тому ⁺¹
Amazing explanation, thank you so much!
@KapilSachdeva 2 роки тому
🙏
@abhinav9058 2 роки тому ⁺²
Subscribed sir awesome tutorial
Learning variantional auto encoder 😃
@KapilSachdeva 2 роки тому
🙏
@UdemmyUdemmy Рік тому ⁺¹
U are a legend!
@KapilSachdeva Рік тому
🙏
@easter.bunny.6 6 місяців тому
Thanks for the lecture sir! I have a question at 4:54, how did you expand that E[log_p_theta(x)] into Integral(q(z|x)log_p_theta(x)dz)? Thanks!
@satadrudas3675 Рік тому ⁺¹
Explaied very well. Thanks
@KapilSachdeva Рік тому
🙏
@peterhall6656 Рік тому ⁺¹
Top drawer explanation.
@KapilSachdeva Рік тому
🙏
@sahhaf1234 Рік тому ⁺¹
Good explanation. I can follow the algebra easily. The problem is this: what is known and what is not known in this formulation? In other words, @0:26, I think we try to find the posterior. But, do we know the prior? Do we know the likelihood? Or, is it that we do not know them but can sample them?
@KapilSachdeva Рік тому
Good questions and you have mostly answered them yourself. Prior is what you assume. Likelihood function you need to know (or model). But the most difficult will be computing the normalizing constant. Most of the time computationally intractable
@the_akhash 2 роки тому ⁺¹
Thanks for the explanation!
@KapilSachdeva 2 роки тому
🙏
@riaarora3126 2 роки тому ⁺¹
Wow, clarity supremacy
@KapilSachdeva 2 роки тому
🙏 😀 “clarity supremacy” …. Good luck with your learnings.
@wolfgangpaier6208 Рік тому ⁺¹
Hi, I really appreciate your video tutorial because it’s super helpful and easy to understand. I only have one question left. At 10:27 you replaced the conditional distribution q(z|x) by q(z). Is this also true for Variational Auto-Encoders? Because for VAEs, if I understand right, q(z) is approximated by a neural network that predicts z from x. So I would expect that it’s a conditional distribution where z depends on x.
@KapilSachdeva Рік тому ⁺¹
In the case of VAE it will always be conditional distribution. Your understanding is correct 🙏
@wolfgangpaier6208 Рік тому
@@KapilSachdeva ok. Thanks a lot for the fast response 🙏
@mmattb Рік тому ⁺¹
Sorry to bother you again Kapil - is the integral at 5:05 supposed to have d(z|x) instead of dz? If not, I'm certainly confused haha.
@KapilSachdeva Рік тому
No bother at all. Conceptually you can think of it like that but I have not seen/encountered differential portion of the integral using the conditional (the pipe) thing. So just a notation thing here. Your understanding is correct.
@alfcnz 3 роки тому ⁺²
Thanks! 😍😍😍
@KapilSachdeva 3 роки тому ⁺¹
🙏
@MrArtod 3 роки тому ⁺¹
Best explanation, thx!
@KapilSachdeva 3 роки тому
🙏
@AI_ML_DL_LLM Рік тому ⁺¹
it is a great one, would be greater if you could start with a simple numerical example
@KapilSachdeva Рік тому
Interesting. Will think about it. 🙏
@anshumansinha5874 Рік тому
Why is the first term reconstruction error? I mean we are getting back x from latent variable z; but reconstruction should it not be x-x' like initial x and final x from (x|z) ? Also, how to read that expression? Eq[log(p(x|z))] = \Int (q(x)*log(p(x|z)*dx) ; i.e we want to average out the function of random variable x with the weight parameter q(x); what does that mean in the sense of VAE?
@KapilSachdeva Рік тому
> Why is the first term reconstruction error?
One way to see the error in reconstruction is x - x' i.e. the difference or square of the difference. This is what you are familiar with. Another way to see it in terms of "likelihood". That type of objective function is called maximum likelihood estimation. Read on MLE to see what it is about if you are not familiar with it. In other words, what is have is another objective/loss function that you will maximize/minimize.
That said, you can indeed replace the E[log p(x|z)] with the MSE. It is done in quite many implementations. In the VAE, tutorial I talk about it as well.
> what does that mean in the sense of VAE?
For that you will want to the VAE tutorial. In that I explain why we need to do this!. If not clear from that tutorial ask the question in the comments of that vide.
@RAP4EVERMRC96 Рік тому
4:33 why is it + ('plus') Expected value of log of p of x as to - ('minus')?
@RAP4EVERMRC96 Рік тому ⁺¹
nvmd got it
@KapilSachdeva Рік тому
🙏
@user-or7ji5hv8y 3 роки тому ⁺³
Just one question, at ua-cam.com/video/IXsA5Rpp25w/v-deo.html, when you expanded log p(x), how did you know to use q(z | x) instead of simply q(x)? Thank you.
@KapilSachdeva 3 роки тому ⁺⁴
We are after approximating the posterior p(z|x). We do this approximation using q, a distribution we know how to sample from and whose parameters we intend to find using optimization procedure. So the distribution q would be different from p but would still be about (or for) z|x. In other words, it is an "assumed" distribution for "z|x".
The symbol/notation "E_q" .... (sorry can't write latex/typeset in the comments 😟) means that it is an expectation where the probability distribution is "q". Whatever is in the subscript of symbol E implies the probability distribution.
Since in this entire tutorial q is a distribution of z given x ( i.e. z|x); the notations E_q and E_q(z|x) are same .....i.e. q and q(z|x) are same. This is why when it expanded it was q(z|x) and not q(x)
Watch my video on Importance Sampling (starting portion at least where I clarify the Expectation notation & symbols). Here is the link to the video - ua-cam.com/video/ivBtpzHcvpg/v-deo.html
@ericzhang4486 3 роки тому
@@KapilSachdeva does that mean: the expectation of log p(x) don't depend on distribution q, since at the end E_q[ log p(x)] becomes to log p(x)?
@KapilSachdeva 3 роки тому
@@ericzhang4486 since log p(x) does not have any 'z' in it, log p(x) will be treated as constant when your sampling distribution when computing expectation is q(z) (or even q(z|x)). This is why the equation gets simplified by taking this constant out of the integral. Let me if know this helps you understand it.
@ericzhang4486 3 роки тому
@@KapilSachdeva it makes perfectly sense. Thank you so much!
@ericzhang4486 3 роки тому
I come to your video from the equation 1 in DALL-E paper (arxiv.org/pdf/2102.12092.pdf). If it's possible, could you give me a little enlightenment on how elbo is derived in that case? Feel free to leave, if you don't have time. Thank you!
@Maciek17PL Рік тому
what is log p blue theta (x) at 5:40? is it a pdf or a single number?
@KapilSachdeva Рік тому ⁺¹
it would be a density but if used for optimization you would get a scalar value for a given batch of samples
@mikhaildoroshenko2169 3 роки тому
Can we choose the prior distribution of z in any way we want or do we have to estimate it somehow?
@KapilSachdeva 3 роки тому ⁺¹
In Bayesian Statistics, choosing/selecting prior is one of the challenging aspects.
The prior distribution can be chosen based on your domain knowledge (when you have small datasets) or estimated from the data itself (when your dataset is large).
Method of "estimating" the prior from data is called "Empirical Bayes" (en.wikipedia.org/wiki/Empirical_Bayes_method)
There are few modern research papers that try to "learn" prior as an additional step in VAE.
@heshamali5208 2 роки тому
why when maximizing the first component the second component will be minimized directly?
@KapilSachdeva 2 роки тому
let's say
fixed_amount = a + b
if `a` increases then `b` must decrease in order to respect above equation.
##
log_evidence is fixed. It is the total probability after taking into consider all parameters and hidden variables. As the tutorial shows, it consists of two components. If you maximize one component then the other should decrease.
@heshamali5208 2 роки тому
@@KapilSachdeva Thanks sir. my last question is how computational I could calculate Q(Z)||P(Z). like how do I know P(Z), while all I can get is latent variable Z which in my understanding it is Q(Z)? so how do I make sure that the predicted distribution of Z is close as possible to the actual distribution of Z? I know now how I could get P(X/Z). my question how do I calculate the regularization term?
@KapilSachdeva 2 роки тому
I explain this in the tutorial on variational auto encoder. ua-cam.com/video/h9kWaQQloPk/v-deo.html
@heshamali5208 2 роки тому
@@KapilSachdeva Thanks sir for your fast reply.
@mammamiachemale 2 роки тому ⁺¹
I love you, great!!!
@KapilSachdeva 2 роки тому
😄🙏
@Pruthvikajaykumar 2 роки тому ⁺¹
Thank you so much
@KapilSachdeva 2 роки тому
🙏
@heshamali5208 3 роки тому
in minute 9:200. how it's log p(z|x) / p(z). it was addition. shouldn't be log p(z|x) * p(z)? please correct it to me sir. thanks.
@KapilSachdeva 3 роки тому
Hello Hesham, I do not see the expression "log p(z|x)/p(z)" any where in the tutorial. Could you check again the screen which is causing some confusion for you and may you have a typo in the above comment?
@heshamali5208 3 роки тому
@@KapilSachdeva thanks for your kind reply sir. I mean in the third line in minute 9:22, we moved from Eq[ log q(z|x)] + Eq[ log p(z)] to --> Eq[log q(z|x) /p(z)] which I don't no why it is division and not multiplication as it was addition before taking a common log.
@KapilSachdeva 3 роки тому
@@heshamali5208 Here is how you should see it. I did not show one intermediary step and hence your confusion.
Let’s look at only the two last terms in the equation.
-E[log q(z|x)] + E[log p(z)]
-E[ log q(z|x) - log p(z)] {I have take the expectation out as it common}
-E[log q(z|x) / p(z)]
Hope this clarifies now.
@heshamali5208 3 роки тому ⁺¹
@@KapilSachdeva ok thanks sir. it is clear now
@yongen5398 3 роки тому ⁺¹
haha that " I have cheated you" at 7:36
@KapilSachdeva 3 роки тому ⁺¹
😀
@UdemmyUdemmy Рік тому ⁺¹
hTis one video is worth a million gold particles..
@KapilSachdeva Рік тому
🙏
@anshumansinha5874 Рік тому
Why even know the posterior p(z|x) ? I think you can start with that.
@KapilSachdeva Рік тому ⁺¹
For that watch the “towards Bayesian regression” series on my channel.
@anshumansinha5874 Рік тому
@@KapilSachdeva Oh great that’ll be of a lot help! And great video series!
@NadavBenedek 11 місяців тому
Not clear enough. In the first minute you say 'intractable', but you need to give an example of why this is intractable and why other terms are not. Also, explain why the denominator is intractable while the nomination is not.
@vivekpokharel4731 4 місяці тому
Thank you so much.

Наступне

Автоматичне відтворення