Best explanation ever! I found this video for my understanding of VAE at first, but I recently found that this is also directly related to diffusion models. Thanks for making this video.
5:55 why is log P_theta(x) constant? Doesn't it depend on theta? When we do optimization to find theta for p_theta(z) and p_theta(x|z) wouldn't this cause problems?
Great explanations! I do have one correction to suggest: At (6:41) you say D_KL is always non-negative; but this can only be true if q is chosen to bound p from above over enough of their overlap (... for the given example, i.e. reverse-KL).
@@KapilSachdeva I was wrong to make my earlier suggestion, because p and q are probabilities. I can give details if anyone requests it, but it's trivial to see using total variation distance or Jensen's inequality.
Because the KL divergence (which in turn is the expected value) is between p(z|x) and q(z|x). Now you need to have a good understanding of KL divergence and expected value to understand it.
Great Explanation. Can you tell me which books / articles that I may refer to for further and deeper reading regarding variational inferences, bayesian statistics and concepts related to in depth probability?
For Bayesian Statistics, I would recommend reading: Statistical Rethinking by Richard Mclearth [See this page for more information - xcelab.net/rm/] A good overview is this paper (Variational Inference: A Review for Statisticians by David M. Blei et al) arxiv.org/abs/1601.00670 For Basic/Foundational Variational Inference, PRML is a good source www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf There are many books and lecture notes on Probability theory. Pick any one.
So, we have to maximise the ELBO (@9:28), right? As that would make it go closer to the log likelihood of the original data. 1. Will that mean we should find parameter 'phi' which increase the reconstruction error (as it is the first term)? 2. And find 'phi' such that the second term gets minimised? Which would mean q_phi(z|x) should be as close as possible from the prior p(z) ? But don't we need to minimise the reconstruction error while not going far from the assumed prior p(z). How to get these inferences from the derived equation @9:28
@@KapilSachdeva The terminologies and signs surrounding KL divergence and ELBO are what make them complex else it's simple concept. Is it really an ' reconstruction error'? I mean, is it the likelihood of observing data given z that needs to be maximized? Why is it called error?
One more question: at 10:11 I can see the right hand term looks like a KL divergence between the distributions, but I'm confused: what would you integrate over if you expanded that? In the KL formulation typically the top and bottom of the fraction are distributions over the same variable. Is it just an intuition to call this KL, or is it literally a KL divergence; if the latter, do you mind writing out the general formula for KL when the top and bottom are distributions over different variables (z|x vs z in this case)?
Z|X just means that you got the Z given X but it still remains the (conditional) distribution for Z. Hence your statement about using KL divergence over the same variable is still valid. Hope this makes sense.
Good explanation. I can follow the algebra easily. The problem is this: what is known and what is not known in this formulation? In other words, @0:26, I think we try to find the posterior. But, do we know the prior? Do we know the likelihood? Or, is it that we do not know them but can sample them?
Good questions and you have mostly answered them yourself. Prior is what you assume. Likelihood function you need to know (or model). But the most difficult will be computing the normalizing constant. Most of the time computationally intractable
Hi, I really appreciate your video tutorial because it’s super helpful and easy to understand. I only have one question left. At 10:27 you replaced the conditional distribution q(z|x) by q(z). Is this also true for Variational Auto-Encoders? Because for VAEs, if I understand right, q(z) is approximated by a neural network that predicts z from x. So I would expect that it’s a conditional distribution where z depends on x.
No bother at all. Conceptually you can think of it like that but I have not seen/encountered differential portion of the integral using the conditional (the pipe) thing. So just a notation thing here. Your understanding is correct.
Why is the first term reconstruction error? I mean we are getting back x from latent variable z; but reconstruction should it not be x-x' like initial x and final x from (x|z) ? Also, how to read that expression? Eq[log(p(x|z))] = \Int (q(x)*log(p(x|z)*dx) ; i.e we want to average out the function of random variable x with the weight parameter q(x); what does that mean in the sense of VAE?
> Why is the first term reconstruction error? One way to see the error in reconstruction is x - x' i.e. the difference or square of the difference. This is what you are familiar with. Another way to see it in terms of "likelihood". That type of objective function is called maximum likelihood estimation. Read on MLE to see what it is about if you are not familiar with it. In other words, what is have is another objective/loss function that you will maximize/minimize. That said, you can indeed replace the E[log p(x|z)] with the MSE. It is done in quite many implementations. In the VAE, tutorial I talk about it as well. > what does that mean in the sense of VAE? For that you will want to the VAE tutorial. In that I explain why we need to do this!. If not clear from that tutorial ask the question in the comments of that vide.
Just one question, at ua-cam.com/video/IXsA5Rpp25w/v-deo.html, when you expanded log p(x), how did you know to use q(z | x) instead of simply q(x)? Thank you.
We are after approximating the posterior p(z|x). We do this approximation using q, a distribution we know how to sample from and whose parameters we intend to find using optimization procedure. So the distribution q would be different from p but would still be about (or for) z|x. In other words, it is an "assumed" distribution for "z|x". The symbol/notation "E_q" .... (sorry can't write latex/typeset in the comments 😟) means that it is an expectation where the probability distribution is "q". Whatever is in the subscript of symbol E implies the probability distribution. Since in this entire tutorial q is a distribution of z given x ( i.e. z|x); the notations E_q and E_q(z|x) are same .....i.e. q and q(z|x) are same. This is why when it expanded it was q(z|x) and not q(x) Watch my video on Importance Sampling (starting portion at least where I clarify the Expectation notation & symbols). Here is the link to the video - ua-cam.com/video/ivBtpzHcvpg/v-deo.html
@@ericzhang4486 since log p(x) does not have any 'z' in it, log p(x) will be treated as constant when your sampling distribution when computing expectation is q(z) (or even q(z|x)). This is why the equation gets simplified by taking this constant out of the integral. Let me if know this helps you understand it.
I come to your video from the equation 1 in DALL-E paper (arxiv.org/pdf/2102.12092.pdf). If it's possible, could you give me a little enlightenment on how elbo is derived in that case? Feel free to leave, if you don't have time. Thank you!
In Bayesian Statistics, choosing/selecting prior is one of the challenging aspects. The prior distribution can be chosen based on your domain knowledge (when you have small datasets) or estimated from the data itself (when your dataset is large). Method of "estimating" the prior from data is called "Empirical Bayes" (en.wikipedia.org/wiki/Empirical_Bayes_method) There are few modern research papers that try to "learn" prior as an additional step in VAE.
let's say fixed_amount = a + b if `a` increases then `b` must decrease in order to respect above equation. ## log_evidence is fixed. It is the total probability after taking into consider all parameters and hidden variables. As the tutorial shows, it consists of two components. If you maximize one component then the other should decrease.
@@KapilSachdeva Thanks sir. my last question is how computational I could calculate Q(Z)||P(Z). like how do I know P(Z), while all I can get is latent variable Z which in my understanding it is Q(Z)? so how do I make sure that the predicted distribution of Z is close as possible to the actual distribution of Z? I know now how I could get P(X/Z). my question how do I calculate the regularization term?
Hello Hesham, I do not see the expression "log p(z|x)/p(z)" any where in the tutorial. Could you check again the screen which is causing some confusion for you and may you have a typo in the above comment?
@@KapilSachdeva thanks for your kind reply sir. I mean in the third line in minute 9:22, we moved from Eq[ log q(z|x)] + Eq[ log p(z)] to --> Eq[log q(z|x) /p(z)] which I don't no why it is division and not multiplication as it was addition before taking a common log.
@@heshamali5208 Here is how you should see it. I did not show one intermediary step and hence your confusion. Let’s look at only the two last terms in the equation. -E[log q(z|x)] + E[log p(z)] -E[ log q(z|x) - log p(z)] {I have take the expectation out as it common} -E[log q(z|x) / p(z)] Hope this clarifies now.
Not clear enough. In the first minute you say 'intractable', but you need to give an example of why this is intractable and why other terms are not. Also, explain why the denominator is intractable while the nomination is not.
Cristal clear explanation, the world needs more people like you!
🙏🙏
Again, thank you. This is incredible well explained, the small steps and the explanation behind, pure gold.
🙏
Absolutely beautiful. The explanation is so insanely well thought out and clear.
🙏
Best explanation ever! I found this video for my understanding of VAE at first, but I recently found that this is also directly related to diffusion models. Thanks for making this video.
🙏
Insane explanation Mr. Sachdeva! Thank you so much - I wish you all the best in life
🙏
That was great, been going through paper after paper, all I needed was this! Thanks!
🙏
Best explanation I have found so far, thank u!
Thanks, your tutorial cleared my doubts!!
Thank you so much for this explanation :) Very clear and well explained. I wish you all the best
🙏
Fantastic tutorial!! Hoping to see more similar content. Thank you
🙏
excellent presentation and explanation
Thank you very much sir
🙏
Kadak sikhaya bhau. Majha aa gaya.
🙏
Very clear explanation! Thank you very much!
Thanks Brooke. Happy that you found it helpful!
best and clear explanation!
🙏
This is an awesome explaination. Thank you.
🙏
This one is masterpiece. Can you please put one video on Hierarchical Variational AutoEncoders when you have time. Looking forward to it.
🙏
Thankyou so much sir ! I'm glad that I found your video 💯
🙏
Amazingly clear explanation!
🙏
Amazing tutorial! Keep up the good work.
🙏
best explanation ever!
🙏
Fantastic Explanation!
🙏
5:55 why is log P_theta(x) constant? Doesn't it depend on theta? When we do optimization to find theta for p_theta(z) and p_theta(x|z) wouldn't this cause problems?
Great explanations! I do have one correction to suggest: At (6:41) you say D_KL is always non-negative; but this can only be true if q is chosen to bound p from above over enough of their overlap (... for the given example, i.e. reverse-KL).
🙏 Correct
@@KapilSachdeva I was wrong to make my earlier suggestion, because p and q are probabilities. I can give details if anyone requests it, but it's trivial to see using total variation distance or Jensen's inequality.
At 4:40, how to see the third component is an expectation with respect to z instead of x ?
Because the KL divergence (which in turn is the expected value) is between p(z|x) and q(z|x).
Now you need to have a good understanding of KL divergence and expected value to understand it.
0:50 Can I conclude that these thetas are different from each other, unrelated and independent?
Great Explanation. Can you tell me which books / articles that I may refer to for further and deeper reading regarding variational inferences, bayesian statistics and concepts related to in depth probability?
For Bayesian Statistics, I would recommend reading:
Statistical Rethinking by Richard Mclearth [See this page for more information - xcelab.net/rm/]
A good overview is this paper (Variational Inference: A Review for Statisticians by David M. Blei et al)
arxiv.org/abs/1601.00670
For Basic/Foundational Variational Inference, PRML is a good source
www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf
There are many books and lecture notes on Probability theory. Pick any one.
So, we have to maximise the ELBO (@9:28), right? As that would make it go closer to the log likelihood of the original data.
1. Will that mean we should find parameter 'phi' which increase the reconstruction error (as it is the first term)?
2. And find 'phi' such that the second term gets minimised? Which would mean q_phi(z|x) should be as close as possible from the prior p(z) ?
But don't we need to minimise the reconstruction error while not going far from the assumed prior p(z). How to get these inferences from the derived equation @9:28
We minimize the “negative” ELBO
@@KapilSachdeva The terminologies and signs surrounding KL divergence and ELBO are what make them complex else it's simple concept. Is it really an ' reconstruction error'? I mean, is it the likelihood of observing data given
z that needs to be maximized? Why is it called error?
One more question: at 10:11 I can see the right hand term looks like a KL divergence between the distributions, but I'm confused: what would you integrate over if you expanded that? In the KL formulation typically the top and bottom of the fraction are distributions over the same variable. Is it just an intuition to call this KL, or is it literally a KL divergence; if the latter, do you mind writing out the general formula for KL when the top and bottom are distributions over different variables (z|x vs z in this case)?
Z|X just means that you got the Z given X but it still remains the (conditional) distribution for Z. Hence your statement about using KL divergence over the same variable is still valid. Hope this makes sense.
@@KapilSachdeva ohhhh so both of them are defined over the same domain as Z. That makes sense. Thanks again.
🙏
Amazing explanation, thank you so much!
🙏
Subscribed sir awesome tutorial
Learning variantional auto encoder 😃
🙏
U are a legend!
🙏
Thanks for the lecture sir! I have a question at 4:54, how did you expand that E[log_p_theta(x)] into Integral(q(z|x)log_p_theta(x)dz)? Thanks!
Explaied very well. Thanks
🙏
Top drawer explanation.
🙏
Good explanation. I can follow the algebra easily. The problem is this: what is known and what is not known in this formulation? In other words, @0:26, I think we try to find the posterior. But, do we know the prior? Do we know the likelihood? Or, is it that we do not know them but can sample them?
Good questions and you have mostly answered them yourself. Prior is what you assume. Likelihood function you need to know (or model). But the most difficult will be computing the normalizing constant. Most of the time computationally intractable
Thanks for the explanation!
🙏
Wow, clarity supremacy
🙏 😀 “clarity supremacy” …. Good luck with your learnings.
Hi, I really appreciate your video tutorial because it’s super helpful and easy to understand. I only have one question left. At 10:27 you replaced the conditional distribution q(z|x) by q(z). Is this also true for Variational Auto-Encoders? Because for VAEs, if I understand right, q(z) is approximated by a neural network that predicts z from x. So I would expect that it’s a conditional distribution where z depends on x.
In the case of VAE it will always be conditional distribution. Your understanding is correct 🙏
@@KapilSachdeva ok. Thanks a lot for the fast response 🙏
Sorry to bother you again Kapil - is the integral at 5:05 supposed to have d(z|x) instead of dz? If not, I'm certainly confused haha.
No bother at all. Conceptually you can think of it like that but I have not seen/encountered differential portion of the integral using the conditional (the pipe) thing. So just a notation thing here. Your understanding is correct.
Thanks! 😍😍😍
🙏
Best explanation, thx!
🙏
it is a great one, would be greater if you could start with a simple numerical example
Interesting. Will think about it. 🙏
Why is the first term reconstruction error? I mean we are getting back x from latent variable z; but reconstruction should it not be x-x' like initial x and final x from (x|z) ? Also, how to read that expression? Eq[log(p(x|z))] = \Int (q(x)*log(p(x|z)*dx) ; i.e we want to average out the function of random variable x with the weight parameter q(x); what does that mean in the sense of VAE?
> Why is the first term reconstruction error?
One way to see the error in reconstruction is x - x' i.e. the difference or square of the difference. This is what you are familiar with. Another way to see it in terms of "likelihood". That type of objective function is called maximum likelihood estimation. Read on MLE to see what it is about if you are not familiar with it. In other words, what is have is another objective/loss function that you will maximize/minimize.
That said, you can indeed replace the E[log p(x|z)] with the MSE. It is done in quite many implementations. In the VAE, tutorial I talk about it as well.
> what does that mean in the sense of VAE?
For that you will want to the VAE tutorial. In that I explain why we need to do this!. If not clear from that tutorial ask the question in the comments of that vide.
4:33 why is it + ('plus') Expected value of log of p of x as to - ('minus')?
nvmd got it
🙏
Just one question, at ua-cam.com/video/IXsA5Rpp25w/v-deo.html, when you expanded log p(x), how did you know to use q(z | x) instead of simply q(x)? Thank you.
We are after approximating the posterior p(z|x). We do this approximation using q, a distribution we know how to sample from and whose parameters we intend to find using optimization procedure. So the distribution q would be different from p but would still be about (or for) z|x. In other words, it is an "assumed" distribution for "z|x".
The symbol/notation "E_q" .... (sorry can't write latex/typeset in the comments 😟) means that it is an expectation where the probability distribution is "q". Whatever is in the subscript of symbol E implies the probability distribution.
Since in this entire tutorial q is a distribution of z given x ( i.e. z|x); the notations E_q and E_q(z|x) are same .....i.e. q and q(z|x) are same. This is why when it expanded it was q(z|x) and not q(x)
Watch my video on Importance Sampling (starting portion at least where I clarify the Expectation notation & symbols). Here is the link to the video - ua-cam.com/video/ivBtpzHcvpg/v-deo.html
@@KapilSachdeva does that mean: the expectation of log p(x) don't depend on distribution q, since at the end E_q[ log p(x)] becomes to log p(x)?
@@ericzhang4486 since log p(x) does not have any 'z' in it, log p(x) will be treated as constant when your sampling distribution when computing expectation is q(z) (or even q(z|x)). This is why the equation gets simplified by taking this constant out of the integral. Let me if know this helps you understand it.
@@KapilSachdeva it makes perfectly sense. Thank you so much!
I come to your video from the equation 1 in DALL-E paper (arxiv.org/pdf/2102.12092.pdf). If it's possible, could you give me a little enlightenment on how elbo is derived in that case? Feel free to leave, if you don't have time. Thank you!
what is log p blue theta (x) at 5:40? is it a pdf or a single number?
it would be a density but if used for optimization you would get a scalar value for a given batch of samples
Can we choose the prior distribution of z in any way we want or do we have to estimate it somehow?
In Bayesian Statistics, choosing/selecting prior is one of the challenging aspects.
The prior distribution can be chosen based on your domain knowledge (when you have small datasets) or estimated from the data itself (when your dataset is large).
Method of "estimating" the prior from data is called "Empirical Bayes" (en.wikipedia.org/wiki/Empirical_Bayes_method)
There are few modern research papers that try to "learn" prior as an additional step in VAE.
why when maximizing the first component the second component will be minimized directly?
let's say
fixed_amount = a + b
if `a` increases then `b` must decrease in order to respect above equation.
##
log_evidence is fixed. It is the total probability after taking into consider all parameters and hidden variables. As the tutorial shows, it consists of two components. If you maximize one component then the other should decrease.
@@KapilSachdeva Thanks sir. my last question is how computational I could calculate Q(Z)||P(Z). like how do I know P(Z), while all I can get is latent variable Z which in my understanding it is Q(Z)? so how do I make sure that the predicted distribution of Z is close as possible to the actual distribution of Z? I know now how I could get P(X/Z). my question how do I calculate the regularization term?
I explain this in the tutorial on variational auto encoder. ua-cam.com/video/h9kWaQQloPk/v-deo.html
@@KapilSachdeva Thanks sir for your fast reply.
I love you, great!!!
😄🙏
Thank you so much
🙏
in minute 9:200. how it's log p(z|x) / p(z). it was addition. shouldn't be log p(z|x) * p(z)? please correct it to me sir. thanks.
Hello Hesham, I do not see the expression "log p(z|x)/p(z)" any where in the tutorial. Could you check again the screen which is causing some confusion for you and may you have a typo in the above comment?
@@KapilSachdeva thanks for your kind reply sir. I mean in the third line in minute 9:22, we moved from Eq[ log q(z|x)] + Eq[ log p(z)] to --> Eq[log q(z|x) /p(z)] which I don't no why it is division and not multiplication as it was addition before taking a common log.
@@heshamali5208 Here is how you should see it. I did not show one intermediary step and hence your confusion.
Let’s look at only the two last terms in the equation.
-E[log q(z|x)] + E[log p(z)]
-E[ log q(z|x) - log p(z)] {I have take the expectation out as it common}
-E[log q(z|x) / p(z)]
Hope this clarifies now.
@@KapilSachdeva ok thanks sir. it is clear now
haha that " I have cheated you" at 7:36
😀
hTis one video is worth a million gold particles..
🙏
Why even know the posterior p(z|x) ? I think you can start with that.
For that watch the “towards Bayesian regression” series on my channel.
@@KapilSachdeva Oh great that’ll be of a lot help! And great video series!
Not clear enough. In the first minute you say 'intractable', but you need to give an example of why this is intractable and why other terms are not. Also, explain why the denominator is intractable while the nomination is not.
Thank you so much.