I watched in sequence your videos about KL, ELBO, VAE and now this. They helped me a lot to clarify my understanding on Variational Auto-Encoders. Pure gold. Thanks!
Man, these have to be the best ML videos on UA-cam. I don't have a degree in Stats and you are absolutely right - the biggest roadblock for understanding is just parsing the notation. The fact that you explain the terms and give concrete examples for them in the context of the neural network is INCREDIBLY helpful. I've watched half a dozen videos on VAEs and this is the one that finally got me to a solid mathematical understanding .
@@KapilSachdeva what was your path to decoding this? I am curious about where you started and how you ended up here. I am sure that's just as interesting as this video.
I have watched so many ML / deep learning videos from so many creators, and you are the best. I feel like I finally understand what's going on.. Thank you so much
I just found treasure! This was the clearest explanation I've come across so far... And now I'm going to binge-watch this channel's videos like I do Netflix shows. :D
For the quiz at the end: From what I understood, the Encoder network (parametrized by phi) predicts some mu and sigma (based on input X) which then define a normal distribution that the latent variable is sampled from. So I think the answer is 2 "predicts" not "learns".
I hope to also learn your style of delivery from these videos. Its so effective in breaking down the complexity of topics. Looking forward to whatever your next video is.
Amazing video for VAE and VI. Could you make a tutorial about Variational inference in Latent Dirichlet allocation? The descriptions and explanation for this part of work are rather rare.
This series was so informative and enjoyable. Absolutely love it! Hope to understand the diffusion models much better and have some ideas about extensions
Thank you for this series. It has really helped understand the theoretical basis of the VAE model. I had a couple of questions: Q1) At 21:30, is dx=d(epsilon) only because we have a linear location-scale transform or is that a general property of LOTUS? Q2) At 9:00, how are the terms combined to give the joint distribution when the parameters of the distribution are different? We would have the log of the multiplication of the probabilities but the two thetas are different right? Sorry if this is a stupid question.
Q1) it has nothing to do with LOTUS just linear location-scale transform Q2) theta here represents the parameters of the “joint distribution”. Do not think of it as log of multiplication of probs rather think that it is a distribution of two random variables and theta represents the parameters of this distribution.
At 6:39, the distribution p_\theta(x|z) cannot have mean mu and stddev sigma as the mean and std dev live in the latent space (the space of z) and x lives in the input space.
Absolutely brilliant! One issue that I have is that the Leibniz integral rule is concerned with the support of the integral being functions of the variable w.r.t which we are trying to take the derivative. I don't see how this applies to our case in your video! Isn't the support here just a lower and upper bound CONSTANT values for the Phi parameter? In other words, am I wrong in saying that the support is NOT a function of Phi, thus, we should be able to move the derivative inside the integral? I would appreciate your feedback on this. Thanks
This is where the notation creates confusion. You should think phi to be a function (a neural network in this case) that your are learning/discovering.
Thank you for a detailed explanation. I had one question though. I am not being able to understand why we cannot take derivative over theta inside the integration when integral is over x (in 20:52). Could you please help me get insight on it?
Hello Ruby, thanks for your comment and more importantly for paying attention. The reason you are confused here is because I have a typo in this example. The dx in this example should have been dtheta. Now that I look back I am not happy with this simpler example that I tried to use before explaining it for ELBO. Not only there is a typo but it can create confusion. I would suggest ignoring this (so-called simpler) example and directly see it for ELBO. Apologies!
19:09 Since the base distribution is free of our parameters, when we backprop and do differentiation, we don't have to do differentiation on unit normal distribution? Is this correct?
Correct. Now this should also make you think if this assumption of prior being standard normal is a good assumption? There are variants of variational autoencoder in which you can also learn/estimate the parameter of prior distribution.
thank you sir clear explanation , i want to ask regarding the expression p(xi,z) , is this the joint probability or its the likelihood under z and theta ?
The output of the decoder could be either of following - a) Direct prediction of X (the input vector) or b) Prediction of mu and sigma of distribution from which X came. Note the mu and sigma if predicted (by the decoder) will be that of X and not Z
I have a question. From the change of variable concept, we assign z to be a deterministic function of a sample from base distribution and parameters of the target distribution. But when we apply this in the case of ELBO, we assign z to be a deterministic function of Phi, x and epsilon, the Phi is the parameters of the encoder network but not the parameters of the target distribution p(z|x). Would this not create any inconsistency in the application?
ELBO (the loss function) is used during the “training” of the neural network. During training you are learning the parameters of encoder (and decoder) networks. Once the networks are trained then q(z|x) would be approximate of p(z|x).
8:48 how can the terms be combined if one follows the conventional syntax (sigma denoting the parameters of the density function) and the other the non-conventional syntax (sigma denoting the parameters of the decoder leading to estimates of the parameters of the density function). In essence the sigmas they are refrencing are not the same.
Assuming when you mentioned “sigma” you meant “theta”. This is yet another example of abuse of notation and hence your confusion is normal. Even though I say that theta is the parameters of decoder network in this situation think that network has predicted mu and sigma (watch the VAE tutorial) and in the symbolic expression when combining two terms we are considering theta to a set of mu and sigma.
They are two different architectures with some goals that are shared. VAE were primarily designed to do efficient latent variable model inference (see previous tutorial for more details to understand this line) but they can be used as generative models. GAN is a generative architecture whose training regime (loss function, setup etc) is very different from VAE. For long time they produced much better images but now VAE have also caught up to the quality of generated images Both of the architecture are somewhat difficult to train. VAE are relatively easier to train though. Hope this sheds some light.
I watched in sequence your videos about KL, ELBO, VAE and now this. They helped me a lot to clarify my understanding on Variational Auto-Encoders. Pure gold. Thanks!
🙏 ...glad that you found them helpful!
Glad that someone finally take the time to decrypt the symbols in the loss function equation!! What a great channel :)
🙏
Man, these have to be the best ML videos on UA-cam. I don't have a degree in Stats and you are absolutely right - the biggest roadblock for understanding is just parsing the notation. The fact that you explain the terms and give concrete examples for them in the context of the neural network is INCREDIBLY helpful.
I've watched half a dozen videos on VAEs and this is the one that finally got me to a solid mathematical understanding .
🙏 I don’t have a degree in stats either 😄
@@KapilSachdeva
what was your path to decoding this? I am curious about where you started and how you ended up here. I am sure that's just as interesting as this video.
I have watched so many ML / deep learning videos from so many creators, and you are the best. I feel like I finally understand what's going on.. Thank you so much
🙏
I just found treasure! This was the clearest explanation I've come across so far... And now I'm going to binge-watch this channel's videos like I do Netflix shows. :D
🙏 …. all tutorials are PG :)
I knew the concept now I know the maths. Thanks for the videos sir.
🙏
This explanation is what I was looking for for many days! Thank you!
🙏
For the quiz at the end:
From what I understood, the Encoder network (parametrized by phi) predicts some mu and sigma (based on input X) which then define a normal distribution that the latent variable is sampled from.
So I think the answer is 2 "predicts" not "learns".
You answer is 100% correct 🤗
I was looking for this. It’s full of essential information. Convention matters, and you clearly explained the
differences in this context.
🙏
I hope to also learn your style of delivery from these videos. Its so effective in breaking down the complexity of topics. Looking forward to whatever your next video is.
🙏 Thanks.
Amazing video for VAE and VI. Could you make a tutorial about Variational inference in Latent Dirichlet allocation? The descriptions and explanation for this part of work are rather rare.
🙏
Wonderful video Kapil. Thanks from the University of Oslo.
🙏
Enjoyed watching your clear explanation of the re-parameterization trick. Well done!
🙏
Thanks a lot sir for your excellent explanation. It made me understand the key idea behind the reparameterization trick.
🙏
Incredible quality of teaching 👌.
🙏
This series was so informative and enjoyable. Absolutely love it! Hope to understand the diffusion models much better and have some ideas about extensions
🙏
Thank you for this series. It has really helped understand the theoretical basis of the VAE model. I had a couple of questions:
Q1) At 21:30, is dx=d(epsilon) only because we have a linear location-scale transform or is that a general property of LOTUS?
Q2) At 9:00, how are the terms combined to give the joint distribution when the parameters of the distribution are different? We would have the log of the multiplication of the probabilities but the two thetas are different right? Sorry if this is a stupid question.
Q1) it has nothing to do with LOTUS just linear location-scale transform
Q2) theta here represents the parameters of the “joint distribution”. Do not think of it as log of multiplication of probs rather think that it is a distribution of two random variables and theta represents the parameters of this distribution.
Can you turn on the transcript for this! Great explanation!
At 6:39, the distribution p_\theta(x|z) cannot have mean mu and stddev sigma as the mean and std dev live in the latent space (the space of z) and x lives in the input space.
Really appreciate, I enjoyed your teaching style and great expectations! Thank you ❤️❤️
🙏
Wow~ an amazing tutorial. Thank you!
🙏
As always I am stunned by your video! May I ask with what software you produce such videos?
🙏 thanks Arash for the kind words.
I use PowerPoint primarily except for very few advanced animations I use manim (github.com/manimCommunity/manim)
Absolutely brilliant! One issue that I have is that the Leibniz integral rule is concerned with the support of the integral being functions of the variable w.r.t which we are trying to take the derivative. I don't see how this applies to our case in your video! Isn't the support here just a lower and upper bound CONSTANT values for the Phi parameter? In other words, am I wrong in saying that the support is NOT a function of Phi, thus, we should be able to move the derivative inside the integral? I would appreciate your feedback on this. Thanks
This is where the notation creates confusion. You should think phi to be a function (a neural network in this case) that your are learning/discovering.
Thank you for a detailed explanation. I had one question though. I am not being able to understand why we cannot take derivative over theta inside the integration when integral is over x (in 20:52). Could you please help me get insight on it?
Hello Ruby, thanks for your comment and more importantly for paying attention. The reason you are confused here is because I have a typo in this example. The dx in this example should have been dtheta.
Now that I look back I am not happy with this simpler example that I tried to use before explaining it for ELBO. Not only there is a typo but it can create confusion. I would suggest ignoring this (so-called simpler) example and directly see it for ELBO. Apologies!
19:09 Since the base distribution is free of our parameters, when we backprop and do differentiation, we don't have to do differentiation on unit normal distribution? Is this correct?
Correct. Now this should also make you think if this assumption of prior being standard normal is a good assumption?
There are variants of variational autoencoder in which you can also learn/estimate the parameter of prior distribution.
thank you sir clear explanation , i want to ask regarding the expression p(xi,z) , is this the joint probability or its the likelihood under z and theta ?
Joint prob
In 6:35, the output of the decoder is the reconstruction of X not μ and σ?
The output of the decoder could be either of following -
a) Direct prediction of X (the input vector)
or
b) Prediction of mu and sigma of distribution from which X came.
Note the mu and sigma if predicted (by the decoder) will be that of X and not Z
18:18 shouldn't it be inverse of f of x instead of f of x
I have a question. From the change of variable concept, we assign z to be a deterministic function of a sample from base distribution and parameters of the target distribution. But when we apply this in the case of ELBO, we assign z to be a deterministic function of Phi, x and epsilon, the Phi is the parameters of the encoder network but not the parameters of the target distribution p(z|x). Would this not create any inconsistency in the application?
ELBO (the loss function) is used during the “training” of the neural network. During training you are learning the parameters of encoder (and decoder) networks. Once the networks are trained then q(z|x) would be approximate of p(z|x).
8:48 how can the terms be combined if one follows the conventional syntax (sigma denoting the parameters of the density function) and the other the non-conventional syntax (sigma denoting the parameters of the decoder leading to estimates of the parameters of the density function). In essence the sigmas they are refrencing are not the same.
Assuming when you mentioned “sigma” you meant “theta”.
This is yet another example of abuse of notation and hence your confusion is normal. Even though I say that theta is the parameters of decoder network in this situation think that network has predicted mu and sigma (watch the VAE tutorial) and in the symbolic expression when combining two terms we are considering theta to a set of mu and sigma.
@@KapilSachdeva thanks for clearing that up and yes I meant theta. I always mix them up.
😊
3:00 shouldn't it be the "negative reconstruction error" instead?
Since in optimization we minimize then we minimize the negative ELBO which will result in the negative reconstruction error.
perfect
Thank you
Incredibly clear, and thank you so much for these videos. Looking forward to more...
🙏
It predicts the parameters of the latent variable.
Correct. 🙏
Thank you sir :)
It learns the parameter right?
Is there a difference between VAE and GAN
They are two different architectures with some goals that are shared.
VAE were primarily designed to do efficient latent variable model inference (see previous tutorial for more details to understand this line) but they can be used as generative models.
GAN is a generative architecture whose training regime (loss function, setup etc) is very different from VAE. For long time they produced much better images but now VAE have also caught up to the quality of generated images
Both of the architecture are somewhat difficult to train. VAE are relatively easier to train though.
Hope this sheds some light.
@@KapilSachdeva thank you very much
If there is a possibility to make tutorial about GAN it will be very appreciated
Thanks again
🙏
GEM
🙏