As a ComfyUI user (a tool working with Stable Diffusion), I've always been curious about how the Stable Diffusion Model creates images. After reading many articles and watching countless UA-cam videos that were either too academic or too superficial, this is the only video that really satisfied my curiosity. Thank you so much for making such a valuable video. Wishing your channel continued growth and looking forward to more great content like this!
You know sir i found out your channel pure accidentally. But thank god i found it . What ever you are teaching us in absolute gold. But there is this one thing that i gotta ask. I am really curious to know about your background. You never shared your linkedin profile with us. How come you know your shit so deeply? Finally sir, you are awesome. Have a nice day.
Thanks a lot… glad you are enjoying the channel. I don’t share details about my background because of my privacy and also because they aren’t pertinent to the topics of my videos. Hope you can respect that. Thanks again for supporting the channel!
@@avb_fj Yeah i understand. No problem sir. But seeing your deep knowledge in your subjects, it's quite natural for one to think, who the hell is this guy? But that ok. Whoever you are, you are so cool. Thanx 4 replying.
This is a very informative video, thank you so much! Please talk and explain how to code Rectified Flow Neural Network next 🙏🙏 And how it is different from Stable Diffusion 🤔
Great Job! Especially considering that these models are not easy to train. I also never considered training CelebA with text conditioning, which seemed to produce good results given the training time. A critique: you made a mistake when describing CLIP and with cross-attention. CLIP uses a transformer image encoder and a transformer text encoder, which are jointly trained - it may be possible to use a frozen VAE for the image encoder, but that would probably constrain the latent space and prevent strong semantic alignment. For cross-attention, K and V come from CLIP whereas Q comes from the image tokens (you reversed them on your slide). Flipping them would also likely work, but then the cross-attention is modulating existing image features rather than introducing new features based on the conditioning.
Thanks for pointing that out… here are some details that wasn’t in the video related to CLIP training. The original CLIP model does train a VIT for the image encoder, but I didn’t want to train a VIT, so I just used the VAE that I already had and used it as my image encoder. I just used each channel as its own image token. And for the text encoder I just fine tuned the last 2 layers of one of the smaller Bert models. Some of the details and the reasoning didn’t make it into the video. Basically I didn’t intend to replicate the paper, just the basic joint embedding idea… plus, my clip model was just gonna be used for the LDM on celeba, so it doesn’t have to be as general and foundational as the OG, so I cut a bunch of corners on that module. that said, I should’ve mentioned how the paper does it in the video for clarity! Edit: Also yeah, looking at the paper and the diffusers source code - the text embeddings are indeed used for K, V and the image embeddings for Q. That's my bad for not being accurate in the video. So thanks once again for your comment!
@@avb_fj I didn't realize that you also trained a CLIP model. That makes this result even more impressive! If you were going that far though, you probably could have trained a small BERT model from scratch too. And yes, training a ViT from scratch would have taken significantly longer - they're know for slow convergence (probably slower with CLIP, since both the ViT and text encoder are randomly initialized and only receive gradient updates through the contrastive loss). Could you speak to the GPU you used to train and around how long it took? Your approach might make for a good University term project for a DL course.
@@hjups Yeah training the BERT model was where I drew the line haha. Just used the classic distillbert-uncase model, froze all the layers but the last couple and slapped a linear layer at the end to map it to my desired output shape. I used my Macbook Pro M2 chips to train.
Good video! I'm very impressed with your results. But, one thing I'm confused about is, during sampling, there was a +σ_t * z term. I assume z is noise, but what is the sigma term? What defines how much extra noise to add each sampling step?
Yes, the sigma term controls the variance of gaussian noise added to the image. It is generally kept as a function of beta like (sqrt(beta_t)). Check out Section 3.2 - Reverse Diffusion in the paper: arxiv.org/pdf/2006.11239
@@avb_fj Gotcha, thanks. By the way, why is the alpha term always encoded as an Embedding layer instead of being encoded as a linear layer of dim (1, 1024) (or whatever the feature dimension is), where the input into the linear layer is the noise level (from 0.0 to 1.0)?
@@kalebbroo2 The Patreon post includes a github link and an additional code-walkthrough video. The github link has the entire codebase I used to train the models in the video. Mostly python files.
As a ComfyUI user (a tool working with Stable Diffusion), I've always been curious about how the Stable Diffusion Model creates images. After reading many articles and watching countless UA-cam videos that were either too academic or too superficial, this is the only video that really satisfied my curiosity. Thank you so much for making such a valuable video. Wishing your channel continued growth and looking forward to more great content like this!
You brought some serious points to light that I couldn’t previously see, thank you!
Great Intuitive Explanation!
Awesome video Mate❤
Looking forward to the next one.
Thanks a lot!
Good explanation of Diffusion Model.
I really love your channel, Keep up the good work
Thanks a lot!!
Great video!
great content with so many deep concepts..
Best of the best, very clear, thank you.
You know sir i found out your channel pure accidentally. But thank god i found it . What ever you are teaching us in absolute gold.
But there is this one thing that i gotta ask. I am really curious to know about your background. You never shared your linkedin profile with us. How come you know your shit so deeply?
Finally sir, you are awesome. Have a nice day.
Thanks a lot… glad you are enjoying the channel. I don’t share details about my background because of my privacy and also because they aren’t pertinent to the topics of my videos. Hope you can respect that. Thanks again for supporting the channel!
@@avb_fj Yeah i understand. No problem sir. But seeing your deep knowledge in your subjects, it's quite natural for one to think, who the hell is this guy?
But that ok. Whoever you are, you are so cool. Thanx 4 replying.
This is a very informative video, thank you so much! Please talk and explain how to code Rectified Flow Neural Network next 🙏🙏 And how it is different from Stable Diffusion 🤔
ThankYou! for your Great work. How can I avail the Dataset for Conditional Generative model ?
You can search for CelebA dataset on Kaggle.
www.kaggle.com/datasets/jessicali9530/celeba-dataset/data
@@avb_fj Thankyou! Sir
@@avb_fj Sir, can I have the whole code from the beginning till the end?
@@tanvikumari5406 The code and a full walkthrough is available on our Patreon page.
www.patreon.com/NeuralBreakdownwithAVB
Great Job! Especially considering that these models are not easy to train.
I also never considered training CelebA with text conditioning, which seemed to produce good results given the training time.
A critique: you made a mistake when describing CLIP and with cross-attention. CLIP uses a transformer image encoder and a transformer text encoder, which are jointly trained - it may be possible to use a frozen VAE for the image encoder, but that would probably constrain the latent space and prevent strong semantic alignment.
For cross-attention, K and V come from CLIP whereas Q comes from the image tokens (you reversed them on your slide). Flipping them would also likely work, but then the cross-attention is modulating existing image features rather than introducing new features based on the conditioning.
Thanks for pointing that out… here are some details that wasn’t in the video related to CLIP training. The original CLIP model does train a VIT for the image encoder, but I didn’t want to train a VIT, so I just used the VAE that I already had and used it as my image encoder. I just used each channel as its own image token. And for the text encoder I just fine tuned the last 2 layers of one of the smaller Bert models. Some of the details and the reasoning didn’t make it into the video. Basically I didn’t intend to replicate the paper, just the basic joint embedding idea… plus, my clip model was just gonna be used for the LDM on celeba, so it doesn’t have to be as general and foundational as the OG, so I cut a bunch of corners on that module. that said, I should’ve mentioned how the paper does it in the video for clarity!
Edit: Also yeah, looking at the paper and the diffusers source code - the text embeddings are indeed used for K, V and the image embeddings for Q. That's my bad for not being accurate in the video.
So thanks once again for your comment!
@@avb_fj I didn't realize that you also trained a CLIP model. That makes this result even more impressive! If you were going that far though, you probably could have trained a small BERT model from scratch too.
And yes, training a ViT from scratch would have taken significantly longer - they're know for slow convergence (probably slower with CLIP, since both the ViT and text encoder are randomly initialized and only receive gradient updates through the contrastive loss).
Could you speak to the GPU you used to train and around how long it took? Your approach might make for a good University term project for a DL course.
@@hjups Yeah training the BERT model was where I drew the line haha. Just used the classic distillbert-uncase model, froze all the layers but the last couple and slapped a linear layer at the end to map it to my desired output shape.
I used my Macbook Pro M2 chips to train.
Good video! I'm very impressed with your results. But, one thing I'm confused about is, during sampling, there was a +σ_t * z term. I assume z is noise, but what is the sigma term? What defines how much extra noise to add each sampling step?
Yes, the sigma term controls the variance of gaussian noise added to the image. It is generally kept as a function of beta like (sqrt(beta_t)). Check out Section 3.2 - Reverse Diffusion in the paper: arxiv.org/pdf/2006.11239
@@avb_fj Gotcha, thanks. By the way, why is the alpha term always encoded as an Embedding layer instead of being encoded as a linear layer of dim (1, 1024) (or whatever the feature dimension is), where the input into the linear layer is the noise level (from 0.0 to 1.0)?
Great video...
If possible provide code also...
I’ll be uploading the code on Patreon and channel members later this week…
@@avb_fj Do you also include any easy to use tools for creating models? This seems like a fun project I can sink my time into.
@@kalebbroo2 The Patreon post includes a github link and an additional code-walkthrough video. The github link has the entire codebase I used to train the models in the video. Mostly python files.
dataset please
You can search for CelebA dataset on Kaggle.
www.kaggle.com/datasets/jessicali9530/celeba-dataset/data
The ai mind the all mind hint hint this is just the beginning.