Hi, amazing explanation! Thanks for all the efforts you put into making the video. Can you please share the details of the UNet model that you've used (maybe a link to a paper/blog)? Thank you!
Thank you for the appreciation! For the UNet model, I just mimicked the architecture from the huggingface Unet2DModel class in diffusers library (huggingface.co/docs/diffusers/en/api/models/unet2d) with minor changes(at what point concatenation and upsampling happens in upblock). The diffusers Unet2DModel class (which itself is based on unet paper arxiv.org/abs/1505.04597 ) and this comment thread (ua-cam.com/video/vu6eKteJWew/v-deo.html&lc=UgzBFfe4anyDf4txEZx4AaABAg) should give you all the necessary information regarding the Unet Model. Do let me know if that ends up not being the case.
Great video, thank you. When passing transposed feature-map in ConvBlocks, do you intentionally skip adding positional encodings to them (like ViT does for example), or is it intentional somehow?
I didn’t intentionally skip position embeddings. The reason they are not included is that this code mimics the official implementation provided by the authors of Latent Diffusion, where positional embeddings are skipped. I’m not entirely sure why the authors made that choice, but I’ve discussed this in a bit more detail in the issue linked below. If you're interested, then do check it out: github.com/explainingai-code/DDPM-Pytorch/issues/4
Hi there, thanks for the video, may I ask a question: to my understanding, the multi-headed attention first applies 3 ff networks for key, query, and value, and in this model, you applied multiheaded attention on the image where channels play as sequence length and flattened image plays as the token_length that should mean that the query network for example should be a Linear(token_length/4,token_length/4) which means its parameter count should be (token_length*token_length/16) = ((h*w)**2)/16 which is huge, or am I wrong?
Thank you! @binyaminramati3010 So the channel dimension here is the embedding dimension and the H*W is the sequence length. If you notice before attention, we do a transpose this is to make the channel dimension as the embedding dimension. Assuming the feature map is 128x7x7 (CxHxW) and lets assume we only have one head. So that means we have a sequence of 49 tokens(feature map cells) each of 128 dimensions. Q/K/V will be 128*128 (QKT) attention weights will be 49x49 Weighted Values will 49x128 So no huge computation as such required right? Or am I not understanding your question correctly ?
Amazing explaination. But i have a question that i want to train on my custom rgb data with the shapr 128x128 or 256x256, buy i always gave the results of outofmemory, but the training params is inly about 10m params. Can you help with that?
@@colder4163 Its most likely because of image size. Can you try with 64x64 . Have responded on what changes need to be made for this here - github.com/explainingai-code/DDPM-Pytorch/issues/1#issuecomment-2236651773
Thanks for the very informative video! I am having trouble with using my own dataset in this. I'm doing this on a macbook in google colab. Currently, I have mounted my drive to the colab and pulled in my dataset from my drive, through the default.yaml. However, I am getting an error, saying that num_samples should be positive, and not 0. I am not sure what you mean by "Put the image files in a folder created within the repo root (example: data/images/*.png ).". What is this repo root and where can I find it? Is it local on my computer? Could you help with this? Thank you in advance!
You are welcome! So the path in config can either be the relative path from the "DDPM-Pytorch" directory or the absolute path. So currently the config assumes inside DDPM-Pytorch directory there would be a data/images folder which will have all image files.
Thank you so much for the video. It was amazing and your video explained many things that I couldn't understand anywhere. Though I have a question regarding the up channels. You have given down channels as [32, 64, 128, 256]. As per your code the channels for the first upsample will be (256, 64) but after concatenating from the last down layer the number of channels for the first convolution of the resnet layer should be 128 + 256 = 384 but as per your code it is 256. The same thing will happen for each upblock. In second case 128 + 64 should be the in channels but as per your code 128, and the third upsample layer should have in channels 64 + 32 = 96 but as per your code it is 64. I think there is little miscalculation.
Hello, according to the code the first down layer to be concatenated is not the last down layer but the second last down layer. Its a bit easier to explain with a diagram so can you take a look at the below text representing whats happening and let me know if you have any issues still. Downblocks Upblocks 32 ---------------------------64->16 |down |upsample(&concat) 64 ------------------128->32 |down |upsample(&concat) 128------------256->64 |down |upsample(&concat) 256----256---128
@Explaining-AI Sorry, my mistake. I got it. You are saving the feature tensors before passing it through the down block hence the math works out if we consider that. But isn't normally we concatenate the feature tensor obtained after passing through the downblock? in my brief experience with unets I have seen that normally. That's why I thought there is mistake.
@@takihasan8310 yes you are right. That way is indeed closer to the "official unet" implementation. After spending limited amount of time on this, I found this way enabled me to write simpler code. So went with this only. And as long as the network has layers of downsampling followed by layers of upsampling together with concatenation of downblock feature maps, I would say it still qualifies as a unet per say. But yes, definitely not the official paper's unet implementation.
Hello, If you have already tried reducing the batch size and are still getting this error, could you take a look at this github.com/explainingai-code/DDPM-Pytorch/issues/1 specifically this comment - github.com/explainingai-code/DDPM-Pytorch/issues/1#issuecomment-1862244458 and see if that helps getting rid of the out of memory error.
@Explaining-AI Sorry to bother you but I don't know why but whenever I am training on any dataset, I tried mnist, cifar10 etc but mse loss is always nan. Is this expected, I checked my transformation. It is correct, first transforms.ToTensor(), and transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]). All the losses are nan values, will the model learn anything meaningful?
Were you able to get rid if this issue? Is it possible for you send me a link to your repo in case you have changed any part of code or parameters of training.
hi Sir, i would like to request you kindly make changing in the stable diffusion model repository regarding size of the images because this repository is not supporting high image size and required very high GPU memory like for 256 size images its required almost 200Gb which is high cost effective. also if possible include few evaluation metrics for quantitative analysis between the original and the generated images. waiting for the next video!
Hi @muhammadawais2173, I will next start working on the Stable diffusion video but unfortunately it would take me a month to get it up with code and video. Sorry but its going to take that long given my other works. In case you are really blocked because of this might I suggest using the hugging face diffusers library . They will anyway have much more efficient implementation than me :)
@@Explaining-AI Thank you so much. I will go through it. Infect, i already went through many diffusion model implementation but you explained very well and an easiest way also your model give satisfactory results as compared to others.
*Github Code* - github.com/explainingai-code/DDPM-Pytorch
*DDPM Math Explanation Video* - ua-cam.com/video/H45lF4sUgiE/v-deo.html
Thank you! It was amazing. While there are limited content available for diffusion models, you did pretty nice.❤
Thank you for your kind words :)
I am very thankful for your nice video; it's the best explanation of the diffusion model I have seen!
Thank you so much for your encouraging words!
Nicely explained! Keep the good work going! 😁
This is incredible
Hi, amazing explanation! Thanks for all the efforts you put into making the video.
Can you please share the details of the UNet model that you've used (maybe a link to a paper/blog)? Thank you!
Thank you for the appreciation! For the UNet model, I just mimicked the architecture from the huggingface Unet2DModel class in diffusers library (huggingface.co/docs/diffusers/en/api/models/unet2d) with minor changes(at what point concatenation and upsampling happens in upblock). The diffusers Unet2DModel class (which itself is based on unet paper arxiv.org/abs/1505.04597 ) and this comment thread (ua-cam.com/video/vu6eKteJWew/v-deo.html&lc=UgzBFfe4anyDf4txEZx4AaABAg) should give you all the necessary information regarding the Unet Model. Do let me know if that ends up not being the case.
Great video, thank you.
When passing transposed feature-map in ConvBlocks, do you intentionally skip adding positional encodings to them (like ViT does for example), or is it intentional somehow?
I didn’t intentionally skip position embeddings. The reason they are not included is that this code mimics the official implementation provided by the authors of Latent Diffusion, where positional embeddings are skipped.
I’m not entirely sure why the authors made that choice, but I’ve discussed this in a bit more detail in the issue linked below.
If you're interested, then do check it out:
github.com/explainingai-code/DDPM-Pytorch/issues/4
Amazing.
How does self attention work in convnets (instead of transformers)? 😊
very well explained. what changes would we need to make if we used our own dataset? specifically greyscale
Thank you. Have replied on github regarding this.
yeah it was me@@Explaining-AI
thank you so much sir
Glad you found it helpful
Thank you so much!
Hi there, thanks for the video, may I ask a question: to my understanding, the multi-headed attention first applies 3 ff networks for key, query, and value, and in this model, you applied multiheaded attention on the image where channels play as sequence length and flattened image plays as the token_length that should mean that the query network for example should be a Linear(token_length/4,token_length/4) which means its parameter count should be (token_length*token_length/16) = ((h*w)**2)/16 which is huge, or am I wrong?
Thank you! @binyaminramati3010
So the channel dimension here is the embedding dimension and the H*W is the sequence length.
If you notice before attention, we do a transpose this is to make the channel dimension as the embedding dimension.
Assuming the feature map is 128x7x7 (CxHxW) and lets assume we only have one head.
So that means we have a sequence of 49 tokens(feature map cells) each of 128 dimensions.
Q/K/V will be 128*128
(QKT) attention weights will be 49x49
Weighted Values will 49x128
So no huge computation as such required right? Or am I not understanding your question correctly ?
@@Explaining-AIThank you, I missed the transpose. and again, applause for the impressive content👏
Amazing explaination. But i have a question that i want to train on my custom rgb data with the shapr 128x128 or 256x256, buy i always gave the results of outofmemory, but the training params is inly about 10m params. Can you help with that?
Moreover, i set the config params that the batchsize is 1, and i trained on the gpu t4
@@colder4163 Its most likely because of image size. Can you try with 64x64 . Have responded on what changes need to be made for this here - github.com/explainingai-code/DDPM-Pytorch/issues/1#issuecomment-2236651773
@@Explaining-AI oh i see, thank you so much
@@Explaining-AI if i want to restore the blured image like motion blur of exposure blur, what should i do, could you give me some advises?
Thanks for the very informative video! I am having trouble with using my own dataset in this. I'm doing this on a macbook in google colab. Currently, I have mounted my drive to the colab and pulled in my dataset from my drive, through the default.yaml. However, I am getting an error, saying that num_samples should be positive, and not 0. I am not sure what you mean by "Put the image files in a folder created within the repo root (example: data/images/*.png ).". What is this repo root and where can I find it? Is it local on my computer? Could you help with this? Thank you in advance!
You are welcome! So the path in config can either be the relative path from the "DDPM-Pytorch" directory or the absolute path. So currently the config assumes inside DDPM-Pytorch directory there would be a data/images folder which will have all image files.
Unbelievable!
Thank you so much for the video. It was amazing and your video explained many things that I couldn't understand anywhere. Though I have a question regarding the up channels. You have given down channels as [32, 64, 128, 256]. As per your code the channels for the first upsample will be (256, 64) but after concatenating from the last down layer the number of channels for the first convolution of the resnet layer should be 128 + 256 = 384 but as per your code it is 256. The same thing will happen for each upblock. In second case 128 + 64 should be the in channels but as per your code 128, and the third upsample layer should have in channels 64 + 32 = 96 but as per your code it is 64. I think there is little miscalculation.
Hello, according to the code the first down layer to be concatenated is not the last down layer but the second last down layer. Its a bit easier to explain with a diagram so can you take a look at the below text representing whats happening and let me know if you have any issues still.
Downblocks Upblocks
32 ---------------------------64->16
|down |upsample(&concat)
64 ------------------128->32
|down |upsample(&concat)
128------------256->64
|down |upsample(&concat)
256----256---128
@Explaining-AI Sorry, my mistake. I got it. You are saving the feature tensors before passing it through the down block hence the math works out if we consider that. But isn't normally we concatenate the feature tensor obtained after passing through the downblock? in my brief experience with unets I have seen that normally. That's why I thought there is mistake.
@@takihasan8310 yes you are right. That way is indeed closer to the "official unet" implementation. After spending limited amount of time on this, I found this way enabled me to write simpler code. So went with this only. And as long as the network has layers of downsampling followed by layers of upsampling together with concatenation of downblock feature maps, I would say it still qualifies as a unet per say. But yes, definitely not the official paper's unet implementation.
I am getting a Cuda out of memory error when used on my own dataset. The dataset consists of .npy files
Hello, If you have already tried reducing the batch size and are still getting this error, could you take a look at this github.com/explainingai-code/DDPM-Pytorch/issues/1 specifically this comment - github.com/explainingai-code/DDPM-Pytorch/issues/1#issuecomment-1862244458 and see if that helps getting rid of the out of memory error.
@Explaining-AI
Sorry to bother you but I don't know why but whenever I am training on any dataset, I tried mnist, cifar10 etc but mse loss is always nan. Is this expected, I checked my transformation. It is correct, first transforms.ToTensor(), and transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]). All the losses are nan values, will the model learn anything meaningful?
Were you able to get rid if this issue? Is it possible for you send me a link to your repo in case you have changed any part of code or parameters of training.
hi Sir, i would like to request you kindly make changing in the stable diffusion model repository regarding size of the images because this repository is not supporting high image size and required very high GPU memory like for 256 size images its required almost 200Gb which is high cost effective. also if possible include few evaluation metrics for quantitative analysis between the original and the generated images. waiting for the next video!
Hi @muhammadawais2173, I will next start working on the Stable diffusion video but unfortunately it would take me a month to get it up with code and video. Sorry but its going to take that long given my other works. In case you are really blocked because of this might I suggest using the hugging face diffusers library . They will anyway have much more efficient implementation than me :)
@@Explaining-AI Thank you so much. I will go through it. Infect, i already went through many diffusion model implementation but you explained very well and an easiest way also your model give satisfactory results as compared to others.
Amazing.