How diffusion models work - explanation and code!

Umar Jamil

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 2 лют 2025

КОМЕНТАРІ • 25

@umarjamilai Рік тому ⁺⁴
Full code and PDF slides available at: github.com/hkproj/pytorch-ddpm
@christopherhornle4513 Рік тому ⁺¹
Can you explain how the (text) guidance with clip works? I don't find any information other than that CLIP is used to influence the UNet during training through attention layers (also indicated by the "famous" LDM depiction). However it is not mentioned how the CLIP embeddings are aligned with the latents used by the UNet or VAE. I suppose it must be involved in the training process somehow? otherwise the embeddings can not be compatible or?
@umarjamilai Рік тому
@@christopherhornle4513 Hi Christopher! I'm preparing a video on how to code Stable Diffusion from zero, without using any external library except for PyTorch. I'll explain how the UNet works, how CLIP works (with Classifier and Classifier-Free guidance). I'll also explain advanced topics like score-based generative models and k-diffusion. The math is very hard, but I'll try to explain the concepts behind the maths rather than the proofs, so that people who have little or no maths background can understand what's going on even if they don't understand every detail. Since time is limited and the topic is vast, it will take me some more time before the video is ready. Please stay tuned!
@christopherhornle4513 Рік тому
@@umarjamilai That sounds awesome, thank you! I know pretty much how CLIP and the UNet work independently from each other. Cross attention, also clear. I am just wondering how the text embeddings are compatible with the UNet, if they are from a separate model (CLIP). I guess the UNet is trained feeding in CLIP text via attention to reproduce CLIP images (frozen VAE). Just stange that its not mentioned in the places I looked at.
@umarjamilai Рік тому ⁺¹
@@christopherhornle4513 Let me simplify it for you: the Unet is a model that is trained to predict the noise added to a noisy image at a particular time of a time-schedule, so given X+Noise and the time step T, the Unet has to return X. During the training, we not only provide X+Noise and T, but we also provide the CLIP embeddings (that is the embeddings of the caption associated with the image), so when training we provide the Unet with X (image) + Noise + T (time step) + CLIP_EMBEDDINGS (embeddings associated with the caption of the image). When T=1000, the image is completely noisy according to the Normal distribution. When you sample (generate an image), you start from complete noise (T=1000). Since it is complete noise, the model could output any image when denoising, because it doesn't know which image the noise corresponds to. To "guide" the de-noisification process, the Unet needs some "help", and that guidance is your prompt. Since CLIP's embeddings kind of represent a language model, if an image was trained with caption "red car with man driving", if you use the prompt "red car with woman driving", CLIP's embeddings will tell the UNET how to denoise the image so as to reproduce something that is close to your prompt. So, summarizing, Unet and CLIP are connected because CLIP's embeddings (the embeddings extracted by encoding the caption associated with the image being trained upon) are used when training the Unet (they are given as parameters in each layer of the Unet) and the CLIP's embeddings (from your prompt) as used as input of the Unet to help him denoise when generating the image. I hope this clarifies the process. In my next video, which hopefully will come within two weeks, I'll explain everything in detail.
@christopherhornle4513 Рік тому
Thank you very much!! Now I understand: During training the UNet learns to predict the noise given the text embedding (+timestamp and other if provided). So it learns which (text) embeddings are associated with specific features of the images and the noise predictions for those images. During sampling we start with noise (no encoded image) and provide an embedding, the model will use it as guidance for denoising along the features it has learned to be associated with that embedding.
@oiooio7879 Рік тому ⁺²
This is a great conceptual breakdown of diffusion models thank you!
@thelookerful 11 місяців тому ⁺¹
So cool! Thank you for your explanations
@pikachu123454 Рік тому ⁺⁴
I would love a video of you breaking down the math :)
@umarjamilai Рік тому ⁺²
Hi! A new video is coming soon :) stay tuned!
@Charbel-n1k 5 місяців тому ⁺¹
Thank you for thr clear explanation!
@muhammadsaad3421 5 місяців тому ⁺²
I can only say geniou guy ever
@swapankumarsarkar1737 7 місяців тому
Dear Sir please make a video on details explanation on code of diffusion model . It will be helpful. Thanks for understanding and valuable video
@shajidmughal3386 8 місяців тому
i came here form your VAE video. after that, should i be doing the 5hr long stable diffusion or this one?? what do you suggest?
@jerrylin2790 8 місяців тому
I watched the 5 hour one first then come to this. Now I would say, I know how to train the model, thanks to Umar.
@qtomars.theory 2 місяці тому ⁺¹
Great!!!!
@VinayKumar-o8j5y 10 місяців тому ⁺¹
Can you do code for inpainting in diffusion model please
@programmingpillars6805 3 дні тому
i Apologize but the explanation of VAE was not that accurate . imeant encoder learns visual features not semantic (no visual relationship between a potato and a tomato ) so for the encoder they are two completely diff things and it may some red entity with tomato
@mokira3d48 Рік тому
If cant you implement an example in next tutorial like you made for the transformers, it will be great !😊
@umarjamilai Рік тому ⁺²
You can start by browsing the code I've shared, because it's a full working code to train a diffusion model. I'll try to make a video explaining each line of code as well
@mokira3d48 Рік тому
@@umarjamilai okay, thank guys!
@ramprasath6424 Рік тому
can u do a Bert coding video
@umarjamilai Рік тому
Thanks for the suggestion, I'll try my best
@umarjamilai Рік тому ⁺⁶
Hi! My new video on BERT is out: ua-cam.com/video/90mGPxR2GgY/v-deo.html

Наступне

Автоматичне відтворення

Coding Stable Diffusion from scratch in PyTorch