How diffusion models work - explanation and code!

Поділитися
Вставка
  • Опубліковано 15 чер 2024
  • A gentle introduction to diffusion models without the math derivations, but rather, a focus on the concepts that define the diffusion models as described in the DDPM paper.
    Full code and PDF slides available at: github.com/hkproj/pytorch-ddpm
    Chapters
    00:00 - Introduction
    00:46 - Generative models
    03:51 - Latent space
    07:35 - Forward and reverse process
    09:00 - Mathematical definitions
    13:00 - Training loop
    15:05 - Sampling loop
    16:36 - U-Net
    18:31 - Training code
    19:28 - Sampling code
    20:34 - Full code
  • Наука та технологія

КОМЕНТАРІ • 21

  • @umarjamilai
    @umarjamilai  11 місяців тому +3

    Full code and PDF slides available at: github.com/hkproj/pytorch-ddpm

    • @christopherhornle4513
      @christopherhornle4513 10 місяців тому +1

      Can you explain how the (text) guidance with clip works? I don't find any information other than that CLIP is used to influence the UNet during training through attention layers (also indicated by the "famous" LDM depiction). However it is not mentioned how the CLIP embeddings are aligned with the latents used by the UNet or VAE. I suppose it must be involved in the training process somehow? otherwise the embeddings can not be compatible or?

    • @umarjamilai
      @umarjamilai  10 місяців тому

      @@christopherhornle4513 Hi Christopher! I'm preparing a video on how to code Stable Diffusion from zero, without using any external library except for PyTorch. I'll explain how the UNet works, how CLIP works (with Classifier and Classifier-Free guidance). I'll also explain advanced topics like score-based generative models and k-diffusion. The math is very hard, but I'll try to explain the concepts behind the maths rather than the proofs, so that people who have little or no maths background can understand what's going on even if they don't understand every detail. Since time is limited and the topic is vast, it will take me some more time before the video is ready. Please stay tuned!

    • @christopherhornle4513
      @christopherhornle4513 10 місяців тому

      @@umarjamilai That sounds awesome, thank you! I know pretty much how CLIP and the UNet work independently from each other. Cross attention, also clear. I am just wondering how the text embeddings are compatible with the UNet, if they are from a separate model (CLIP). I guess the UNet is trained feeding in CLIP text via attention to reproduce CLIP images (frozen VAE). Just stange that its not mentioned in the places I looked at.

    • @umarjamilai
      @umarjamilai  10 місяців тому +1

      @@christopherhornle4513 Let me simplify it for you: the Unet is a model that is trained to predict the noise added to a noisy image at a particular time of a time-schedule, so given X+Noise and the time step T, the Unet has to return X. During the training, we not only provide X+Noise and T, but we also provide the CLIP embeddings (that is the embeddings of the caption associated with the image), so when training we provide the Unet with X (image) + Noise + T (time step) + CLIP_EMBEDDINGS (embeddings associated with the caption of the image). When T=1000, the image is completely noisy according to the Normal distribution. When you sample (generate an image), you start from complete noise (T=1000). Since it is complete noise, the model could output any image when denoising, because it doesn't know which image the noise corresponds to. To "guide" the de-noisification process, the Unet needs some "help", and that guidance is your prompt. Since CLIP's embeddings kind of represent a language model, if an image was trained with caption "red car with man driving", if you use the prompt "red car with woman driving", CLIP's embeddings will tell the UNET how to denoise the image so as to reproduce something that is close to your prompt. So, summarizing, Unet and CLIP are connected because CLIP's embeddings (the embeddings extracted by encoding the caption associated with the image being trained upon) are used when training the Unet (they are given as parameters in each layer of the Unet) and the CLIP's embeddings (from your prompt) as used as input of the Unet to help him denoise when generating the image. I hope this clarifies the process. In my next video, which hopefully will come within two weeks, I'll explain everything in detail.

    • @christopherhornle4513
      @christopherhornle4513 10 місяців тому

      Thank you very much!! Now I understand: During training the UNet learns to predict the noise given the text embedding (+timestamp and other if provided). So it learns which (text) embeddings are associated with specific features of the images and the noise predictions for those images. During sampling we start with noise (no encoded image) and provide an embedding, the model will use it as guidance for denoising along the features it has learned to be associated with that embedding.

  • @oiooio7879
    @oiooio7879 11 місяців тому +1

    This is a great conceptual breakdown of diffusion models thank you!

  • @swapankumarsarkar1737
    @swapankumarsarkar1737 6 днів тому

    Dear Sir please make a video on details explanation on code of diffusion model . It will be helpful. Thanks for understanding and valuable video

  • @thelookerful
    @thelookerful 3 місяці тому +1

    So cool! Thank you for your explanations

  • @pikachu123454
    @pikachu123454 9 місяців тому +4

    I would love a video of you breaking down the math :)

    • @umarjamilai
      @umarjamilai  9 місяців тому +2

      Hi! A new video is coming soon :) stay tuned!

  • @mokira3d48
    @mokira3d48 11 місяців тому

    If cant you implement an example in next tutorial like you made for the transformers, it will be great !😊

    • @umarjamilai
      @umarjamilai  11 місяців тому +2

      You can start by browsing the code I've shared, because it's a full working code to train a diffusion model. I'll try to make a video explaining each line of code as well

    • @mokira3d48
      @mokira3d48 11 місяців тому

      @@umarjamilai okay, thank guys!

  • @shajidmughal3386
    @shajidmughal3386 16 днів тому

    i came here form your VAE video. after that, should i be doing the 5hr long stable diffusion or this one?? what do you suggest?

    • @jerrylin2790
      @jerrylin2790 11 днів тому

      I watched the 5 hour one first then come to this. Now I would say, I know how to train the model, thanks to Umar.

  • @user-us5eo6ev2x
    @user-us5eo6ev2x 2 місяці тому

    Can you do code for inpainting in diffusion model please

  • @ramprasath6424
    @ramprasath6424 11 місяців тому

    can u do a Bert coding video

    • @umarjamilai
      @umarjamilai  11 місяців тому

      Thanks for the suggestion, I'll try my best

    • @umarjamilai
      @umarjamilai  7 місяців тому +6

      Hi! My new video on BERT is out: ua-cam.com/video/90mGPxR2GgY/v-deo.html