Diffusion models from scratch in PyTorch

Поділитися
Вставка
  • Опубліковано 17 тра 2024
  • ▬▬ Resources/Papers ▬▬▬▬▬▬▬
    - Colab Notebook: colab.research.google.com/dri...
    - DDPM: arxiv.org/pdf/2006.11239.pdf
    - DDPM Improved: arxiv.org/pdf/2105.05233.pdf
    - Awesome Diffusion Models Github: github.com/heejkoo/Awesome-Di...
    - Outlier Diffusion Model Video: • Diffusion Models | Pap...
    - Positional Embeddings: machinelearningmastery.com/a-...
    ▬▬ Used Icons ▬▬▬▬▬▬▬▬▬▬
    All Icons are from flaticon: www.flaticon.com/authors/freepik
    ▬▬ Used Music ▬▬▬▬▬▬▬▬▬▬▬
    Music from Uppbeat (free for Creators!):
    uppbeat.io/t/prigida
    Song: Spooky Loops
    License code: QKVNF1BODEDX33HO
    ▬▬ Timestamps ▬▬▬▬▬▬▬▬▬▬▬
    00:00 Introduction
    00:30 Generative Deep Learning
    02:58 Diffusion Models Papers / Resources
    04:06 What are diffusion models?
    05:06 How to implement them?
    05:29 [CODE] Cars Dataset
    06:50 Forward process
    10:15 Closed form sampling
    12:15 [CODE] Noise Scheduler
    16:10 Backward process (U-Net)
    19:32 Timestep Embedding
    20:52 [CODE] U-Net
    25:35 Loss
    26:28 [CODE] Loss
    28:53 Training and Results
    30:05 Final remarks
    ▬▬ Support me if you like 🌟
    ►Support me on Patreon: bit.ly/2Wed242
    ►Buy me a coffee on Ko-Fi: bit.ly/3kJYEdl
    ►Coursera: imp.i384100.net/b31QyP
    ►Link to this channel: bit.ly/3zEqL1W
    ►E-Mail: deepfindr@gmail.com
    ▬▬ My equipment 💻
    - Microphone: amzn.to/3DVqB8H
    - Microphone mount: amzn.to/3BWUcOJ
    - Monitors: amzn.to/3G2Jjgr
    - Monitor mount: amzn.to/3AWGIAY
    - Height-adjustable table: amzn.to/3aUysXC
    - Ergonomic chair: amzn.to/3phQg7r
    - PC case: amzn.to/3jdlI2Y
    - GPU: amzn.to/3AWyzwy
    - Keyboard: amzn.to/2XskWHP
    - Bluelight filter glasses: amzn.to/3pj0fK2

КОМЕНТАРІ • 187

  • @kiunthmo
    @kiunthmo Рік тому +35

    Really well explained, and a compact notebook. It's basically all written directly from Torch, very refreshing to see when so much content is heavily reliant on APIs.

  • @tanbui7569
    @tanbui7569 Рік тому +13

    Extremely fantastic implementation. I understood the whole diffusion ideas and all mathematical details made sense to me just from your codes.

  • @MassivaRiot
    @MassivaRiot Рік тому +9

    Absolutely phenomenal content! Love it ❤️

  • @LiquidMasti
    @LiquidMasti Рік тому +1

    Loved the simple implementation also thanks for sharing additional articles

  • @sergiobromberg9233
    @sergiobromberg9233 Рік тому +1

    Thank you! I really liked your graphic interpretation of the beta scheduling. It's missing in many other videos about diffusion.

  • @peterthegreat7125
    @peterthegreat7125 Рік тому

    Oh my god, this explanation is SUPER CLEAR! 🤯

  • @AndrejKarpathy
    @AndrejKarpathy Рік тому +8

    Quite good!

  • @anonymousperson9757
    @anonymousperson9757 Рік тому +1

    Thanks for this amazing video! Do you plan on extending this video to include conditional generation at some point in the future? I would love to see an implementation of the SR3/Palette models that use DDPM for image to image translation tasks such as super-resolution, JPEG restoration etc. In this case, the reverse diffusion process is conditioned on the input image.

  • @ioannisd2762
    @ioannisd2762 Місяць тому

    Amazing video! Highly suggested before diving into the paper

  • @cankoban
    @cankoban Рік тому +4

    Great effort thank you! The simplified version of it is still complicated though :D Probably, I need to watch this couple of more time after reading resources you attached.

  • @orrimoch5226
    @orrimoch5226 Рік тому +3

    Thanks! Great animation and explanations..amazing 🙏

  • @shakibyazdani9276
    @shakibyazdani9276 Рік тому +2

    what a great explanation, I will take a deeper look at the code. Thanks

  • @andreray6562
    @andreray6562 Рік тому

    Thanks for putting this together

  • @xczhou3340
    @xczhou3340 9 місяців тому

    Thanks for the amazing video !

  • @ShobeirKSMazinani
    @ShobeirKSMazinani Рік тому

    What a great video! Loved it!

  • @tidianec
    @tidianec Рік тому

    That was really clear. Thank you !

  • @sohampyne8009
    @sohampyne8009 Рік тому +1

    Really nice exposition. Can you please elaborate on the specifications of the machine it was trained on and approx how long the training took?

  • @harsh9558
    @harsh9558 5 місяців тому

    The explanation was awesome 🔥

  • @user-yp4ye9kf3b
    @user-yp4ye9kf3b Рік тому

    Excellent Explanation. Learn a lot from your video, thank you~

  • @senpeng6441
    @senpeng6441 Рік тому

    really good introduction!thanks

  • @user-fu3jx7mj2o
    @user-fu3jx7mj2o Рік тому

    Thank you for the awesome guide :D
    Just 1 simple question. In the plotted image, are we looking at x0, x1, .... x10 where x0 is the image at the very left (denoised version) and x10 image at the very right (noised most)?

  • @user-wn3hb3vc1v
    @user-wn3hb3vc1v Рік тому

    Thank you! This is the best video I've ever seen

  • @kidzheng8531
    @kidzheng8531 Рік тому +2

    Thanks for sharing this tutorial. It's so kind for beginners.

  • @usama57926
    @usama57926 Рік тому +1

    Thank you! It was helpful

  • @ViduzTube
    @ViduzTube Рік тому +10

    Very good job! A note: to me you should add torch.clamp(image, -1.0, 1.0) after each forward_diffusion_sample() call. You can check the behavior with and without the clamp when simulating forward diffusion. Images shown without clamp seems "not naturally noisy" as the pixel range is no longer between -1 and 1. I don't know how much this affects the final training result; it should be tried.

    • @DeepFindr
      @DeepFindr  Рік тому +2

      Yes very good point. Later I also realized this and it actually led to an improvement (on a different dataset however). :)

    • @oliverliu9248
      @oliverliu9248 Рік тому

      Thank you! When I was watching I was wondering what would happen as you add the variance and the value exceeds 1. Your answer helped me understand it.

  • @tensenpark
    @tensenpark Рік тому

    Damnnn I wish I watched your video first thing when trying to understand this. Great explanation

  • @Zindit
    @Zindit Рік тому

    Forward processing is very clear. Could you categorize the code blocks of backward processing?

  • @curiousseeker3784
    @curiousseeker3784 9 місяців тому

    ques: at 15:18 , why did we not directly scaled b/w -1&1 , or are there two different tensors we are scaling? one b/w 0&1 and the other b/w -1&1 ?

  • @adamtran5747
    @adamtran5747 Рік тому

    love the content brother

  • @user-dc2vc5ju3m
    @user-dc2vc5ju3m Місяць тому

    Amazing explanation

  • @LuisPereira-bn8jq
    @LuisPereira-bn8jq Рік тому +3

    Thanks a lot for the video, really helpful for someone trying to grasp these models.
    Also, little typo I noticed: at 16:06 in the cell "# Simulate forward diffusion" noise is being added a little faster than intended.
    The culprit is the line "image, noise = forward_diffusion_sample(image,t)" due to it rewriting the variable "image" at each step in the loop, despite the fact that forward_diffusion_sample was built expecting the initial non noisy image. So from the second iteration step onwards we're adding noise to an already noisy image.

    • @DeepFindr
      @DeepFindr  Рік тому +1

      Hehe thanks for this finding, this is indeed a bug. I just checked, it doesn't look very different with the correction (assigning to a new variable). If I'm not mistaken, this led to a multiplication by 2, as in every step the pre-computed noise for this t is added, plus the cumulative noise until t (which should be the same as the pre computed one) hence leading to twice the noise as indented. Anyways, thanks for this comment! :)

    • @LuisPereira-bn8jq
      @LuisPereira-bn8jq Рік тому +5

      ​@@DeepFindr Hi again. Yeah, the bug didn't really affect the images much, but it might confuse some viewers about whether you're computing x_t from x_0 or from x_{t-1}.
      As for the "multiplication by 2" bit, it's not going to be exactly that since the betas are changing and you're adding (t-1)-step noise to t-step noise. Moreover, adding a N(0,1) to another independent N(0,1) is a N(0,2), whose standard deviation is sqrt(2), so what was happening should be closer to multiplication by sqrt(2), even if also not exactly that.
      Anyway, since my previous comment I've now finished the video and trained if for 100 epochs so far (with comparable results to yours).
      I have two more comments in the latter bits of the video, namely the "sample_timestep" function at 26:59:
      - I was rather confused for a while as to why we were returning "model_mean" rather than just "x" for t=0. Though eventually I realized that the t's in the code are offset from the t's in the paper: the code is 0-indexed but the paper is 1-indexed. So the t=0 case in the sample_timestep is really inferring x_0 from x_1 in terms of the paper.
      It might be worth adding a comment about this in either the video or the code.
      - it took me quite a bit to understand the output of the sample_timestop function. I think I mostly got it now, but this is a really subtle step that is worth demystifying.
      Here's my current understanding: in effect our model is always trying to predict x_0 from x_t, but we don't expect the prediction to be great for large t. However, the distribution of p(x_(t-1) | x_t, x_0) is a fully known normal distribution, so we instead we the predicted x_0 to approxiamte this p(...), then sample from that to get x_(t-1).
      In retrospect, I've seen multiple videos on diffusion try to describe this process in words as "we predict the full noise, but then we add some of the noise back", but that vague description never made sense to me.
      So maybe an extra comment on this could help a future viewer as well.
      Anyway, let me thank you again for the video. My hope is to eventually actually understand stuff like stable diffusion with all its bells and whistles, and this already helped a lot.
      And on that note, I noticed that the weights for the network in the video take up 700M, compared to something like 4GB for stable diffusion, so it's maybe not so surprising that this would require a while to train from scratch.

    • @DeepFindr
      @DeepFindr  Рік тому

      @Luis Pereira yes I totally agree, in retrospect some things could've been more in depth. Meanwhile I've also experimented more and read other papers about these models (and also the connection to score based approaches) which could also be added here. Maybe I'll make an update video some day :)

    • @LuisPereira-bn8jq
      @LuisPereira-bn8jq Рік тому

      @@DeepFindr No worries. My experience is that in retrospect nearly everything could has been improved in some way or other.
      And if you ever find the time for another
      video, I at least would be interested. There are a decent number of good UA-cam videos on this topic, but this is one of the best I've found.

  • @lchunleo
    @lchunleo 6 місяців тому

    thanks for the well explained and code. but i wonder how can i use the trained model to do generation of images? can advise?

  • @alexvass
    @alexvass Рік тому

    great video, very clear

  • @sienloonglee4238
    @sienloonglee4238 Рік тому

    very nice video and very easy to understand

  • @erank3
    @erank3 Рік тому +5

    Thanks so much !! This is gold! Keep them coming. I’m curious to see the results of the model, can you share some more pictures?

    • @DeepFindr
      @DeepFindr  Рік тому +2

      Thank you! Happy that you liked it!
      I only have the pictures at the end of this video. Unfortunately I didn't save the model weights after the longer training, because I thought I wouldn't need them anymore :/

  • @user-nn5fp7tl2j
    @user-nn5fp7tl2j 20 днів тому

    beutifully explained

  • @roblee5721
    @roblee5721 Рік тому +3

    Very nice video with good explanation. I would like to point out that your Block class, the same batchnorm is used in different places. Batchnorm is trainable and have weights, so you might want to treat it more like an actual layer rather than memory-less operations like pooling and ReLU.

    • @DeepFindr
      @DeepFindr  Рік тому

      Hi, thanks for pointing out. This was a little bug, which I've corrected in the original notebook. :)

  • @CyberwizardProductions
    @CyberwizardProductions Рік тому

    really interesting explination, thank you for doing this.

  • @frederictost6659
    @frederictost6659 Рік тому

    Thank you for the video.

  • @MonkkSoori
    @MonkkSoori Рік тому

    Hello thank you for the video and code. I have two questions:
    Q1- In the Block module 24:16 why is the input channel in `self.conv1` multiplied by 2? The input channel is twice the size of the output based on the `up_channels` list in the `init` of your SimpleUnet class. Is this related to adding "residual x as additional channels" at 24:50?
    Q2- How do you direct which way the diffusion direction goes in? I know this is a very simplified example model but how would you add the ability to direct the generation towards making a certain class of car, or a car based on text descriptions like "red SUV"? Is there a good explanatory paper or blog post or video on this matter that you can recommend (preferably practical without a lot of math)?
    (Thank you again for the video)

  • @chyldstudios
    @chyldstudios Рік тому +1

    Amazing work, thanks for sharing!

  • @michael2826
    @michael2826 Місяць тому

    Thanks! From the training result, what I saw was an image that when from its original image to a less noisy one? I was expecting to see a noisy image converted to a less noisy one or it's original one?

  • @AI_Financier
    @AI_Financier Рік тому +1

    Thank you mate for the video, can you make one for the "conditional Diffusion" too? thanks

  • @FelipeOliveira-gt9bf
    @FelipeOliveira-gt9bf Рік тому +1

    Awesome video! what is the software are you using to draw these examples?

    • @DeepFindr
      @DeepFindr  Рік тому +1

      Thanks!
      Its nothing fancy - a mix of PowerPoint and DaVinci Resolve. :)

  • @kudre302
    @kudre302 3 місяці тому

    Great video! Can You give any tips for 128*128 images generation with that model, please?

  • @infocus2160
    @infocus2160 Рік тому +1

    Many thanks. Excellent explanation. Can we use the diffusion models for the deblurring of images? These are Generative models, and I want to use them for image restoration problems. Thanks

    • @DeepFindr
      @DeepFindr  Рік тому +2

      Hi!
      Yes they can be used for image restoration as well. Have you seen this paper: arxiv.org/abs/2201.11793
      :)

    • @infocus2160
      @infocus2160 Рік тому

      @@DeepFindr Excellent thanks. You are amazing.

  • @michakowalczyk7411
    @michakowalczyk7411 Рік тому +35

    Always been told after math classes: "You won't need that in real life anyways" xd

    • @DeepFindr
      @DeepFindr  Рік тому +6

      I can relate xD

    • @snoosri
      @snoosri 9 місяців тому +2

      but what is real life?

    • @michakowalczyk7411
      @michakowalczyk7411 9 місяців тому +1

      ​@@snoosri Don't take words so literally, especially on media. I believe you can grasp the meaning from context my friend :)

    • @superpie0000
      @superpie0000 9 місяців тому

      ​@@snoosriunderrated comment

  • @user-co6pu8zv3v
    @user-co6pu8zv3v Рік тому

    Thank you!

  • @jby1985
    @jby1985 11 місяців тому

    At 10:19, q is the noise and the next forward image is x+q. Do I understand it right? Or we just used q and x interchangeably?

  • @rajatsubhrachakraborty6767
    @rajatsubhrachakraborty6767 Рік тому

    How can we save all the generated images? Like as far as my understanding goes at the end of training there would be generated images of stanfordcars from completely noised images.

  • @SandeepSinghPlus
    @SandeepSinghPlus 9 місяців тому +1

    Stanfords Cars dataset is no longer available in PyTorch datasets. Do you have any alternate locations for the same data?

  • @marcinwaesa8713
    @marcinwaesa8713 Рік тому +3

    When you was explaining code for noise scheduler, T value changed from 200 to 300 which also i think should be reflected in different (smaller) betas, cause we end up with smaller alpha compound.

  • @xingyubian5654
    @xingyubian5654 Рік тому

    goated content

  • @user-sc8hg7lw8t
    @user-sc8hg7lw8t 5 місяців тому

    Hi, thank you for the well explained video! I've been following your code and training the same on StanfordCars dataset. At epoch 65, the sampled images of my training just come out as grey images. Is there something wrong with my training? Should I adjust the learning rate?

    • @neelsortur1036
      @neelsortur1036 4 місяці тому

      also having this issue, did you figure it out?

  • @mehdidehghani7706
    @mehdidehghani7706 Рік тому

    Thank you very much for this video

  • @arnabkumarpan5615
    @arnabkumarpan5615 10 місяців тому

    You are seeing something thats gonna change the way we see our universe in upcoming 2-3years! Save my comment!

  • @leonliang9185
    @leonliang9185 Рік тому +2

    Bro, one day in the future when this channel becomes famous, don't forget I am one of your early fans!

  • @arymansrivastava6313
    @arymansrivastava6313 11 місяців тому

    Will it be possible to generate new images using this model, if saved after training?, please share as to how to generate new images if possible.

  • @usama57926
    @usama57926 Рік тому +1

    Can you make a video on *Conditional generation in Diffusion modals*

  • @nicolasf1219
    @nicolasf1219 11 місяців тому

    So, stupid question:
    In the SimpleUnetClass we define the outputLayer with parameter of 3, to regain the amount of channels our image has. Couldn't we then just input the image_channels variable there? What if my image is grayscale and has only 1 channel?

  • @derekyun5109
    @derekyun5109 Рік тому +1

    Looks like the torchvision dataset for StanfordCars is now deprecated or sth; the original url from which the function pulls the data is closed

  • @SeonhoonKim
    @SeonhoonKim Рік тому

    Thanks for your a lot contribution for this.. But a bit confused. at 7:30, q(Xt | Xt-1) is a distribution for which "the sampled noise" follows OR for which "the Noised image" follows ?

    • @DeepFindr
      @DeepFindr  Рік тому

      It's the distribution of the noised image :) the distribution of the noise is always gaussian. This formula expresses the mixture of the original input and the noise distribution, hence the distribution of the noised image.

    • @SeonhoonKim
      @SeonhoonKim Рік тому

      @@DeepFindr Thanks for your reply !! just one more, please..? Then q(Xt | Xt-1) = N(Xt; ... , BtI ) means the variance of Xt is Bt ? Someone says V(Xt) eventually becomes to 1 in every step, so a bit confused...

    • @DeepFindr
      @DeepFindr  Рік тому

      @@SeonhoonKim bt is just the variance of this single step. Have a look at the "closed form" part with alpha. Ideally alpha bar becomes 0 at the end (the cumulative product) which leads to a variance of 1

  • @terguunzoregtiin8791
    @terguunzoregtiin8791 Рік тому

    Really helpful content, and the recommended resources are very good, thanks

  • @jungminhwang8115
    @jungminhwang8115 11 місяців тому

    Hi, thanks for awesome works. I would like to reduce image size. but, when I changed, training is not working, could you send me some info? and I would like to change your good code to DDIM method, only changing sampling part? could you send me detail info?

  • @curiousseeker3784
    @curiousseeker3784 9 місяців тому

    that's insane math

  • @DmitryFink
    @DmitryFink 10 місяців тому

    How many epochs does it take to produce anything that does not look like noise? I've downloaded the dataset from kaggle and replaced the data loader code in the collab, forward process works, howerver training doesn't seem to work. the loss is stuck from the very beginning at ~0.81, it doesn't go down, and the pictures sampled still look like noise. I am at Epoch 65, it does not seem to improve at all

  • @amortalbeing
    @amortalbeing 2 місяці тому +1

    @13:09 Why isnt sqrt_recip_alphas used anywhere?
    also why do do you calculate sqrt_one_minus_alphas_cumprod, where in the equation we only have 1-alphas_cumprod? is that a typo?
    whats the alphas_cumprod_prev exactly?
    can someone please explain what is being done here? and whats the posterior_variance?
    Thanks a lot in advance

  • @jeonghwanh8617
    @jeonghwanh8617 Рік тому

    best explanation, and perhaps the sigma in the normal distribution graph should be sigma^2

  • @bluebear7870
    @bluebear7870 Рік тому

    I have a question,sir
    At 13:22, the formula is (1-alpha_bar), why does the code put root(1-alpha_bar)?\

  • @sanjeevlvac1784
    @sanjeevlvac1784 5 місяців тому

    if anyone wants to implement the DDPM for time-series data which model would be good instead Unet? any suggestions ?

  • @cerann89
    @cerann89 Рік тому +2

    Thanks for a great tutorial. I think there is a small bug though in the implementation of the output layer of UNet. The output channel dimension is flipped with a kernel size and set to fixed 3. Shouldn't it look instead like this: self.output = nn.Conv2d(up_channels[-1], out_dim, 3)

    • @DeepFindr
      @DeepFindr  Рік тому +1

      Oh yes :D bugs everywhere.
      With output dim 1 it would just produce black and white images, so this bug led to color ;-) have you tried it with another kernel size? Did it make a difference?

    • @cerann89
      @cerann89 Рік тому +2

      @@DeepFindr I actually tried it on medical MRI images which have only one color dim (greyscale). That is where the error was triggered. I kept the kernel size at 3, so no I can’t give any input on the influence of the kernel size.

  • @tamascsepely235
    @tamascsepely235 4 місяці тому

    Could you help me in that if I want to stop the training and resume it later how can I save it and load it later ?

  • @yangjun330
    @yangjun330 Рік тому

    Thanks a lot. Can I ask how to choose timesteps in diffusion? Is the larger the timesteps, the better?

    • @DeepFindr
      @DeepFindr  Рік тому

      Basically it's a hyperparameter. Not only the step size is relevant, but rather the beta schedule (so start and end values). In my experiments I was simply visualizing the data distributions to determine a good value. You have a good schedule if the last distribution follows a normal gaussian with zero mean and std 1. Also, I have the feeling that a higher number of steps leads to higher fidelity, but I didn't further look into this

  • @adamgrygielski7395
    @adamgrygielski7395 Рік тому +3

    Shouldn't you use separate BN layers for 1st and 2nd convolution in a block? In your implementation batch statistics are shared among 2 layers what seems to be a bug.

    • @DeepFindr
      @DeepFindr  Рік тому +4

      Yep, you are right. I updated the notebook.
      Actually I also found that bug in a local version of the code and forgot to adjust the notebook. Bnorm layers can't be shared as each layer learns individual normalization coefficients.
      Thanks for pointing this out :)

  • @int16_t
    @int16_t 8 місяців тому

    Say I have a still image x0 and a pre initialized noisy image N. I think I can apply a noise to x0 by "(1-B)x0 + BN". When B=1, the output is N, the noisy image, when B=0, the output is the still image. But that's just the linear version.

  • @user-bh8kn3zt5z
    @user-bh8kn3zt5z 5 місяців тому

    hello.i am confuse in a few things. 1. why you chose T=300 as generally its T=1000? what decide the number of time steps? 2. there is a variable num_images in simulate forward diffusion section why are we dividing T/num images what does it mean?

  • @FrankWu-hc1dl
    @FrankWu-hc1dl Місяць тому

    Hey I have a question: I think in the Colab notebook you only sample one time step from each image in a batch, but I was wondering why we don't take more intermediate time steps from each sample?

  • @asheeshmathur
    @asheeshmathur 9 місяців тому

    Excellent tutorial, but looks code needs to updated, Stanford Cars data set is gone. Available at Kaggle, could you please update it accordingly.

  • @hilmiyafia
    @hilmiyafia Рік тому +1

    In the Google Collab there are log and exp in the Sinusoidal Embedding block. You did not explain where that come from. I don't see it on the formula at 20:26.

    • @DeepFindr
      @DeepFindr  Рік тому +1

      Hi :)
      Some implementations of positional embeddings are calculated in log space, that's why you see exp and log there. This usually improves numerical stability and is sometimes also done for loss functions

  • @thenial8245
    @thenial8245 Рік тому

    How to save the model and generate images after training?

  • @tilkesh
    @tilkesh 9 місяців тому

    Thx

  • @egoistChelly
    @egoistChelly 5 місяців тому +2

    The dataset is no longer available.

  • @junpengqiu4054
    @junpengqiu4054 Рік тому

    great walkthrough, just want to point out what missing term in implementation of sample_timestep:
    when returning results, you forgot to time model_mean with one over sqrt of alphas_t (i.e. 1 + betas_t), to match algorithm 2 from paper, it shall be:
    return model_mean / torch.sqrt(1. - betas_t) + torch.sqrt(posterior_variance_t) * noise
    however, even plugging back this term to the return statement, I did not see too much difference in the training result ;P So missing that time might not be a big deal

  • @catfood7859
    @catfood7859 Рік тому

    Thanks for the great video first! I trained the model on human face dataset using your code, it seems the sampling results appear checkerboard (grid pattern) artifacts, how to solve this?

    • @DeepFindr
      @DeepFindr  Рік тому

      Hi! Make sure to train the model long enough (e.g. Set 1000 epochs and see what happens). Also you might want to fine tune the model architecture and add more advanced layers like attention.
      I also encountered weird patterns at first but after training longer the quality got better.

    • @catfood7859
      @catfood7859 Рік тому +1

      @@DeepFindr Thanks for the advice, I'll try it : )

    • @KJPCox
      @KJPCox Рік тому +1

      It helps if you set the final layers filter size to 1

  • @henriwang8603
    @henriwang8603 11 місяців тому +2

    Maybe it's because I came to see this video not in time. The Standford car dataset link is now invalid. An 404 error.

  • @sriharsha580
    @sriharsha580 Рік тому

    why appending time embedding to the feature is after first layer of cnn in UNet? Why not add the time embedding at the initial step (before UNet)?

    • @DeepFindr
      @DeepFindr  Рік тому

      You could also do that, but I added the timestep in each of the Unet blocks.
      I think that there are many possibilities to try things out :)

  • @namirahrasul
    @namirahrasul 24 дні тому

    Also StandfordCars is no longer available can you plese chnage it?

  • @nikhilprem7998
    @nikhilprem7998 2 місяці тому +1

    The dataset StandfordCars is no more available, what alternative can I use?

    • @henrysun6430
      @henrysun6430 2 місяці тому +1

      ^ having the same problem

  • @glacialclaw1211
    @glacialclaw1211 6 місяців тому

    How do I put my own training dataset to the ipynb script?

  • @JDechnics
    @JDechnics Рік тому +2

    How do you made sure that the cars that has been generated at the end are truly original generations and not just copy of some cars in the dataset?

    • @DeepFindr
      @DeepFindr  Рік тому +4

      This actually relates to all generative models - how to make sure that the model doesn't simply memorize the train set.
      For example I've also seen this discussion for GANs: www.lesswrong.com/posts/g9sQ2sj92Nus9DeKX/gan-discriminators-don-t-generalize
      There is also some research in that direction: openreview.net/forum?id=PlGSgjFK2oJ
      To answer your question: you need to sample some data points and compare them with the nearest matches in the Dataset to be sure the model didn't overfit. More data always helps of course, to make it less likely that the model memorizes the whole dataset.

  • @namirahrasul
    @namirahrasul 24 дні тому

    I didnt understand...why do we have to convert images to tensors?

  • @chrislloyd1734
    @chrislloyd1734 Рік тому +18

    How can a model that is only 3.2GB, produce almost infinite image combinations that can be produced from just a simple text prompt, with so many language variables. What I am interested in, is how a prompt of say a "monkey riding a bicycle" can produce something that visually represents the prompt. How are the data images tagged and categorized in training to do this? As a creative person we often say that an idea is still misty and is not formed yet. What strikes me about this diffusion process is the similarity in how our minds at a creative level seem to work. We iterate and de-noise the concept until it becomes concrete using a combination of imagination and logic. It is the same process that you described to arrive at the finished formula. What also strikes me about the images produced by these diffusion algorithms is that they look so creative and imaginative. Even artists are shocked when they see them for the first time and realize a machine made them. My line of thinking here is that we use two main tools to acquire and simulate knowledge and experience. They are images and language. Maybe this input is then stored in a similar way as a diffusion model within our memory. Logic, creativity and ideas are just a consequence of reconstituting this data due to our current social or environmental needs. This could explain our thinking process and why our memory is of such low resolution. The de-noising process could also explain many human conditions such as depression and even why we dream etc. This brings up the interesting question " Could a diffusion model be created to simulate a human personality"? Or provide new speed think concepts and formulas for the solving of a multitude of complex problems for that matter. The path would be 1) diffusion model idea/concept 2) ask a GAN like gpt-3 to check if it works 3) feed back to the diffusion model and keep iterating until it does in much the same way as de-noising a picture. Just a thought from a diffusion brain.

    • @flubnub266
      @flubnub266 Рік тому +5

      It's because the subset of possible images we humans are interested in is actually very specific. If you think about it, infinite combinations isn't that complicated. It's when we want specific things that you need more information. It only takes a few KB of code to make a pseudorandom number generator that can theoretically output every possible image, but we would see the vast majority of those permutations as boring rainbow noise. Ironically, the storage space used by generative models is needed to essentially explain what we DON'T want, so that we are left with the very specific subset that does meet our requirements.

  • @arnob3196
    @arnob3196 Рік тому +1

    how long did it take to train 500 epochs in your rtx 3060?

    • @DeepFindr
      @DeepFindr  Рік тому

      Hi! Good question, it was certainly several hours. I ran it overnight

  • @tendocat8778
    @tendocat8778 4 місяці тому

    You saved my PhD

  • @isaacsalvador4188
    @isaacsalvador4188 Рік тому

    Do you know how to modify this diffusion model to accept a custom data set?

    • @DeepFindr
      @DeepFindr  Рік тому

      Yes, simply exchange the Dataset class with a custom dataset from pytorch. As long as it's images, the rest should work fine :)

  • @AbhishekSingh-hz6rv
    @AbhishekSingh-hz6rv 7 місяців тому

    Did anyone got an output, dont know why I am getting only noisy images in Epoch 0 ,

  • @sachinmotwani2905
    @sachinmotwani2905 Рік тому +3

    Unable to access the dataset - stanford-cars.

  • @xhinker
    @xhinker 10 місяців тому

    At 11:45, the third line of the formula should have a bar on the top of alpha_t

  • @omarlopezrincon
    @omarlopezrincon Рік тому

    did you share the code of the implementation for your personal gpu?

    • @DeepFindr
      @DeepFindr  Рік тому +1

      Hi, I dont have it anymore but it's basically the same as this one, just in a python file.
      Also there were some comments below to change parts of the code e.g. Positional embeddings and final filter sizes that might improve the performance.

    • @omarlopezrincon
      @omarlopezrincon Рік тому

      @@DeepFindr thanks, are you a researcher ?

    • @DeepFindr
      @DeepFindr  Рік тому +1

      I work in applied research in the industry. For me the sweet spot between pure research and building software products :)

  • @vladilek
    @vladilek 2 місяці тому

    das ist aber süß