'3/4/5-levels' looks like a very powerful way of explaining concepts. I'd like to see the higher levels be longer, and really drill down into the heart of the matter. So that the final level is communicating at an expert level. +1 / subbed.
The way you describe it is how the get the 'original' picture back. But all the content generated is new (in that combination). I was waiting for the explanation of the step that describes how a new combination (so new content) is generated into existence from the latent space through diffusion, not only the method on how to get the starting picture back from the noise...
The video did explain this very briefly. It's as simple as creating a random noise image and feeding it into the same model you used during training. It will turn it into a real image exactly the same way it happened during training. You just get rid of the part that turns an image into noise and use the part that turns noise into an image. You don't have to use CLIP or any text at all. That's a whole other ball game
Appreciate your explanation skill. Q. What is diffusion model Ans. Let's say you tell your best friend, Sarah, about this amazing new flavor. Sarah gets excited and tells her friend, Tom. Then Tom tells his cousin, Emily. Emily, in turn, tells her family, and the news keeps spreading from person to person, creating a chain reaction. This process of your ice cream flavor information spreading from one person to another is like how a drop of ink spreads in water. At first, it's just a small spot, but then it spreads out and covers more and more area as time goes on. In the diffusion model, experts study how things, whether it's information, ideas, or products like your ice cream flavor, spread through a community of people. They try to understand how fast it spreads, how many people it reaches, and what factors influence its spread. By understanding these patterns, they can learn a lot about how people share and adopt new things!
really good video! I have checked few blogs explain how diffusion mode works, still can not understand. But after see your video one time , i have a better understanding how diffusion actually works! Really thanks!
It doesn't hold up to scrutiny nor retraining. This is a core concept and not what was used to actually create stable diffusion. Kohya and similar diffusors related programs show the more commonly used training methods and maths if you analyze the code and pay attention to the optimization routes and specific inference selection code itself. It doesn't fit to the paper and often branches outward into almost renegade realms to provide clarity to the input and output of the finetuning or lora creation. For example; A common way to train is to use step by step, but there's also a great number of low step high density training methods now. There's quite a few actually. This singular concept was the original concept. Another problem here is, the part 1 example is based on physics, when there's actually a substantial amount of the math based entirely on chaos theory now for diffusion analysis and inference. There's a great deal of complexity since this original video.
Diffusion models actually predict a bit of noise to remove from the input noisy image at inference time. The noise is added to images just to produce training data.
I want to understand diffusion models so I can understand how it's possible for artificial intelligence to produce an image. Your explanation helps. A bit.
Nice explanation! I feel like the video title is misleading, it is just one explanation going deeper and not complete without the deeper levels of knowledge and differs a lot from other videos that start from zero the explanation at different levels. This is more like 4 shades of Diffussion :D Thanks for sharing!
Sorry, but I don't understand something very important. WHY would you add the noise and then substract the noise? Correct me if I'm wrong, but the rightmost noise image in this example is basically an encoded image of the original dog image, that can be decoded deterministically with the neural network, in multiple steps. That's nice and dandy. And I do understand that the noise image is not like a RAR archive, which, were it to be slightly modified, would just yield corruption errors, and instead the modified noise image would still generate... an image. NOW. 1. How do you get from the user text prompt to the noise image of what the user WANTED, that will THEN be denoised (decoded)? 2. How is it so that not every OTHER noise result from the text prompt (except previously deterministically encoded images like this dog image for example) will output just a bunch of garbled mess? And yes, I know that is sometimes the case, I used Stable Diffusion daily.
This only describes how the model is trained. Training set picture -> noise -> reconstructed training set picture. During actual generation there is no "original", you start from random noise but the model has learned to denoise given the relevant tags.
Can you post a follow up explanation on how the text conditionalized generation works. Like Imagen I quess for example used T5 but how actually that text embedding affects the generated image and how it is trained. Cause in the end we have a system where the "noise" is generated with some text embedding so I am curious how that process works
The text is transformed to visual embeddings (visual concepts) using another model and it is fed to diffusion model alogside the noisy image. The other model is CLIPS which was trained separately on images with associated descriptions
Regarding level 3. Is every single pixel diffused at each step, or there is a subset that is randomly chosen? Is the sampling separate for every pixel or we take one value and then multiply it by each pixel? Subsequent diffusions work on the already diffused value, I guess (we don't try to remember what was the mean of the original pixel, but just use the new one)?
This depends on the data the model is trained on! There is not only one specific diffusion model instead you can train as much models as you like. If your training data contains copyright limited images you won't be able to use the model for commercial purposes but there are many open source non-copyright datasets out there!
The explanation of adding noise was well done, but the reverse process--by far the strangest process--was not really explained at all. You introduced, but did not explain, some learning process. This unexplained process "somehow" gets back the image. Every "explanation" of SD always skips over this step! Why? (Also skipped, how the text prompt is "combined" with the image. Folks mumble about CL??, but never clearly explain it.) You are a very very good presenter. Please take 15 minutes to "explain" SD.
Yeah, I am having this frustration as well, except I think I may understood the concepts more poorly than you. Regarding the process of how the Level 4 part "somehow" gets back to the image, it could be because UNets are just really complicated, so it almost has to be handwaved? Every explanation I've seen of them (which is not many) immediately descends into highly technical language. It's evidently a step wise process but I don't understand really anything about what is happening in each of those steps, and what data is used during them. I also don't think I understand the point of _gradually_ adding noise to the original image if you just end up with 100% noise at the end, and then that's where the denoising process starts. Exactly when and how are the partially noised images used? In the UNet? If this is the case, either the explanations of UNets I've seen are missing that info, or they're explaining it in a way that I completely fail to comprehend. In addition, the explanations I've seen tend to use a single image as an example of how the model is trained. But I understand that these models are trained on many images. So the steps laid out in this video are repeated on thousands of images to train the model to generate an image of a dog (or any image??), but how is information from repeating that process combined into the algorithm or latent space or whatever? Do you start with a virgin model or some generalized model or latent space, which then gets modified when you train it on the first image, and then you carry those modifications over when you train the next image? It seems like that ought to be how it works, but if it is, I think a great explanation for how this stuff works would make that explicit. And then, yeah, how do text prompts work? Both at a basic level, with just a single word prompt like "dog," but also, how are complicated multi-prompt words managed? (I imagine many of the common "mistakes" of diffusion models might be illustrative.)
A U-Net is a standardized deep-learning model that takes the image as an input and has another image (with the same dimensions) as an output. It is trained the conventional way with the so called gradient decent algorithm that aims to minimizes the least squared error loss function. In this case, the model aims to predict a mask of the image which represents the noise that was added to the previous step so that we can simply substract that noise from the noisy image to get back to the original image. I hope that was at least somewhat helpful? :)
@AssemblyAI Why 255 (Probability Density Graph), does it have to do with binary? Network Engineer here, and I am trying to draw correlations between IP address ranges being 255 and subnet ranges being 255 and the graph you displayed. They all have binary masks in common hence why I am asking.
It has to do with binary indeed. 0-255 just represents all the 8-bit values possible. When dealing with standard colored pixels we have an 8-bit value for each of the red, green and blue values. Having a 24-bit value for each pixel is simply the standard, it already gives 16+ million unique colors that are possible for a pixel. en.wikipedia.org/wiki/List_of_monochrome_and_RGB_color_formats#24-bit_RGB
"Full noise" that contains a message is not "white noise". These input "white noise" images are just a puzzle containing info for a computer algorithm to solve. I would not, at this point want to bet our future - or even crossing the street - on "advanced AI"
"Full noise" is just used for AI to see patterns in it like a kid see shapes in a noisy TV screen. It is just a way to give imagination to AI. To get what you want you can guide the generation using text prompts.
Wow, this is really great, it definitely helped me understand how these models are working. However I did have one question. In your explanation of how a gaussian noise was created for an image, I was a bit confused. As i have had to generate an image of pure noise following a gaussian distribution before, but in those cases I just generated it by for each pixel, calling a function to get some random number generated following a gaussian noise distribution, usually centered where 0.5 would say be the zero value for that distrution, and so basically remapping the -1 to 1 distribution to be 0 to 1. ie Xnew = (X/2) + 0.5. Hopefully that makes sense. But the way you described it sounded like the noise was created by placing a sort of splat on the image following say a guassian distribution, and then place down subsequence splats in positions that are based on that first previous splats position on the image. I guess this is needed so you can generated all the inbetween time steps from image to pure noise. Rather then just teh final image.. But I didn't quite get exactly how you are creating the noise. For example are you actually splatting a sort of guassian distrution that happens over several pixels for each position, or is it just effecting that one pixel. I could see it happenign both ways and wasn't quite certain from your explanation which one was happening. ie do you come up with the position, then on that pixel just create one value that follows the guassian distrution curve to pick it's value. Or are you placing some splat that at it's center is say the brightest, but falls off to zero, following a gaussian distribution curve? If the latter then how wide is that, ie what would be the radius in pixels for that? And in either case, how is that mixed with the image? Do you multiply the image by the value in that pixel from the noise you generated, or do you mix between them? I doubt anyone with read this, as it's quite a long comment/question for a yoututbe video, but thought it woudln't hurt to try, as I am very interested in how these models work, and the under the hood details...
Learnt a lot of new things from this video. Why it is called as UNet Why it is called as diffusion model. What diffusion model does and how it does. Thanks
Having just watched 5 videos on this, umm, "topic?" I feel as if I have been in a coma for 25 years. I am looking for the simplest possible explanation on how this whole AI thing works, yet there don't seem to be any videos that can explain that without using already established terminology that, to me, is completely foreign. Your video is obviously well made, and you are good at explaining this, especially with the example of a drop of paint in water, but I am obviously so far from even beginning anything beyond. Apart from understanding "noise", I have no clue as to what "diffusion", or "model" or anything means. I could always watch videos on any topic, i.e. quantum physics, rocket science, robotics, or anything, and get the basic idea, but this time I feel like I'm years behind... If you could make a video explaining this as if you would explain it to someone in kindergarten, I would definitely come back and watch
I think the last level makes sense for people who actually do deep learning (like me). To get the proper background you need to learn about neural networks first.
Good explanation but i do hate when papers add needless maths and physics which are tangential at best when they should be describing their model in a simple way.
Hahah I understand the frustration :D But it's just what Diffusion Models are based on so you don't actually have to understand non-equilibrium thermodynamics. :D -Mısra
This feels like it should not be possible...then again, its not too different from us humans imagining faces in the clouds. Computers just take this hallucination to the next level.
Stick to applied ML then. In that case you can make use of existing frameworks and libraries to implement models for solving problems, without knowing the working under the hood. 1. But if you do want to understand the math, the only way is to refer to better learning resources, and keep trying iteratively. Often, it's not the math alone, but the way it is being taught, that makes a whole lot of difference in one's understanding. For eg, back in grad school, I used to refer to Salman Khan's math videos to get the actual understanding of linear algebra concepts (which could not be attained even after reading a few standard books) 2. Having said that, each one of us has to maintain a trade-off in math deep dive vs actual implementation. No ones knows all things a 100%
0:48 - The first sentence out of your mouth was not level 1 dawg. "thermodynamic equilibrium". You should start with showing how a drop of paint spreads in water (a phenomenon everyone either already knows or can easily see) and then explain what's going on and give the definition
Fine, you add noise to an image and then restore it. VERY simple concepts (even if very hard in practice). But the magic of DALL-E, Midjourney & Stable Diffusion is the creation of NEW images. This is the third video I'm watching that explains the same trivial diffusion concept. Guess I'll have to ask ChatGPT instead.
Exactly! I've watched and read numerous explanations of diffusion models, but not one so far has told me how the process ends with an image DIFFERENT from the one with which it began.
In the level 1 explanation, what’s the point of introducing the phrase “thermodynamic equilibrium”? Most lay people understand what it means when we say food coloring diffuses into clear water. Reminding the viewer why that happens from a physics standpoint makes the level 1 explanation less clear, not more clear.
I dunno about all that, I just type in 'boobs' and the thing delivers. Whatever math those silicon wafers decide to subject themselves to, that's on them.
Sorry, but this video is very frustrating. Nothing was explained in terms of either the technique for reversing or how it relates to new image creation when prompting, which is obviously what we are mostly interested on.
Can we take a moment to appreciate how silly it is to say, "we're gonna explain this in 4 levels - 1 being the easiest, 4 being the hardest" and immediately starting level 1 with: "diffusion models were inspired by non-equilibrium thermodynamics from physics and as you can understand from the name this field deals with system d that are not in thermodynamic equilibrium" next time ask ChatGPT to write it for you lmao, imagine going up to a five year old and being like, "Hey kid, you're familiar with thermodynamic equilibrium right? Well the area of machine learning concerned with image generation using diffusion models takes that principle, but is inspired by its inverse."
Please note that what she referenced in Level 1 is secondary school stuff. The authors obviously assumed *this* basis to build upon, not that of a five-year-old kid.
'3/4/5-levels' looks like a very powerful way of explaining concepts. I'd like to see the higher levels be longer, and really drill down into the heart of the matter. So that the final level is communicating at an expert level. +1 / subbed.
now that's some quantum technology, man... Being one of the beta testers of Stable Diffusion helps me understand this even more.
Awesome!
Now, that's a great explanation for Diffusion Models.
you're so beautiful and explain the Diffusion model in the most simple way. as the chinese saying: 人狠话不多!
This is very good intro for quick understanding of the concept 👍
Glad it was helpful!
This was so helpful. Love this format of starting easier and add layers of explanations.
Great to hear, thanks!
Great video for beginners! Really helpful, Thank you!
The way you describe it is how the get the 'original' picture back. But all the content generated is new (in that combination). I was waiting for the explanation of the step that describes how a new combination (so new content) is generated into existence from the latent space through diffusion, not only the method on how to get the starting picture back from the noise...
You pair it with CLIP, which takes in a text string and image and outputs the distance between them. You denoise while lowering this distance
The video did explain this very briefly. It's as simple as creating a random noise image and feeding it into the same model you used during training. It will turn it into a real image exactly the same way it happened during training. You just get rid of the part that turns an image into noise and use the part that turns noise into an image. You don't have to use CLIP or any text at all. That's a whole other ball game
It would be good if you also explain the reverse process in detail as you explained the forward process
Appreciate your explanation skill.
Q. What is diffusion model
Ans. Let's say you tell your best friend, Sarah, about this amazing new flavor. Sarah gets excited and tells her friend, Tom. Then Tom tells his cousin, Emily. Emily, in turn, tells her family, and the news keeps spreading from person to person, creating a chain reaction. This process of your ice cream flavor information spreading from one person to another is like how a drop of ink spreads in water. At first, it's just a small spot, but then it spreads out and covers more and more area as time goes on.
In the diffusion model, experts study how things, whether it's information, ideas, or products like your ice cream flavor, spread through a community of people. They try to understand how fast it spreads, how many people it reaches, and what factors influence its spread. By understanding these patterns, they can learn a lot about how people share and adopt new things!
Such a great video to dive in! I'm live streaming learning about Diffusion, right now!
really good video! I have checked few blogs explain how diffusion mode works, still can not understand. But after see your video one time , i have a better understanding how diffusion actually works! Really thanks!
That's great to heat Zhao! Thank you for watching. :)
It doesn't hold up to scrutiny nor retraining. This is a core concept and not what was used to actually create stable diffusion. Kohya and similar diffusors related programs show the more commonly used training methods and maths if you analyze the code and pay attention to the optimization routes and specific inference selection code itself. It doesn't fit to the paper and often branches outward into almost renegade realms to provide clarity to the input and output of the finetuning or lora creation.
For example;
A common way to train is to use step by step, but there's also a great number of low step high density training methods now. There's quite a few actually. This singular concept was the original concept.
Another problem here is, the part 1 example is based on physics, when there's actually a substantial amount of the math based entirely on chaos theory now for diffusion analysis and inference. There's a great deal of complexity since this original video.
Diffusion models actually predict a bit of noise to remove from the input noisy image at inference time. The noise is added to images just to produce training data.
Great explanation of diffusion models. Thank you.
Glad it was helpful!
Thank you for your explanation!
You're welcome!
Thanks for this video, this was very insightfull. Still have a lot to learn about this topic that will revolutionize our world so much
I want to understand diffusion models so I can understand how it's possible for artificial intelligence to produce an image. Your explanation helps. A bit.
This video was awesome! Well done :) and thank you
thanks for this great presentation
This was a great explanation! I tried to read the blog first but the maths notation was way over my head
Thank you Chewie :)
my brain cannot handle this
wow someone finally pulled this off
Nice explanation! I feel like the video title is misleading, it is just one explanation going deeper and not complete without the deeper levels of knowledge and differs a lot from other videos that start from zero the explanation at different levels. This is more like 4 shades of Diffussion :D Thanks for sharing!
Sorry, but I don't understand something very important. WHY would you add the noise and then substract the noise? Correct me if I'm wrong, but the rightmost noise image in this example is basically an encoded image of the original dog image, that can be decoded deterministically with the neural network, in multiple steps. That's nice and dandy. And I do understand that the noise image is not like a RAR archive, which, were it to be slightly modified, would just yield corruption errors, and instead the modified noise image would still generate... an image. NOW.
1. How do you get from the user text prompt to the noise image of what the user WANTED, that will THEN be denoised (decoded)?
2. How is it so that not every OTHER noise result from the text prompt (except previously deterministically encoded images like this dog image for example) will output just a bunch of garbled mess? And yes, I know that is sometimes the case, I used Stable Diffusion daily.
This only describes how the model is trained. Training set picture -> noise -> reconstructed training set picture.
During actual generation there is no "original", you start from random noise but the model has learned to denoise given the relevant tags.
❤🎉 amazing lecture
diffusion model can add noise to image1 and then in the revers process it make a different image (not the same) ?????? plz rpns?
Excellent
so much good info! Thank you!!!
You're very welcome!
Thanks so much for a useful presentation…what a good idea to present in several levels!
Thanks for clear explanations and link to the blog!
You're very welcome!
AMAZING !! Thanks so much!
Thank you very much: well done.
really amazing video thank you very much! subbed! :)
Wow this is so helpful.
Great to hear!
Thanks for the nice explanation. Appreciate if you can present similar type of explanation and compare DDPM vs DDIM.
You're very welcome Vipin! Noted your recommendation!
Thank You !
Can diffusion models be used in denoising audio. If yes, how?
Can you post a follow up explanation on how the text conditionalized generation works. Like Imagen I quess for example used T5 but how actually that text embedding affects the generated image and how it is trained. Cause in the end we have a system where the "noise" is generated with some text embedding so I am curious how that process works
Thank you for this suggestion Rasmus, noted!
The text is transformed to visual embeddings (visual concepts) using another model and it is fed to diffusion model alogside the noisy image. The other model is CLIPS which was trained separately on images with associated descriptions
Regarding level 3. Is every single pixel diffused at each step, or there is a subset that is randomly chosen? Is the sampling separate for every pixel or we take one value and then multiply it by each pixel? Subsequent diffusions work on the already diffused value, I guess (we don't try to remember what was the mean of the original pixel, but just use the new one)?
thanks for this
You bet!
This is an excellent video. Love the format. Well done, more please!
Thanks, will do!
Fascinating stuff, Great explanation.
Thank you!
Thank you so much for the elegant explanation.
Great explanation!
Glad you think so!
A very confusing, yet somehow great explanation of diffusion models. Thank you!
A confusing but somehow positive feedback. :D Thank you!
Excellent:)
Anyone else see a horse in this drop of paint? 1:00
what role do images in the training set play? are diffusion models violating copyright or not?
This depends on the data the model is trained on! There is not only one specific diffusion model instead you can train as much models as you like. If your training data contains copyright limited images you won't be able to use the model for commercial purposes but there are many open source non-copyright datasets out there!
Tysm
You're very welcome. :)
Beautiful❤
Good job.
Can someone explain why do we need to know the initial position of the ink in water if we already knew where the ink was first introduced?
Think of it as not as the position but the shape of the ink. We're trying to reach the initial shape right after the moment it was dropped.
Great Video! B)
Thank you!
that was aweeeeessssommmmmmeeeee
The explanation of adding noise was well done, but the reverse process--by far the strangest process--was not really explained at all. You introduced, but did not explain, some learning process. This unexplained process "somehow" gets back the image. Every "explanation" of SD always skips over this step! Why? (Also skipped, how the text prompt is "combined" with the image. Folks mumble about CL??, but never clearly explain it.) You are a very very good presenter. Please take 15 minutes to "explain" SD.
Yeah, I am having this frustration as well, except I think I may understood the concepts more poorly than you.
Regarding the process of how the Level 4 part "somehow" gets back to the image, it could be because UNets are just really complicated, so it almost has to be handwaved? Every explanation I've seen of them (which is not many) immediately descends into highly technical language. It's evidently a step wise process but I don't understand really anything about what is happening in each of those steps, and what data is used during them.
I also don't think I understand the point of _gradually_ adding noise to the original image if you just end up with 100% noise at the end, and then that's where the denoising process starts. Exactly when and how are the partially noised images used? In the UNet? If this is the case, either the explanations of UNets I've seen are missing that info, or they're explaining it in a way that I completely fail to comprehend.
In addition, the explanations I've seen tend to use a single image as an example of how the model is trained. But I understand that these models are trained on many images. So the steps laid out in this video are repeated on thousands of images to train the model to generate an image of a dog (or any image??), but how is information from repeating that process combined into the algorithm or latent space or whatever? Do you start with a virgin model or some generalized model or latent space, which then gets modified when you train it on the first image, and then you carry those modifications over when you train the next image? It seems like that ought to be how it works, but if it is, I think a great explanation for how this stuff works would make that explicit.
And then, yeah, how do text prompts work? Both at a basic level, with just a single word prompt like "dog," but also, how are complicated multi-prompt words managed? (I imagine many of the common "mistakes" of diffusion models might be illustrative.)
A U-Net is a standardized deep-learning model that takes the image as an input and has another image (with the same dimensions) as an output. It is trained the conventional way with the so called gradient decent algorithm that aims to minimizes the least squared error loss function. In this case, the model aims to predict a mask of the image which represents the noise that was added to the previous step so that we can simply substract that noise from the noisy image to get back to the original image.
I hope that was at least somewhat helpful? :)
@@abail7010 UNet predicts the noise, and a scheduler removes the noise from the image right?
@@jocke8277 On a high level, yes that is true! :)
@AssemblyAI
Why 255 (Probability Density Graph), does it have to do with binary? Network Engineer here, and I am trying to draw correlations between IP address ranges being 255 and subnet ranges being 255 and the graph you displayed. They all have binary masks in common hence why I am asking.
It has to do with binary indeed. 0-255 just represents all the 8-bit values possible. When dealing with standard colored pixels we have an 8-bit value for each of the red, green and blue values. Having a 24-bit value for each pixel is simply the standard, it already gives 16+ million unique colors that are possible for a pixel. en.wikipedia.org/wiki/List_of_monochrome_and_RGB_color_formats#24-bit_RGB
NICE
"Full noise" that contains a message is not "white noise". These input "white noise" images are just a puzzle containing info for a computer algorithm to solve. I would not, at this point want to bet our future - or even crossing the street - on "advanced AI"
"Full noise" is just used for AI to see patterns in it like a kid see shapes in a noisy TV screen. It is just a way to give imagination to AI. To get what you want you can guide the generation using text prompts.
Wow, this is really great, it definitely helped me understand how these models are working.
However I did have one question. In your explanation of how a gaussian noise was created for an image, I was a bit confused. As i have had to generate an image of pure noise following a gaussian distribution before, but in those cases I just generated it by for each pixel, calling a function to get some random number generated following a gaussian noise distribution, usually centered where 0.5 would say be the zero value for that distrution, and so basically remapping the -1 to 1 distribution to be 0 to 1. ie Xnew = (X/2) + 0.5. Hopefully that makes sense. But the way you described it sounded like the noise was created by placing a sort of splat on the image following say a guassian distribution, and then place down subsequence splats in positions that are based on that first previous splats position on the image. I guess this is needed so you can generated all the inbetween time steps from image to pure noise. Rather then just teh final image.. But I didn't quite get exactly how you are creating the noise. For example are you actually splatting a sort of guassian distrution that happens over several pixels for each position, or is it just effecting that one pixel. I could see it happenign both ways and wasn't quite certain from your explanation which one was happening. ie do you come up with the position, then on that pixel just create one value that follows the guassian distrution curve to pick it's value. Or are you placing some splat that at it's center is say the brightest, but falls off to zero, following a gaussian distribution curve? If the latter then how wide is that, ie what would be the radius in pixels for that? And in either case, how is that mixed with the image? Do you multiply the image by the value in that pixel from the noise you generated, or do you mix between them?
I doubt anyone with read this, as it's quite a long comment/question for a yoututbe video, but thought it woudln't hurt to try, as I am very interested in how these models work, and the under the hood details...
Great explanation, easy to follow. So in essence, the first step is fixed, then is variable for the decoding if I understand it right?
First step is just to generate training data: final images with coresponding noisy images and the number of steps used to add noise
Learnt a lot of new things from this video.
Why it is called as UNet
Why it is called as diffusion model.
What diffusion model does and how it does.
Thanks
Thanks
You're very welcome!
Great
Having just watched 5 videos on this, umm, "topic?" I feel as if I have been in a coma for 25 years. I am looking for the simplest possible explanation on how this whole AI thing works, yet there don't seem to be any videos that can explain that without using already established terminology that, to me, is completely foreign. Your video is obviously well made, and you are good at explaining this, especially with the example of a drop of paint in water, but I am obviously so far from even beginning anything beyond. Apart from understanding "noise", I have no clue as to what "diffusion", or "model" or anything means. I could always watch videos on any topic, i.e. quantum physics, rocket science, robotics, or anything, and get the basic idea, but this time I feel like I'm years behind... If you could make a video explaining this as if you would explain it to someone in kindergarten, I would definitely come back and watch
I think the last level makes sense for people who actually do deep learning (like me). To get the proper background you need to learn about neural networks first.
Good explanation but i do hate when papers add needless maths and physics which are tangential at best when they should be describing their model in a simple way.
“OK level one… non-equilibrium thermodynamics” 🥴
level 0 - annealed Langevin dynamics
Hahah I understand the frustration :D But it's just what Diffusion Models are based on so you don't actually have to understand non-equilibrium thermodynamics. :D -Mısra
would you say the process is fractal?
It most definitely is not fractal
I wish my brain was smart enough to understand!
This feels like it should not be possible...then again, its not too different from us humans imagining faces in the clouds. Computers just take this hallucination to the next level.
I have one question. I hate maths but I love to train models. I tried to learn math but godd it's 😵😵. Any advice?
Stick to applied ML then. In that case you can make use of existing frameworks and libraries to implement models for solving problems, without knowing the working under the hood.
1. But if you do want to understand the math, the only way is to refer to better learning resources, and keep trying iteratively. Often, it's not the math alone, but the way it is being taught, that makes a whole lot of difference in one's understanding. For eg, back in grad school, I used to refer to Salman Khan's math videos to get the actual understanding of linear algebra concepts (which could not be attained even after reading a few standard books)
2. Having said that, each one of us has to maintain a trade-off in math deep dive vs actual implementation. No ones knows all things a 100%
@@ujjalkrdutta7854 What you said about math is true. I'm sticking with applied ml for now. There is a lot to explore there. Thank you for your time
Any level >= 5 ?
0:48 - The first sentence out of your mouth was not level 1 dawg. "thermodynamic equilibrium". You should start with showing how a drop of paint spreads in water (a phenomenon everyone either already knows or can easily see) and then explain what's going on and give the definition
there is the whole language part missing.
Wow! There’s another video if yours below this one, and your hair is so different that I didn’t recognize that it’s you.
I feel like these steps are just steps, not changed in difficulty
Fine, you add noise to an image and then restore it. VERY simple concepts (even if very hard in practice). But the magic of DALL-E, Midjourney & Stable Diffusion is the creation of NEW images. This is the third video I'm watching that explains the same trivial diffusion concept. Guess I'll have to ask ChatGPT instead.
Exactly! I've watched and read numerous explanations of diffusion models, but not one so far has told me how the process ends with an image DIFFERENT from the one with which it began.
In the level 1 explanation, what’s the point of introducing the phrase “thermodynamic equilibrium”? Most lay people understand what it means when we say food coloring diffuses into clear water. Reminding the viewer why that happens from a physics standpoint makes the level 1 explanation less clear, not more clear.
I came to the comments to see if this was Mandy Moore.
Your level 2 should have been level 1
I dunno about all that, I just type in 'boobs' and the thing delivers. Whatever math those silicon wafers decide to subject themselves to, that's on them.
Lost me at level one 😅
6 minutes explaining nothing and at the end.. blablabla super fast about convolution... and nothing clear :/
Who is the lady? Her @
You are beautiful
Sorry, but this video is very frustrating. Nothing was explained in terms of either the technique for reversing or how it relates to new image creation when prompting, which is obviously what we are mostly interested on.
Then this just isn’t the video for you. This was purely explaining the concept
Helped me a lot
This stuff just sucks man
Can we take a moment to appreciate how silly it is to say, "we're gonna explain this in 4 levels - 1 being the easiest, 4 being the hardest"
and immediately starting level 1 with: "diffusion models were inspired by non-equilibrium thermodynamics from physics and as you can understand from the name this field deals with system d that are not in thermodynamic equilibrium"
next time ask ChatGPT to write it for you lmao, imagine going up to a five year old and being like,
"Hey kid, you're familiar with thermodynamic equilibrium right? Well the area of machine learning concerned with image generation using diffusion models takes that principle, but is inspired by its inverse."
Please note that what she referenced in Level 1 is secondary school stuff. The authors obviously assumed *this* basis to build upon, not that of a five-year-old kid.
who all think she's AI generated ??
She might not have technical background. No technical person will mispronounce variance as variation.