Diffusion models explained in 4-difficulty levels

AssemblyAI

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 16 гру 2024

КОМЕНТАРІ • 138

@pi5549 Рік тому ⁺⁶⁰
'3/4/5-levels' looks like a very powerful way of explaining concepts. I'd like to see the higher levels be longer, and really drill down into the heart of the matter. So that the final level is communicating at an expert level. +1 / subbed.
@synthoelectro 2 роки тому ⁺¹⁶
now that's some quantum technology, man... Being one of the beta testers of Stable Diffusion helps me understand this even more.
@AssemblyAI 2 роки тому
Awesome!
@shashankshekharsingh2912 8 місяців тому
Now, that's a great explanation for Diffusion Models.
@CharlieYou823 2 місяці тому ⁺¹
you're so beautiful and explain the Diffusion model in the most simple way. as the chinese saying: 人狠话不多！
@paramino Рік тому ⁺⁴
This is very good intro for quick understanding of the concept 👍
@AssemblyAI Рік тому
Glad it was helpful!
@user-wr4yl7tx3w 2 роки тому ⁺²¹
This was so helpful. Love this format of starting easier and add layers of explanations.
@AssemblyAI 2 роки тому ⁺¹
Great to hear, thanks!
@Simplegrandeur1162 Місяць тому
Great video for beginners! Really helpful, Thank you!
@yricktube 2 роки тому ⁺²³
The way you describe it is how the get the 'original' picture back. But all the content generated is new (in that combination). I was waiting for the explanation of the step that describes how a new combination (so new content) is generated into existence from the latent space through diffusion, not only the method on how to get the starting picture back from the noise...
@polyfoxgames9006 2 роки тому ⁺⁴
You pair it with CLIP, which takes in a text string and image and outputs the distance between them. You denoise while lowering this distance
@I77AGIC 2 роки тому ⁺⁷
The video did explain this very briefly. It's as simple as creating a random noise image and feeding it into the same model you used during training. It will turn it into a real image exactly the same way it happened during training. You just get rid of the part that turns an image into noise and use the part that turns noise into an image. You don't have to use CLIP or any text at all. That's a whole other ball game
@malikfahadsarwar2281 Рік тому ⁺⁸
It would be good if you also explain the reverse process in detail as you explained the forward process
@AIMLDLNLP-TECH Рік тому
Appreciate your explanation skill.
Q. What is diffusion model
Ans. Let's say you tell your best friend, Sarah, about this amazing new flavor. Sarah gets excited and tells her friend, Tom. Then Tom tells his cousin, Emily. Emily, in turn, tells her family, and the news keeps spreading from person to person, creating a chain reaction. This process of your ice cream flavor information spreading from one person to another is like how a drop of ink spreads in water. At first, it's just a small spot, but then it spreads out and covers more and more area as time goes on.
In the diffusion model, experts study how things, whether it's information, ideas, or products like your ice cream flavor, spread through a community of people. They try to understand how fast it spreads, how many people it reaches, and what factors influence its spread. By understanding these patterns, they can learn a lot about how people share and adopt new things!
@sotasearcher Рік тому
Such a great video to dive in! I'm live streaming learning about Diffusion, right now!
@zhaoyufei9096 2 роки тому ⁺²
really good video! I have checked few blogs explain how diffusion mode works, still can not understand. But after see your video one time , i have a better understanding how diffusion actually works! Really thanks!
@AssemblyAI 2 роки тому
That's great to heat Zhao! Thank you for watching. :)
@ApexFunplayer 5 місяців тому
It doesn't hold up to scrutiny nor retraining. This is a core concept and not what was used to actually create stable diffusion. Kohya and similar diffusors related programs show the more commonly used training methods and maths if you analyze the code and pay attention to the optimization routes and specific inference selection code itself. It doesn't fit to the paper and often branches outward into almost renegade realms to provide clarity to the input and output of the finetuning or lora creation.
For example;
A common way to train is to use step by step, but there's also a great number of low step high density training methods now. There's quite a few actually. This singular concept was the original concept.
Another problem here is, the part 1 example is based on physics, when there's actually a substantial amount of the math based entirely on chaos theory now for diffusion analysis and inference. There's a great deal of complexity since this original video.
@MrAlextorex 2 роки тому ⁺³
Diffusion models actually predict a bit of noise to remove from the input noisy image at inference time. The noise is added to images just to produce training data.
@yousufmamsa Рік тому
Great explanation of diffusion models. Thank you.
@AssemblyAI Рік тому
Glad it was helpful!
@OpuYT 2 роки тому ⁺⁴
Thank you for your explanation!
@AssemblyAI 2 роки тому
You're welcome!
@Kaleubs 11 місяців тому
Thanks for this video, this was very insightfull. Still have a lot to learn about this topic that will revolutionize our world so much
@John-eq8cu Рік тому
I want to understand diffusion models so I can understand how it's possible for artificial intelligence to produce an image. Your explanation helps. A bit.
@sinsernadeesoyo Рік тому
This video was awesome! Well done :) and thank you
@hamidzemirline7318 10 місяців тому
thanks for this great presentation
@dandogamer 2 роки тому ⁺¹
This was a great explanation! I tried to read the blog first but the maths notation was way over my head
@AssemblyAI 2 роки тому
Thank you Chewie :)
@andikafaishal2230 2 роки тому ⁺¹
my brain cannot handle this
@thobeycampion5387 Рік тому
wow someone finally pulled this off
@inetmiguel Рік тому
Nice explanation! I feel like the video title is misleading, it is just one explanation going deeper and not complete without the deeper levels of knowledge and differs a lot from other videos that start from zero the explanation at different levels. This is more like 4 shades of Diffussion :D Thanks for sharing!
@cosmingurau 2 роки тому ⁺⁹
Sorry, but I don't understand something very important. WHY would you add the noise and then substract the noise? Correct me if I'm wrong, but the rightmost noise image in this example is basically an encoded image of the original dog image, that can be decoded deterministically with the neural network, in multiple steps. That's nice and dandy. And I do understand that the noise image is not like a RAR archive, which, were it to be slightly modified, would just yield corruption errors, and instead the modified noise image would still generate... an image. NOW.
1. How do you get from the user text prompt to the noise image of what the user WANTED, that will THEN be denoised (decoded)?
2. How is it so that not every OTHER noise result from the text prompt (except previously deterministically encoded images like this dog image for example) will output just a bunch of garbled mess? And yes, I know that is sometimes the case, I used Stable Diffusion daily.
@Epqntlg 3 місяці тому
This only describes how the model is trained. Training set picture -> noise -> reconstructed training set picture.
During actual generation there is no "original", you start from random noise but the model has learned to denoise given the relevant tags.
@AbuzarbhuttaG Рік тому
❤🎉 amazing lecture
@akrammekbal8936 Рік тому ⁺¹
diffusion model can add noise to image1 and then in the revers process it make a different image (not the same) ?????? plz rpns?
@vigneshvicky6720 Місяць тому ⁺¹
Excellent
@whentheinternetwasgood8049 2 роки тому ⁺¹
so much good info! Thank you!!!
@AssemblyAI 2 роки тому
You're very welcome!
@uquantum 10 місяців тому ⁺²
Thanks so much for a useful presentation…what a good idea to present in several levels!
@JanMatusiewicz 2 роки тому ⁺²
Thanks for clear explanations and link to the blog!
@AssemblyAI 2 роки тому
You're very welcome!
@soulaymanal-abdallah6410 2 роки тому
AMAZING !! Thanks so much!
@luisluiscunha 6 місяців тому
Thank you very much: well done.
@alirezaakhavi9943 Рік тому
really amazing video thank you very much! subbed! :)
@user-wr4yl7tx3w 2 роки тому ⁺¹
Wow this is so helpful.
@AssemblyAI 2 роки тому
Great to hear!
@talktovipin1 2 роки тому ⁺²
Thanks for the nice explanation. Appreciate if you can present similar type of explanation and compare DDPM vs DDIM.
@AssemblyAI 2 роки тому
You're very welcome Vipin! Noted your recommendation!
@alaad1009 Рік тому
Thank You !
@vasudevankannan9823 9 місяців тому
Can diffusion models be used in denoising audio. If yes, how?
@rasmustoivanen2709 2 роки тому ⁺⁵
Can you post a follow up explanation on how the text conditionalized generation works. Like Imagen I quess for example used T5 but how actually that text embedding affects the generated image and how it is trained. Cause in the end we have a system where the "noise" is generated with some text embedding so I am curious how that process works
@AssemblyAI 2 роки тому ⁺³
Thank you for this suggestion Rasmus, noted!
@MrAlextorex 2 роки тому
The text is transformed to visual embeddings (visual concepts) using another model and it is fed to diffusion model alogside the noisy image. The other model is CLIPS which was trained separately on images with associated descriptions
@BartoszBielecki 2 роки тому ⁺⁵
Regarding level 3. Is every single pixel diffused at each step, or there is a subset that is randomly chosen? Is the sampling separate for every pixel or we take one value and then multiply it by each pixel? Subsequent diffusions work on the already diffused value, I guess (we don't try to remember what was the mean of the original pixel, but just use the new one)?
@al-aminibrahim1394 Рік тому
thanks for this
@AssemblyAI Рік тому
You bet!
@Democracy_Manifest Рік тому
This is an excellent video. Love the format. Well done, more please!
@AssemblyAI Рік тому
Thanks, will do!
@Grifter 2 роки тому
Fascinating stuff, Great explanation.
@AssemblyAI 2 роки тому
Thank you!
@kaushiks7303 Рік тому
Thank you so much for the elegant explanation.
@theoryofmind_music Рік тому
Great explanation!
@AssemblyAI Рік тому
Glad you think so!
@chaneydw Рік тому
A very confusing, yet somehow great explanation of diffusion models. Thank you!
@AssemblyAI Рік тому ⁺¹⁰
A confusing but somehow positive feedback. :D Thank you!
@faridalaghmand4802 9 місяців тому
Excellent:)
@audiogus2651 2 роки тому ⁺⁴
Anyone else see a horse in this drop of paint? 1:00
@ONDANOTA 2 роки тому ⁺¹
what role do images in the training set play? are diffusion models violating copyright or not?
@abail7010 Рік тому
This depends on the data the model is trained on! There is not only one specific diffusion model instead you can train as much models as you like. If your training data contains copyright limited images you won't be able to use the model for commercial purposes but there are many open source non-copyright datasets out there!
@ahmedsinger9435 Рік тому
Tysm
@AssemblyAI Рік тому
You're very welcome. :)
@jhanolaer8286 2 роки тому
Beautiful❤
@0xeb- 2 роки тому
Good job.
@Gurugurustan Рік тому
Can someone explain why do we need to know the initial position of the ink in water if we already knew where the ink was first introduced?
@AssemblyAI Рік тому
Think of it as not as the position but the shape of the ink. We're trying to reach the initial shape right after the moment it was dropped.
@automatalearninglab 2 роки тому
Great Video! B)
@AssemblyAI 2 роки тому
Thank you!
@saraebrahimi3795 Рік тому
that was aweeeeessssommmmmmeeeee
@S.Mullen 2 роки тому ⁺³³
The explanation of adding noise was well done, but the reverse process--by far the strangest process--was not really explained at all. You introduced, but did not explain, some learning process. This unexplained process "somehow" gets back the image. Every "explanation" of SD always skips over this step! Why? (Also skipped, how the text prompt is "combined" with the image. Folks mumble about CL??, but never clearly explain it.) You are a very very good presenter. Please take 15 minutes to "explain" SD.
@RTukka 2 роки тому ⁺⁶
Yeah, I am having this frustration as well, except I think I may understood the concepts more poorly than you.
Regarding the process of how the Level 4 part "somehow" gets back to the image, it could be because UNets are just really complicated, so it almost has to be handwaved? Every explanation I've seen of them (which is not many) immediately descends into highly technical language. It's evidently a step wise process but I don't understand really anything about what is happening in each of those steps, and what data is used during them.
I also don't think I understand the point of _gradually_ adding noise to the original image if you just end up with 100% noise at the end, and then that's where the denoising process starts. Exactly when and how are the partially noised images used? In the UNet? If this is the case, either the explanations of UNets I've seen are missing that info, or they're explaining it in a way that I completely fail to comprehend.
In addition, the explanations I've seen tend to use a single image as an example of how the model is trained. But I understand that these models are trained on many images. So the steps laid out in this video are repeated on thousands of images to train the model to generate an image of a dog (or any image??), but how is information from repeating that process combined into the algorithm or latent space or whatever? Do you start with a virgin model or some generalized model or latent space, which then gets modified when you train it on the first image, and then you carry those modifications over when you train the next image? It seems like that ought to be how it works, but if it is, I think a great explanation for how this stuff works would make that explicit.
And then, yeah, how do text prompts work? Both at a basic level, with just a single word prompt like "dog," but also, how are complicated multi-prompt words managed? (I imagine many of the common "mistakes" of diffusion models might be illustrative.)
@abail7010 Рік тому ⁺⁷
A U-Net is a standardized deep-learning model that takes the image as an input and has another image (with the same dimensions) as an output. It is trained the conventional way with the so called gradient decent algorithm that aims to minimizes the least squared error loss function. In this case, the model aims to predict a mask of the image which represents the noise that was added to the previous step so that we can simply substract that noise from the noisy image to get back to the original image.
I hope that was at least somewhat helpful? :)
@jocke8277 Рік тому
@@abail7010 UNet predicts the noise, and a scheduler removes the noise from the image right?
@abail7010 Рік тому ⁺¹
@@jocke8277 On a high level, yes that is true! :)
@adammason1587 2 роки тому
@AssemblyAI
Why 255 (Probability Density Graph), does it have to do with binary? Network Engineer here, and I am trying to draw correlations between IP address ranges being 255 and subnet ranges being 255 and the graph you displayed. They all have binary masks in common hence why I am asking.
@vyndecimibd 2 роки тому
It has to do with binary indeed. 0-255 just represents all the 8-bit values possible. When dealing with standard colored pixels we have an 8-bit value for each of the red, green and blue values. Having a 24-bit value for each pixel is simply the standard, it already gives 16+ million unique colors that are possible for a pixel. en.wikipedia.org/wiki/List_of_monochrome_and_RGB_color_formats#24-bit_RGB
@pf98 10 місяців тому
NICE
@PhilipRittscher 2 роки тому ⁺¹
"Full noise" that contains a message is not "white noise". These input "white noise" images are just a puzzle containing info for a computer algorithm to solve. I would not, at this point want to bet our future - or even crossing the street - on "advanced AI"
@MrAlextorex 2 роки тому
"Full noise" is just used for AI to see patterns in it like a kid see shapes in a noisy TV screen. It is just a way to give imagination to AI. To get what you want you can guide the generation using text prompts.
@Arrogan28 10 місяців тому
Wow, this is really great, it definitely helped me understand how these models are working.
However I did have one question. In your explanation of how a gaussian noise was created for an image, I was a bit confused. As i have had to generate an image of pure noise following a gaussian distribution before, but in those cases I just generated it by for each pixel, calling a function to get some random number generated following a gaussian noise distribution, usually centered where 0.5 would say be the zero value for that distrution, and so basically remapping the -1 to 1 distribution to be 0 to 1. ie Xnew = (X/2) + 0.5. Hopefully that makes sense. But the way you described it sounded like the noise was created by placing a sort of splat on the image following say a guassian distribution, and then place down subsequence splats in positions that are based on that first previous splats position on the image. I guess this is needed so you can generated all the inbetween time steps from image to pure noise. Rather then just teh final image.. But I didn't quite get exactly how you are creating the noise. For example are you actually splatting a sort of guassian distrution that happens over several pixels for each position, or is it just effecting that one pixel. I could see it happenign both ways and wasn't quite certain from your explanation which one was happening. ie do you come up with the position, then on that pixel just create one value that follows the guassian distrution curve to pick it's value. Or are you placing some splat that at it's center is say the brightest, but falls off to zero, following a gaussian distribution curve? If the latter then how wide is that, ie what would be the radius in pixels for that? And in either case, how is that mixed with the image? Do you multiply the image by the value in that pixel from the noise you generated, or do you mix between them?
I doubt anyone with read this, as it's quite a long comment/question for a yoututbe video, but thought it woudln't hurt to try, as I am very interested in how these models work, and the under the hood details...
@jayseb 2 роки тому
Great explanation, easy to follow. So in essence, the first step is fixed, then is variable for the decoding if I understand it right?
@MrAlextorex 2 роки тому
First step is just to generate training data: final images with coresponding noisy images and the number of steps used to add noise
@kartikpodugu Рік тому
Learnt a lot of new things from this video.
Why it is called as UNet
Why it is called as diffusion model.
What diffusion model does and how it does.
Thanks
@cevxj 2 роки тому
Thanks
@AssemblyAI 2 роки тому
You're very welcome!
@harshadmane8785 Рік тому
Great
@BogdanEchoMilosevic Рік тому ⁺²
Having just watched 5 videos on this, umm, "topic?" I feel as if I have been in a coma for 25 years. I am looking for the simplest possible explanation on how this whole AI thing works, yet there don't seem to be any videos that can explain that without using already established terminology that, to me, is completely foreign. Your video is obviously well made, and you are good at explaining this, especially with the example of a drop of paint in water, but I am obviously so far from even beginning anything beyond. Apart from understanding "noise", I have no clue as to what "diffusion", or "model" or anything means. I could always watch videos on any topic, i.e. quantum physics, rocket science, robotics, or anything, and get the basic idea, but this time I feel like I'm years behind... If you could make a video explaining this as if you would explain it to someone in kindergarten, I would definitely come back and watch
@matthewcui3907 14 днів тому
I think the last level makes sense for people who actually do deep learning (like me). To get the proper background you need to learn about neural networks first.
@xgalarion8659 Рік тому
Good explanation but i do hate when papers add needless maths and physics which are tangential at best when they should be describing their model in a simple way.
@jwithy 2 роки тому ⁺³
“OK level one… non-equilibrium thermodynamics” 🥴
@bayesianlee6447 2 роки тому ⁺¹
level 0 - annealed Langevin dynamics
@AssemblyAI 2 роки тому ⁺³
Hahah I understand the frustration :D But it's just what Diffusion Models are based on so you don't actually have to understand non-equilibrium thermodynamics. :D -Mısra
@theartoftk1 2 роки тому
would you say the process is fractal?
@generichuman_ 2 роки тому ⁺¹
It most definitely is not fractal
@angelxiii3181 2 роки тому
I wish my brain was smart enough to understand!
@abraruralam3534 2 роки тому
This feels like it should not be possible...then again, its not too different from us humans imagining faces in the clouds. Computers just take this hallucination to the next level.
@dcodeai369 2 роки тому
I have one question. I hate maths but I love to train models. I tried to learn math but godd it's 😵😵. Any advice?
@ujjalkrdutta7854 2 роки тому ⁺³
Stick to applied ML then. In that case you can make use of existing frameworks and libraries to implement models for solving problems, without knowing the working under the hood.
1. But if you do want to understand the math, the only way is to refer to better learning resources, and keep trying iteratively. Often, it's not the math alone, but the way it is being taught, that makes a whole lot of difference in one's understanding. For eg, back in grad school, I used to refer to Salman Khan's math videos to get the actual understanding of linear algebra concepts (which could not be attained even after reading a few standard books)
2. Having said that, each one of us has to maintain a trade-off in math deep dive vs actual implementation. No ones knows all things a 100%
@dcodeai369 2 роки тому ⁺¹
@@ujjalkrdutta7854 What you said about math is true. I'm sticking with applied ml for now. There is a lot to explore there. Thank you for your time
@zenchiassassin283 Рік тому
Any level >= 5 ?
@nikilragav 5 місяців тому ⁺¹
0:48 - The first sentence out of your mouth was not level 1 dawg. "thermodynamic equilibrium". You should start with showing how a drop of paint spreads in water (a phenomenon everyone either already knows or can easily see) and then explain what's going on and give the definition
@jonathaningram8157 Рік тому
there is the whole language part missing.
@terjeoseberg990 Рік тому ⁺¹
Wow! There’s another video if yours below this one, and your hair is so different that I didn’t recognize that it’s you.
@EthanZoid 4 місяці тому
I feel like these steps are just steps, not changed in difficulty
@retroathlete5814 Рік тому ⁺¹
Fine, you add noise to an image and then restore it. VERY simple concepts (even if very hard in practice). But the magic of DALL-E, Midjourney & Stable Diffusion is the creation of NEW images. This is the third video I'm watching that explains the same trivial diffusion concept. Guess I'll have to ask ChatGPT instead.
@curvingorbit8262 9 місяців тому
Exactly! I've watched and read numerous explanations of diffusion models, but not one so far has told me how the process ends with an image DIFFERENT from the one with which it began.
@truejim Рік тому
In the level 1 explanation, what’s the point of introducing the phrase “thermodynamic equilibrium”? Most lay people understand what it means when we say food coloring diffuses into clear water. Reminding the viewer why that happens from a physics standpoint makes the level 1 explanation less clear, not more clear.
@chrisyoutube8488 11 місяців тому
I came to the comments to see if this was Mandy Moore.
@nikilragav 5 місяців тому
Your level 2 should have been level 1
@kaiboshvanhortonsnort359 Рік тому
I dunno about all that, I just type in 'boobs' and the thing delivers. Whatever math those silicon wafers decide to subject themselves to, that's on them.
@TheBiffsterLife 6 місяців тому
Lost me at level one 😅
@samsaraAI2025 Рік тому ⁺⁷
6 minutes explaining nothing and at the end.. blablabla super fast about convolution... and nothing clear :/
@bunnystrasse 10 місяців тому
Who is the lady? Her @
@resurrection355 11 місяців тому
You are beautiful
@potrishead Рік тому ⁺²
Sorry, but this video is very frustrating. Nothing was explained in terms of either the technique for reversing or how it relates to new image creation when prompting, which is obviously what we are mostly interested on.
@rae1220 7 місяців тому
Then this just isn’t the video for you. This was purely explaining the concept
Helped me a lot
@milesgreb3537 11 місяців тому
This stuff just sucks man
@MistereXMachina Рік тому
Can we take a moment to appreciate how silly it is to say, "we're gonna explain this in 4 levels - 1 being the easiest, 4 being the hardest"
and immediately starting level 1 with: "diffusion models were inspired by non-equilibrium thermodynamics from physics and as you can understand from the name this field deals with system d that are not in thermodynamic equilibrium"
next time ask ChatGPT to write it for you lmao, imagine going up to a five year old and being like,
"Hey kid, you're familiar with thermodynamic equilibrium right? Well the area of machine learning concerned with image generation using diffusion models takes that principle, but is inspired by its inverse."
@franzmkrumenacker2519 Рік тому
Please note that what she referenced in Level 1 is secondary school stuff. The authors obviously assumed *this* basis to build upon, not that of a five-year-old kid.
@Adityak1997 9 місяців тому
who all think she's AI generated ??
@cipherxen2 Рік тому
She might not have technical background. No technical person will mispronounce variance as variation.

Наступне

Автоматичне відтворення

How Graph Neural Networks Are Transforming Industries