The fourth little pig said, "I'm going to build my house out of A100s and the big bad wolf won't ever get in because then he wouldn't be able to train his image models anymore." The big bad wolf said, "I'll huff and I'll puff and I'll just use 4090s to train hypernetworks instead, and I'll blow your house down!" Took him a while, though.
as someone who is not actually involved with machine learning, I really enjoy your videos. They ensure that I stay up to date and fundamentally understand the subject. Thank you very much & keep going!
Love your videos. There's not a lot of info on SD for people who are technical and want details, but who are not AI experts or hardcore programmers. You've found the sweet spot. Your explanations have helped me quickly make a great leap forward in understanding and using SD. Thanks for putting these together.
I love these video, I have been using stable diffusion a lot lately, as a software developer it is great to actually know how it works "under the hood".
The fact it was trained in under a week on a single consumer-grade GPU is massive. I wonder, could ControlNet be applied to LLMs like Llama/Alpaca/etc?
Cool stuff, and great paper! If I understand the concept correctly then there is one other major advantage of this hyper network approach, which is modularity. You can train multiple smaller external network and swap them or even interject them without having to redo the whole main model. This means you can swap in feature detecting external networks to improve your image(s) on the specific aspects that you find important without the hassle of finding the correct model or the training time it takes to achieve this model. That's huge!
You are so gifted explaining AI stuff. I'd need your systematic way of explaining in the LLM/NLP/GPT realm. My AI life started with Stable Diffusion, but I don't see a way to make a living with images. More lucrative way seems to be helping companies to "integrate" their domain knowledge into a LLM to build knowledge systems. With your explanation through comparison of fine-tuning strategies (DB, TI, LoRa, HN) in mind, I get confused in the text-only realm. I see zero shot prompts, few shots, "context from a chat history", embeddings of mini-, midi-, maxi-text corpora - sometimes stored locally in a json, sometimes in a vector store, sometimes elsewhere, elsehow. Everything is somehow called fine-tuning embeddings without a clear demarcation. I'd need a systematics to all the possibilities.
Thanks for the video! Another cool aspect besides being able to apply multiple controlnet models at the same time, is you can apply them to variations of the original locked model. There’s all these merges and mixes of 1.5, but they are close enough to the original that these external networks work with them as well.
With all this modularity, you can actually have the case of using say Realistic Vision (which lists 10 other variations of SD1.5 that it merged together), put three modes of ControlNet on that, as well as Latent Couple to split the image into regions, and Composable LoRA so you can have different LoRAs apply in those diff regions.
Just subscribed before you go crazy. I love to learn how things work and this was a good vídeo. Once i was in a basic computer class and i was laughing while the teacher was describing the things that the program could do (it was a "word and excel" class), the teacher stoped and asked what was so funny and i explained that i was just marveled with the power of those tools... Well i laughed this entired video in the same way, the way AI in general is evolving is mind blowing. Also how some people find ways to solve big problems with simple solutions always put a smile on my face.
A really big thank you to you for all the effort you put into this. It is common to learn about the "how 2" but very rare to learn the "how" as well. Since AI and its influences will change SO much in the next few years and beyond, I am very glad you included a very, very important part of this powerful technology: basic understanding. It allows others to make appropriate assessments. And since AI can be used to change, reshape, or distort just about anything, it's hugely important, and (unfortunately) often overlooked, to also understand how all of this is possible. The video taught me a lot. I enjoy it, it's interesting, and most of all it's responsible the way you do it. So thank you! :D
Excellent ELI5 explanation and overview, thank you for taking the time to create the video, and for your use of many visuals and diagrams. I'm a visual person, and it helps me understand things better, and to see (literally) how all the different parts relate and work together. I'd never be able to visualize and understand it like this just by reading the paper like you have. You should make a version of the dabbing image with Jabba the Hutt, and label it "Jabba Dab-a-Do!". I know he's not Fred Flintstone...so maybe that joke wouldn't work, unless you put Jabba next to Fred, who's looking at him with an expression of "Wtf dude?" Anyways thanks for this video, I like your style of explaining complex topics in simple terms for simpletons like me.
Great explanation. A few questions came to my mind after watching this: 1. Will it also work without any new controlling variable like a depth map? I think the method should still work and it could be used for transfer learning instead of traditional fine-tuning. 2. Will Controlnet without the new controlling variable work the same as Hypernetwork you described in the previous video? For me it doesn't seem like this is the case as I understood that hypernetwork outputs additional layers that dynamically change the structure of the original network and controlnet outputs activations that are then summed up with activations of the original network. 3. If it's really more efficient than other methods of fine-tuning models then could we use it for other tasks? Like instead of fine-tuning transformer or CNN we could also add a controlnet to them making the training process (hopefully) faster. 4. Are there tests of how many training examples are needed to train this controlnet? I understood that it was trained for a pretty long time on a pretty big dataset but it seems like it could also work for small datasets. LoRA, hypernetwork, dreambooth and textual inversion work for small datasets and controlnet doesn't seem like that much different of an idea compared to those methods.
I would say it is the difference between Straciatella icecream and vanilla ice with chocolate sprinkles, both are vanilla icecream but one type straciatella you trained the whole icecream to be different and whatever you do you will allways get straciatella and it would be a hell of a job to only get vanilla out of that box, where as regular vanilla, you can add the chocolate sprinkles and add it to where it is satisfactory to you or just omit them according to your taste, but it's a lot faster, because you just add a bit to the big model without losing the big model, whereas in straicatella you change the model to something completely different and you can't finetune it anymore
I like your sense of humor, keep these videos coming 😁
Рік тому
AI and controlNet evolve so fast that you almost can do a monthly vido how to install Automatic1111, ControlNett and what fun and intressent models a beginner should get :) Thank you for the videos, explanations and updates.
Thank you for the explanation. I wonder is something like this can be done with a LORA type of training. I mean, maybe we can do customized control nets? And maybe eg subject training can be more efficiently done with higher accuracy?
i would love to also see videos from you about the little innovations. like the tile upscaling, self attention guidance, or segmenting. there are already youtubers talking about it but they just show how to apply the magic. 🐒 YOU would actually explain the tech behind it so we can truly understand how it works and why and when we wanna use it. :) edit: no shit. your videos are like the best resource so far for me.
Love this so much! I actually ended up reading this paper two nights ago. Very easy read. Love your insight about it. Do you have a discord or patreon?
I DO have a discord (forgot to link actually)! But fair warning: I don't really know how to manage discords properly. discord.gg/CNTQPUqK I've been obsessing over this paper for like 3 days now, I'm thinking of maybe doing like, an ablation study to determine how useful zero convolutions actually are, cuz like, wut???
I've been having fun using controlnet. Pretty interesting that we dont actually know why something is better. It's like throwing at a wall and seeing what sticks
This is so cool! The main model seems to be an artist who is good at drawing everything, and control net seems to be the expertise in drawing a cat. So when you only train the control net, you are training the expertise in drawing a cat. But if you don't freeze the main model and train them all, you are training the artist to be someone who can only draw good cats, and that's overfitting.
this seems to be the same approach the LLMS are taking now, AGENTS or SWARMS are the future of General Purpose LLMS, same thing with image creation, split the smaller tasks and only require them to activate when needed
do you have those C part 4tries somewhere? I would like to watch this, as it probably have a lot of nice info indeed (some other side of camera series)? Newcomer, enjoying your work!
Using the full ControlNet seems to be somehow merging the output from Deliberate(my original choice) and SD15(the downloaded controlnet). Is ControlNet a type of Lora?
> Is ControlNet a type of LoRA? My thought exactly. ControlNet might not require the low rank decomposition, but at a high level it's the same idea, or am I missing something?
I'm new to Stable Diffusion, only been using it for a few days and it was fun. now, I'm only 10 minutes into the video but I wanna make sure I'm understanding it correctly. basically, there are 2 people working on a task. person 1 doing the main task, person 2 helping. we feed them both with a task. the first guy start doing the job, but then the 2nd guy stopped the 1st guy and be like "wait a minute. I know how to do this. let me handle it and then I'll teach you later." 2nd guy: "so here's how you do it. this and that, blablablabla, done" both done with the job and we get the result exactly how we want. if we don't have the 2nd guy, the 1st wouldn't've known how to do our specific job, because he wasn't trained to do that. am I correct at this point?
@@lewingtonn its so hard! lol i literally you cant keep up cause I dont have a fast computer (mac m1 mbpro with 8gb ram) but videos like yours do help ppl like me since we cant do as much testing as we'd like etc
is it possible to train several submodels and add the outputs from each one? Because if yes, that would be something you can't really do with normal fine tuning, I think.
Already done through having multiple networks working inside automatic1111. Avaible in setting -> controlnet. Actually, that is best way to do it. Canny + Hed models is a nice balance between coherence and speed. Try out smaller weights - about 0.6 per controlnet.
Pretrained half Unet external network, bonded with original SD model, also CN is more efficient and easier to train than having a model trained with additional depth channel, got it. But how exactly CN is intervenes SD layers? Does it affecting latent space noise, or what?
When you did the comparison of depth2img and controlnet, I assume you were using the usual SD 1.5 controlnet, right? So it's not _entirely_ a fair comparison of the techniques, because depth2img is based on SD 2.0 (which is worse than 1.5). There have been some SD 2.X controlnet models released recently, though. I don't know whether that would be enough to make a completely fair comparison against the 2.0-depth2img model possible, though.
The depth map of 2.1 is only 64 pixel resolution whereas the ControlNet depth is 512 pixels which means that controlnet depth map are of a higher quality.
I'm not 100% sure but I thiiiink the controlnet paper authors condensed the depth map down to 64x64 using another small neural network... which is better than just resizing the depth map of course but still
I can't say much from a technical perspective since there's a lot about the overall technology that I haven't had time to read up on. However, the logic behind it kind of makes sense to me after all by freezing the main model and training the secondary one you're limiting the scope of what has to be learned. Similarly to me it almost seems like the method is less learn something new and more find the difference between what it already knows and what it's being given.
Is there a way to run Automatic1111 or ComfyUI locally with ControlNet while abstracting the Stable Diffusion layer using an API to Hugging Face? The idea is to run the user interface (Automatic1111 or ComfyUI) and ControlNet locally on my machine while offloading the heavy lifting (the actual image generation by Stable Diffusion) to an API like Hugging Face. I just want to benefit from the flexibility and control offered by ControlNet while not being limited by local hardware for the image generation process.
I wonder if this approach is generally true? Maybe the secret to general AI isn't one huge network, but hundreds of smaller, more tightly trained models for specific tasks all connected together in a clever way
I'm wondering how many image do you use to train your ControlNet to produce dabbing result, or you just use the hugging face ControlNet pretrain model?
I'm a bit confused about the difference between the 5gb and 700mb versions of the Controlnet checkpoints. You mentioned that the smaller ones are the external network only, while the bigger ones contain SD1.5. Does this mean I should be using the bigger ones on custom checkpoints downloaded from, e.g., Civitai, and the smaller ones should only be used with vanilla SD1.5?
Awesome and useful tyvm. But I wonder why there is no powerful open-source AI like chatGPT? I mean something like SD but for language. Is language harder to train? ChatGPT is great but it is like Midjourney. They don't give users as much control over the result as SD does.
i am very new to this whole diffusion topic, but isnt depth2img possible in stable diffusion 2? edit: i realized you probably refer to 1.5 only, correct?
The first 10 minutes: Basically saying that ControlNet is about making and training an external expert, like a hypernetwork "contract lawyer". So if I get this video right, ChatGTP urgently requires a hypernetwork "contract lawyer" or external sub-routine for simple math, Excel-style table sorting, and coding. C. 10:00 to 15:00: What one may understand after studying ML at university for about 50 years. 15:00 to 19:20: Pretty much regurgitating the simple fact that what makes ControlNet special is the external contract lawyer. 19:20 to 22:20: A few priddy pictures where the guy's going absolutely bonkers about a supposed difference between the outputs from Depth2img and ControlNet I can't see at all, outside of Depth2img obviously having been told or trained to use a fresco style, whereas ControlNet was told to or trained to partly also use a photo style. It seems like he's even ignoring those Depth2img results where they have the same shirt folds and cherrypicking those few ControlNet results that are pretty much the same. 22:20 to 23:00: Explaining that Depth2img needs huge industrial GPUs worth $15-30,000 a piece, whereas ControlNet can be used with consumer GPUs. 23:00 to end: More gushing about a quality difference I can't see. I mean, it's cool that unlike Depth2img, you can do it at home, but as said, I don't see the actual visual difference.
So, bottom line of the entire 25-minutes video: ControlNet means you train external little sub-routines. That's why you can do it at home, the sub-routines are much smaller, and you don't have to be a millionaire to use it.
I applaud your idea of answering the question of 'What is xxx"? I was then completely confused by the example used of "Jesus dabbing". WTF is 'Dabbing'? I'm afraid I only speak English and not the gobbledegook that 'dabbing' seems to describe. Could somebody please enlighten me in words of one syllable or less, please?
I appreciate the technical explanation but, in all honesty, I boiled it down after a single picture: ControlNet prioritizes image-to-image for denoising the subject, and text-to-image fills in the details.
I love how these researchers basically applied layers or rather layering processes to a neural network. Like, they basically are breaking down what a designer would do manually but they have to do EVERYTHING from scratch. LOL. What a round about way to develop a tool for people to be lazy.
er...it is more like. multiple ai's working together coopeeatively. Like an arvhitect who has an engineer to design the building, an artist to draw the concept, a landscaper to design the landscape etc
The fourth little pig said, "I'm going to build my house out of A100s and the big bad wolf won't ever get in because then he wouldn't be able to train his image models anymore." The big bad wolf said, "I'll huff and I'll puff and I'll just use 4090s to train hypernetworks instead, and I'll blow your house down!" Took him a while, though.
What is koiboi?
How does koiboi work?
Why should we care?
A: because he makes awesome videos about the underlying stuff not many talk about... Tnx!
Lel
as someone who is not actually involved with machine learning, I really enjoy your videos. They ensure that I stay up to date and fundamentally understand the subject. Thank you very much & keep going!
Your videos are so great. It’s a pity you don’t make them any more
Laughed at 20:30 😆. Thanks for the explanation, I've always wondered how ControlNet worked like black magic!
Love this style of video. Please keep deep diving into papers in this way!
I think this must be the first SD video I have seen where I actually understood and learned something. Thanks!
Great video as always, just if you can be a bit more dramatic on camera that would be perfect.
-_-
Great work! But we as community really should team up and buy him a more decent mic 😁
lol I'll order a good one off amazon next time I do a shop
19:40 - "I just downloaded this image of a gentleman dabbing, from the Internet." LOLLOOLLLOL, this made my day! Understatement of the month.
Fantastic as usual :) Thank you for that delightful detailed explanation!
This video is just perfect. I really love your style of making videos. That makes it really fun to watch :)
Thanks for your videos, helps me stay updated here. Really good work!
Love your videos. There's not a lot of info on SD for people who are technical and want details, but who are not AI experts or hardcore programmers. You've found the sweet spot. Your explanations have helped me quickly make a great leap forward in understanding and using SD. Thanks for putting these together.
Wonderful video! You do such a great job breaking down these concepts.
oh boy, always koi great video
The boy is back!! 😁
You are a little bit of a totalitarian (based on another video I saw), but you're good at making educational videos on this topic. Kudos.
I love these video, I have been using stable diffusion a lot lately, as a software developer it is great to actually know how it works "under the hood".
The fact it was trained in under a week on a single consumer-grade GPU is massive. I wonder, could ControlNet be applied to LLMs like Llama/Alpaca/etc?
Cool stuff, and great paper! If I understand the concept correctly then there is one other major advantage of this hyper network approach, which is modularity. You can train multiple smaller external network and swap them or even interject them without having to redo the whole main model. This means you can swap in feature detecting external networks to improve your image(s) on the specific aspects that you find important without the hassle of finding the correct model or the training time it takes to achieve this model. That's huge!
Good job dude. I've been trying to understand this for a few days and this made it all click. Well done.
Amazing stuff, great explanation way better than many tutorials. Thank you for your work and time.
I was looking everywhere for an explanation like this. So helpful, thank you!
Thanks for sharing the detailed mechanisms. Doing the heavy lifting for us!
You are so gifted explaining AI stuff. I'd need your systematic way of explaining in the LLM/NLP/GPT realm. My AI life started with Stable Diffusion, but I don't see a way to make a living with images. More lucrative way seems to be helping companies to "integrate" their domain knowledge into a LLM to build knowledge systems. With your explanation through comparison of fine-tuning strategies (DB, TI, LoRa, HN) in mind, I get confused in the text-only realm. I see zero shot prompts, few shots, "context from a chat history", embeddings of mini-, midi-, maxi-text corpora - sometimes stored locally in a json, sometimes in a vector store, sometimes elsewhere, elsehow. Everything is somehow called fine-tuning embeddings without a clear demarcation. I'd need a systematics to all the possibilities.
Awe my ma se kind, thanks for the deep dive.
Thank you!
You are an fn rockstar
Amazing video, I hope you make more of videos like this!
best explanation i've heard - thanks for taking the time to put this together!
Thanks for the video! Another cool aspect besides being able to apply multiple controlnet models at the same time, is you can apply them to variations of the original locked model. There’s all these merges and mixes of 1.5, but they are close enough to the original that these external networks work with them as well.
With all this modularity, you can actually have the case of using say Realistic Vision (which lists 10 other variations of SD1.5 that it merged together), put three modes of ControlNet on that, as well as Latent Couple to split the image into regions, and Composable LoRA so you can have different LoRAs apply in those diff regions.
Just subscribed before you go crazy. I love to learn how things work and this was a good vídeo.
Once i was in a basic computer class and i was laughing while the teacher was describing the things that the program could do (it was a "word and excel" class), the teacher stoped and asked what was so funny and i explained that i was just marveled with the power of those tools... Well i laughed this entired video in the same way, the way AI in general is evolving is mind blowing. Also how some people find ways to solve big problems with simple solutions always put a smile on my face.
A really big thank you to you for all the effort you put into this. It is common to learn about the "how 2" but very rare to learn the "how" as well. Since AI and its influences will change SO much in the next few years and beyond, I am very glad you included a very, very important part of this powerful technology: basic understanding. It allows others to make appropriate assessments. And since AI can be used to change, reshape, or distort just about anything, it's hugely important, and (unfortunately) often overlooked, to also understand how all of this is possible. The video taught me a lot. I enjoy it, it's interesting, and most of all it's responsible the way you do it. So thank you! :D
Excellent ELI5 explanation and overview, thank you for taking the time to create the video, and for your use of many visuals and diagrams. I'm a visual person, and it helps me understand things better, and to see (literally) how all the different parts relate and work together. I'd never be able to visualize and understand it like this just by reading the paper like you have.
You should make a version of the dabbing image with Jabba the Hutt, and label it "Jabba Dab-a-Do!". I know he's not Fred Flintstone...so maybe that joke wouldn't work, unless you put Jabba next to Fred, who's looking at him with an expression of "Wtf dude?"
Anyways thanks for this video, I like your style of explaining complex topics in simple terms for simpletons like me.
So entertaining and insightful, thank you koiboi. Really enjoyed it!
Another amazing video! Love seeing these in my feed. Entertaining and informative
Great explanation. A few questions came to my mind after watching this:
1. Will it also work without any new controlling variable like a depth map? I think the method should still work and it could be used for transfer learning instead of traditional fine-tuning.
2. Will Controlnet without the new controlling variable work the same as Hypernetwork you described in the previous video? For me it doesn't seem like this is the case as I understood that hypernetwork outputs additional layers that dynamically change the structure of the original network and controlnet outputs activations that are then summed up with activations of the original network.
3. If it's really more efficient than other methods of fine-tuning models then could we use it for other tasks? Like instead of fine-tuning transformer or CNN we could also add a controlnet to them making the training process (hopefully) faster.
4. Are there tests of how many training examples are needed to train this controlnet? I understood that it was trained for a pretty long time on a pretty big dataset but it seems like it could also work for small datasets. LoRA, hypernetwork, dreambooth and textual inversion work for small datasets and controlnet doesn't seem like that much different of an idea compared to those methods.
Love the explanation, but the dabbing at the end was a special treat 😊
I would say it is the difference between Straciatella icecream and vanilla ice with chocolate sprinkles, both are vanilla icecream but one type straciatella you trained the whole icecream to be different and whatever you do you will allways get straciatella and it would be a hell of a job to only get vanilla out of that box, where as regular vanilla, you can add the chocolate sprinkles and add it to where it is satisfactory to you or just omit them according to your taste, but it's a lot faster, because you just add a bit to the big model without losing the big model, whereas in straicatella you change the model to something completely different and you can't finetune it anymore
I like your sense of humor, keep these videos coming 😁
AI and controlNet evolve so fast that you almost can do a monthly vido how to install Automatic1111, ControlNett and what fun and intressent models a beginner should get :)
Thank you for the videos, explanations and updates.
Thank you for the explanation. I wonder is something like this can be done with a LORA type of training. I mean, maybe we can do customized control nets? And maybe eg subject training can be more efficiently done with higher accuracy?
i would love to also see videos from you about the little innovations. like the tile upscaling, self attention guidance, or segmenting.
there are already youtubers talking about it but they just show how to apply the magic. 🐒
YOU would actually explain the tech behind it so we can truly understand how it works and why and when we wanna use it. :)
edit: no shit. your videos are like the best resource so far for me.
A great mid-level explanation. Very helpful for those versed in machine-learning to bridge the gap.
Love this so much! I actually ended up reading this paper two nights ago. Very easy read. Love your insight about it. Do you have a discord or patreon?
s8rVscu2pM , hope this helps!
Hopefully comment stays up. Doing my best here to keep the invites up. I don't see my previous comments mentioning that.
I DO have a discord (forgot to link actually)! But fair warning: I don't really know how to manage discords properly. discord.gg/CNTQPUqK
I've been obsessing over this paper for like 3 days now, I'm thinking of maybe doing like, an ablation study to determine how useful zero convolutions actually are, cuz like, wut???
@@pipinstallyp you're getting shadowbanned by Neal Mohan bro.. watch out D:
@@lewingtonn It's about time Neal captures my soul eternally into an NFT and I become forever a citizen of UA-cam's hivemind.
Excellent explanation. As always. Love your work!
Keep 'em coming, this is fantastic on the topic.
I've been having fun using controlnet. Pretty interesting that we dont actually know why something is better. It's like throwing at a wall and seeing what sticks
Amazing vid mate. Super clear and concise.
I love these educational videos.
Thanks for the information. I wonder how many other specific factors there are to train for. I'd love one for lighting.
Thank for the video! Look forward to more!
Excellent work, keep doing it ❤
This is so cool! The main model seems to be an artist who is good at drawing everything, and control net seems to be the expertise in drawing a cat. So when you only train the control net, you are training the expertise in drawing a cat. But if you don't freeze the main model and train them all, you are training the artist to be someone who can only draw good cats, and that's overfitting.
Love your video! great explanation!
Awesome video, accurate, but still it made me giggle at times xD
this seems to be the same approach the LLMS are taking now, AGENTS or SWARMS are the future of General Purpose LLMS, same thing with image creation, split the smaller tasks and only require them to activate when needed
Well done, thanks for creating this video!
great vid thanks! 20:30 killed me lmao
do you have those C part 4tries somewhere? I would like to watch this, as it probably have a lot of nice info indeed (some other side of camera series)? Newcomer, enjoying your work!
Dude, you're awesome, thanks for this
Using the full ControlNet seems to be somehow merging the output from Deliberate(my original choice) and SD15(the downloaded controlnet). Is ControlNet a type of Lora?
> Is ControlNet a type of LoRA?
My thought exactly. ControlNet might not require the low rank decomposition, but at a high level it's the same idea, or am I missing something?
This is so good man. Keep these coming. Are you a researcher or what? What's your Twitter?
you are doing (a dabbing) Gods work, thank you for this video
this is such a great video, would you be able to make a video on IPAdapters?
I'm new to Stable Diffusion, only been using it for a few days and it was fun.
now, I'm only 10 minutes into the video but I wanna make sure I'm understanding it correctly.
basically, there are 2 people working on a task.
person 1 doing the main task, person 2 helping.
we feed them both with a task.
the first guy start doing the job, but then the 2nd guy stopped the 1st guy and be like "wait a minute. I know how to do this. let me handle it and then I'll teach you later."
2nd guy: "so here's how you do it. this and that, blablablabla, done"
both done with the job and we get the result exactly how we want.
if we don't have the 2nd guy, the 1st wouldn't've known how to do our specific job, because he wasn't trained to do that.
am I correct at this point?
come back koiboi, come back...
Your videos are amazing!
controlnet's outta control is what it is lol you blink and in an instant there's 5 new updates and diff applications of it
yeah I'm like... way behind ngl
@@lewingtonn its so hard! lol i literally you cant keep up cause I dont have a fast computer (mac m1 mbpro with 8gb ram) but videos like yours do help ppl like me since we cant do as much testing as we'd like etc
I've been programming for 30 years. This AI brings me back to the early days of computer, when hacking was so much fun.
Man this video is so helpful. Ty
Could this also be used to optimize the training for a new base model?
Literally yes!!
Great analogy! Very clear!
is it possible to train several submodels and add the outputs from each one? Because if yes, that would be something you can't really do with normal fine tuning, I think.
Already done through having multiple networks working inside automatic1111. Avaible in setting -> controlnet. Actually, that is best way to do it. Canny + Hed models is a nice balance between coherence and speed. Try out smaller weights - about 0.6 per controlnet.
Pretrained half Unet external network, bonded with original SD model, also CN is more efficient and easier to train than having a model trained with additional depth channel, got it. But how exactly CN is intervenes SD layers? Does it affecting latent space noise, or what?
Niiiiiceee!! Very comprehensive
When you did the comparison of depth2img and controlnet, I assume you were using the usual SD 1.5 controlnet, right? So it's not _entirely_ a fair comparison of the techniques, because depth2img is based on SD 2.0 (which is worse than 1.5).
There have been some SD 2.X controlnet models released recently, though. I don't know whether that would be enough to make a completely fair comparison against the 2.0-depth2img model possible, though.
The depth map of 2.1 is only 64 pixel resolution whereas the ControlNet depth is 512 pixels which means that controlnet depth map are of a higher quality.
I'm not 100% sure but I thiiiink the controlnet paper authors condensed the depth map down to 64x64 using another small neural network... which is better than just resizing the depth map of course but still
@@dibbidydoo4318 I also expect the controlnet results to be better for theoretical reasons. But would be nice to see it shown in practice too.
You mentioned LORA as full training architecture, but isn't it external? 17:25
I can't say much from a technical perspective since there's a lot about the overall technology that I haven't had time to read up on. However, the logic behind it kind of makes sense to me after all by freezing the main model and training the secondary one you're limiting the scope of what has to be learned. Similarly to me it almost seems like the method is less learn something new and more find the difference between what it already knows and what it's being given.
Is there a way to run Automatic1111 or ComfyUI locally with ControlNet while abstracting the Stable Diffusion layer using an API to Hugging Face? The idea is to run the user interface (Automatic1111 or ComfyUI) and ControlNet locally on my machine while offloading the heavy lifting (the actual image generation by Stable Diffusion) to an API like Hugging Face. I just want to benefit from the flexibility and control offered by ControlNet while not being limited by local hardware for the image generation process.
I wonder if this approach is generally true? Maybe the secret to general AI isn't one huge network, but hundreds of smaller, more tightly trained models for specific tasks all connected together in a clever way
this is so so so helpful!
I'm wondering how many image do you use to train your ControlNet to produce dabbing result, or you just use the hugging face ControlNet pretrain model?
SO we don't need to download the 45GB to use controlnet if the External networks are included in A1111?
I’m reading the paper. Thx.
I'm a bit confused about the difference between the 5gb and 700mb versions of the Controlnet checkpoints. You mentioned that the smaller ones are the external network only, while the bigger ones contain SD1.5. Does this mean I should be using the bigger ones on custom checkpoints downloaded from, e.g., Civitai, and the smaller ones should only be used with vanilla SD1.5?
Awesome and useful tyvm. But I wonder why there is no powerful open-source AI like chatGPT? I mean something like SD but for language. Is language harder to train?
ChatGPT is great but it is like Midjourney. They don't give users as much control over the result as SD does.
Amazing video 👍
i am very new to this whole diffusion topic, but isnt depth2img possible in stable diffusion 2?
edit: i realized you probably refer to 1.5 only, correct?
The first 10 minutes: Basically saying that ControlNet is about making and training an external expert, like a hypernetwork "contract lawyer". So if I get this video right, ChatGTP urgently requires a hypernetwork "contract lawyer" or external sub-routine for simple math, Excel-style table sorting, and coding.
C. 10:00 to 15:00: What one may understand after studying ML at university for about 50 years.
15:00 to 19:20: Pretty much regurgitating the simple fact that what makes ControlNet special is the external contract lawyer.
19:20 to 22:20: A few priddy pictures where the guy's going absolutely bonkers about a supposed difference between the outputs from Depth2img and ControlNet I can't see at all, outside of Depth2img obviously having been told or trained to use a fresco style, whereas ControlNet was told to or trained to partly also use a photo style. It seems like he's even ignoring those Depth2img results where they have the same shirt folds and cherrypicking those few ControlNet results that are pretty much the same.
22:20 to 23:00: Explaining that Depth2img needs huge industrial GPUs worth $15-30,000 a piece, whereas ControlNet can be used with consumer GPUs.
23:00 to end: More gushing about a quality difference I can't see. I mean, it's cool that unlike Depth2img, you can do it at home, but as said, I don't see the actual visual difference.
So, bottom line of the entire 25-minutes video: ControlNet means you train external little sub-routines. That's why you can do it at home, the sub-routines are much smaller, and you don't have to be a millionaire to use it.
are you available for consulation?
would love to join your discord but invite is expired? Also, do you have Patreon?
If you train a model will using control net not work super well since it was trained on the base model?
Is there any advantage to training your own controlnet?
is there any reason you couldn't apply this control net strategy to work for language models?
Awasome!
Excellent !!!!
I applaud your idea of answering the question of 'What is xxx"? I was then completely confused by the example used of "Jesus dabbing". WTF is 'Dabbing'? I'm afraid I only speak English and not the gobbledegook that 'dabbing' seems to describe. Could somebody please enlighten me in words of one syllable or less, please?
Please make a video about LoCon and also the vae revolution.
Bro please link me, never heard of this
@@lewingtonn cant link here. It deletes my comment
I appreciate the technical explanation but, in all honesty, I boiled it down after a single picture:
ControlNet prioritizes image-to-image for denoising the subject, and text-to-image fills in the details.
I love how these researchers basically applied layers or rather layering processes to a neural network. Like, they basically are breaking down what a designer would do manually but they have to do EVERYTHING from scratch. LOL. What a round about way to develop a tool for people to be lazy.
er...it is more like. multiple ai's working together coopeeatively. Like an arvhitect who has an engineer to design the building, an artist to draw the concept, a landscaper to design the landscape etc