Efficient Text-to-Image Training (16x cheaper than Stable Diffusion) | Paper Explained

Поділитися
Вставка
  • Опубліковано 26 вер 2024
  • Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression.
    If you want to dive in even more into Würstchen here is the link to the paper & code:
    Arxiv: arxiv.org/abs/...
    Huggingface: huggingface.co...
    Github: huggingface.co...
    We also created a community Discord for people interested in Generate AI:
    / discord

КОМЕНТАРІ • 81

  • @outliier
    @outliier  Рік тому +5

    Join our Discord for Generative AI: discord.com/invite/BTUAzb8vFY

  • @xiaolongye-y4g
    @xiaolongye-y4g 11 місяців тому +5

    You are definitely the most detailed and understandable person I have ever seen.

  • @ml-ok3xq
    @ml-ok3xq 7 місяців тому +8

    Congrats on stable cascade 🎉

  • @dbender
    @dbender 7 місяців тому +2

    Super nice video which explains the architecture behind Stable Cascade. Step B was nicely visualized, but I still need a bit more time to fully grasp it. Well done!

  • @ratside9485
    @ratside9485 Рік тому +3

    Wir brauchen mehr Würstchen! 🙏🍽️

  • @dbssus123
    @dbssus123 Рік тому +2

    awesoooom !!! I always wait your videos

  • @hayhay_to333
    @hayhay_to333 Рік тому +1

    Damn, you're so smart thanks for explaining this to us. I hope you'll make millions of dollars.

  • @macbetabetamac8998
    @macbetabetamac8998 Рік тому +1

    Amazing work mate ! 🙏

  • @omarei
    @omarei Рік тому +2

    Awesome

  • @mik3lang3lo
    @mik3lang3lo Рік тому +1

    Great job as always

  • @e.galois4940
    @e.galois4940 Рік тому +3

    Tks very much

  • @mtolgacangoz
    @mtolgacangoz 4 місяці тому

    Brilliant work!

  • @TheAero
    @TheAero Рік тому

    Why use a second encoder? Isn't that what VQGan is supposed to do?

    • @outliier
      @outliier  Рік тому +1

      Yes but the VQGAN can only do a certain spatial compression, afterwards it gets really bad. That's why we introduce a second one

    • @TheAero
      @TheAero Рік тому

      @@outliier So can we replace the GAN-Encoder to a pre-trained better encoder and reduce the expense of using 2 encoders instead of one? So fundamentally, start with a simple enccoder then replace with a pre-trained better one and continue trainer, so that you also improve the decoder?

  • @eswardivi
    @eswardivi Рік тому +3

    Amazing work. I am wondering, how this video was made i.e. Editing Process and Cool Animations

    • @outliier
      @outliier  Рік тому +3

      Thank you a lot! I edit all videos in premiere pro and some of the animations like the compute gpu hours comparison between stable diffusion and würstchen were made mit manim. (The library from 3blue1brown)

  • @adrienforbu5165
    @adrienforbu5165 Рік тому +1

    Amazing explainations, good job

  • @timeTegus
    @timeTegus Рік тому +1

    I love the video. :) and i would love more detail 😮😮😮😮

    • @outliier
      @outliier  Рік тому +1

      Noted, in case for Würstchen, you can take a look at the paper: arxiv.org/abs/2306.00637

  • @jonmichaelgalindo
    @jonmichaelgalindo 11 місяців тому

    Awesome work and great insights! ❤

  • @arpanpoudel
    @arpanpoudel Рік тому +2

    thanks for the awesome content.

  • @Даниильчик
    @Даниильчик 6 місяців тому +1

    Hi! If it's not a secret, where do you get datasets for training Text2img models? Very great video!

  • @mohammadaljumaa5427
    @mohammadaljumaa5427 Рік тому +1

    Amazing job and I really love the idea of reducing the size of the models, since it’s just make so much sense for me!! I have a small question, what gpus did you use for training? Did you use a cloud provider for that or you have your own local station? If the second I’m interested to know which hardware components you have? Just curious because I’m trying to make a decision between using cloud providers for training vs buying a local station 😊

    • @outliier
      @outliier  Рік тому

      Hey there. We were using the stability cluster.

    • @outliier
      @outliier  Рік тому

      Local would be much more expensive I guess. What gpus are you thinking to buy / rent and how many?

  • @lookout816
    @lookout816 Рік тому +1

    Great video 👍👍

  • @ChristProg
    @ChristProg 4 місяці тому

    Thank you so much . But please i prefer that you go to the maths and operations more detailly being training of Würstchen 🎉🎉 thank you

  • @factlogyofficial
    @factlogyofficial Рік тому +1

    good job guys !!

  • @xyzxyz324
    @xyzxyz324 11 місяців тому

    well explained, thank you!

  • @nexyboye5111
    @nexyboye5111 Місяць тому

    good job guyz!

  • @jollokim1948
    @jollokim1948 9 місяців тому

    Hi Dominic,
    This is some great work you have accomplished and definitely a step in the right right direction of democratizing the diffusion method.
    I have some questions, and a little bit of critique if that would be okay.
    You say you achieve a compression rate of 42x, however, is this a fair statement when that vector is never decompressed into an actual image?
    It looks more like your Stage C can create some sort of feature vectors of images in very low dimensional space using the text descriptions. Which then are used to guide the actual image creation, along with embedded text in stage B.
    In my opinion it looks more like you have used stage C to learn a feature vector representation for the image, which is used as a condition similar to how language free text-to-image models might use the image itself to guide in training.
    However, I don't believe this to be a 42x image compression without the decompression. Have you tried connecting a decoder onto vectors coming out of stage C?
    (I would believe the that vector might not be big enough to create a high resolution images because of it's dimensional size)
    I hope you can answer some of my questions or clear up any misunderstandings on my part.
    I'm currently doing my thesis on fast diffusion models and found your concept of extreme compression very compelling. Directions on where to go next regarding this topic is also very much appreciated :)
    Best of luck with further research.

  • @NoahElRhandour
    @NoahElRhandour Рік тому +3

    🔥🔥🔥

  • @jeffg4686
    @jeffg4686 6 місяців тому

    Nice !

  • @flakky626
    @flakky626 7 місяців тому +1

    Can you please tell where did you study entirety of ML/Deep learning? (courses?)

  • @KienLe-md9yv
    @KienLe-md9yv 4 місяці тому

    At inference. Input of State A( VQGAN decoder) is discrete latents. Continuous latents needs to be quantize to discrete latents( discrete latents is also choosen from codebook, by which vector in Continuous latents nearest to vector in codebook). But Output of State B is Continuous latents. And Output of State B is directly for Input of State A..... if it right ? How State A( VQGAN decoder) handle Continuous latents . I check VQGAN paper and this Wurchen paper. That is not clear. Please help me that. Thank you

    • @outliier
      @outliier  4 місяці тому

      The VQGAN decoder can also decode continuous latents. It‘s as easy as that.

  • @hipy-tz3qt
    @hipy-tz3qt Рік тому +1

    Awesome! I have a question: who decided to call it "Würstchen" and why? I am German and just wondering

    • @akashdutta6235
      @akashdutta6235 Рік тому

      Man who loves hot dogs😂

    • @outliier
      @outliier  Рік тому +1

      We called it Würstchen because Pablo is from Spain and we called our first model Paella. And I‘m from Germany as well, so I thought let’s call the next model after something german lol

  • @jeanbedry3941
    @jeanbedry3941 Рік тому +2

    This is great, models that are intuitive to understand are the bests ones I find. Great job of explaining it as well.

  • @digiministrator
    @digiministrator 10 місяців тому

    Hello,
    How do I make a Seamless Pattern with Würstchen, I try a few prompts, the edges are always problematic.

    • @outliier
      @outliier  10 місяців тому

      Someone on the discord was talking about circular padding on the convolutions. Maybe you can try that

  • @MiyawMiv
    @MiyawMiv Рік тому +1

    awsome

  • @davidyang102
    @davidyang102 Рік тому +1

    Why do you still do stage A, would it be possible to just do stage B direct from the image? I assume the issue is stage A is cheaper than stage B to train?

    • @outliier
      @outliier  Рік тому +1

      Yea you can. We actually even tried that out. But it takes longer to learn and as of now we didnt achieve quite the same results with a single compression stage. The VQGAN is just really neat and provides a free compression already, which simplifies things for Stage B a lot I think. But definitely more experiments could be made here :D

    • @davidyang102
      @davidyang102 Рік тому +1

      ​@@outliier Really cool work. Is the use of diffusion models to compress data in this way a generic technique that can be used anywhere? For example could I use it to compress text?

    • @pablopernias
      @pablopernias Рік тому

      @@davidyang102 The only issue with text is its discrete nature. If you're ok with having continuous latent representations for text instead of discrete tokens then I think it could theoretically work, although we haven't properly tried with anything else than RGB images. The important thing is having a powerful enough signal so the diffusion model can rely on it and only require to complete missing details instead of having to make a lot of information up.

  • @KienLe-md9yv
    @KienLe-md9yv 4 місяці тому

    So, apparently, it sounds like Wurchen is exactly at Stage C. am i right?

    • @outliier
      @outliier  4 місяці тому

      What do you mean exactly?

  • @darrynrogers204
    @darrynrogers204 Рік тому

    I very much like the image you are using at the opening of the video. The glitchy 3D graph that looks like an image generation gone wrong. How was it generated? Was it intentional or a bit of buggy code?

    • @outliier
      @outliier  Рік тому

      Hey, which glitchy 3d graph? Could you give the timestamp?

    • @darrynrogers204
      @darrynrogers204 Рік тому

      The one at 0:01. Right at the start. It says "outlier" at the bottom in mashed up AI text. Also the same image that you are using for your UA-cam banner on your channel page.@@outliier

  • @saulcanoortiz7902
    @saulcanoortiz7902 8 місяців тому

    How do you create the dynamic videos of NNs? I want to create a UA-cam Channel explaining theory&code in Spanish. Best regards.

  • @streamtabulous
    @streamtabulous Рік тому

    what about decompression times? are they faster and would it be less resources on older systems.
    curious if the models from this wound benefit users, IE most still use 1.5 nd v2 models of SD due to the decompression times of SDXL models taking so long.

    • @outliier
      @outliier  Рік тому +1

      Hey, we have a comparison to inference times compared to SDXL in the blog post here: huggingface.co/blog/wuerstchen
      And I think the model should be comparable to SD1.X in terms of Speed.

    • @streamtabulous
      @streamtabulous Рік тому

      @@outliier thought those where compression only times not decompression times, that's awesome to read.
      People like you are hero's to me

    • @outliier
      @outliier  Рік тому +2

      @@streamtabulous hey, those barcharts are for full sampling times after feeding in the prompt until you receive the final image in pixel space. That is so kind of you, I appreciate it a lot. But people like Pablo, the HF team and other people helping us out together are the real reason that this was possible. And I promise this is only the start.

    • @streamtabulous
      @streamtabulous Рік тому

      @@outliier the whole team are a god send, myself i am on a disability pension neuromuscular so can't afford the pay to use like abode firefly that's a tick off charging as they use Stable Diffusion.
      Been disabled I game so have a gtx1070 have a rtx3060 in another system.
      But one of the Thing I miss doing is art and helping people by restoring there photos free, I have Stable diffusion on my PCs and I love its letting me do stuff I could not before including photo restorations, makes my life better as it give me joy doing that stuff.
      knowing from work like yours and the team your with, that it will mean in the near future I can do not better art but better photo Restorations faster and higher quality for people with my hardware means a massive amount to me.
      I'm doing a video tomorrow to help teach people how I use SD and models to help restore photos. only found SD a few weeks ago but i am working out how to use it in ways to help people with damaged old photos.

  • @davidgruzman5750
    @davidgruzman5750 Рік тому

    Thank you for explanations! I am a bit puzzled - why we call the state in the inner layers of AE as latent, since we actually can observer it?

    • @outliier
      @outliier  Рік тому

      which "state" are you referring to? The ones from Stage B?

    • @davidgruzman5750
      @davidgruzman5750 Рік тому

      ​@@outliier I would refer to one you mention in 1:27 point of the video. It is probabbly stage A.

    • @outliier
      @outliier  Рік тому +1

      @@davidgruzman5750 ah got it. Well you can observe it, but you can’t really understand right? If you print or visualise the latents they are not really meaningful. There are strategies to make them more meaningful tho. But just by themselves its hard to understand them. That’s what we usually call latents I would say

  • @muhammadrezahaghiri
    @muhammadrezahaghiri Рік тому

    That is a great project, I am excited to test the project.
    Out of curiosity, how is it possible to fine tune the model?

    • @outliier
      @outliier  Рік тому +1

      Hey, there is not yet official code for that. If you are interested you can give it a shot yourself. With the diffusers release in the next days, this should become much easier I think

    • @swannschilling474
      @swannschilling474 Рік тому +1

      This is very interesting!! 😊

  • @JT-hg7mj
    @JT-hg7mj 11 місяців тому

    Did you use the same dataset with SDXL?

  • @krisman2503
    @krisman2503 Рік тому

    Hey, does it recover from the noise or a encoded xT during the inference?

    • @outliier
      @outliier  Рік тому

      During inference you start from pure noise and start denoising and after every denoise step, you noise the image again and then denoise again and then noise and so on

  • @beecee793
    @beecee793 Рік тому

    If I need X time to inference on SD on a given example GPU, what do I need and how fast in the same environment would inferencing this be? Will it run on my toaster?

    • @outliier
      @outliier  Рік тому +1

      Hey. Take a look at the blog post. It has a inference time bar chart: huggingface.co/blog/wuertschen

    • @beecee793
      @beecee793 Рік тому

      @@outliier Thank you

  • @lawtonkovac4215
    @lawtonkovac4215 Рік тому

    💘 promo sm

  • @leab.6600
    @leab.6600 Рік тому +2

    Super helpful

  • @EvanSpades
    @EvanSpades 4 місяці тому

    Love this - what a fantastic achievement!

  • @aiartbx
    @aiartbx Рік тому +1

    Looks very interesting. Depending on how fast the generation is real time diffusion seems closer than expected.
    Btw any hugging space demo we can try this?

    • @outliier
      @outliier  Рік тому +1

      Hey thank you! The demo is available here: huggingface.co/spaces/warp-ai/Wuerstchen