Visual AutoRegressive Modeling:Scalable Image Generation via Next-Scale Prediction

Поділитися
Вставка
  • Опубліковано 15 гру 2024

КОМЕНТАРІ • 19

  • @hjups
    @hjups 7 місяців тому +6

    They didn't discuss a proof of linear scaling with size, only generation time. My guess is that their linear scaling comes from training the VQVAE in tandem with the generation model, which DiT does not do. The frozen VAE sets a minimum limit for the FID score, which I believe for ImageNet 256x256 is somewhere around 1.4, and would require perfect latents. Pixel-space models wouldn't have that issue though, but are much more expensive to train and run.
    That aside, VAR is a clever idea and the generation speed is impressive - I do wonder if they could achieve better results (perhaps with smaller models) if they combined the idea with MaskGiT. It would be a little slower (although a smaller model could make up for that), but it would allow for a self-correction step.

    • @keyutian
      @keyutian 6 місяців тому +3

      As the author of this work, I'd like to provide two more details XD: 1. VAR uses a fronzen VQVAE too. Both VAR's VQVAE and DiT's VAE are trained only before the generation model training phase. Once trained, the VQVAE/VAE will be frozen.
      2. MaskGiT has no chance to do a self-correction as each token is only generated once. But VAR can potentially do this, because the tokens of each scale are eventually added together to get an output. So if some mistakes are made in early scales, late scale generation can correct them due to the autoregressive nature.

    • @hjups
      @hjups 6 місяців тому +3

      @@keyutian Thanks for responding!
      1) That was originally unclear from your paper given how you started section 3. But I see now that you split up the training approach in sub-headings. That certainly makes things easier to train!
      2) That depends on how MaskGiT is sampled, correct? You can choose to re-mask an already un-masked token each step, which would allow the model to self-correct. I do not believe MaskGiT did that, but I believe the Token-Critic paper proposed such behavior (although maybe they also respected the unmasked tokens). Regardless, re-masking those types of models does work during sampling.
      My point about self-correction was that image models (at least for diffusion) tend to be highly biased toward the low-resolution structure. If an error is generated at a smaller scale, then it tends to propagate forward regardless of the conditioning methodology (e.g. in super-resolution models using concat conditioning or cross-attention). Reducing those errors before propagation would likely yield better results.
      On another note, what are your thoughts on VAR's FID behavior compared to the image quality, especially in the context of diffusion models?
      I am wondering if the use of the quantized latent space is both a blessing and a curse. A VAE and a VQVAE may be trained to achieve a similar rFID, but in practice it would be impossible to achieve that level with a VAE due to the continuous nature of the latent space (except in the case of pure reconstruction). However, a VQVAE has a finite number of codebook entries, meaning that the rFID score could be achieved if those tokens were generated exactly, making the problem easier.
      However, VQVAEs suffer from fine-detail noise / artifacts (VAEs do too, but not to the same degree), which comes from the quantization of the continuous image space. This appears to be a tradeoff of structural generation for fine image quality (which can be desirable in certain application spaces), however, this tradeoff is not properly captured by the most fidelity metrics.
      But I did notice how VAR does not handle fine detail well, especially in the online demo - this is especially apparent in high-noise classes like "Fountain".

  • @alexandernanda2261
    @alexandernanda2261 7 місяців тому +1

    Best explanation on youtube

  • @fusionlee844
    @fusionlee844 7 місяців тому +1

    Thanks, looking for some new generative structures recently for reseach these days

    • @gabrielmongaras
      @gabrielmongaras  7 місяців тому

      Yea it's always nice to see new generative structures other than the standard autoregressive next-token prediction model or diffusion model!

  • @bruceokla
    @bruceokla 7 місяців тому +1

    Nice work

  • @Ali-wf9ef
    @Ali-wf9ef 7 днів тому

    I didn't quite understand what the quantization part is achieving. Does it serve only as the tokenization step? Or does it contribute to reducing the picture's resolution somehow?

    • @gabrielmongaras
      @gabrielmongaras  7 днів тому +1

      It just serves as an easy way to tokenize the image somehow since we want to do autoregression. An easy way to make a model autoregressive on images is to use vector quanitzation, though it does reduce quality as you're reducing a continuous signal to some discrete set of tokens. It's probably far from the best way to model an image though.

  • @xplained6486
    @xplained6486 7 місяців тому

    Isnt this basically a Diffusion model but instead of noising they do blurring (through downsampling) and try to revert the blur (instead of the noise in DM). And the vector quantitzation is similar to the one from stable diffusion as far as I understand. But how does it compare to the general concept of scores matching?

    • @gabrielmongaras
      @gabrielmongaras  7 місяців тому +3

      I like to think of diffusion models as reversing some sort of transformation (like in the Cold Diffusion paper). That's kind of why I think of this as similar to a diffusion process. Where diffusion reverses the process of corrupting an image with noise, this model reverses the process of making an image smaller resolution. However, the objective is significantly different. In diffusion, we train the model to predict all the noise, whereas here we train it to autoregressively predict the next step. The nice thing about diffusion objective is it allows an arbitrary number of steps for generation, this model does not since it's forced to predict the next step.

  • @lawrencephillips786
    @lawrencephillips786 7 місяців тому +1

    Where are the actual learnable NN parameters in Algorithm 1 and 2? In the interpolation step? Also, you depict r1, r2 and so on as sets of 1, 2, an so on tokens, but shouldnt it be the square (1, 4, and so on)?

    • @NgOToraxxx
      @NgOToraxxx 7 місяців тому

      In the encoder and decoder, and a few in the convs after the upscaling (phi).
      The token count is a square for each resolution scale yes, although they are predicted in parallel.

    • @lawrencephillips786
      @lawrencephillips786 6 місяців тому

      @@NgOToraxxx Thanks! What do you mean the square is predicted in parallel? Do you mean this as a causal masking step like with GPT

    • @NgOToraxxx
      @NgOToraxxx 6 місяців тому +1

      @@lawrencephillips786 There's causal masking yes, but instead of predicting just 1 token from all previous ones, it predicts the tokens of an entire scale from the previous ones. And the scale can attend itself too (it's initialized by upscaling the previous scale). There are some more details on issue #1 on VAR's github repo.

  • @marinepower
    @marinepower 7 місяців тому +2

    This paper literally makes no sense. The whole point of autoregressive modeling is to independently sample one token at a time, and condition future tokens on past tokens. This method breaks all that by having the model sample hundreds (perhaps thousands) of tokens independently in one inference step, despite all tokens being in one large joint probability space. The only way this works is if you overtrain your model to such an absurd degree that there is basically no ambiguity anywhere in your generation step, and then you can simply take the argmax over every single logit and have it work out.
    And, if you look at their methodology, that's exactly what you find. This model was trained for **350** epochs over the same data. That is **absurd**. So yeah, don't expect this method to work unless you wildly overtrain it. It has some good ideas (e.g. hierarchical generation), but the rest of its claims are dubious at best.

    • @keyutian
      @keyutian 6 місяців тому +2

      I strongly disagree with this.
      First, if "independently generation" makes no sense, that's basically saying BERT, MaskGIT, UniLM and more models would make no sense, which is obviously not true.
      If you think about it, when VAR/BERT/MaskGIT/UniLM generates a bunch of tokens in parallel, these tokens **can attend to each other**, which may alleviate that ambiguity to a large extent.
      Second, for that 350-epoch training, well, DiT was trained on the same data for **1400** epochs.
      It's common to train a generative model on ImageNet for hundreds epochs because ImageNet can be a small dataset today.

    • @hjups
      @hjups 6 місяців тому +1

      Most models trained on benchmark datasets are over-trained, this is due to a combination of the smaller dataset sizes (as keyutian mentioned), and a minimum number of training samples required to establish coherent image features. For reference, ImageNet contains around 1.28M images, whereas SD1.5 trained on a 128M image subset of LAION-5B (100x more images). While adding more images would be ideal, it's not possible with a fixed benchmark, while Stability could easily add more images to the training set (which they did for SDXL). That said, if you need proof of working in practice, look at Pixart-alpha. They trained on only 28M images (which included ImageNet pre-training for ~240 epochs to establish the initial image structures and features).

    • @marinepower
      @marinepower 6 місяців тому +1

      @@keyutian Just because it works in practice doesn't mean it's good science. I think the entire field of ML is plagued with techniques that just barely work and aren't theoretically justifiable but people use them anyway. Maybe its unfair to single this paper out since others (BERT, DiT, etc) do it too, but until papers are held to a higher standard nothing will change.
      One can easily think of ways to improve sampling that don't need to do this (e.g. have multiple predictions per pixel along with confidence values, iterative partial sampling (where we commit to a set amount of data each iteration, perhaps on the predicted confidence values, and predict an image over a set number of steps), etc). There's pretty basic things that can be done and yet no one does them because it's easier to just follow the herd and train for an absurd number of epochs.