VQ-GAN | Paper Explanation
Вставка
- Опубліковано 30 тра 2024
- Vector Quantized Generative Adversarial Networks (VQGAN) is a generative model for image modeling. It was introduced in Taming Transformers for High-Resolution Image Synthesis. The concept is build upon two stages. The first stage learns in an autoencoder-like fashion by encoding images into a low-dimensional latent space, then applying vector quantization by making use of a codebook. Afterwards, the quantized latent vectors are projected back to the original image space by using a decoder. Encoder and Decoder are fully convolutional. The second stage is learning a transformer for the latent space. Over the course of training it learns which codebook vectors go along together and which not. This can then be used in an autoregressive fashion to generate before unseen images from the data distribution.
#deeplearning #gan #generative # vqgan
0:00 Introduction
0:42 Idea & Theory
9:20 Implementation Details
13:37 Outro
Further Reading:
• VAE: towardsdatascience.com/unders...
• VQVAE: arxiv.org/pdf/1711.00937.pdf
• Why CNNS are invariant to sizes: www.quora.com/How-are-variabl...
• NonLocal NN: arxiv.org/pdf/1711.07971.pdf
• PatchGAN: arxiv.org/pdf/1611.07004.pdf
PyTorch Code: github.com/dome272/VQGAN
Follow me on instagram lol: / dome271 - Наука та технологія
Really cool video! 😎Can't wait for the next one.
omg u here??? i know u from your videos. thats so cool!
@@NoahElRhandour Haha, I can only reply with: omg, u recognize me??? That is so cool!
Yes, I am here. I have to keep a close eye on the competition! 😆
@@AICoffeeBreak i see :D
What an amazing video. Please keep up the great work! :)
By far the best video on VQVAE. Great job, outlier!
Excellent visualization for this smooth transition from VQVAE -> VQGAN (focus on main idea first and details second). 10/10
Incredible video! Can't tell you how much clearer everything is now. Looking forward to the future of your channel!
Thats so nice to hear and motivational. The next video is in the making already about CrossAttention
This is such a great channel!!!! Why didn't I find it earlier? Thanks a lot for the great work...
Your videos are great! Super clearly explained :) Thanks!!
that made some clicks in understanding! thanks a lot
awesome!! More of this please.
after three days of struggling with the paper, I find this amazing explanation for VQ-GAN.
Nice explanation and visualizations!
didaktisch, visuell und inhaltlich absolut insane, dicke probs
Brilliantly explained
Incredible videos
Great work !!!!
Thank you for this video, now I can be better
So excited for the next one!
Very cool video
Hey! Really great video:) I have one question. Imagine you want to use a diffusion model to learn image-to-image translation, more specifically, from segmentation masks to synthetic images. Then, you can have a tool to create images from hand-painted segmentation masks, and then, you can augment a dataset and see if state-of-the-art segmentation networks trained with the augmented dataset improve its performance. Do you know a diffusion model for this image-to-image translation task with some explanations and available repos?
Thank you so much for the explanation
Hopefully one can now go ahead with clip and create free version of DALL-E like text-to-image models
The strange patten in the reconstructed image and the generated image is likely to be caused by the perceptual loss, I have no idea why but the disappears when I take the perceptual loss away.
Those pictures that were generated with VQGAN are surprisingly coherent. How do you do that?
great video
Why make 2 loss functions with sg instead of optimizing ||E(x) -z_q||_2^2 directly?
Crisp Explanation! I would request you to talk little bit slower, it would be really helpful. Keep up the good work.
cool!
I can't find the VQGan Paper!
Ich liebe dich Mathemann❤️
you are truly an outlier!
How do we decide on what goes to the codebook? Is it filled with random vectors?
It seems to be the case and they are to be converged over the course of training
Hmm isn't trying to train the whole network ( decoder and encoder) using the discriminator just too complicated and would result in a loss function that's so complex that using the gradient descent to minimize it would be inefficient? I mean wouldn't it take a longer time to train?
Hence the following idea, why not use separate discriminators to train the decoder and the encoder separately. Yes it would be quite a lot more complicated than this to design but I guess it's worth giving a shot 😀
If someone knows if something like this is already done( cuz I have a feeling it probably is), may he enlighten me, thanks
Hey great video. Can you tell me why random sampling of codebook vectors doesn't generate a meaningful images. In Vae we random sample from std gaussian, why the same doesn't work for vq auto encoders.
Because in a VAE you only predict mean and standard deviation. Sampling this is easier. Sampling the codebook vectors happens independently and this is why the output doesn‘t give a meaningful output.
개쩐다
Thanks boy :)
Please speak louder in the video your voice is low.:)