NVAE: A Deep Hierarchical Variational Autoencoder (Paper Explained)
Вставка
- Опубліковано 17 тра 2024
- VAEs have been traditionally hard to train at high resolutions and unstable when going deep with many layers. In addition, VAE samples are often more blurry and less crisp than those from GANs. This paper details all the engineering choices necessary to successfully train a deep hierarchical VAE that exhibits global consistency and astounding sharpness at high resolutions.
OUTLINE:
0:00 - Intro & Overview
1:55 - Variational Autoencoders
8:25 - Hierarchical VAE Decoder
12:45 - Output Samples
15:00 - Hierarchical VAE Encoder
17:20 - Engineering Decisions
22:10 - KL from Deltas
26:40 - Experimental Results
28:40 - Appendix
33:00 - Conclusion
Paper: arxiv.org/abs/2007.03898
Abstract:
Normalizing flows, autoregressive models, variational autoencoders (VAEs), and deep energy-based models are among competing likelihood-based frameworks for deep generative learning. Among them, VAEs have the advantage of fast and tractable sampling and easy-to-access encoding networks. However, they are currently outperformed by other models such as normalizing flows and autoregressive models. While the majority of the research in VAEs is focused on the statistical challenges, we explore the orthogonal direction of carefully designing neural architectures for hierarchical VAEs. We propose Nouveau VAE (NVAE), a deep hierarchical VAE built for image generation using depth-wise separable convolutions and batch normalization. NVAE is equipped with a residual parameterization of Normal distributions and its training is stabilized by spectral regularization. We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models on the MNIST, CIFAR-10, and CelebA HQ datasets and it provides a strong baseline on FFHQ. For example, on CIFAR-10, NVAE pushes the state-of-the-art from 2.98 to 2.91 bits per dimension, and it produces high-quality images on CelebA HQ as shown in Fig. 1. To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256×256 pixels.
Authors: Arash Vahdat, Jan Kautz
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher - Наука та технологія
As a new PhD student in this field, I can literally not thank you enough for making this content!
I legit expected a 'Dear fellow scholars' in the start! Lol.
Two very good channels.
What a time to be alive!
Hang on to your papers!
Please entertain the idea of a sit-down paper commentary with Dr. Károly Zsolnai-Fehér. I mean, who thinks that would be beyond awesome ?
This one is 2 hour papers
Thankyou so much for making this video! It's so hard to find friendly content when you really start digging into these topics and this is an absolute life saver!
This was so helpful! Thank you!
You (or whoever is narrating) are really good at describing things clearly and exactly.
We need an ACNE dataset with no smooth faces to test the true power of these generative methods
Lol, is there such a thing?
CelebA - High School Edition :D
Great talk. Thanks for taking time to read through this. The heavier linear algebra can be a bit daunting without maths background. You help it become a bit more digestible.
One thing missing in explanation is that you prefer a distribution over latent codes in order to get a continuous smooth latent space in which you can sample new (unseen) interpolated latent codes which are still valid ones. This provides generative power of VAE.
Absolutely true!
I wonder how much practical application this still has given stylegan allows you to walk the ‘latent space’ as well and it generates 1024*1024 insanely realistic images
@@Anonymous-vh9tc you must test yourself to really see the level of controllability, how fast is inference , ease of training, the variety and realism of generated output
Super awesome paper review! Much thanks.
Awesome explanation. Great job explaining VAEs without ELBO. This is a cool and conceptually simple way of building hierarchical VAEs (unlike say BIVA which is a nightmare)
very well presented, thanks.
I'm wonder how hard is train this NVAE.... How long it takes to be trained when compared with stylegan2?. Their results are very good! Looks like a dream generate high quality samples without to face mode collapsing. I'm curious about the down sides.
Hi Yannic,
Great Explanation. One small doubt I wanted to clear. The generation of these face images is more like crops. How does this or any generative methods work on reconstructing the different types of datasets? For example, let's say COCO Dataset. Will the object's scale in this hierarchical architecture be maintained If I pass the image of the beach having different scale people in it? If yes, can it be used as a pretext task for object detection/image classification/segmentation on the COCO alike datasets?
It would be worth a try !!
I wonder how VQ-VAEs compare - they are much simpler conceptually and practically and seem to address the same issues. You mentioned them briefly in the Jukebox video but they are probably worth their own video.
Excellent 👌
I just saw the images and woow! The model produces crisp images. Wondering what would be the output for videos.
Could you use a similar hierarchical method to learn image classification which avoids some of the adversarial issues of one pass learning
Interesting choice using SE instead of self-attention (which proven to generate good image as well in SAGAN ), maybe it's more likely due to memory limitation? for them to choose channel attention instead of position wise attention.
I was wrong - you CAN get details in (hierarchical) VAEs. I, too, was struck with the smoothness and "cutout"-like character of these faces. It seems like it handled lighting very differently, especially small level skin texture and oily skin shine. I suppose that it would have be more realistic if there had been more z layers with less upconversion at each stage, but it looked like from the number of tricks being played that NVIDIA was already struggling to make it fit in memory, and perhaps to converge.
Yes, I agree. They compensate the usual blurryness of VAEs with more layers, which results in this multi-scale smoothness, which looks just a bit weird.
Can you do a video explaining normalizing flows?
Yeah, that would be a nice topic to have. In the meantime, I can suggest this wonderful blog by Lilian Weng - lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html
I think what the decoder outputs is also a distribution instead of directly an image. The reconstruction error is then the likelihood of the input image X given this distribution of the decoder. You will have to sample to get different images by the same latent code. You don't get blurry images due to this multiple sampling, but by the uncorrelatedness assumption of the image pixels. The VQ-VAE 2 amends it by improving the prior assumption of unit normal distribution to make it learnable with an autoregressive pixelcnn model and doing the multiscale encoder/decoder pair simultaneously.
"make VAEs great again" .. hm? ;-) Impressive engineering work. Swish in the wild. Much nicer than anything I got out of my VAE models. I wonder whether decoders could be added at each level to train with auxiliary losses at each level.
I wonder if putting a gan discriminator at the end of the decoder could help remove the cartoonish look of the generated images.
perhaps the output images' skins are so smooth because the celebs (the training data) won't have it any other way!
They also show samples from FFHQ, which is a dataset with normal people as opposed to only celebs. There the effect seems the same. So I do not think that the dataset is the problem.
So, the smoothing effect is observed due to the loss function finding an easy way out to generate a smoothed representation of the input face.
Although the faces are unrealistic (they are so clean and smooth), one may borrow techniques from this paper to make an Instagram filter to make people's faces look clean and fancier.
how did you do it can you share with me , thank you
Regressive multistatic models in MNIST-CIFAR canvases or CIFAR-1-MNIST in the Bayesian way of encoding degression, looks like never distributed this sigma (E) from one package by the 256-byte, but 352-byte package of video encoded source data.
The description makes me believe VAEs are inferior to GANs, and the paper outlines heavy engineering necessary to put them on par. Can you think of examples where for a generative task, training a VAE is more beneficial than a GAN? What's the "more-blurry, less-crisp" equivalent for say generating Music?
It depends on what you measure them by. The machinery for the two is different so what each of them excels at is different as well. I'm not familiar with the current state but GANs have serious issues with mode collapse so it requires a lot of engineering to prevent mode collapse with GANs and you don't end up with as much variety. In the case of music, you lose a lot of high frequencies if your music is overly "blurry", which is the same as in images.
How do you train such a variational auto-encoder ? I mean, the sampling process is not differentiable, is it ?
It's called The Re-parameterization Trick. Essentially, you isolate the sampling process from the gradient update path.
They haven't published any code yet, waiting for this one
Same
github.com/NVlabs/NVAE
these NVAE faces are still in the uncanny valley for me. I do not have this feeling with (for example) styleGAN
14:00 nope, it's not just you. There's something uncanny valley about those predictions
Would love to see videogames have faces this crisp
very nice and all but 32 V100 GPUs, 200 hours for ImageNet :S
where is the code? can anybody help?
I don't think it's out yet
As a friend up there pointed out: github.com/NVlabs/NVAE
Seems like you should be operating directly on wavelets instead of pixel space.
You're right that these look like silicone puppets of humans or something. Very close but juuust a little off.
I wonder how that might get fixed.
I think if we give the model lighting info. Direction and magnitude of the light in the scene. Then it might do a better job.
I wonder if it would be cheap enough to train a small convnet GAN to learn to transform texture on images from the AE generator - pretty much a [narrowly focussed] style transfer to improve "realness" through texture
@Yannic NLP papers please.
I would be more confident in the entire variational auto-encoder method if they used photos of people with faces with irregularities like people over 65 years old. Why not choose difficult samples?
Zeroth (not first)
First
Apparently, having super smooth skin can look more uncomfortable than being pleasant.
it's the uncanny valley
I think this model can be used for automatic beautification, lol
Lol omg at puppet bit.
Perfect for producing fake baby pictures 😂
Too many faces... LOL
These faces are not real, just like how Yannick is not a real person.