NVAE: A Deep Hierarchical Variational Autoencoder (Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 17 тра 2024
VAEs have been traditionally hard to train at high resolutions and unstable when going deep with many layers. In addition, VAE samples are often more blurry and less crisp than those from GANs. This paper details all the engineering choices necessary to successfully train a deep hierarchical VAE that exhibits global consistency and astounding sharpness at high resolutions.
OUTLINE:
0:00 - Intro & Overview
1:55 - Variational Autoencoders
8:25 - Hierarchical VAE Decoder
12:45 - Output Samples
15:00 - Hierarchical VAE Encoder
17:20 - Engineering Decisions
22:10 - KL from Deltas
26:40 - Experimental Results
28:40 - Appendix
33:00 - Conclusion
Paper: arxiv.org/abs/2007.03898
Abstract:
Normalizing flows, autoregressive models, variational autoencoders (VAEs), and deep energy-based models are among competing likelihood-based frameworks for deep generative learning. Among them, VAEs have the advantage of fast and tractable sampling and easy-to-access encoding networks. However, they are currently outperformed by other models such as normalizing flows and autoregressive models. While the majority of the research in VAEs is focused on the statistical challenges, we explore the orthogonal direction of carefully designing neural architectures for hierarchical VAEs. We propose Nouveau VAE (NVAE), a deep hierarchical VAE built for image generation using depth-wise separable convolutions and batch normalization. NVAE is equipped with a residual parameterization of Normal distributions and its training is stabilized by spectral regularization. We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models on the MNIST, CIFAR-10, and CelebA HQ datasets and it provides a strong baseline on FFHQ. For example, on CIFAR-10, NVAE pushes the state-of-the-art from 2.98 to 2.91 bits per dimension, and it produces high-quality images on CelebA HQ as shown in Fig. 1. To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256×256 pixels.
Authors: Arash Vahdat, Jan Kautz
Links:
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
Наука та технологія

КОМЕНТАРІ • 74

@renehaas7866 3 роки тому ⁺¹⁸
As a new PhD student in this field, I can literally not thank you enough for making this content!
@mathematicalninja2756 3 роки тому ⁺⁴⁷
I legit expected a 'Dear fellow scholars' in the start! Lol.
@AnthonyBecker9 3 роки тому ⁺⁵
Two very good channels.
@rafaelmoraes5465 3 роки тому ⁺⁵
What a time to be alive!
@luciengrondin5802 3 роки тому ⁺²
Hang on to your papers!
@renehaas7866 3 роки тому ⁺¹
Please entertain the idea of a sit-down paper commentary with Dr. Károly Zsolnai-Fehér. I mean, who thinks that would be beyond awesome ?
@motherbear55 3 роки тому
This one is 2 hour papers
@mrcrazysalad2 3 роки тому
Thankyou so much for making this video! It's so hard to find friendly content when you really start digging into these topics and this is an absolute life saver!
@deez_nuts_77 8 місяців тому
This was so helpful! Thank you!
@oreganorx7 3 роки тому ⁺¹
You (or whoever is narrating) are really good at describing things clearly and exactly.
@TheParkitny 3 роки тому ⁺²⁸
We need an ACNE dataset with no smooth faces to test the true power of these generative methods
@bntejn 3 роки тому ⁺²
Lol, is there such a thing?
@YannicKilcher 3 роки тому ⁺¹⁷
CelebA - High School Edition :D
@johnpope1473 3 роки тому ⁺³
Great talk. Thanks for taking time to read through this. The heavier linear algebra can be a bit daunting without maths background. You help it become a bit more digestible.
@MrAlextorex 3 роки тому ⁺¹²
One thing missing in explanation is that you prefer a distribution over latent codes in order to get a continuous smooth latent space in which you can sample new (unseen) interpolated latent codes which are still valid ones. This provides generative power of VAE.
@YannicKilcher 3 роки тому ⁺³
Absolutely true!
@Anonymous-vh9tc 3 роки тому ⁺¹
I wonder how much practical application this still has given stylegan allows you to walk the ‘latent space’ as well and it generates 1024*1024 insanely realistic images
@MrAlextorex 3 роки тому
@@Anonymous-vh9tc you must test yourself to really see the level of controllability, how fast is inference , ease of training, the variety and realism of generated output
@alexijohansen 2 роки тому
Super awesome paper review! Much thanks.
@kazz811 3 роки тому ⁺²
Awesome explanation. Great job explaining VAEs without ELBO. This is a cool and conceptually simple way of building hierarchical VAEs (unlike say BIVA which is a nightmare)
@DanielCohenOr 3 роки тому ⁺¹
very well presented, thanks.
@bioinfolucas5606 3 роки тому ⁺⁵
I'm wonder how hard is train this NVAE.... How long it takes to be trained when compared with stylegan2?. Their results are very good! Looks like a dream generate high quality samples without to face mode collapsing. I'm curious about the down sides.
@shreejaltrivedi9731 2 роки тому
Hi Yannic,
Great Explanation. One small doubt I wanted to clear. The generation of these face images is more like crops. How does this or any generative methods work on reconstructing the different types of datasets? For example, let's say COCO Dataset. Will the object's scale in this hierarchical architecture be maintained If I pass the image of the beach having different scale people in it? If yes, can it be used as a pretext task for object detection/image classification/segmentation on the COCO alike datasets?
It would be worth a try !!
@glennkroegel1342 3 роки тому ⁺⁵
I wonder how VQ-VAEs compare - they are much simpler conceptually and practically and seem to address the same issues. You mentioned them briefly in the Jukebox video but they are probably worth their own video.
@machinelearningdojowithtim2898 3 роки тому ⁺¹
Excellent 👌
@kadirgunel5926 3 роки тому ⁺²
I just saw the images and woow! The model produces crisp images. Wondering what would be the output for videos.
@robbiero368 3 роки тому
Could you use a similar hierarchical method to learn image classification which avoids some of the adversarial issues of one pass learning
@user-cn2ft7ir1g 3 роки тому ⁺⁴
Interesting choice using SE instead of self-attention (which proven to generate good image as well in SAGAN ), maybe it's more likely due to memory limitation? for them to choose channel attention instead of position wise attention.
@scottmiller2591 3 роки тому ⁺⁵
I was wrong - you CAN get details in (hierarchical) VAEs. I, too, was struck with the smoothness and "cutout"-like character of these faces. It seems like it handled lighting very differently, especially small level skin texture and oily skin shine. I suppose that it would have be more realistic if there had been more z layers with less upconversion at each stage, but it looked like from the number of tricks being played that NVIDIA was already struggling to make it fit in memory, and perhaps to converge.
@YannicKilcher 3 роки тому ⁺²
Yes, I agree. They compensate the usual blurryness of VAEs with more layers, which results in this multi-scale smoothness, which looks just a bit weird.
@rahuldeora1120 3 роки тому ⁺²⁵
Can you do a video explaining normalizing flows?
@skymanaditya 2 роки тому ⁺¹
Yeah, that would be a nice topic to have. In the meantime, I can suggest this wonderful blog by Lilian Weng - lilianweng.github.io/lil-log/2018/10/13/flow-based-deep-generative-models.html
@binjianxin7830 2 роки тому
I think what the decoder outputs is also a distribution instead of directly an image. The reconstruction error is then the likelihood of the input image X given this distribution of the decoder. You will have to sample to get different images by the same latent code. You don't get blurry images due to this multiple sampling, but by the uncorrelatedness assumption of the image pixels. The VQ-VAE 2 amends it by improving the prior assumption of unit normal distribution to make it learnable with an autoregressive pixelcnn model and doing the multiscale encoder/decoder pair simultaneously.
@bluel1ng 3 роки тому ⁺⁵
"make VAEs great again" .. hm? ;-) Impressive engineering work. Swish in the wild. Much nicer than anything I got out of my VAE models. I wonder whether decoders could be added at each level to train with auxiliary losses at each level.
@dark808bb8 3 роки тому ⁺⁴
I wonder if putting a gan discriminator at the end of the decoder could help remove the cartoonish look of the generated images.
@napper12 3 роки тому ⁺¹⁴
perhaps the output images' skins are so smooth because the celebs (the training data) won't have it any other way!
@Gerryflap 3 роки тому ⁺²
They also show samples from FFHQ, which is a dataset with normal people as opposed to only celebs. There the effect seems the same. So I do not think that the dataset is the problem.
@skymanaditya 2 роки тому
So, the smoothing effect is observed due to the loss function finding an easy way out to generate a smoothed representation of the input face.
@viniciusarruda2515 3 роки тому ⁺²
Although the faces are unrealistic (they are so clean and smooth), one may borrow techniques from this paper to make an Instagram filter to make people's faces look clean and fancier.
@HoaNguyen-ym5dg 2 роки тому
how did you do it can you share with me , thank you
@G12GilbertProduction 3 роки тому ⁺¹
Regressive multistatic models in MNIST-CIFAR canvases or CIFAR-1-MNIST in the Bayesian way of encoding degression, looks like never distributed this sigma (E) from one package by the 256-byte, but 352-byte package of video encoded source data.
@XecutionStyle 3 роки тому ⁺¹
The description makes me believe VAEs are inferior to GANs, and the paper outlines heavy engineering necessary to put them on par. Can you think of examples where for a generative task, training a VAE is more beneficial than a GAN? What's the "more-blurry, less-crisp" equivalent for say generating Music?
@ssssssstssssssss 3 роки тому ⁺¹
It depends on what you measure them by. The machinery for the two is different so what each of them excels at is different as well. I'm not familiar with the current state but GANs have serious issues with mode collapse so it requires a lot of engineering to prevent mode collapse with GANs and you don't end up with as much variety. In the case of music, you lose a lot of high frequencies if your music is overly "blurry", which is the same as in images.
@luciengrondin5802 3 роки тому ⁺¹
How do you train such a variational auto-encoder ? I mean, the sampling process is not differentiable, is it ?
@amanchauhan1994u 3 роки тому ⁺⁵
It's called The Re-parameterization Trick. Essentially, you isolate the sampling process from the gradient update path.
@waxwingvain 3 роки тому ⁺⁶
They haven't published any code yet, waiting for this one
@yoannfleytoux5917 3 роки тому ⁺²
Same
@yoannfleytoux5917 3 роки тому ⁺²
github.com/NVlabs/NVAE
@xaviernaturel5194 3 роки тому ⁺²
these NVAE faces are still in the uncanny valley for me. I do not have this feeling with (for example) styleGAN
@staticmind1872 2 роки тому
14:00 nope, it's not just you. There's something uncanny valley about those predictions
@siyn007 3 роки тому ⁺²
Would love to see videogames have faces this crisp
@rogertrullo8272 3 роки тому ⁺²
very nice and all but 32 V100 GPUs, 200 hours for ImageNet :S
@thevivekmathema 3 роки тому ⁺²
where is the code? can anybody help?
@YannicKilcher 3 роки тому
I don't think it's out yet
@XecutionStyle 3 роки тому
As a friend up there pointed out: github.com/NVlabs/NVAE
@user-oj9iz4vb4q 6 місяців тому
Seems like you should be operating directly on wavelets instead of pixel space.
@Kram1032 3 роки тому ⁺²
You're right that these look like silicone puppets of humans or something. Very close but juuust a little off.
I wonder how that might get fixed.
@herp_derpingson 3 роки тому ⁺¹
I think if we give the model lighting info. Direction and magnitude of the light in the scene. Then it might do a better job.
@Will-kt5jk 3 роки тому ⁺²
I wonder if it would be cheap enough to train a small convnet GAN to learn to transform texture on images from the AE generator - pretty much a [narrowly focussed] style transfer to improve "realness" through texture
@thak456 3 роки тому ⁺¹
@Yannic NLP papers please.
@cptechno 2 роки тому
I would be more confident in the entire variational auto-encoder method if they used photos of people with faces with irregularities like people over 65 years old. Why not choose difficult samples?
@arnavdas3139 3 роки тому ⁺²
Zeroth (not first)
@drdca8263 3 роки тому ⁺¹
First
@anastasiadunbar5246 3 роки тому ⁺¹
Apparently, having super smooth skin can look more uncomfortable than being pleasant.
@rafaelmoraes5465 3 роки тому ⁺¹
it's the uncanny valley
@andrew9851 3 роки тому ⁺¹
I think this model can be used for automatic beautification, lol
@ericcodes 3 роки тому ⁺¹
Lol omg at puppet bit.
@patrickjdarrow 3 роки тому ⁺¹
Perfect for producing fake baby pictures 😂
@LouisChiaki 3 роки тому
Too many faces... LOL
@dhruvpatel4948 3 роки тому ⁺³
These faces are not real, just like how Yannick is not a real person.

Наступне

Автоматичне відтворення

Variational Autoencoder - Model, ELBO, loss function and maths explained easily!