Lumiere: A Space-Time Diffusion Model for Video Generation (Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 17 тра 2024
#lumiere #texttovideoai #google
LUMIERE by Google Research tackles globally consistent text-to-video generation by extending the U-Net downsampling concept to the temporal axis of videos.
OUTLINE:
0:00 - Introduction
8:20 - Problems with keyframes
16:55 - Space-Time U-Net (STUNet)
21:20 - Extending U-Nets to video
37:20 - Multidiffusion for SSR prediction fusing
44:00 - Stylized generation by swapping weights
49:15 - Training & Evaluation
53:20 - Societal Impact & Conclusion
Paper: arxiv.org/abs/2401.12945
Website: lumiere-video.github.io/
Abstract:
We introduce Lumiere - a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.
Authors: Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Наука та технологія

КОМЕНТАРІ • 66

@jamescamacho3403 3 місяці тому ⁺¹¹
I love the no-nonsense attitude when discussing these papers. I think the goal of ML authors needs to shift more to helping people understand their work, instead of showing off their researchy prowess.
@sinitarium 3 місяці тому ⁺¹
"Sticker" is so cool! The fact that it can learn to hallucinate these styles over time so well is mind blowing...
@msokokokokokok 3 місяці тому ⁺¹⁰
At @31:29 , the anchor frames are learnt rather than humanly selected. That is the difference between previous architecture and this one
@funginimp 3 місяці тому ⁺⁶
It's basically style in the time dimension. Neat.
@sergeyfrolov1208 3 місяці тому ⁺⁵
as always, nice comments and explanation! as for the paper - our team did this in a different domain a few years ago. too bad we didn't publish "4D U-net with attention". of course, the application here is a lot more interesting, but tech is the same
@Calbefraques 3 місяці тому ⁺¹
You’re the best, I always enjoy your opinions 😊
@Alonhhh 3 місяці тому ⁺⁷
I think that the cross connections of the unet help the coherency of the frames (in regards to being "just like key frames")
@nathanbanks2354 3 місяці тому ⁺⁴
42:30 I like videos like this because I would never have noticed the citation from the same author's paper after reading the paper for an hour, or half an hour at double speed. Not sure I'd have noticed at all.
@hsooi2748 3 місяці тому ⁺¹⁹
@31:44 Those are NOT just key frames.
An ordinary Key Frame contains only RGB, however this Key Frame at 31:44 contains MORE than just RGB.
Those extra channel can encode movement information, and even information from other Time Frames (carried from the down sampling side across Frames) and other information that can help with global consistency.
@ledjfou125 3 місяці тому
Yeah I thought the same
@hyunsunggo855 3 місяці тому
I think Yannic knew all that. That's basically what he said right after for a minute. I think what Yannic meant was that the claim for globally consistency was a bit overstated.
@TheRohr 3 місяці тому
The problem with the encoded latent key frames is that when you upsample them again and you choose a kernel size lower than the overall dimension, then you will possibly again get global artifacts, because the distant resulting frames cannot communicate information to each other (for CNNs this is called receptive field). Meaning that this only works for this very short videos of 5 seconds!
@kiunthmo 3 місяці тому ⁺¹
It's also worth pointing out that Tero Karras (nvidia, did StyleGAN2 and EDM diffusion) made a new architecture for Diffusion end of last year that significantly improve FID score for diffusion image gen. We've not really seen many other models trained with his updated architecture yet, so things may take a big step this year very quickly
@NONAME-ic2kx 3 місяці тому ⁺¹
Man watching this just after OpenAI's announcement of Sora is quite something lol
@dibbidydoo4318 3 місяці тому ⁺¹
have you seen the phenaki paper from 2022 Yannic?
@markcorrigan9815 3 місяці тому ⁺³⁶
Open source this right now!! 😭😭
@stacksmasherninja7266 3 місяці тому ⁺⁷
how about you start working on it?
@gpeschke 3 місяці тому ⁺¹
Sounds like the path on this is 'retro onto current open source'.
@sam-you-is 3 місяці тому
@@gpeschkecorrect
@TheRyulord 3 місяці тому
@@stacksmasherninja7266 I'm sure they'll spin up their multi-million dollar compute cluster any minute
@mathiasvogel9350 3 місяці тому
Lucidrains might be onto it already, but google will never.
@ArielTavori 3 місяці тому ⁺¹
Isn't the same concept behind IP Adapter applicable here? That approach seems ideal for temporal consistency, especially in synergy with things like "Tile" ControlNet or others potentially...
@oscarfernandes4364 3 місяці тому ⁺¹
I think you could freeze all these weights and make a similar system to make consistent segments for much longer. Really cool, especially the drawing effect!
@wurstelei1356 3 місяці тому
You could also use the overlapping method to train for longer videos. I think this model required a shitloads of to-tear GPU to train. That is why they shortened the duration I think.
@wurstelei1356 3 місяці тому ⁺¹
About the keyframe thing in min. 33... They might blur/approximate the downsampling also in the t-dimension as they blur the w/h dimension. So this is different from the former keyframe method.
Google/YT is slightly ahead with training data. Lets hope there will pop up some useful free video datasets to train good open models like Stable Diffusion. Maybe Stable Video can be improved with this paper. The overlapping and the t-w-h-d-filtering is easy to understand and to reproduce.
@antoinedesjardins2723 3 місяці тому
How much do you reckon it would cost to run an inference of/train Lumiere?
@u2b83 3 місяці тому ⁺¹
I've trained conv3d U-nets on 128x128x128x16ch data on a gp100 w/ 16GB. It's pretty exciting that you can do this with video. Interestingly, the conv3d net had a 50% better MSE than the previous conv2d version where I stacked more channels.
@wurstelei1356 3 місяці тому ⁺¹
If you had a video of this on your channel. I would watch and upvote it :)
@Rizhiy13 3 місяці тому
32:14 I think one of the advantages might be that there are multiple key frames, not just start and end. They might provide more temporal information, similar to linear vs cubic interpolation.
I didn't find the kernel sizes in the paper, so can't verify that.
@yirushen6460 2 місяці тому
Interested in joining the paper reading group mentioned in the video. Curious to know how to join it? Thanks much!
@msokokokokokok 3 місяці тому ⁺²
Inflated means adding extra trainable weights to some frozen weights
@ledjfou125 3 місяці тому ⁺¹
Could you maybe make a video that explains how it is different from Stable Video ? I guess the only model where we actually know its inner workings ...
@SkaziziP 3 місяці тому ⁺¹
42:45 Poor Omer, never citing himself again
@jamescamacho3403 3 місяці тому
32:00 "Isn't this just key frames?"
I think it's slightly different in that there are multiple key frames per video frame. Even if it wasn't in a latent space there would be overlap across key frames, so you don't get that boundary issue. It'd be interesting to look into finite element methods (splines) for video generation.
@michaelwangCH 3 місяці тому
DNN trained the time revolution as well resp. autoreg. model of next frame - this demonstrate how powerful the mathenatical concept of tensor in the real world. That is the reason why I love mathematics, usefulness and simplified the representation which we can operate on. You can imagine if we did not invented the concept of tensor, how should the data be organizied and prepared.
@Kram1032 3 місяці тому
If you have a really strong text to video model with great spaciotemporal consistency (even if it's just for short intervals), I wonder if that suffices to then turn this around and make a fresh text to *image* model where the latent space is just set up to more naturally give consistent video - like, perhaps directions would emerge that correspond to camera pans, rotations, zooms, and the passing of time, and beyond that, maybe even some directions that correspond to archetypical armatures ("biped, quadruped" or whatever) moving in specific, somewhat untangleable ways.
@draken5379 3 місяці тому ⁺⁴
If u have time, can u do a video on the animatediff paper. Its very popular in open source. Thanks :)
@ledjfou125 3 місяці тому
Yeah and Stable Video
@roberto4898 3 місяці тому
The name 🎉
@TheRohr 3 місяці тому
Just one thing possibly missing: as far as I remember, a UNet is not just a CNN autonencoder like a pipeline, but additionally, the outputs embeddings of the encoder are concatenated to the input embeddings of the decoder of the same level. So is this really a UNet or just an autoencoder?
@apoorvumang 3 місяці тому ⁺²
I see Oliver Wang's name in the author list
@JohlBrown 3 місяці тому
been putting in some time on set, can confirm the red eyes on the koalas are accurate... usually when their OS bugs, scary trying to hit the off switch...
@thegreenxeno9430 3 місяці тому
I can hardly wait for text-to-videogame. My prompt: build a game based on a mix between The Hobbit, Naruto, and xianxia genre tropes.
@u2b83 3 місяці тому ⁺¹
I think "key frame" animation would be the next useful use-case. ...Oh wait, the paper actually talks about that lol
@paxdriver 3 місяці тому
In a multimodal system, why can't there be an image analysis modal acting like a director and overlooking the bigger picture along the way, and maybe it can trigger a mulligan if it sees a key frame deviate too much from the whole? It would just be a few seconds of extra time for inference and resetting a chunk of frames, totally doable imho.
@DanFrederiksen 3 місяці тому ⁺²
Pretty cool, oversharpening aside, but other than novelty, the lack of control over the details in the content means it's quite far from being actually useful. I think you need explicit control over every element in the video or the AI needs explicit control. If the generator isn't responsive to demands then you are not so much in control as you are sharing the experience.
@gulllars4620 3 місяці тому
5 seconds at 16 fps, means probably next gen can do 30 seconds at 24/30/60 fps. If you include options to have a vector DB of key objects or frames as reference you could then make movies consisting of individual but maybe globally consistent video clips.
@ivanstepanovftw 3 місяці тому
Why your video so sharp?
@ivanstepanovftw 3 місяці тому
18:50
It is technically "key frames", but upsampler (SSR) does not know about movement. Remember the frame interpolation techniques based on optical flow? What they all see is linear movement of pixels, thats why it sucks compared to just having 4-dimensional convolution.
@cyanhawkk3642 2 місяці тому
This model has just been overshadowed by OpenAi’s Sora model 😅
@u2b83 3 місяці тому ⁺¹
Once GPUs have 32*128GB of memory, we can just do 3D diffusion (forced/nudged with a text prompt every iteration), training on movies and youtube - problem solved lol
@avialexander 3 місяці тому ⁺²
So basically they just added a dimension to the Unet (a very expensive train to perform, but an obvious and well-trod idea) and then used their in-house corporate dataset (a giant cost to procure). Yet another "we have immense resources" paper. Cool outcome, but a big "yawn" from me on implementation.
@cartelion 3 місяці тому
Cartelion
@piratepartyftw 3 місяці тому ⁺²
Cool model, Bad paper. As you pointed out, authors manipulated the presentation of results too much, and failed to actually spell out key steps of the methods.
@paxdriver 3 місяці тому
Every household will have a UA-cam channel and Facebook will cease to exist 🤞
@ivanstepanovftw 3 місяці тому
29:50 stop here. I see attention! Where is positional encoding gone? How it understands who to attend? People are so mad about transformers that they did not notice missing ablation study for both positional encoding and attention mechanism in all pioneer transformers architecture papers.
@propeacemindfortress 3 місяці тому
I think more fish for Yann LeCat
@jcorey333 3 місяці тому ⁺¹
"how is this not reproducible? Just type "L"" 🤣
@cerealpeer 3 місяці тому
mkay. i really like this. apple needs work. im going to contact you at steve jobs cell.
@cerealpeer 3 місяці тому
so... i essence it is locally retaintive and globally an instantaneous measurement that changes key depending on reference frame... jesus yanic... you blow my mind daily.
@cerealpeer 3 місяці тому ⁺¹
dont... dont stop?
@kimchi_taco 3 місяці тому
Tech good, tech bad, tech biased 😂😂😂
@fcorp9755 3 місяці тому
cool
but we are fucked
@abunapha 2 місяці тому
The authors are almost completely Israeli, cool
@G3Kappa 3 місяці тому
Calling it lumiere is a bit pretentious considering that they're not the first to come up with text to video

Наступне

Автоматичне відтворення

RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)