Lumiere: A Space-Time Diffusion Model for Video Generation (Paper Explained)
Вставка
- Опубліковано 17 тра 2024
- #lumiere #texttovideoai #google
LUMIERE by Google Research tackles globally consistent text-to-video generation by extending the U-Net downsampling concept to the temporal axis of videos.
OUTLINE:
0:00 - Introduction
8:20 - Problems with keyframes
16:55 - Space-Time U-Net (STUNet)
21:20 - Extending U-Nets to video
37:20 - Multidiffusion for SSR prediction fusing
44:00 - Stylized generation by swapping weights
49:15 - Training & Evaluation
53:20 - Societal Impact & Conclusion
Paper: arxiv.org/abs/2401.12945
Website: lumiere-video.github.io/
Abstract:
We introduce Lumiere - a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.
Authors: Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, Oliver Wang, Deqing Sun, Tali Dekel, Inbar Mosseri
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
UA-cam: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n - Наука та технологія
I love the no-nonsense attitude when discussing these papers. I think the goal of ML authors needs to shift more to helping people understand their work, instead of showing off their researchy prowess.
"Sticker" is so cool! The fact that it can learn to hallucinate these styles over time so well is mind blowing...
At @31:29 , the anchor frames are learnt rather than humanly selected. That is the difference between previous architecture and this one
It's basically style in the time dimension. Neat.
as always, nice comments and explanation! as for the paper - our team did this in a different domain a few years ago. too bad we didn't publish "4D U-net with attention". of course, the application here is a lot more interesting, but tech is the same
You’re the best, I always enjoy your opinions 😊
I think that the cross connections of the unet help the coherency of the frames (in regards to being "just like key frames")
42:30 I like videos like this because I would never have noticed the citation from the same author's paper after reading the paper for an hour, or half an hour at double speed. Not sure I'd have noticed at all.
@31:44 Those are NOT just key frames.
An ordinary Key Frame contains only RGB, however this Key Frame at 31:44 contains MORE than just RGB.
Those extra channel can encode movement information, and even information from other Time Frames (carried from the down sampling side across Frames) and other information that can help with global consistency.
Yeah I thought the same
I think Yannic knew all that. That's basically what he said right after for a minute. I think what Yannic meant was that the claim for globally consistency was a bit overstated.
The problem with the encoded latent key frames is that when you upsample them again and you choose a kernel size lower than the overall dimension, then you will possibly again get global artifacts, because the distant resulting frames cannot communicate information to each other (for CNNs this is called receptive field). Meaning that this only works for this very short videos of 5 seconds!
It's also worth pointing out that Tero Karras (nvidia, did StyleGAN2 and EDM diffusion) made a new architecture for Diffusion end of last year that significantly improve FID score for diffusion image gen. We've not really seen many other models trained with his updated architecture yet, so things may take a big step this year very quickly
Man watching this just after OpenAI's announcement of Sora is quite something lol
have you seen the phenaki paper from 2022 Yannic?
Open source this right now!! 😭😭
how about you start working on it?
Sounds like the path on this is 'retro onto current open source'.
@@gpeschkecorrect
@@stacksmasherninja7266 I'm sure they'll spin up their multi-million dollar compute cluster any minute
Lucidrains might be onto it already, but google will never.
Isn't the same concept behind IP Adapter applicable here? That approach seems ideal for temporal consistency, especially in synergy with things like "Tile" ControlNet or others potentially...
I think you could freeze all these weights and make a similar system to make consistent segments for much longer. Really cool, especially the drawing effect!
You could also use the overlapping method to train for longer videos. I think this model required a shitloads of to-tear GPU to train. That is why they shortened the duration I think.
About the keyframe thing in min. 33... They might blur/approximate the downsampling also in the t-dimension as they blur the w/h dimension. So this is different from the former keyframe method.
Google/YT is slightly ahead with training data. Lets hope there will pop up some useful free video datasets to train good open models like Stable Diffusion. Maybe Stable Video can be improved with this paper. The overlapping and the t-w-h-d-filtering is easy to understand and to reproduce.
How much do you reckon it would cost to run an inference of/train Lumiere?
I've trained conv3d U-nets on 128x128x128x16ch data on a gp100 w/ 16GB. It's pretty exciting that you can do this with video. Interestingly, the conv3d net had a 50% better MSE than the previous conv2d version where I stacked more channels.
If you had a video of this on your channel. I would watch and upvote it :)
32:14 I think one of the advantages might be that there are multiple key frames, not just start and end. They might provide more temporal information, similar to linear vs cubic interpolation.
I didn't find the kernel sizes in the paper, so can't verify that.
Interested in joining the paper reading group mentioned in the video. Curious to know how to join it? Thanks much!
Inflated means adding extra trainable weights to some frozen weights
Could you maybe make a video that explains how it is different from Stable Video ? I guess the only model where we actually know its inner workings ...
42:45 Poor Omer, never citing himself again
32:00 "Isn't this just key frames?"
I think it's slightly different in that there are multiple key frames per video frame. Even if it wasn't in a latent space there would be overlap across key frames, so you don't get that boundary issue. It'd be interesting to look into finite element methods (splines) for video generation.
DNN trained the time revolution as well resp. autoreg. model of next frame - this demonstrate how powerful the mathenatical concept of tensor in the real world. That is the reason why I love mathematics, usefulness and simplified the representation which we can operate on. You can imagine if we did not invented the concept of tensor, how should the data be organizied and prepared.
If you have a really strong text to video model with great spaciotemporal consistency (even if it's just for short intervals), I wonder if that suffices to then turn this around and make a fresh text to *image* model where the latent space is just set up to more naturally give consistent video - like, perhaps directions would emerge that correspond to camera pans, rotations, zooms, and the passing of time, and beyond that, maybe even some directions that correspond to archetypical armatures ("biped, quadruped" or whatever) moving in specific, somewhat untangleable ways.
If u have time, can u do a video on the animatediff paper. Its very popular in open source. Thanks :)
Yeah and Stable Video
The name 🎉
Just one thing possibly missing: as far as I remember, a UNet is not just a CNN autonencoder like a pipeline, but additionally, the outputs embeddings of the encoder are concatenated to the input embeddings of the decoder of the same level. So is this really a UNet or just an autoencoder?
I see Oliver Wang's name in the author list
been putting in some time on set, can confirm the red eyes on the koalas are accurate... usually when their OS bugs, scary trying to hit the off switch...
I can hardly wait for text-to-videogame. My prompt: build a game based on a mix between The Hobbit, Naruto, and xianxia genre tropes.
I think "key frame" animation would be the next useful use-case. ...Oh wait, the paper actually talks about that lol
In a multimodal system, why can't there be an image analysis modal acting like a director and overlooking the bigger picture along the way, and maybe it can trigger a mulligan if it sees a key frame deviate too much from the whole? It would just be a few seconds of extra time for inference and resetting a chunk of frames, totally doable imho.
Pretty cool, oversharpening aside, but other than novelty, the lack of control over the details in the content means it's quite far from being actually useful. I think you need explicit control over every element in the video or the AI needs explicit control. If the generator isn't responsive to demands then you are not so much in control as you are sharing the experience.
5 seconds at 16 fps, means probably next gen can do 30 seconds at 24/30/60 fps. If you include options to have a vector DB of key objects or frames as reference you could then make movies consisting of individual but maybe globally consistent video clips.
Why your video so sharp?
18:50
It is technically "key frames", but upsampler (SSR) does not know about movement. Remember the frame interpolation techniques based on optical flow? What they all see is linear movement of pixels, thats why it sucks compared to just having 4-dimensional convolution.
This model has just been overshadowed by OpenAi’s Sora model 😅
Once GPUs have 32*128GB of memory, we can just do 3D diffusion (forced/nudged with a text prompt every iteration), training on movies and youtube - problem solved lol
So basically they just added a dimension to the Unet (a very expensive train to perform, but an obvious and well-trod idea) and then used their in-house corporate dataset (a giant cost to procure). Yet another "we have immense resources" paper. Cool outcome, but a big "yawn" from me on implementation.
Cartelion
Cool model, Bad paper. As you pointed out, authors manipulated the presentation of results too much, and failed to actually spell out key steps of the methods.
Every household will have a UA-cam channel and Facebook will cease to exist 🤞
29:50 stop here. I see attention! Where is positional encoding gone? How it understands who to attend? People are so mad about transformers that they did not notice missing ablation study for both positional encoding and attention mechanism in all pioneer transformers architecture papers.
I think more fish for Yann LeCat
"how is this not reproducible? Just type "L"" 🤣
mkay. i really like this. apple needs work. im going to contact you at steve jobs cell.
so... i essence it is locally retaintive and globally an instantaneous measurement that changes key depending on reference frame... jesus yanic... you blow my mind daily.
dont... dont stop?
Tech good, tech bad, tech biased 😂😂😂
cool
but we are fucked
The authors are almost completely Israeli, cool
Calling it lumiere is a bit pretentious considering that they're not the first to come up with text to video