NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (ML Research Paper Explained)

Yannic Kilcher

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 25 лис 2024

КОМЕНТАРІ • 150

@YannicKilcher 3 роки тому ⁺²⁵
OUTLINE:
0:00 - Intro & Overview
4:50 - View Synthesis Task Description
5:50 - The fundamental difference to classic Deep Learning
7:00 - NeRF Core Concept
15:30 - Training the NeRF from sparse views
20:50 - Radiance Field Volume Rendering
23:20 - Resulting View Dependence
24:00 - Positional Encoding
28:00 - Hierarchical Volume Sampling
30:15 - Experimental Results
33:30 - Comments & Conclusion
@G12GilbertProduction 3 роки тому
Captions, dear bud'! Caaaaaaaaptions!
@jimj2683 Рік тому
Imagine this, but using an AI model that is trained on vast amounts of 3d data from the real world. It would be able to fill in the gaps with all the experience it has much more accurately.
@THEMATT222 Рік тому
Noice 👍
@thierrymilard1544 3 роки тому ⁺⁴⁰
First time I really truely understand NERF. Wonderfull simple explanation. Thanks a lot !
@Jianju69 2 роки тому ⁺⁹
This type of pre-digestion for a complex technical paper is very expedient. Thank you.
@TheDukeGreat 3 роки тому ⁺⁶⁰
Wait, did Yannic just release a review of a paper I already read? So proud of myself :D
@NikolajKuntner 3 роки тому
Incidentally, same.
@Kolopuster 3 роки тому ⁺⁵
It's even my thesis subject :O
@DMexa Рік тому
So proud of Yaniic bro, he is sharing this awasome knowledge!
@gravkint8376 Рік тому ⁺²
Gotta present this paper for a seminar at uni so this video makes it so much easier. Thank you so much for this!
@adriansalazar8303 Рік тому ⁺³
One of the best NeRF explanations available. Thank you so much, it helped a lot.
@peter5470 10 місяців тому
My guy, this has to be the best tutorial on NeRF I've seen, finally understood everything
@muhammadaliyu3076 3 роки тому ⁺⁵
UC Berkeley - I salute this University when it comes to A.I. research. In most big paper, you will definitely see one or more scholars from it.
@免費仔的驕傲 2 роки тому ⁺¹
Man you got many clear notes to explained papers. I got tons of helps from your videos.
@dsp4392 2 роки тому ⁺²
Excellent explanation. Realtime 3D Street View should be right around the corner now.
@LouisChiaki 3 роки тому ⁺³²
I feel this approach probably has been used in physic and 3D image reconstruction for a long time with the Fourier decomposition technique (that is renamed as the positional encoding here). The main point is that it is 1 model per object so I feel like it is a curve fitting problem. Though using gradient descent and neural network like framework probably makes it much easier to model.
@paulothink 2 роки тому ⁺²
Would that be analogous to the DCT technique in video codecs, and that hopefully this could shed some light into a potentially better video codec? 👀
@Stopinvadingmyhardware Рік тому
DTFT?
@tomfahey2823 Рік тому
The "1 model per object" is an interesting, if not surprising, evolution in itself, as it can be seen as a further step in the direction of neural computing (as opposed to algorithmic computing), where memory/data and computation are encoded in the same structure (the neural network), in a manner not dissimilar to our own brains.
@trejohnson7677 3 роки тому ⁺²
The "overfitting" is one of the core principles in Functional Programming/Dataflow Programming. Very awesome to see, wil have to check whether or not it was a locally unique idea, or if it is directly pulling from the aforementioned knowledgebases.
@heejuneAhn 2 роки тому
Can we say memorizing instead of "overfiitting"? It sounds more intuitive to me.
@willd1mindmind639 3 роки тому ⁺³
It more closely represents what happens in the brain where the neural networks represent a coherent high fidelity representation of real world signal information. However, that kind of detailed scene representation normally has a lot of temporal decay with the 'learning" being a set of generalized learning elements extracted from such input info. For example, you could "learn" a generalized coordinate space (up,down, left, right, near, far), depth perception, perspective, surface information (convex, concave, etc), shape information, etc. But that would be another set of networks for specific tasks with less temporal decay and more generalization parameters to allow higher order understanding such as object classification, logical relationships between objects and so forth.
@NoobMLDude 2 роки тому
Thanks for the Great explanation. Finally understand the central ideas behind NeRF.
@nitisharora41 5 місяців тому
Thanks for creating such a detailed video on NerF
@vslaykovsky 3 роки тому ⁺¹⁵
"Two papers down the line" we'll probably see a paper that also infers positions and directions of photos.
@Milan_Openfeint 3 роки тому ⁺¹
Nice, Image Based Rendering strikes back after 25 years.
@任辽-n8n 2 роки тому
非常细致的讲解，thanks to you!
@howdynamic6529 2 роки тому ⁺²
Thank you for the clear-cut and thorough explanation! I was able to follow and that is definitely saying something because I come from a different world, model-based controls :)
@aayushlamichhane Рік тому ⁺³
Awesome explanation! Please dont stop making these.
@ethanjiang4091 3 роки тому
I watched the video on the same topic before but got lost, Now I get it after watching your video.
@kameelamareen Рік тому ⁺¹
Beautiful and Super Intuitive video ! Thanks :3
@thecheekychinaman6713 9 місяців тому
Crazy to think that this came out 2 years ago, advancement in the field is crazy
@YangLi-x9s Рік тому ⁺¹
Pretty clear and great thanks to you!!
@ChuanGuo-y3i 2 роки тому
This video helps a lot for some fresher like me to understand NeRF, thanks!
@Dave_Lee 2 роки тому
Great video. Thanks Yannic!
@alpers.2123 3 роки тому ⁺²⁷
Dear fellow scholars...
@Snuson Рік тому
Loved the video. Learned a lot. Thanks
@daanhoek1818 2 роки тому ⁺⁴
Really cool. I love getting into this stuff. I'm a compsci student in my first year, but considering switching and going for AI. Such an interesting field.
What a time to be alive! ;)
@minjunesong6667 2 роки тому ⁺¹
I'm also a first year student, feeling same here. Which school do u go to?
@daanhoek1818 2 роки тому ⁺¹
@@minjunesong6667 The university of Amsterdam
@truy7399 Рік тому
I was searching for nerf guns, this is better than what I was asked for.
@shempincognito4401 2 роки тому
Awesome explanation! Thanks for the video.
@michaellellouch3682 3 роки тому
Superbly explained, thanks!
@bilalbayrakdar7100 Рік тому
bro you are killin' it, pretty damn good explanation thanks
@juang.8799 Рік тому
Thanks for the explanation!!
@canhdz169 15 днів тому
Thank you so much, it helped a lot.
@SanduniPremaratne 3 роки тому ⁺⁴
How are the (x,y,z) coordinates obtained for the input data?
I assume a pose estimation method was used to get the two angles?
@ferranrigual 10 місяців тому
Amazing video, thanks a lot.
@tnmygrwl 3 роки тому ⁺²
Had been waiting for this for a while now. 🔥
@hbedrix 2 роки тому
awesome video! Really appreciate you doing this!
@Bellenchia 3 роки тому ⁺¹
I heard about Neural Radiance Fields on the TWIML podcast earlier this year, and never connected that it was the same paper Karloy (and now, Yannic) talked about.
It's funny how we associate a paper with a photo or graphic a lot of the time.
@Bellenchia 3 роки тому
Fellow visual learners feel free to @ me
@Bellenchia 3 роки тому
Also want to mention that you did a much better job explaining this than Pavan Turaga did on the episode in question, so well done Yannic. The episode I'm talking about is called Trends in Computer Vision or something along those lines for those interested.
@siyandong2564 2 роки тому ⁺¹
Nice explanation!
@6710345 3 роки тому ⁺¹¹
Yannic, would you ever review your own paper? 🤔
@thomsontg1730 3 роки тому
Great explanation, I really enjoyed watching it.
@michaelwangCH 3 роки тому
Cool effect. I saw this on two mlnute papers. To train NN from different perspectives of same object - hard to get the right data.
@晨希刘 3 роки тому
Wonderful videos! Thanks for sharing~
@vaishnavikhindkar9444 Рік тому ⁺¹
Great video. Can you please make one on LeRF (Language embedded Radiance Fields)?
@Dyxuki 5 місяців тому ⁺¹
this is VERY cool.
but I believe that if given multiple views (image) of a scene, and if we are able to match enough points, it's possible to generate a photogrametric model of the scene (a little bit like stereo vision, but with many more view so the generated model is more complete), and once we have that, we can simply sample or reproject it to any point of view. Isn't that a simpler way of solving the same problem?
@우시후-i4w 10 місяців тому
Thank u😮😮😮😮😮 amazing description
@qwerty123443wifi 3 роки тому
Awesome! Was hoping you'd I'd a video on this one
@usama57926 2 роки тому
This is mind blowing
@bona8561 2 роки тому ⁺³
Hi Yannic, I found this video very helpful. Could you do a follow up on instant NERF by Nvidia?
@isbestlizard 2 роки тому
You could stack lots of objects so long as you know the transformation from object to world coordinates and give each object a bounding volume in world space for the ray tracer to bother calculating if you had a supercomputer you could render worlds with thousands of overlapping and moving objects :D
@jaysethii Рік тому
Phenomemal video!
@IoannisNousias 3 роки тому ⁺³
D-NeRF
Great explanation as always Yannic! Will you be doing a follow up on their next paper (D-NeRF), which handles dynamic scenes?
@AdmMusicc Рік тому
This is an amazing explanation! I have a doubt though. You talked about the major question of training images not having information about "density". How are we even computing the loss in that case for each image? You said we compare what we see with what the model outputs. But how does the model give different density information for a particular pixel if we don't have that kind of information in the input? How will having a differentiable function that can backtrack all the way to the input space be any helpful if we don't have any reference or ground truth for the densities in the training images?
@dr.mikeybee 3 роки тому
View synthesis shows the power of interpolation!
@agnivsharma9163 3 роки тому ⁺³
Can anyone tell, how do we get the density parameter during training?
Since, we don't have the full 3D scene?
@jeteon 3 роки тому ⁺²
The density is something the network makes up. It only exists so that you can use it to say what each pixel from a new viewpoint should look like. If you ask the network to generate a picture you already have then the loss from comparing them gives the network enough information to find a density.
The density it finds doesn't mean anything outside of the specific algorithm that generates images from the network outputs. Just think of the network as generating 4 numbers and then coupled to some other function h(a, b, c, d) that we use to generate pictures from those 4 numbers. We can name the 2nd number "red" but the network doesn't "care", it's just an output, same as what they chose to call "density".
@shawkontzu642 3 роки тому ⁺²
The output for training is not the (color, density) array but the rendered images, after the network predicts (color, density) for samples points, this info is then rendered into images using volume rendering technique, so the loss is the error between rendered images and training images instead of the (color, density) array itself.
@agnivsharma9163 3 роки тому
Thank you so much for both the replies. Now, that you have explained it, it makes much more sense to do it like that. It also helped me clarify a few doubts which I had with follow-up NERF based papers. Huge help!
@jeteon 3 роки тому
@@shawkontzu642 Yes, I agree. The (color, density) is an intermediate output that gets fed into h(a, b, c, d) whose outputs are rendered images. The h function doesn't need to be trained though. It is just the volume rendering technique.
@albertoderfisch1580 3 місяці тому
BRAVO!!!
@laurentvit3117 6 місяців тому
Duuuude i'm learning NERF, and this video is a jewel, thank you!
@CristianGarcia 3 роки тому ⁺¹
I was a bit confused, I thought this paper already had been reviewed but it was actually the SIREN paper.
@Bellenchia 3 роки тому ⁺¹
I think the fact that it takes 1-2 days on a V100 is the biggest gotcha
@jeroenput258 3 роки тому
Another gotcha: the view dependent part only comes into play in the last layer. It really doesn't do much but optimize some common geometry.
@Bellenchia 3 роки тому
@@jeroenput258 maybe this is akin to GPT-3's large store of knowledge, where the top layers all store information about the image and the last layer basically just constructs an appropriate image for a given view using the shapes and textures it learned
@jeroenput258 3 роки тому
Perhaps, but I don't know enough about GPT-3 to answer that. What I do find odd is that when you move the ray direction parameter up just one layer the whole thing falls apart. It's really strange Nerf even works imo.
@Siva-wv5zk 3 місяці тому
Great explanation, Thanks Yannic. Shouldn't it be called 6D - x,y,z , Direction, color and density?
@jonatan01i 3 роки тому ⁺¹
I've started the "I invented everything" video yesterday and paused to continue today, but it's private now :(
@sebastianreyes8025 2 роки тому
I noticed manny of the scenes were from UC Berkeley, kinda trippy. The engineering school there gave me a bit of PTSD ngl.
@masterodst1 3 роки тому
Itd be cool is someone combined this with volumetric or loghtfield displays
@dhawals9176 3 роки тому
Ancient huh! nice way to put it.
@starship9874 3 роки тому ⁺¹
Hey will you ever do a video explaining Knowledge Graphs / Entity embeddings? For example by talking about the "Kepler, a unified model for KE and PLM" paper
@paulcassidy4559 3 роки тому
hopefully this isn't comment isn't in bad taste - but the changes in lighting patterns on the right hand side at 2:51 reminded me a lot of how light behaves while under the influence of psychedelics. yes I'll see myself out...
@AThagoras 3 роки тому ⁺³
I must be not understanding something. How do you get the density from 2d images?
@arushirai3776 3 роки тому ⁺¹
I don't understand either
@jeteon 3 роки тому
The 2D images give you multiple perspectives of the same point in space just from different angles. If you combine that information (in this case using the neural network) then you can get a good idea of whether or not there is something at a particular point. Density is not how much mass there is in the volume around a point but rather how much stuff there is at that point that interacts with light.
Think of it like what people naturally do when they pick something up and look at it from different angles. Each of those angles is like the pictures and the internal idea you would form of what the 3D object looks like is the model that gets trained. By the time you've analysed an object this way, you also can make pretty good guesses about which parts of the object reflect light, how much light, from which angles and what colour to expect. That's basically the model. The density is how much that point dominates the light you get from that point and could be something like 0 to 1 being from completely invisible to completely opaque.
Also, if you just look at the pictures you train on, your brain can build this model so that you have a good sense of whether or not a purple spot makes sense on the yellow tractor.
@thanvietduc1997 2 роки тому
You ask the neural netwok for density information, not the images. The pixels (RGB value) in those images serve as the target for the neural network to train on.
@AThagoras 2 роки тому
@@thanvietduc1997 OK. that makes sense. Thanks.
@firecloud77 2 роки тому
When will this become available for image/video software?
@maryguty1705 2 місяці тому
So the network only work on one scene? And it is more of a 3D model compressor than a 3D scene generator, am I understanding this correctly?
@R0m0uT 2 роки тому
This sounds as if presentation could be entirely done in a raymarching shader on the GPU as I suspect the evaluation of the model can be implemented as a shader.
@ceovizzio 2 роки тому ⁺¹
Great explanation. Yannic! Like to know if this technique could be used for 3D modelling?
@pretzelboi64 2 роки тому
Yes, you can construct a triangle mesh from NeRF density data
@herp_derpingson 3 роки тому ⁺¹
Reminds me of that SIREN paper.
@jonatan01i 3 роки тому
This is better than magic.
@dr.mikeybee 3 роки тому
How compact are minimally accurate models? How many parameters?
@brod515 7 місяців тому
where does the scene come from?
@JeSuisUnKikoolol 3 роки тому ⁺²
I would be very surprised to see a similar technology used to render objects inside games. According to the paper, sampling takes 30 seconds on a high end GPU. As games often run at 60 fps, this would only be viable with a speed up of x1800 and it's assuming we only have to render a single object (so realistically speaking we could add another factor of x100).
This does not mean it is not possible with more research and better hardware but if we compare this to the traditional way of rendering in games, I'm not really sure there is an advantage.
It's not even something we could not do as we already have photogrammetry to generate meshes from images.
For non biased rendering ("photorealistic") I could see some use but the learning time is way too high for the moment. One application could be to render a few frames of an animation and use the model to "interpolate" between the frames.
@ksy8585 2 роки тому
Now it reached x1000 speedup in both training and inference. What a speed of progress. There is more chance of using this technology where you take a few pictures of an object existing in the real world and reconstruct it as 3d image by training a neural network. Then you can manipulate the image or synthesis 2d images from novel viewpoint, lightening, time-step(if a video) or so.
@JeSuisUnKikoolol 2 роки тому
@@ksy8585 Very impressive. Do you have a link ?
@mort. Рік тому
Is this an in depth breakdown of what photogrammetry or is this different?
@WhenThoughtsConnect 3 роки тому
take pics from all angles of apple, maximize the score to label apple
@ilhamwicaksono5802 Рік тому
THE BEST
@marknadal9622 2 роки тому
Help! How do they determine depth density from a photo? Wouldn't you need prior trained data to know how far away an object is, from a single photo?
@YannicKilcher 2 роки тому ⁺²
Yes, search for monocular depth estimation
@marknadal9622 2 роки тому
@@YannicKilcher Thank you!
@sarvagyagupta1744 3 роки тому ⁺¹
Hey Yannic. I've been waiting for you to talk about this. Thanks. One question though. The viewing angle, is it like the latitude and longitude angle? We need to values because we need to want to know how that point looks from both the horizontal and vertical angle, right?
@quickdudley 3 роки тому
I'd been assuming it was pan and tilt. The full algorithm would need to know the roll of the camera but I don't think that would influence any lighting effects.
@ashastra123 2 роки тому
It's spherical coordinates (minus the radius, for obvious reasons)
@sarvagyagupta1744 2 роки тому
@@ashastra123 but then spherical coordinates have two angles, one wrt y axis and other wrt X axis. So are we using the name nomenclature here?
@ghostoftsushimaps4150 2 роки тому
good question, I also been assuming one angle to measure how much left right, and other to measure how much up down, on a surface of a sphere. So, I am also assuming viewing angle like lat long angles.
@jasonvolk4146 3 роки тому ⁺⁴
Reminds me of that scene from Enemy Of The State ua-cam.com/video/3EwZQddc3kY/v-deo.html -- made over 20 years ago!
@paulcurry8383 2 роки тому
Why is this “overfitting”? Wouldn’t overfitting in this case be if the network snaps the rays to the nearest data point with that angle and doesn’t interpolate?
@piotr780 Рік тому
so there are really two networks (coarse and fine) or this is some kind of trick ?
@VERY_TALL_MAN Рік тому
It’s NeRF or Nothin’ 😎
@sofia.eris.bauhaus 3 роки тому ⁺²
i don't think this should be called "overfitting". as far as i'm concerned, overfitting means learning the input data (or at least big chunks of it) as one big pattern itself, instead of finding the patterns within the data and generalizing from it. this system may be able to reproduce the input data faithfully (i haven't compared them 🤷) but it clearly learned to generalize the spacial patterns of the scene.
@mgostIH 3 роки тому ⁺¹
It doesn't really generalize anything outside the data it has seen, its job is to just learn *really* well the points we care about, but anything outside that range isn't important.
Think of it like if you were to train a network on a function f(x) and you are interested on the domain [-1,1]. Overfitting on this domain would mean that the network is extremely precise inside this interval but does something random outside of it, while generalizing means that we also care about having a good estimate of the function outside the domain.
Here our domain is the parts where we can send rays to, it doesn't really matter what the model thinks is outside the box we never sampled on.
@laurenpinschannels 3 роки тому ⁺²
in this context, overfitting might be replaced with undercompression
@sofia.eris.bauhaus 3 роки тому
@@mgostIH yeah, and network that is trained on birds will probably never generate a good squirrel. i don't think neural nets tend to be good at producing things unlike anything they have ever seen before.
@jeteon 3 роки тому
🤔 That's actually a good point. If it "overfit" it wouldn't be able to interpolate novel viewpoints, just the pictures it was trained on.
@NeoShameMan 3 роки тому ⁺¹
I'm studying light field, the premise makes it not that impressive to me, program a lytro like renderer you'll know what i mean
@GustavBoye-cs9vz 5 місяців тому
7:05 - 7:45 So we use the same neural network for multiple different scenes? - Thats smart because then we dont need to retrain it every time.
@rezarawassizadeh4601 Рік тому
I think, saying that each scene is associated with one single neural network (NN is overfitted for that scene) is not correct.
@ankurkumarsrivastava6958 Рік тому
Code?
@usama57926 2 роки тому
But can this be used real time?
@dvfh3073 3 роки тому ⁺²
5:50
@govindnarasimman6819 2 роки тому
finally something without cnns. bravo guys.
@pratik245 2 роки тому
Deep Tesla
@fintech1378 10 місяців тому
Python code?
@Anjum48 3 роки тому
Obligatory "we're living in a simulation" comment
@yunusemrekarpuz668 Рік тому
Its ilke end of the photogrammetry
@notram249 Рік тому
nerf is a step forward in photogrammetry
@muzammilaziz9979 3 роки тому
Yannick "not so lightspeed" Kilcher

Наступне

Автоматичне відтворення

DDPM - Diffusion Models Beat GANs on Image Synthesis (Machine Learning Research Paper Explained)