Imagine this, but using an AI model that is trained on vast amounts of 3d data from the real world. It would be able to fill in the gaps with all the experience it has much more accurately.
I feel this approach probably has been used in physic and 3D image reconstruction for a long time with the Fourier decomposition technique (that is renamed as the positional encoding here). The main point is that it is 1 model per object so I feel like it is a curve fitting problem. Though using gradient descent and neural network like framework probably makes it much easier to model.
The "1 model per object" is an interesting, if not surprising, evolution in itself, as it can be seen as a further step in the direction of neural computing (as opposed to algorithmic computing), where memory/data and computation are encoded in the same structure (the neural network), in a manner not dissimilar to our own brains.
The "overfitting" is one of the core principles in Functional Programming/Dataflow Programming. Very awesome to see, wil have to check whether or not it was a locally unique idea, or if it is directly pulling from the aforementioned knowledgebases.
It more closely represents what happens in the brain where the neural networks represent a coherent high fidelity representation of real world signal information. However, that kind of detailed scene representation normally has a lot of temporal decay with the 'learning" being a set of generalized learning elements extracted from such input info. For example, you could "learn" a generalized coordinate space (up,down, left, right, near, far), depth perception, perspective, surface information (convex, concave, etc), shape information, etc. But that would be another set of networks for specific tasks with less temporal decay and more generalization parameters to allow higher order understanding such as object classification, logical relationships between objects and so forth.
Thank you for the clear-cut and thorough explanation! I was able to follow and that is definitely saying something because I come from a different world, model-based controls :)
Really cool. I love getting into this stuff. I'm a compsci student in my first year, but considering switching and going for AI. Such an interesting field. What a time to be alive! ;)
I heard about Neural Radiance Fields on the TWIML podcast earlier this year, and never connected that it was the same paper Karloy (and now, Yannic) talked about. It's funny how we associate a paper with a photo or graphic a lot of the time.
Also want to mention that you did a much better job explaining this than Pavan Turaga did on the episode in question, so well done Yannic. The episode I'm talking about is called Trends in Computer Vision or something along those lines for those interested.
this is VERY cool. but I believe that if given multiple views (image) of a scene, and if we are able to match enough points, it's possible to generate a photogrametric model of the scene (a little bit like stereo vision, but with many more view so the generated model is more complete), and once we have that, we can simply sample or reproject it to any point of view. Isn't that a simpler way of solving the same problem?
You could stack lots of objects so long as you know the transformation from object to world coordinates and give each object a bounding volume in world space for the ray tracer to bother calculating if you had a supercomputer you could render worlds with thousands of overlapping and moving objects :D
This is an amazing explanation! I have a doubt though. You talked about the major question of training images not having information about "density". How are we even computing the loss in that case for each image? You said we compare what we see with what the model outputs. But how does the model give different density information for a particular pixel if we don't have that kind of information in the input? How will having a differentiable function that can backtrack all the way to the input space be any helpful if we don't have any reference or ground truth for the densities in the training images?
The density is something the network makes up. It only exists so that you can use it to say what each pixel from a new viewpoint should look like. If you ask the network to generate a picture you already have then the loss from comparing them gives the network enough information to find a density. The density it finds doesn't mean anything outside of the specific algorithm that generates images from the network outputs. Just think of the network as generating 4 numbers and then coupled to some other function h(a, b, c, d) that we use to generate pictures from those 4 numbers. We can name the 2nd number "red" but the network doesn't "care", it's just an output, same as what they chose to call "density".
The output for training is not the (color, density) array but the rendered images, after the network predicts (color, density) for samples points, this info is then rendered into images using volume rendering technique, so the loss is the error between rendered images and training images instead of the (color, density) array itself.
Thank you so much for both the replies. Now, that you have explained it, it makes much more sense to do it like that. It also helped me clarify a few doubts which I had with follow-up NERF based papers. Huge help!
@@shawkontzu642 Yes, I agree. The (color, density) is an intermediate output that gets fed into h(a, b, c, d) whose outputs are rendered images. The h function doesn't need to be trained though. It is just the volume rendering technique.
@@jeroenput258 maybe this is akin to GPT-3's large store of knowledge, where the top layers all store information about the image and the last layer basically just constructs an appropriate image for a given view using the shapes and textures it learned
Perhaps, but I don't know enough about GPT-3 to answer that. What I do find odd is that when you move the ray direction parameter up just one layer the whole thing falls apart. It's really strange Nerf even works imo.
Hey will you ever do a video explaining Knowledge Graphs / Entity embeddings? For example by talking about the "Kepler, a unified model for KE and PLM" paper
hopefully this isn't comment isn't in bad taste - but the changes in lighting patterns on the right hand side at 2:51 reminded me a lot of how light behaves while under the influence of psychedelics. yes I'll see myself out...
The 2D images give you multiple perspectives of the same point in space just from different angles. If you combine that information (in this case using the neural network) then you can get a good idea of whether or not there is something at a particular point. Density is not how much mass there is in the volume around a point but rather how much stuff there is at that point that interacts with light. Think of it like what people naturally do when they pick something up and look at it from different angles. Each of those angles is like the pictures and the internal idea you would form of what the 3D object looks like is the model that gets trained. By the time you've analysed an object this way, you also can make pretty good guesses about which parts of the object reflect light, how much light, from which angles and what colour to expect. That's basically the model. The density is how much that point dominates the light you get from that point and could be something like 0 to 1 being from completely invisible to completely opaque. Also, if you just look at the pictures you train on, your brain can build this model so that you have a good sense of whether or not a purple spot makes sense on the yellow tractor.
You ask the neural netwok for density information, not the images. The pixels (RGB value) in those images serve as the target for the neural network to train on.
This sounds as if presentation could be entirely done in a raymarching shader on the GPU as I suspect the evaluation of the model can be implemented as a shader.
I would be very surprised to see a similar technology used to render objects inside games. According to the paper, sampling takes 30 seconds on a high end GPU. As games often run at 60 fps, this would only be viable with a speed up of x1800 and it's assuming we only have to render a single object (so realistically speaking we could add another factor of x100). This does not mean it is not possible with more research and better hardware but if we compare this to the traditional way of rendering in games, I'm not really sure there is an advantage. It's not even something we could not do as we already have photogrammetry to generate meshes from images. For non biased rendering ("photorealistic") I could see some use but the learning time is way too high for the moment. One application could be to render a few frames of an animation and use the model to "interpolate" between the frames.
Now it reached x1000 speedup in both training and inference. What a speed of progress. There is more chance of using this technology where you take a few pictures of an object existing in the real world and reconstruct it as 3d image by training a neural network. Then you can manipulate the image or synthesis 2d images from novel viewpoint, lightening, time-step(if a video) or so.
Hey Yannic. I've been waiting for you to talk about this. Thanks. One question though. The viewing angle, is it like the latitude and longitude angle? We need to values because we need to want to know how that point looks from both the horizontal and vertical angle, right?
I'd been assuming it was pan and tilt. The full algorithm would need to know the roll of the camera but I don't think that would influence any lighting effects.
good question, I also been assuming one angle to measure how much left right, and other to measure how much up down, on a surface of a sphere. So, I am also assuming viewing angle like lat long angles.
Why is this “overfitting”? Wouldn’t overfitting in this case be if the network snaps the rays to the nearest data point with that angle and doesn’t interpolate?
i don't think this should be called "overfitting". as far as i'm concerned, overfitting means learning the input data (or at least big chunks of it) as one big pattern itself, instead of finding the patterns within the data and generalizing from it. this system may be able to reproduce the input data faithfully (i haven't compared them 🤷) but it clearly learned to generalize the spacial patterns of the scene.
It doesn't really generalize anything outside the data it has seen, its job is to just learn *really* well the points we care about, but anything outside that range isn't important. Think of it like if you were to train a network on a function f(x) and you are interested on the domain [-1,1]. Overfitting on this domain would mean that the network is extremely precise inside this interval but does something random outside of it, while generalizing means that we also care about having a good estimate of the function outside the domain. Here our domain is the parts where we can send rays to, it doesn't really matter what the model thinks is outside the box we never sampled on.
@@mgostIH yeah, and network that is trained on birds will probably never generate a good squirrel. i don't think neural nets tend to be good at producing things unlike anything they have ever seen before.
OUTLINE:
0:00 - Intro & Overview
4:50 - View Synthesis Task Description
5:50 - The fundamental difference to classic Deep Learning
7:00 - NeRF Core Concept
15:30 - Training the NeRF from sparse views
20:50 - Radiance Field Volume Rendering
23:20 - Resulting View Dependence
24:00 - Positional Encoding
28:00 - Hierarchical Volume Sampling
30:15 - Experimental Results
33:30 - Comments & Conclusion
Captions, dear bud'! Caaaaaaaaptions!
Imagine this, but using an AI model that is trained on vast amounts of 3d data from the real world. It would be able to fill in the gaps with all the experience it has much more accurately.
Noice 👍
First time I really truely understand NERF. Wonderfull simple explanation. Thanks a lot !
This type of pre-digestion for a complex technical paper is very expedient. Thank you.
Wait, did Yannic just release a review of a paper I already read? So proud of myself :D
Incidentally, same.
It's even my thesis subject :O
So proud of Yaniic bro, he is sharing this awasome knowledge!
Gotta present this paper for a seminar at uni so this video makes it so much easier. Thank you so much for this!
One of the best NeRF explanations available. Thank you so much, it helped a lot.
My guy, this has to be the best tutorial on NeRF I've seen, finally understood everything
UC Berkeley - I salute this University when it comes to A.I. research. In most big paper, you will definitely see one or more scholars from it.
Man you got many clear notes to explained papers. I got tons of helps from your videos.
Excellent explanation. Realtime 3D Street View should be right around the corner now.
I feel this approach probably has been used in physic and 3D image reconstruction for a long time with the Fourier decomposition technique (that is renamed as the positional encoding here). The main point is that it is 1 model per object so I feel like it is a curve fitting problem. Though using gradient descent and neural network like framework probably makes it much easier to model.
Would that be analogous to the DCT technique in video codecs, and that hopefully this could shed some light into a potentially better video codec? 👀
DTFT?
The "1 model per object" is an interesting, if not surprising, evolution in itself, as it can be seen as a further step in the direction of neural computing (as opposed to algorithmic computing), where memory/data and computation are encoded in the same structure (the neural network), in a manner not dissimilar to our own brains.
The "overfitting" is one of the core principles in Functional Programming/Dataflow Programming. Very awesome to see, wil have to check whether or not it was a locally unique idea, or if it is directly pulling from the aforementioned knowledgebases.
Can we say memorizing instead of "overfiitting"? It sounds more intuitive to me.
It more closely represents what happens in the brain where the neural networks represent a coherent high fidelity representation of real world signal information. However, that kind of detailed scene representation normally has a lot of temporal decay with the 'learning" being a set of generalized learning elements extracted from such input info. For example, you could "learn" a generalized coordinate space (up,down, left, right, near, far), depth perception, perspective, surface information (convex, concave, etc), shape information, etc. But that would be another set of networks for specific tasks with less temporal decay and more generalization parameters to allow higher order understanding such as object classification, logical relationships between objects and so forth.
Thanks for the Great explanation. Finally understand the central ideas behind NeRF.
Thanks for creating such a detailed video on NerF
"Two papers down the line" we'll probably see a paper that also infers positions and directions of photos.
Nice, Image Based Rendering strikes back after 25 years.
非常细致的讲解,thanks to you!
Thank you for the clear-cut and thorough explanation! I was able to follow and that is definitely saying something because I come from a different world, model-based controls :)
Awesome explanation! Please dont stop making these.
I watched the video on the same topic before but got lost, Now I get it after watching your video.
Beautiful and Super Intuitive video ! Thanks :3
Crazy to think that this came out 2 years ago, advancement in the field is crazy
Pretty clear and great thanks to you!!
This video helps a lot for some fresher like me to understand NeRF, thanks!
Great video. Thanks Yannic!
Dear fellow scholars...
Loved the video. Learned a lot. Thanks
Really cool. I love getting into this stuff. I'm a compsci student in my first year, but considering switching and going for AI. Such an interesting field.
What a time to be alive! ;)
I'm also a first year student, feeling same here. Which school do u go to?
@@minjunesong6667 The university of Amsterdam
I was searching for nerf guns, this is better than what I was asked for.
Awesome explanation! Thanks for the video.
Superbly explained, thanks!
bro you are killin' it, pretty damn good explanation thanks
Thanks for the explanation!!
Thank you so much, it helped a lot.
How are the (x,y,z) coordinates obtained for the input data?
I assume a pose estimation method was used to get the two angles?
Amazing video, thanks a lot.
Had been waiting for this for a while now. 🔥
awesome video! Really appreciate you doing this!
I heard about Neural Radiance Fields on the TWIML podcast earlier this year, and never connected that it was the same paper Karloy (and now, Yannic) talked about.
It's funny how we associate a paper with a photo or graphic a lot of the time.
Fellow visual learners feel free to @ me
Also want to mention that you did a much better job explaining this than Pavan Turaga did on the episode in question, so well done Yannic. The episode I'm talking about is called Trends in Computer Vision or something along those lines for those interested.
Nice explanation!
Yannic, would you ever review your own paper? 🤔
Great explanation, I really enjoyed watching it.
Cool effect. I saw this on two mlnute papers. To train NN from different perspectives of same object - hard to get the right data.
Wonderful videos! Thanks for sharing~
Great video. Can you please make one on LeRF (Language embedded Radiance Fields)?
this is VERY cool.
but I believe that if given multiple views (image) of a scene, and if we are able to match enough points, it's possible to generate a photogrametric model of the scene (a little bit like stereo vision, but with many more view so the generated model is more complete), and once we have that, we can simply sample or reproject it to any point of view. Isn't that a simpler way of solving the same problem?
Thank u😮😮😮😮😮 amazing description
Awesome! Was hoping you'd I'd a video on this one
This is mind blowing
Hi Yannic, I found this video very helpful. Could you do a follow up on instant NERF by Nvidia?
You could stack lots of objects so long as you know the transformation from object to world coordinates and give each object a bounding volume in world space for the ray tracer to bother calculating if you had a supercomputer you could render worlds with thousands of overlapping and moving objects :D
Phenomemal video!
D-NeRF
Great explanation as always Yannic! Will you be doing a follow up on their next paper (D-NeRF), which handles dynamic scenes?
This is an amazing explanation! I have a doubt though. You talked about the major question of training images not having information about "density". How are we even computing the loss in that case for each image? You said we compare what we see with what the model outputs. But how does the model give different density information for a particular pixel if we don't have that kind of information in the input? How will having a differentiable function that can backtrack all the way to the input space be any helpful if we don't have any reference or ground truth for the densities in the training images?
View synthesis shows the power of interpolation!
Can anyone tell, how do we get the density parameter during training?
Since, we don't have the full 3D scene?
The density is something the network makes up. It only exists so that you can use it to say what each pixel from a new viewpoint should look like. If you ask the network to generate a picture you already have then the loss from comparing them gives the network enough information to find a density.
The density it finds doesn't mean anything outside of the specific algorithm that generates images from the network outputs. Just think of the network as generating 4 numbers and then coupled to some other function h(a, b, c, d) that we use to generate pictures from those 4 numbers. We can name the 2nd number "red" but the network doesn't "care", it's just an output, same as what they chose to call "density".
The output for training is not the (color, density) array but the rendered images, after the network predicts (color, density) for samples points, this info is then rendered into images using volume rendering technique, so the loss is the error between rendered images and training images instead of the (color, density) array itself.
Thank you so much for both the replies. Now, that you have explained it, it makes much more sense to do it like that. It also helped me clarify a few doubts which I had with follow-up NERF based papers. Huge help!
@@shawkontzu642 Yes, I agree. The (color, density) is an intermediate output that gets fed into h(a, b, c, d) whose outputs are rendered images. The h function doesn't need to be trained though. It is just the volume rendering technique.
BRAVO!!!
Duuuude i'm learning NERF, and this video is a jewel, thank you!
I was a bit confused, I thought this paper already had been reviewed but it was actually the SIREN paper.
I think the fact that it takes 1-2 days on a V100 is the biggest gotcha
Another gotcha: the view dependent part only comes into play in the last layer. It really doesn't do much but optimize some common geometry.
@@jeroenput258 maybe this is akin to GPT-3's large store of knowledge, where the top layers all store information about the image and the last layer basically just constructs an appropriate image for a given view using the shapes and textures it learned
Perhaps, but I don't know enough about GPT-3 to answer that. What I do find odd is that when you move the ray direction parameter up just one layer the whole thing falls apart. It's really strange Nerf even works imo.
Great explanation, Thanks Yannic. Shouldn't it be called 6D - x,y,z , Direction, color and density?
I've started the "I invented everything" video yesterday and paused to continue today, but it's private now :(
I noticed manny of the scenes were from UC Berkeley, kinda trippy. The engineering school there gave me a bit of PTSD ngl.
Itd be cool is someone combined this with volumetric or loghtfield displays
Ancient huh! nice way to put it.
Hey will you ever do a video explaining Knowledge Graphs / Entity embeddings? For example by talking about the "Kepler, a unified model for KE and PLM" paper
hopefully this isn't comment isn't in bad taste - but the changes in lighting patterns on the right hand side at 2:51 reminded me a lot of how light behaves while under the influence of psychedelics. yes I'll see myself out...
I must be not understanding something. How do you get the density from 2d images?
I don't understand either
The 2D images give you multiple perspectives of the same point in space just from different angles. If you combine that information (in this case using the neural network) then you can get a good idea of whether or not there is something at a particular point. Density is not how much mass there is in the volume around a point but rather how much stuff there is at that point that interacts with light.
Think of it like what people naturally do when they pick something up and look at it from different angles. Each of those angles is like the pictures and the internal idea you would form of what the 3D object looks like is the model that gets trained. By the time you've analysed an object this way, you also can make pretty good guesses about which parts of the object reflect light, how much light, from which angles and what colour to expect. That's basically the model. The density is how much that point dominates the light you get from that point and could be something like 0 to 1 being from completely invisible to completely opaque.
Also, if you just look at the pictures you train on, your brain can build this model so that you have a good sense of whether or not a purple spot makes sense on the yellow tractor.
You ask the neural netwok for density information, not the images. The pixels (RGB value) in those images serve as the target for the neural network to train on.
@@thanvietduc1997 OK. that makes sense. Thanks.
When will this become available for image/video software?
So the network only work on one scene? And it is more of a 3D model compressor than a 3D scene generator, am I understanding this correctly?
This sounds as if presentation could be entirely done in a raymarching shader on the GPU as I suspect the evaluation of the model can be implemented as a shader.
Great explanation. Yannic! Like to know if this technique could be used for 3D modelling?
Yes, you can construct a triangle mesh from NeRF density data
Reminds me of that SIREN paper.
This is better than magic.
How compact are minimally accurate models? How many parameters?
where does the scene come from?
I would be very surprised to see a similar technology used to render objects inside games. According to the paper, sampling takes 30 seconds on a high end GPU. As games often run at 60 fps, this would only be viable with a speed up of x1800 and it's assuming we only have to render a single object (so realistically speaking we could add another factor of x100).
This does not mean it is not possible with more research and better hardware but if we compare this to the traditional way of rendering in games, I'm not really sure there is an advantage.
It's not even something we could not do as we already have photogrammetry to generate meshes from images.
For non biased rendering ("photorealistic") I could see some use but the learning time is way too high for the moment. One application could be to render a few frames of an animation and use the model to "interpolate" between the frames.
Now it reached x1000 speedup in both training and inference. What a speed of progress. There is more chance of using this technology where you take a few pictures of an object existing in the real world and reconstruct it as 3d image by training a neural network. Then you can manipulate the image or synthesis 2d images from novel viewpoint, lightening, time-step(if a video) or so.
@@ksy8585 Very impressive. Do you have a link ?
Is this an in depth breakdown of what photogrammetry or is this different?
take pics from all angles of apple, maximize the score to label apple
THE BEST
Help! How do they determine depth density from a photo? Wouldn't you need prior trained data to know how far away an object is, from a single photo?
Yes, search for monocular depth estimation
@@YannicKilcher Thank you!
Hey Yannic. I've been waiting for you to talk about this. Thanks. One question though. The viewing angle, is it like the latitude and longitude angle? We need to values because we need to want to know how that point looks from both the horizontal and vertical angle, right?
I'd been assuming it was pan and tilt. The full algorithm would need to know the roll of the camera but I don't think that would influence any lighting effects.
It's spherical coordinates (minus the radius, for obvious reasons)
@@ashastra123 but then spherical coordinates have two angles, one wrt y axis and other wrt X axis. So are we using the name nomenclature here?
good question, I also been assuming one angle to measure how much left right, and other to measure how much up down, on a surface of a sphere. So, I am also assuming viewing angle like lat long angles.
Reminds me of that scene from Enemy Of The State ua-cam.com/video/3EwZQddc3kY/v-deo.html -- made over 20 years ago!
Why is this “overfitting”? Wouldn’t overfitting in this case be if the network snaps the rays to the nearest data point with that angle and doesn’t interpolate?
so there are really two networks (coarse and fine) or this is some kind of trick ?
It’s NeRF or Nothin’ 😎
i don't think this should be called "overfitting". as far as i'm concerned, overfitting means learning the input data (or at least big chunks of it) as one big pattern itself, instead of finding the patterns within the data and generalizing from it. this system may be able to reproduce the input data faithfully (i haven't compared them 🤷) but it clearly learned to generalize the spacial patterns of the scene.
It doesn't really generalize anything outside the data it has seen, its job is to just learn *really* well the points we care about, but anything outside that range isn't important.
Think of it like if you were to train a network on a function f(x) and you are interested on the domain [-1,1]. Overfitting on this domain would mean that the network is extremely precise inside this interval but does something random outside of it, while generalizing means that we also care about having a good estimate of the function outside the domain.
Here our domain is the parts where we can send rays to, it doesn't really matter what the model thinks is outside the box we never sampled on.
in this context, overfitting might be replaced with undercompression
@@mgostIH yeah, and network that is trained on birds will probably never generate a good squirrel. i don't think neural nets tend to be good at producing things unlike anything they have ever seen before.
🤔 That's actually a good point. If it "overfit" it wouldn't be able to interpolate novel viewpoints, just the pictures it was trained on.
I'm studying light field, the premise makes it not that impressive to me, program a lytro like renderer you'll know what i mean
7:05 - 7:45 So we use the same neural network for multiple different scenes? - Thats smart because then we dont need to retrain it every time.
I think, saying that each scene is associated with one single neural network (NN is overfitted for that scene) is not correct.
Code?
But can this be used real time?
5:50
finally something without cnns. bravo guys.
Deep Tesla
Python code?
Obligatory "we're living in a simulation" comment
Its ilke end of the photogrammetry
nerf is a step forward in photogrammetry
Yannick "not so lightspeed" Kilcher