Thanks for watching! Did you follow along with the exercise and try and find issues yourself? What did you find? 👇 Also don't forget you can try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/TheCherno . You’ll also get 20% off an annual premium subscription.
Some of your assumptions are completely wrong - like with the sin/cos values: Recalculating those values would be significantly slower but the compiler can see that they are identical and will not redo the work all the time. On the other hand explicitly storing those intermediate values has no chance of being a cold memory read as they would only be used right after getting calculated. With the memory access in different parts of the array: Nah, that really isn't a problem. Going backwards for the sky is likely far worse - but without an actual isolated benchmark there is no way of saying what is going on. And what really is slow is setting every single pixel with the SDL_RenderDrawPoint - this is extremely slow. The Function is doing renderer-setups, checks, allocations and a lot more for every single pixel. Use your own pixel-buffer and then send in the whole thing at once will be much much faster.
If they were locally cached the values would not be cold. I was thinking it would be cached once before rendering then fetching each frame. They may end up being prefetched though, so would still be in cache. Can CPUs detect reverse loop offsets and prefetch? I was thinking the same thing about drawing pixels. There is going to be a massive overhead individually drawing pixels. I am surprised he didn't mention that. Perhaps I will compare.
@@xeridea "Can CPUs detect reverse loop offsets and prefetch?" Can? pretty sure - yes. But as there are other things going on it is still better to avoid that. I have seen instances where the branch-predictor managed to get better than chance performance on data that was basically random, and memory-prefetch for lists.
If the values are stored on the stack they would basically never be in cold memory right? Because the CPU is accessing the stack all the time when you call functions, push function arguments to the stack, write to a stack allocated buffer, etc. So the area of memory that contains the stack would be in the CPU cache most of the time wouldn't it since it's being used constantly?
9:34 Make the raytracing series run in a shader, it would be really cool to see how you would implement it. Maybe another cool video idea would be compute shaders with vulkan, or a vulkan series in general, kind of like the opengl series.
@@emomaxd2462opencl would probably be a better bet no? but i think an opengl shader would be the best bet as it is the closest to practical application in game engines
15:30 While you're right in most cases, that caching does come with the cost of memory and reading from the memory, in this specific case its just a constant that not only never changes, it can be computed at compile-time and will very very likely just live directly inside the instruction as an immediate operand. Compilers are very smart.
17:40 - 20:20 It's not *that* bad since it's almost sequential. Also, 19:25 suggests that reading an array in reverse is always bad, which is wrong (it might not be what was meant, but it's very easy to interpret it like that). I made a small program to illustrate that but youtube dislikes comments with links and ate it (so I'm reposting my comment without that, I hope I'm not being a bother). The takeaway was that: - going through an array sequentially or in reverse (sequentially but backwards) doesn't noticeably change performance - reversing the order of rows (like in the video) or columns causes a small performance hit (about +5% time spent on my machine with x=y=10000 and a loop body consisting of a single addition), possibly not noticeable if the loop body does as much work as the one shown in the video - iterating over x in the outer loop and y in the inner loop however causes a massive performance hit (about +900% time spent, same context as above), that's the main thing to avoid if possible - random accesses is even worse (about +1500% time spent, same context as above)
I would love to see you convert this loop to use SSE/AVX intrinsics to really start to use the power of modern CPUs, not enough people really know or understand about that stuff
I would say for caching you would want a mix of function. The issue is that a typical modern CPU do about 4 instruction per cycle, but every instruction takes anywhere between about 4 and 15 cycles to do. If you feed the result from one instruction that take a lot of cycles to do into a other one, it have to wait for it to catch up. The shedular often do a good work of this for short issues, but if you do a loop, that may not be possible. So inside the loop you would want a good mix of cach calls, floats and other instruction mix. The more mix you have, the faster it will execute. Of cause in this case, if you want it to do it quickly, you would really want to use the SIMD function Its also worth saying that the L1 cache is typically fairly small, but its instant. Like typically 32kB of L1 cach. If you do something like a 256 bit simd you really would want to do no more than 100 of them in cache at any one time, preferably quite a bit less. I i would speculate that a resonable aproch would be to set up a calculation for a block of simd and run it for 20-30 sets at the time, then rework them, and during the rework set up the next block, allowing it to draw form memory while its calculating the old work
You would probably speed up a lot using SSE on the inner x loop too (can be combined with threading). The compiler may not be able to do that on its own since it can't know if the rows are aligned or have lengths that are multiples of 4. The texture lookups could not be vectorized, but the math could. Could probably even vectorize the rgba unpacking (it is just bit shifting).
15:52 The cost of "caching" mainly depend about where you put it and how to retrieve it. A HashMap for example, while being the most awesome data structure ever, involve quite a bit of math to retrieve a value from a given key. In THIS specific case though, since the variable doesn't depend on anything else at the moment, you'd probably simply keep it in the same struct alongside the fWorldA and fFoVHalf that you're already accessing, so it would be in a very similar place in memory, no expensive math to retrieve, and the relative cost of trigonometry function on the sum of two variable in a struct is definitely higher than retrieving a single variable in that same struct.
One tiny hint, if you have to specially handle an iteration because of an initial zero value, it's better to have that code before the loop and then start the loop at one. It'd be nice if the compiler would always recognize what's happening and do that for you, but it's also significantly more clear if you do it yourself.
Hey @TheCherno I request you to please make a tutorial on running the exit screen rendering on GPU. I would love to see you deploy this workload on iGPU if your on intel or amd or use dGPU like Nvidia. May be we can dive into some CUDA programming too if needed in future or raw CPP is fine for now. Also include this idea to utilise the same with your ray tracing examples. Netizens please hit the like button below if you feel the same.
19:39 The CPU fetches memory as fixed lines. It basically divides the whole address range into fixed lines of (usually) 64 bytes. When a particular address is accessed, its whole line will be fetched, some of which could be behind it. Suddenly looping backwards may result in some waste, as a line may have been loaded going forward that doesn't get fully utilised, but the difference would be imperceptible.
+1 for the GPU video, especially if you can make it simple in your usual style. I last used a GPU when OpenGL was still properly pipelining, no shaders, so that's like 2 decades out of date knowledge.
Caching can be good if only done on code that is looped a lot, and small to keep within the CPU cache, unless if operations are really expensive then larger cache would be fine. AFAIK, looping backwards may not necessarily be horrible because prefetchers can detect offsets and fetch accordingly, but forward is still likely better. I would say a big slowdown is calling a function to draw each pixel. You could just save everything to a buffer, then do the 1 draw.
From what I could see in the video (I've not got the source code so can't be sure), the values pre-trig functions could all be constexpr. In C++26, the trig functions will also be constexpr. But the first thing I saw was the sin and cos calculations were each repeated 4 times, so I'd start there. Good observations re memory caching and memory access in the inner loop.
Although you split ground and sky rendering into two loops, you didn't change the sky accessing memory "backwards"; I think that would've made a big(er) change in performance than just splitting the two into two separate loops.
I'm surprised you didn't make an rgba32 backbuffer and format the loaded textures as rgba32, then you could avoid all function calling overhead and copy pixels directly from source to destination buffer. SDL uses the GPU behind the scenes so setting up render state and issuing a draw call for each pixel has a much higher overhead than performing all of the work in the CPU and flushing the buffer at the end.
I feel like if u still wanted the cosf to be there for easy reading. Then put it in a const scope. So it can just be calculated at compile time. Why cache the compiler can inline the results for u😊.
I’m more interested in how you’d multi-thread the rendering of this than seeing it run on a GPU, which I think would just be a ton of boilerplate code and a fragment shader that looks very similar to the existing code (I could be wrong!). I guess multi-threading would introduce concepts like synchronicity? I.e how do you avoid tearing effects if different cores are spitting out pixels at different rates? Mostly guessing here, I’m new to C++ myself and am strictly a single-thread guy right now.
lmao, last week I wrote a Cacher class which held cached values and the relevant recomputing functions ... granted, my cached computations were actually costly, not just arithmetic operations and trig (there was noticable lag without caching)
there is the maf, and there is the code. math is interesting, code is boring maintenance. or just do post processing filtering like FXA. or TAA, or MSAA or just super sampling anti-alias from higher resolution down.
Is following idea good? i have 1MB memory fragment in which i have many memory fragments i want to access many times, my idea is to copy all of those fragments to one smaller buffer that will fit into cpu cache and then do calculations on that memory? it will be probably faster even i add extra allocation at beginning and end assuming i access memory like milion of times
My memories of pseudo 3D was mostly with a system having 8Mhz but some memory of pseudo 3D with 1Mhz CPU. The performance of that thing is horrendous when considering that.
13:30 I feel incredibly called out LMAO - I was exactly the same way 10 years ago and now I am "Future Me" and have to deal with "Past Me" being all clever and stuff. Help...
Thanks for watching! Did you follow along with the exercise and try and find issues yourself? What did you find? 👇
Also don't forget you can try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/TheCherno . You’ll also get 20% off an annual premium subscription.
I sign the petition to see this game running on GPU
Me too
me to
yes, That would be interesting...
+1
me five!
I would ABSOLUTELY like it if you could run that code on the GPU
Yes please, make this run on the gpu; maybe even increase the resolution to full screen and compare the results.
Recommended settings for that game:
CPU: Intel Core i3 8100
GPU: Yes
Some of your assumptions are completely wrong - like with the sin/cos values:
Recalculating those values would be significantly slower but the compiler can see that they are identical and will not redo the work all the time. On the other hand explicitly storing those intermediate values has no chance of being a cold memory read as they would only be used right after getting calculated.
With the memory access in different parts of the array: Nah, that really isn't a problem. Going backwards for the sky is likely far worse - but without an actual isolated benchmark there is no way of saying what is going on.
And what really is slow is setting every single pixel with the SDL_RenderDrawPoint - this is extremely slow. The Function is doing renderer-setups, checks, allocations and a lot more for every single pixel. Use your own pixel-buffer and then send in the whole thing at once will be much much faster.
If they were locally cached the values would not be cold. I was thinking it would be cached once before rendering then fetching each frame. They may end up being prefetched though, so would still be in cache.
Can CPUs detect reverse loop offsets and prefetch?
I was thinking the same thing about drawing pixels. There is going to be a massive overhead individually drawing pixels. I am surprised he didn't mention that. Perhaps I will compare.
@@xeridea "Can CPUs detect reverse loop offsets and prefetch?"
Can? pretty sure - yes. But as there are other things going on it is still better to avoid that.
I have seen instances where the branch-predictor managed to get better than chance performance on data that was basically random, and memory-prefetch for lists.
Yep! Cherno missed the elephant in the room this time! )
If the values are stored on the stack they would basically never be in cold memory right? Because the CPU is accessing the stack all the time when you call functions, push function arguments to the stack, write to a stack allocated buffer, etc. So the area of memory that contains the stack would be in the CPU cache most of the time wouldn't it since it's being used constantly?
Doesn't the complier just inline them?
9:34 Make the raytracing series run in a shader, it would be really cool to see how you would implement it. Maybe another cool video idea would be compute shaders with vulkan, or a vulkan series in general, kind of like the opengl series.
+1
or maybe running ray tracing with CUDA that gives a lot more low-level control
@@emomaxd2462opencl would probably be a better bet no? but i think an opengl shader would be the best bet as it is the closest to practical application in game engines
You should do a collaboration video with one lone coder :D that would be awesome.
lerp is available in the standard library since C++20 as std::lerp in
whatfffff 💀😭😭😭
15:30 While you're right in most cases, that caching does come with the cost of memory and reading from the memory, in this specific case its just a constant that not only never changes, it can be computed at compile-time and will very very likely just live directly inside the instruction as an immediate operand. Compilers are very smart.
17:40 - 20:20
It's not *that* bad since it's almost sequential.
Also, 19:25 suggests that reading an array in reverse is always bad, which is wrong (it might not be what was meant, but it's very easy to interpret it like that).
I made a small program to illustrate that but youtube dislikes comments with links and ate it (so I'm reposting my comment without that, I hope I'm not being a bother). The takeaway was that:
- going through an array sequentially or in reverse (sequentially but backwards) doesn't noticeably change performance
- reversing the order of rows (like in the video) or columns causes a small performance hit (about +5% time spent on my machine with x=y=10000 and a loop body consisting of a single addition), possibly not noticeable if the loop body does as much work as the one shown in the video
- iterating over x in the outer loop and y in the inner loop however causes a massive performance hit (about +900% time spent, same context as above), that's the main thing to avoid if possible
- random accesses is even worse (about +1500% time spent, same context as above)
6:12 When std::chrono is such a cumbersome namespace that you need to make a wrapper around it.
just like pretty much the entire standard library
I did actually do, for a project a few months ago, this exact code as a GLSL fragment shader. It's quite fun as a project.
I would love to see you convert this loop to use SSE/AVX intrinsics to really start to use the power of modern CPUs, not enough people really know or understand about that stuff
Please teach multi threading on such scenarios and offloading stuff on the multiple cpu cores
I would say for caching you would want a mix of function.
The issue is that a typical modern CPU do about 4 instruction per cycle, but every instruction takes anywhere between about 4 and 15 cycles to do.
If you feed the result from one instruction that take a lot of cycles to do into a other one, it have to wait for it to catch up.
The shedular often do a good work of this for short issues, but if you do a loop, that may not be possible.
So inside the loop you would want a good mix of cach calls, floats and other instruction mix. The more mix you have, the faster it will execute.
Of cause in this case, if you want it to do it quickly, you would really want to use the SIMD function
Its also worth saying that the L1 cache is typically fairly small, but its instant. Like typically 32kB of L1 cach. If you do something like a 256 bit simd you really would want to do no more than 100 of them in cache at any one time, preferably quite a bit less.
I i would speculate that a resonable aproch would be to set up a calculation for a block of simd and run it for 20-30 sets at the time, then rework them, and during the rework set up the next block, allowing it to draw form memory while its calculating the old work
I love these code review series. Keep up the good work.
Heck yeah! I'd love to see that rewritten on a GPU. In fact, I was thinking of how to rewrite it as a shader as you were going through it.
You would probably speed up a lot using SSE on the inner x loop too (can be combined with threading). The compiler may not be able to do that on its own since it can't know if the rows are aligned or have lengths that are multiples of 4. The texture lookups could not be vectorized, but the math could. Could probably even vectorize the rgba unpacking (it is just bit shifting).
This was really cool, I hope you continue the code review series. :D
Raytracing series comeback when?
Just here upvoting all “let’s get this on the GPU” comments
Really cool video. Now I'm waiting the video about bringing that code to the GPU.
Please do make a video on leveraging the GPU as much as possible. It would be great if you can go into particle emitter calculations on GPU as well.
Mode 7 rendering is cool and was first implemented with hardware in the SNES
Do you think you could make a video explaining the cache and how to optimize for it?
I want to see more like this in the future
15:52 The cost of "caching" mainly depend about where you put it and how to retrieve it. A HashMap for example, while being the most awesome data structure ever, involve quite a bit of math to retrieve a value from a given key. In THIS specific case though, since the variable doesn't depend on anything else at the moment, you'd probably simply keep it in the same struct alongside the fWorldA and fFoVHalf that you're already accessing, so it would be in a very similar place in memory, no expensive math to retrieve, and the relative cost of trigonometry function on the sum of two variable in a struct is definitely higher than retrieving a single variable in that same struct.
Yes yes yes, please make a video on how to take this code and transform it into a GPU version :D
Keep on making these awesome videos, they're great!!
Makes entire screen white
"Well, yes, looks much better"
One tiny hint, if you have to specially handle an iteration because of an initial zero value, it's better to have that code before the loop and then start the loop at one. It'd be nice if the compiler would always recognize what's happening and do that for you, but it's also significantly more clear if you do it yourself.
Hey @TheCherno
I request you to please make a tutorial on running the exit screen rendering on GPU.
I would love to see you deploy this workload on iGPU if your on intel or amd or use dGPU like Nvidia. May be we can dive into some CUDA programming too if needed in future or raw CPP is fine for now.
Also include this idea to utilise the same with your ray tracing examples.
Netizens please hit the like button below if you feel the same.
19:39 The CPU fetches memory as fixed lines. It basically divides the whole address range into fixed lines of (usually) 64 bytes. When a particular address is accessed, its whole line will be fetched, some of which could be behind it.
Suddenly looping backwards may result in some waste, as a line may have been loaded going forward that doesn't get fully utilised, but the difference would be imperceptible.
+1 for the GPU video, especially if you can make it simple in your usual style. I last used a GPU when OpenGL was still properly pipelining, no shaders, so that's like 2 decades out of date knowledge.
creating this exact look in GPU would be really interesting
Caching can be good if only done on code that is looped a lot, and small to keep within the CPU cache, unless if operations are really expensive then larger cache would be fine.
AFAIK, looping backwards may not necessarily be horrible because prefetchers can detect offsets and fetch accordingly, but forward is still likely better.
I would say a big slowdown is calling a function to draw each pixel. You could just save everything to a buffer, then do the 1 draw.
From what I could see in the video (I've not got the source code so can't be sure), the values pre-trig functions could all be constexpr. In C++26, the trig functions will also be constexpr. But the first thing I saw was the sin and cos calculations were each repeated 4 times, so I'd start there. Good observations re memory caching and memory access in the inner loop.
Now we just need Cherno and olc to collaborate on an light weight engine and the world would be a little more perfect.
15:50 give that some of these don’t ever change. Wy even cache them if you could precalculate them with your compiler with a const_expr I think
9:40 DO IT, There is always need for GPU coding tutorials!
when i wrote a raytracer in js caching everything made it like 350x faster LMFAO (don't ask why i was writing a raytracer in js)
Why were you writing a raytracer in js
why
I would like to see you do this on the GPU
Yes, GPU it, please
I would actually love to see you review a code of javidx9 himself.
Like his pixel game engine or any other project he has shown in his videos.
Good video! Thanks. Please make the GPU (imported original code) video! :)
YES I WANT THAT!
Petition to continue the ray tracing
Could we please have the return of the ray tracing series? 🙏🙏
9:45 I would absolutely love to see how u take code like this run it on a GPU
I'd love to see a video where you make something like this run on the GPU please :)
9:37 Yess
Yes make a video on how to run this on a GPU. Thanks!
I LOVE THIS KIND OF VIDEOS
Although you split ground and sky rendering into two loops, you didn't change the sky accessing memory "backwards"; I think that would've made a big(er) change in performance than just splitting the two into two separate loops.
It would be interesting to see if making all those constants const or locking them to an anonymous namespace will make a difference.
I'm surprised you didn't make an rgba32 backbuffer and format the loaded textures as rgba32, then you could avoid all function calling overhead and copy pixels directly from source to destination buffer. SDL uses the GPU behind the scenes so setting up render state and issuing a draw call for each pixel has a much higher overhead than performing all of the work in the CPU and flushing the buffer at the end.
I feel like if u still wanted the cosf to be there for easy reading. Then put it in a const scope. So it can just be calculated at compile time. Why cache the compiler can inline the results for u😊.
These magic scribbles are well and good but where do I get a cool hoodie like Cherno's?
I’m more interested in how you’d multi-thread the rendering of this than seeing it run on a GPU, which I think would just be a ton of boilerplate code and a fragment shader that looks very similar to the existing code (I could be wrong!). I guess multi-threading would introduce concepts like synchronicity? I.e how do you avoid tearing effects if different cores are spitting out pixels at different rates? Mostly guessing here, I’m new to C++ myself and am strictly a single-thread guy right now.
When new "The Cherno's Adventures in Minecraft"?
lmao, last week I wrote a Cacher class which held cached values and the relevant recomputing functions ... granted, my cached computations were actually costly, not just arithmetic operations and trig (there was noticable lag without caching)
Just for clarity, javidx9 has migrated development of console game engine to pixel game engine which does use the GPU.
Yes please.
there is the maf, and there is the code. math is interesting, code is boring maintenance. or just do post processing filtering like FXA. or TAA, or MSAA or just super sampling anti-alias from higher resolution down.
Is following idea good? i have 1MB memory fragment in which i have many memory fragments i want to access many times, my idea is to copy all of those fragments to one smaller buffer that will fit into cpu cache and then do calculations on that memory? it will be probably faster even i add extra allocation at beginning and end assuming i access memory like milion of times
Anybody know what tool does Cherno uses to draw stuff on the screen kindly drop a comment
Baited my comment :) Please show us running this on GPU
13:55 ........ me too.
yes me too 🤟🏻
Does this mean looping through an array backwards is a cache miss party?
speed to 0.75, then it's watchable 😅
do you only review c++ code?
please convert this code to a shader, i have been waiting for something like this since i finished the raytrace series
Pleeeeeaze do it!!!
MOAR GPU VIDS PLS!!!!!!!!!!!!!!!
Its called Skyward Scammer because you scam Gonzo, the guy running after you, and fly into the sky after taking his money. You're welcome
GPU! GPU!
I would love to see you to port that code to GPU =)
Lets go GPU video
If it is the same calculation, why not just inline it?
My memories of pseudo 3D was mostly with a system having 8Mhz but some memory of pseudo 3D with 1Mhz CPU. The performance of that thing is horrendous when considering that.
My code is a mess of AI for stuff like SDL and structures, a ton of copy and pasting, and hundreds of if statements.
Super Mario Kart!
gpu gpu GPU GPU ❤
Cool
I want this game running at 1000 fps on a gpu
Absolutely port this code to run on discrete graphics hardware! I have no idea how to do that!
GPU!
cpu to gpu code
That code was originally designed for rendering in Windows Command Prompt so it was not expected to be run on GPU
13:30 I feel incredibly called out LMAO - I was exactly the same way 10 years ago and now I am "Future Me" and have to deal with "Past Me" being all clever and stuff. Help...
This is a pseudo comment
🧐
очень понравилось, еще видосов с этим движком про оптимизацию.
javidx9 & ChiliTomatoNoodle