Optimizing Pseudo 3D Rendering // Code Review

The Cherno

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 20 лис 2024

КОМЕНТАРІ • 132

@TheCherno 22 дні тому ⁺¹⁹
Thanks for watching! Did you follow along with the exercise and try and find issues yourself? What did you find? 👇
Also don't forget you can try everything Brilliant has to offer-free-for a full 30 days, visit brilliant.org/TheCherno . You’ll also get 20% off an annual premium subscription.
@vlauderlauders741 22 дні тому ⁺⁴⁶⁴
I sign the petition to see this game running on GPU
@marco_martin 22 дні тому ⁺¹
Me too
@joey244 22 дні тому ⁺¹
me to
@ArcShahi 22 дні тому ⁺¹
yes, That would be interesting...
@балаж98 22 дні тому ⁺¹
+1
@michaelp_c 22 дні тому ⁺¹
me five!
@marco_martin 22 дні тому ⁺¹⁵⁴
I would ABSOLUTELY like it if you could run that code on the GPU
@DesyncX 22 дні тому ⁺⁹²
Yes please, make this run on the gpu; maybe even increase the resolution to full screen and compare the results.
@xxdeadmonkxx 22 дні тому ⁺⁴²
Recommended settings for that game:
CPU: Intel Core i3 8100
GPU: Yes
@ABaumstumpf 22 дні тому ⁺⁷²
Some of your assumptions are completely wrong - like with the sin/cos values:
Recalculating those values would be significantly slower but the compiler can see that they are identical and will not redo the work all the time. On the other hand explicitly storing those intermediate values has no chance of being a cold memory read as they would only be used right after getting calculated.
With the memory access in different parts of the array: Nah, that really isn't a problem. Going backwards for the sky is likely far worse - but without an actual isolated benchmark there is no way of saying what is going on.
And what really is slow is setting every single pixel with the SDL_RenderDrawPoint - this is extremely slow. The Function is doing renderer-setups, checks, allocations and a lot more for every single pixel. Use your own pixel-buffer and then send in the whole thing at once will be much much faster.
@xeridea 22 дні тому ⁺¹
If they were locally cached the values would not be cold. I was thinking it would be cached once before rendering then fetching each frame. They may end up being prefetched though, so would still be in cache.
Can CPUs detect reverse loop offsets and prefetch?
I was thinking the same thing about drawing pixels. There is going to be a massive overhead individually drawing pixels. I am surprised he didn't mention that. Perhaps I will compare.
@ABaumstumpf 22 дні тому ⁺²
@@xeridea "Can CPUs detect reverse loop offsets and prefetch?"
Can? pretty sure - yes. But as there are other things going on it is still better to avoid that.
I have seen instances where the branch-predictor managed to get better than chance performance on data that was basically random, and memory-prefetch for lists.
@streamdx 21 день тому ⁺³
Yep! Cherno missed the elephant in the room this time! )
@jlewwis1995 21 день тому ⁺²
If the values are stored on the stack they would basically never be in cold memory right? Because the CPU is accessing the stack all the time when you call functions, push function arguments to the stack, write to a stack allocated buffer, etc. So the area of memory that contains the stack would be in the CPU cache most of the time wouldn't it since it's being used constantly?
@bobjones304 21 день тому ⁺²
Doesn't the complier just inline them?
@stendaneel4879 22 дні тому ⁺⁴⁵
9:34 Make the raytracing series run in a shader, it would be really cool to see how you would implement it. Maybe another cool video idea would be compute shaders with vulkan, or a vulkan series in general, kind of like the opengl series.
@mertcanzafer5446 22 дні тому ⁺³
+1
@emomaxd2462 22 дні тому ⁺⁴
or maybe running ray tracing with CUDA that gives a lot more low-level control
@pyropoops139 18 днів тому
@@emomaxd2462opencl would probably be a better bet no? but i think an opengl shader would be the best bet as it is the closest to practical application in game engines
@oskardeeream1846 22 дні тому ⁺²¹
You should do a collaboration video with one lone coder :D that would be awesome.
@Jellow2202 22 дні тому ⁺²⁴
lerp is available in the standard library since C++20 as std::lerp in
@hassaniq0777 20 днів тому
whatfffff 💀😭😭😭
@RequiDev 22 дні тому ⁺³
15:30 While you're right in most cases, that caching does come with the cost of memory and reading from the memory, in this specific case its just a constant that not only never changes, it can be computed at compile-time and will very very likely just live directly inside the instruction as an immediate operand. Compilers are very smart.
@OlxinosEtenn 22 дні тому ⁺³
17:40 - 20:20
It's not *that* bad since it's almost sequential.
Also, 19:25 suggests that reading an array in reverse is always bad, which is wrong (it might not be what was meant, but it's very easy to interpret it like that).
I made a small program to illustrate that but youtube dislikes comments with links and ate it (so I'm reposting my comment without that, I hope I'm not being a bother). The takeaway was that:
- going through an array sequentially or in reverse (sequentially but backwards) doesn't noticeably change performance
- reversing the order of rows (like in the video) or columns causes a small performance hit (about +5% time spent on my machine with x=y=10000 and a loop body consisting of a single addition), possibly not noticeable if the loop body does as much work as the one shown in the video
- iterating over x in the outer loop and y in the inner loop however causes a massive performance hit (about +900% time spent, same context as above), that's the main thing to avoid if possible
- random accesses is even worse (about +1500% time spent, same context as above)
@Waffle4569 22 дні тому ⁺⁶
6:12 When std::chrono is such a cumbersome namespace that you need to make a wrapper around it.
@mjthebest7294 22 дні тому ⁺²
just like pretty much the entire standard library
@oliverdowning1543 22 дні тому ⁺²
I did actually do, for a project a few months ago, this exact code as a GLSL fragment shader. It's quite fun as a project.
@Steven-tw7iz 22 дні тому ⁺⁵
I would love to see you convert this loop to use SSE/AVX intrinsics to really start to use the power of modern CPUs, not enough people really know or understand about that stuff
@mr.anderson5077 22 дні тому ⁺⁴
Please teach multi threading on such scenarios and offloading stuff on the multiple cpu cores
@matsv201 22 дні тому ⁺²
I would say for caching you would want a mix of function.
The issue is that a typical modern CPU do about 4 instruction per cycle, but every instruction takes anywhere between about 4 and 15 cycles to do.
If you feed the result from one instruction that take a lot of cycles to do into a other one, it have to wait for it to catch up.
The shedular often do a good work of this for short issues, but if you do a loop, that may not be possible.
So inside the loop you would want a good mix of cach calls, floats and other instruction mix. The more mix you have, the faster it will execute.
Of cause in this case, if you want it to do it quickly, you would really want to use the SIMD function
Its also worth saying that the L1 cache is typically fairly small, but its instant. Like typically 32kB of L1 cach. If you do something like a 256 bit simd you really would want to do no more than 100 of them in cache at any one time, preferably quite a bit less.
I i would speculate that a resonable aproch would be to set up a calculation for a block of simd and run it for 20-30 sets at the time, then rework them, and during the rework set up the next block, allowing it to draw form memory while its calculating the old work
@mohamedyusuf4777 21 день тому ⁺¹
I love these code review series. Keep up the good work.
@mspeir 22 дні тому ⁺¹
Heck yeah! I'd love to see that rewritten on a GPU. In fact, I was thinking of how to rewrite it as a shader as you were going through it.
@TheMaginor 22 дні тому ⁺⁴
You would probably speed up a lot using SSE on the inner x loop too (can be combined with threading). The compiler may not be able to do that on its own since it can't know if the rows are aligned or have lengths that are multiples of 4. The texture lookups could not be vectorized, but the math could. Could probably even vectorize the rgba unpacking (it is just bit shifting).
@juanmacias5922 22 дні тому
This was really cool, I hope you continue the code review series. :D
@mjthebest7294 22 дні тому ⁺¹⁴
Raytracing series comeback when?
@zeusdeux 22 дні тому ⁺³
Just here upvoting all “let’s get this on the GPU” comments
@AlexSmolyankin 20 днів тому
Really cool video. Now I'm waiting the video about bringing that code to the GPU.
@SuperCamelFunTime 22 дні тому ⁺¹
Please do make a video on leveraging the GPU as much as possible. It would be great if you can go into particle emitter calculations on GPU as well.
@arcoute9108 22 дні тому ⁺¹
Mode 7 rendering is cool and was first implemented with hardware in the SNES
@someidiot6359 22 дні тому ⁺²
Do you think you could make a video explaining the cache and how to optimize for it?
@ZackLivestone 21 день тому
I want to see more like this in the future
@Kazyek 21 день тому
15:52 The cost of "caching" mainly depend about where you put it and how to retrieve it. A HashMap for example, while being the most awesome data structure ever, involve quite a bit of math to retrieve a value from a given key. In THIS specific case though, since the variable doesn't depend on anything else at the moment, you'd probably simply keep it in the same struct alongside the fWorldA and fFoVHalf that you're already accessing, so it would be in a very similar place in memory, no expensive math to retrieve, and the relative cost of trigonometry function on the sum of two variable in a struct is definitely higher than retrieving a single variable in that same struct.
@Xoduz85 21 день тому
Yes yes yes, please make a video on how to take this code and transform it into a GPU version :D
Keep on making these awesome videos, they're great!!
@user-sl6gn1ss8p 22 дні тому ⁺³
Makes entire screen white
"Well, yes, looks much better"
@anon_y_mousse 14 днів тому
One tiny hint, if you have to specially handle an iteration because of an initial zero value, it's better to have that code before the loop and then start the loop at one. It'd be nice if the compiler would always recognize what's happening and do that for you, but it's also significantly more clear if you do it yourself.
@mr.anderson5077 22 дні тому ⁺¹
Hey @TheCherno
I request you to please make a tutorial on running the exit screen rendering on GPU.
I would love to see you deploy this workload on iGPU if your on intel or amd or use dGPU like Nvidia. May be we can dive into some CUDA programming too if needed in future or raw CPP is fine for now.
Also include this idea to utilise the same with your ray tracing examples.
Netizens please hit the like button below if you feel the same.
@SueDoeNym-b4d 13 днів тому
19:39 The CPU fetches memory as fixed lines. It basically divides the whole address range into fixed lines of (usually) 64 bytes. When a particular address is accessed, its whole line will be fetched, some of which could be behind it.
Suddenly looping backwards may result in some waste, as a line may have been loaded going forward that doesn't get fully utilised, but the difference would be imperceptible.
@ciCCapROSTi 21 день тому
+1 for the GPU video, especially if you can make it simple in your usual style. I last used a GPU when OpenGL was still properly pipelining, no shaders, so that's like 2 decades out of date knowledge.
@bronkolie 17 днів тому
creating this exact look in GPU would be really interesting
@xeridea 22 дні тому ⁺²
Caching can be good if only done on code that is looped a lot, and small to keep within the CPU cache, unless if operations are really expensive then larger cache would be fine.
AFAIK, looping backwards may not necessarily be horrible because prefetchers can detect offsets and fetch accordingly, but forward is still likely better.
I would say a big slowdown is calling a function to draw each pixel. You could just save everything to a buffer, then do the 1 draw.
@christianlett 22 дні тому ⁺¹
From what I could see in the video (I've not got the source code so can't be sure), the values pre-trig functions could all be constexpr. In C++26, the trig functions will also be constexpr. But the first thing I saw was the sin and cos calculations were each repeated 4 times, so I'd start there. Good observations re memory caching and memory access in the inner loop.
@SoederHouse 21 день тому
Now we just need Cherno and olc to collaborate on an light weight engine and the world would be a little more perfect.
@lptimey 22 дні тому ⁺²
15:50 give that some of these don’t ever change. Wy even cache them if you could precalculate them with your compiler with a const_expr I think
@empireempire3545 22 дні тому ⁺¹
9:40 DO IT, There is always need for GPU coding tutorials!
@theairaccumulator7144 22 дні тому ⁺⁶
when i wrote a raytracer in js caching everything made it like 350x faster LMFAO (don't ask why i was writing a raytracer in js)
@m.raflyyanuar9886 21 день тому
Why were you writing a raytracer in js
@hassaniq0777 20 днів тому
why
@MWPSBCID 22 дні тому ⁺³
I would like to see you do this on the GPU
@linavabai8470 22 дні тому ⁺⁴
Yes, GPU it, please
@anonanon6596 22 дні тому ⁺⁴
I would actually love to see you review a code of javidx9 himself.
Like his pixel game engine or any other project he has shown in his videos.
@frankreeser4400 22 дні тому ⁺¹
Good video! Thanks. Please make the GPU (imported original code) video! :)
@LoopSkaify 21 день тому
YES I WANT THAT!
@TheJonatanMr 21 день тому ⁺³
Petition to continue the ray tracing
@an1n-dya 22 дні тому ⁺²
Could we please have the return of the ray tracing series? 🙏🙏
@debsarkar4893 22 дні тому ⁺²
9:45 I would absolutely love to see how u take code like this run it on a GPU
@Lunastela64 17 днів тому
I'd love to see a video where you make something like this run on the GPU please :)
@furyzenblade3558 22 дні тому ⁺¹
9:37 Yess
@MatrixHound_Dungeon 22 дні тому ⁺¹
Yes make a video on how to run this on a GPU. Thanks!
@FarisEdits 22 дні тому
I LOVE THIS KIND OF VIDEOS
@Rob_III 21 день тому
Although you split ground and sky rendering into two loops, you didn't change the sky accessing memory "backwards"; I think that would've made a big(er) change in performance than just splitting the two into two separate loops.
@denravonska 22 дні тому
It would be interesting to see if making all those constants const or locking them to an anonymous namespace will make a difference.
@awesomeguy11000 22 дні тому ⁺¹
I'm surprised you didn't make an rgba32 backbuffer and format the loaded textures as rgba32, then you could avoid all function calling overhead and copy pixels directly from source to destination buffer. SDL uses the GPU behind the scenes so setting up render state and issuing a draw call for each pixel has a much higher overhead than performing all of the work in the CPU and flushing the buffer at the end.
@ralph_d_youtuber8298 22 дні тому
I feel like if u still wanted the cosf to be there for easy reading. Then put it in a const scope. So it can just be calculated at compile time. Why cache the compiler can inline the results for u😊.
@Doogle41 17 днів тому
These magic scribbles are well and good but where do I get a cool hoodie like Cherno's?
@whoshotdk 19 днів тому
I’m more interested in how you’d multi-thread the rendering of this than seeing it run on a GPU, which I think would just be a ton of boilerplate code and a fragment shader that looks very similar to the existing code (I could be wrong!). I guess multi-threading would introduce concepts like synchronicity? I.e how do you avoid tearing effects if different cores are spitting out pixels at different rates? Mostly guessing here, I’m new to C++ myself and am strictly a single-thread guy right now.
@Redeam. 18 днів тому
When new "The Cherno's Adventures in Minecraft"?
@BadMemoryAccess 22 дні тому
lmao, last week I wrote a Cacher class which held cached values and the relevant recomputing functions ... granted, my cached computations were actually costly, not just arithmetic operations and trig (there was noticable lag without caching)
@ringo2715 21 день тому
Just for clarity, javidx9 has migrated development of console game engine to pixel game engine which does use the GPU.
@brendandower9021 21 день тому
Yes please.
@gsestream 22 дні тому
there is the maf, and there is the code. math is interesting, code is boring maintenance. or just do post processing filtering like FXA. or TAA, or MSAA or just super sampling anti-alias from higher resolution down.
@Jurasebastian 21 день тому
Is following idea good? i have 1MB memory fragment in which i have many memory fragments i want to access many times, my idea is to copy all of those fragments to one smaller buffer that will fit into cpu cache and then do calculations on that memory? it will be probably faster even i add extra allocation at beginning and end assuming i access memory like milion of times
@mr.anderson5077 22 дні тому ⁺¹
Anybody know what tool does Cherno uses to draw stuff on the screen kindly drop a comment
@adamagrest8215 22 дні тому ⁺¹
Baited my comment :) Please show us running this on GPU
@shawnbucholtz8082 22 дні тому ⁺¹
13:55 ........ me too.
@GEKKOGAMES_RETRO 21 день тому
yes me too 🤟🏻
@Diamonddrake 22 дні тому
Does this mean looping through an array backwards is a cache miss party?
@Markus-fw4px 22 дні тому
speed to 0.75, then it's watchable 😅
@cacticrown 16 днів тому
do you only review c++ code?
@ThunderSphun 22 дні тому ⁺¹
please convert this code to a shader, i have been waiting for something like this since i finished the raytrace series
@luizgarciaaa 22 дні тому
Pleeeeeaze do it!!!
@codinghuman9954 20 днів тому ⁺¹
MOAR GPU VIDS PLS!!!!!!!!!!!!!!!
@saurabhmehta7681 21 день тому ⁺¹
Its called Skyward Scammer because you scam Gonzo, the guy running after you, and fly into the sky after taking his money. You're welcome
@ericisconfused 22 дні тому
GPU! GPU!
@axjb2428 21 день тому
I would love to see you to port that code to GPU =)
@adrien8768 18 днів тому
Lets go GPU video
@bobjones304 21 день тому
If it is the same calculation, why not just inline it?
@jouniosmala9921 22 дні тому ⁺²
My memories of pseudo 3D was mostly with a system having 8Mhz but some memory of pseudo 3D with 1Mhz CPU. The performance of that thing is horrendous when considering that.
@ralphyreece4687 22 дні тому
My code is a mess of AI for stuff like SDL and structures, a ton of copy and pasting, and hundreds of if statements.
@mirabilis 21 день тому
Super Mario Kart!
@timmygilbert4102 22 дні тому
gpu gpu GPU GPU ❤
@R_eal-G_rude 22 дні тому
Cool
@casdf7 21 день тому
I want this game running at 1000 fps on a gpu
@Tyler-z8r 22 дні тому
Absolutely port this code to run on discrete graphics hardware! I have no idea how to do that!
@NintendoJimmy 22 дні тому
GPU!
@ShivamKumarPal-nc3nx 18 днів тому
cpu to gpu code
@defini7 22 дні тому
That code was originally designed for rendering in Windows Command Prompt so it was not expected to be run on GPU
@pfqniet 22 дні тому
13:30 I feel incredibly called out LMAO - I was exactly the same way 10 years ago and now I am "Future Me" and have to deal with "Past Me" being all clever and stuff. Help...
@andrewdunbar828 21 день тому
This is a pseudo comment
@dj10schannel 22 дні тому
🧐
@kingofbattleonline 21 день тому
очень понравилось, еще видосов с этим движком про оптимизацию.
@KeithKazamaFlick 22 дні тому
javidx9 & ChiliTomatoNoodle

Наступне

Автоматичне відтворення

Is this the WORST CODE I've EVER SEEN? // Code Review