Who's excited for part 2? Keep exploring at brilliant.org/TheCherno/ Get started for free, and hurry-the first 200 people get 20% off an annual premium subscription.
Thanks a lot for looking at my code ! For the logging, I was using spdlog, but then I removed it because I wasn't able to import it using FetchContent haha. This is very useful feedback and I can't wait for the part 2 !
As a predominately C developer, I agree with and applaud his choice of adding "pp" to the end of file names to differentiate C and C++ header/source files. They are separate and it should be noted. Arena allocators are a good idea and I've implemented several that I use in my own libraries. Heap allocation need not always be super expensive, even with "vectors", and the mitigation technique I learned years ago that still works beautifully to this day is to scale by a factor of two and always reserve memory starting at some power of two. As to the comment about logging, yes, it is a Windows "feature" to slow down by such a large factor when logging to a console. If you use a Linux distro of nearly any variety you'll be surprised by how quick the terminal updates as compared to Windows.
@8:13 would not recommend starting with _ ever because its too easy to make a mistake because "Use of two sequential underscore characters ( __ ) at the beginning of an identifier, or a single leading underscore followed by a capital letter, is reserved for C++ implementations in all scopes." @13:31 Not only do you want to be using pointers, but ask yourself "Do I need a hierarchy or do i just need several implementations of void ClassName::Update(float deltaTime)"? Because if you don't need a hierarchy, don't use one! Use type erasure and yes its still a function pointer and a potential cache miss, but it will simplify your code structure. Now you have a folder called Entities instead of a type named Entity that everything derives from and your type erased entity type now defines the contract a type must fulfill to be an entity instead of saying you HAVE to derive from Entity to be useful here. @14:51 Also known as a "cache-miss" because the writer was not as concerned about "cache locality" @19:59 std::sync_with_stdio(false) improves that time considerably but c++ iostreams are notoriously slow and the reason why is because of all the safeguarding overheads they do. The console is slow because it has to render which as you know is bleh. Logging libraries are the way to go in this case and not have them output to console but have them output to files. This is a graphical program so there shouldn't be a "console out" anyway. Create a new global logger named log or something at the very least. There are multithreaded logging libraries that will attempt to put your logs in chronological order if you don't want to split them. @25:41 On the virtual part: The operating system will allocate to you a "page" of memory when your current page is full so its basically the same thing as a small arena allocator, but its so much smaller than what an arena allocator will give you and many many system calls to the OS to ask for more "pages" is what makes allocation take so long. You're giving over your CPU cycles to the OS and that's going to mess up your execution cache because it code that's not in your program that's being called, malloc or whatever is going to be a function that's in a dynamic library aka a function pointer and more cache misses. Profile your system calls! You may find more than you expect. Also align your types (adds padding) so that when you do ask for a value its not going to have to ask for 2 lines (? proper name escapes me) because half of your object is on one line and the other half on another. @31:47 It looks like the number you're looking for is already computed with collision pairs as well. You seemed to know you needed to make a vector but made it too early! make instances as close as possible to where you use them. @32:09 I think the multiple solver problem is something that should be handled with a template. From what I saw you don't need to dynamically at runtime change your solver with the same types. Make your solver be something the compiler figures out.
Cherno, has a huge backlog of "The topic for another video", please keep it coming. yes, we do want a cpu cache, memory fragmentation , and what not in the multiverse video
31:45 Good advice about preallocating vectors. If this is a function that runs every frame, I would take it a step further and make the vectors persistent. Clear them at the start of the function and reuse them. This way the memory remains allocated and keeps getting reused. Another option would be to use an auto-resetting frame allocator like you mentioned earlier. However you go about it, the main idea is to not make new heap allocations every frame.
I'd really be interested in seeing how you add the optimizations. In particular, I'd be interested in seeing how you clean up the memory used by the arena allocator once you're left with holes.
I left C++ like a 7 years ago, and this brings so much memories and smile to my face. I'm watching your videos for few weeks now and must say good job and keep uploading. :)
I just started learning C++, currently I (I think) finished learning OOP concepts, and this video is so interesting for me actually! The stuff about the memory and access times is pretty interesting.
Would love to see you profiling this after your first look at it. I'm sure the stack allocation and growing of the vector each frame hits like a truck. That would also allow you to show some before/after benchmarks!
Great topic! This has been in my head the latest weeks when implementing my path tracer and SaH BVH, and the optimisations really add up. Especially referring objects by index and saving them in a 1D array.
You really caught my attention when talking about the CPU cache, as I've done some work with Assembly Language programming WAY in the past, but yeah, understanding how that works is an amazing detail for optimization. OMG, great idea with the logging to a file vs console, I'm just getting to the point in my project where it's starting to become medium sized, and logging is an issue already, so great to know that logging to files is more efficient...plus the macros... it probably helps that I am watching your video at a time when I'm considering re-working my entire codebase for my main project too. LOL OMG, that's amazing that you can pre-allocate memory and pass an allocator to the vector class, I'm totally going to look into this and try it! Great video, thanks for sharing.
for c++ simple logging, you can look at `sync_with_stdio(false)` and `std::cin.tie(NULL)` calls to accelerate your `cout` code a bit. `printf` will in general be faster though because it doesn't deal a lot with multi threaded scenarios. there are even faster ways to output logs, but of course, its non trivial overhead.
What I would like to see is you optimizing a project based on your recommentation you given in this video, then compare the results with an unoptimal solution with via a profiler. Would be super interesting! Great video though! :)
Would def love to see you profile this and then implement the optimisations and profile again! (threading, arena, allocators, less heap etc) great video!
in the moment now i dont have time to watch it. But later i will watch this vid and im sure its interesting because videos about how the hardware components work etc are always a thing i like learning about :D
Was going to mention that as well. He goes into some interesting details about conhost if I remember correctly which is doing a lot of crazy things that make consoles slow on Windows.
I wouldn't expect that the virtual memory thing matters all that much considering current CPUs don't prefetch across page boundaries anyway. But things like huge pages do have advantages in terms of TLB lookups and hit rates.
Often the dependencies between chained pointers is more important than the fragmentation. I.e. you could explicitly construnct a linked list in contiguous memory, but iterating will still involve the cpu waiting for each load to complete before it can calculate the next pointer. Iterating over the exact same nodes, but using an index instead of the next pointers, will be much faster. The cpu can prefetch the cache lines.
The reason it's slow to write to stdout is that things like std::flush, std::endl and new lines (" ") will flush the contents of the cout buffer into the stdout buffer terminal (writing to it) this happens instantly because terminals usually have little or no buffering, so it can appear instantly. This also happens with files on disk; although it's perceived as faster because it doesn't flush the contents as frequently, due to how the OS buffers the contents before writing to the file on disk. So it's not that terminals are slow, it's that any I/O is slow in general. You can avoid this by flushing the cout buffer less frequently (i.e. outside of loops) but it can be an architectural nightmare and often not needed, since you're probably more interested in up-to-date info when debugging. Do what Cherno (and many other projects) does and use different levels of logging for more granularity.
Terminal logging is slow in C++ because most streams, especially cout, tends to flush constantly where as most implemented file logging in C++ doesn't perform constant and immediate flushed for every input.
For a better std::cout -> console performance: 1. Call ios_base::sync_with_stdio(false); 2. Call std::cin.tie(nullptr); 3. Use ' ' instead of std::endl
My background is mobile games. I'll still say any build system is better than just VS/Xcode/Android Studio/plain makefiles/shellscripts. Despite CMake being a pain, I'd still recommend learning it because it tends to be most common and most supported by IDEs and toolchains. Despite the wild west of build systems, it tends to be most common. The most important reasons why to use a build system is getting support for new IDEs automatically and the ability to add linters, static analysis, fuzzing and unit tests easily to your project later. I've worked with too many projects where you're stuck with ancient versions of VS, no tests because nobody figured out how to add them (and code being brittle because of that). The absolute worst thing you can do is end up with a build process where devs use one process to make local builds and completely separate set of tools to make CI builds.
Haven't seen this in the comments, so will leave it. There's an article called "What Every Programmer Should Know About Memory". It explains in detail how the CPU works with memory, how RAM works, why it's so slow, and why CPU cache memory is so fast. I really recommend reading it (you just need to read only 3-4 first chapters).
I do something very similar with that log macro. its essentially just an X macro that wraps cerr and uses the ascii color codes. from there DLOG and RLOG are called and will log their respective debug/(sparse) release
Nice project and good talk about memory improvements! Memory arenas and transient memory are great and my most used techniques when i do programming these days. If you are interested, i have a similar physics project (2D fluid simulation) that is a little bit more complex, due to its multi-threading + integrated benchmark support and 4-versions of C++ styles, where i tried to show the difference between naive/from-the-book C++ programming to data-oriented-programming, but didn´t get it exactly right - especially the data-oriented part. Just give me a hint, i will sent you the details.
I'm curious how the actual defragmentation process works in a game engine and how it affects performance in a simulation where we have lots of circles dying
Yes if you could look at your optimisations and the effect on performance that would be really cool! Often I spend too much time optimising code for very little return. EDIT...but I do note the FPS is massive here anyway so it is difficult to quantify if it's worth it. Maybe throw in something that really puts a strain on the FPS and see the optimisations make it smooth again? Either way great code Stowy and great review Cherno.
I read C++ standards 2 months ago, and it said that C++23(C++2b) will support .h file as standard header file. It doesn’t mean that .hpp shouldn’t be used, but .h will be supported because it was before planned to phase it out, but as it was used a lot within C but also C++ they will keep it
A logging setup I've been messing with has the message simply sent to a queue, where a separate thread pulls from the queue and actually logs the thing
I wouldn't worry about fragmentation. It's the heap allocator's job to worry about managing that. And in the general sense, as long as you free memory in the opposite order that you allocated it, fragmentation will not be a problem. I say this as someone who has implemented malloc+free in C. To get a memory leak from allocator fragmentation, you would have to do some insanely stupid things. Of course don't just allocate willy nilly from the heap if you don't have to. Heap allocation carries a performance overhead because when malloc has to get more memory, it has to do so via a system call, which means a context switch, which is slow. That's the `sys` metric given by the `time` command. Regarding specifically what is said in the video, where you go into low level machine details like the CPU cache, I especially wouldn't worry about that, because that's premature optimization. Worry about choosing efficient algorithms, not about how the machine accomplishes a task. That's the compiler's job. Turn on that -O3 flag. Or -Ofast if you're not worried about slightly less precise math. Sometimes you can justify low level optimizations, like when the Quake devs implemented the fast inverse square root using low level floating point math. But then look what happened--the chipset manufacturers and compiler vendors caught up. Nowadays, the quake inverse square root is no faster (and sometimes slower) than code that a compiler will generate for a more straightforward algorithm. I do not recommend wasting your time optimizing for hardware. The compiler has already done it and you can save a lot more time by choosing a better algorithm. C (and by extension C++) is not a low level language, and your computer is not a fast PDP.
A big problem with that argument is the assumption that the pieces of data necessarily will be fragmented. It's "whataboutism" taken to the extreme. But let's look at an average case where you allocate 100 small objects using a heap allocator: the heap allocator has a free pool of memory, so it slices a chunk off for both the object and the bookkeeping node to manage that memory, and updates the other node to account for the borrow. It does this over and over again until 89 objects in, the pool doesn't have enough memory. So the allocator will do a context switch asking for more memory. The memory comes from the heap, so it will be adjacent to the previous memory, but it will continue to allocate memory until all objects are allocated. The allocator is smart, it doesn't want to waste CPU time by making a bunch of syscalls to allocate tiny blocks of memory, so it does them in bulk. Pages and pools of memory that it marks up and manages. If the addresses were wildly spread out, that would mean the allocator is allocating random pages for every single allocation request, and all those context switches would be a far worse bottleneck than a cache miss. But as it turns out, the heap grows upward. The addresses are all fairly close together. Now, you can optimize your code to assume that the allocator allocates a huge chunk of memory that's all close together, or you can optimize it to assume that the addresses will be far apart, but in the end, that's all you're doing: assuming. The standard says nothing about how the allocator is implemented. Don't assume. Write better algorithms. If the compiler thinks your array of structs will be more efficient if it turns it into individual arrays of the one element you access, it will do exactly that. That's the ultimate lesson: the compiler is better at optimizing than you are.
Strong argument to use hpp: A potential user does not need to think about extern "C". If it's .hpp, it can be included only and directly in Cpp. .h leaves a lot of room for speculation. Can you import it from C? Can you import it from Cpp? Do you NEED to call extern "C"? It's there for a reason.
Logging to console in Windows is indeed substantially, like Substantially slower than on Linux, however there are ways to speed it up as well, both by using Microsofts new terminal as well as using buffering in the program instead of flushing every single log immediately, still not as fast as on Linux, but helps a ton.
The reason why writing to console is slow, is that windows assume a window, so it's written to the UI interopts, while filewriting is just bits on disk
Logging on Linux/macOS: Yes, their terminals are magnitutes faster than Windows. Reason is that they are totally different implemented and Console on Windows is just slow. I read somewhere why it's hard to change. But Files are always faster, that's true.
We see it often in code, but in C++ it's not a good idea to start a variable identifier with an underscore. Some combinations of single/double underscore identifiers are reserved for the compiler implementation by the C++ standard. I would avoid it completely.
Excellent video, memory is always an interesting topic! My one suggestion would be to change the storage of bodies in DynamicsWorld. On line 23 in the source file (seen at 27:45) the whole 'if (!body->IsDynamic()) continue;' means that static bodies are loaded into the L1 cache and then immediately discarded. Splitting the storage into static and dynamic bodies will ease the pressure on both the cache and the branch predictor.
With regards to the .hpp header specification over .h, I find it to be very necessary in a lot of projects which are larger where you have a mixture of both c and cpp code (happens way more often than you might think at some companies where you have legacy code). It does makes a huge difference in those cases because you need to compile those .h files as C code only in some situations and not as C++ especially if they are separate projects in a larger solution base. It just makes it easier to distinguish directly what you are looking at. I used to be one of the .h default people, and never did understand why someone would use .hpp until I started working on legacy code bases created by other developers in large teams, now it makes sense because organizationally it serves an actual purpose. I now just use .hpp as default as a result, because I'd rather not go back after the fact and have to specify hey this is actually a cpp header file and you should compile it in your makefile or whatever build system you are using as C++ code specifically and not C code. Just something to consider.
@@user-dh8oi2mk4f What I mean is that usually in external build systems you have some method of determining which files are included in which compilation processes, typically by some kind of pattern matching. You do NOT want .h files which are strictly c linked in unnecessarily with C++ compilation units. This can result in all sorts of unexpected behaviors, especially if you have C headers putting things in global scope with simplified names, which is pretty frequent in legacy code. If I have multiple binaries in a solution that I need to compile some as C and some as C++ then you don't want to pattern match against all .h files when you are building C++ code in your build steps specifically.
@@fenril6685 But why would you need to figure out which headers are c and c++? The compiler simply pastes the contents of the includes directly into the source file. I don't understand why you need to know which headers are which. Maybe this is helpful if you mix c and c++ in the same directory, but I don't get how it would help with a build system
Is a std::vector with preallocated size a decent way to implement this kind of memory management? Or do you need to do it manually? Im a cpp newbie so pls dont roast me :) Btw, a great video! Looking forward for pt 2
also circles and spheres are memory friendly, you only have a center and radius, aabb has center and width-height-depth or 6x coordinates of the planes
javidx9 has some quite excellent videos on how you can make games in C++ and programming in embedded systems, which is really nice if you're into that kind of low level programming 😄 Low Level Learning is also a great channel for that kind of knowledge 😄
CppCon and CppNow also great channels for the more advanced. Amazing talks by Michael Caisse and Luke Valenty this year about what can be done with compile time programming and the type system.
I have no idea why you'd want a pointer there when you KNOW which implementation you use. Hell, why does the class hierarchy even exist? Just use a member variable, not a pointer.
One content, two languages. What I have now written may have a perfect mirror in another language. You can create a program that searches for the perfect language mirror. Thanks to this, you will be able to speak two languages and perform tasks in the shade.Endless enigmatic book in all languages. You can write a book with mirrors in all languages of the world. You can speak two languages at once, you just need to find the perfect reflection, same content, different translation. Infinite Mirrors. Pi 3.14 XBooks. Hybrid language. The algorithm flows through our heads, endless coding, just take off the chameleon masks. Connect words without spaces and you will find hidden tasks in all languages. Our conversations collide in the process, some words as well as numbers in words. We perform tasks hidden between words. You can create a Python coding language from a spoken language. You just need to find the mirrors. Two tongues glued together.
I just started programming in C and I wonder a lot about when to use the heap and when to use the stack. Because I am more comfortable using the stack, I predominantly put all data onto the stack. Is there an easy rule of thumb to when use one or the other?
If he just uses a clock for the delta time instead of a fixed time-step, that means his physics engine is not determinstic and thus will produce different results every time he runs a simulation.
yes i didn't knew that at the time, but i'm working on networking at the moment so I realized that mistake. I'll definetly be careful about that if I ever do something like that again haha
Great video, thank you! In modern C++, is heap memory fragmentation a concern for developers, given that the OS uses virtual memory to map to physical memory? My hypothesis is that even if physical RAM is fragmented, but virtual memory is contiguous, the C++ program's performance will not be affected.
Maybe or maybe not. CPUs don't prefetch across page boundaries, probably because of kernel-side page permissions / residency state. The more pages you access, the more TLB slots you use. TLB misses hurt, but maybe not to the level of framerate problems. It's an extra memory access, paid serially. Huge TLB requires defragmented memory on the kernel-side, and has a system-wide limit. Running kernel code to change page residency really hurts. It's many instructions, and a possible disk access.
Leading underscores are reserved in microsoft code. You should never use leading underscore variable if you expect to work on windows. Prefer trailing if you must.
11:33 The webcam picture quality begins to tank because of the video encoding all the little gaps between so many moving circles. It's interesting to see a non-FPS-related side-effect appear while testing FPS-related benchmarks.
About logging, can we just create a Static class and call it's function to log something there (through parameters) like: Logger.Log(_currentFps); and in our release build, we just comment out all the statements in that function. We would still have an overhead of calling that function and passing parameters, but is it okay to do it like this?
It’s more simple and straightforward to setup sure but you have to keep commenting and uncommenting every time you want to change build type and you have to remember to do that. His macro way is much better.
you suggested allocating things like rigid body to the stack because of cpu optimizations but shouldn't the programmer worry about space? Are you banking on the fact that vectors allocate on the heap contiguously? Or should there be a specific buffer created or contiguous heap memory?
25:45 Does it really work like this? That you have fragmentation in any percievable way. I thought with virtual memory you're not taking any penalty in reading across pages beyond that you're taking more TLB space because you have multiple pages. Is there any gain in having the actual pages be contiguous?
15:20 Is there a difference between a "Entity Component" system and a "Entity Component System" system/architecture? Both can be implemented with a data-oriented memory layout, correct?
Who's excited for part 2?
Keep exploring at brilliant.org/TheCherno/ Get started for free, and hurry-the first 200 people get 20% off an annual premium subscription.
Can you please point me to that sweet Visual Studio color scheme you're using?
Plz do a series of cpu
Can u make tutorial on creation of game engine Cinematics system. Please :)
yes please
can u plz make a complete VIDEO ON ASSEMBLER like the one similar to LINKER AND COMPILER
Thanks a lot for looking at my code ! For the logging, I was using spdlog, but then I removed it because I wasn't able to import it using FetchContent haha. This is very useful feedback and I can't wait for the part 2 !
Cheers, good luck for your classes!
2rd
@@blazefirer english is not my first language lol, my b
@@Stowy its ok. I saw that there was only one reply and I would the 2nd so I couldn't resist making the joke
I suggest taking a look at xmake as a replacement for cmake, it probably has spdlog in its repos and is just a pleasure to use in general.
As a predominately C developer, I agree with and applaud his choice of adding "pp" to the end of file names to differentiate C and C++ header/source files. They are separate and it should be noted. Arena allocators are a good idea and I've implemented several that I use in my own libraries. Heap allocation need not always be super expensive, even with "vectors", and the mitigation technique I learned years ago that still works beautifully to this day is to scale by a factor of two and always reserve memory starting at some power of two. As to the comment about logging, yes, it is a Windows "feature" to slow down by such a large factor when logging to a console. If you use a Linux distro of nearly any variety you'll be surprised by how quick the terminal updates as compared to Windows.
pp
She pp behind my file till I core dump
@@FREAKBAlTsee pp
lol pp
@8:13 would not recommend starting with _ ever because its too easy to make a mistake because "Use of two sequential underscore characters ( __ ) at the beginning of an identifier, or a single leading underscore followed by a capital letter, is reserved for C++ implementations in all scopes."
@13:31 Not only do you want to be using pointers, but ask yourself "Do I need a hierarchy or do i just need several implementations of void ClassName::Update(float deltaTime)"? Because if you don't need a hierarchy, don't use one! Use type erasure and yes its still a function pointer and a potential cache miss, but it will simplify your code structure. Now you have a folder called Entities instead of a type named Entity that everything derives from and your type erased entity type now defines the contract a type must fulfill to be an entity instead of saying you HAVE to derive from Entity to be useful here.
@14:51 Also known as a "cache-miss" because the writer was not as concerned about "cache locality"
@19:59 std::sync_with_stdio(false) improves that time considerably but c++ iostreams are notoriously slow and the reason why is because of all the safeguarding overheads they do. The console is slow because it has to render which as you know is bleh. Logging libraries are the way to go in this case and not have them output to console but have them output to files. This is a graphical program so there shouldn't be a "console out" anyway. Create a new global logger named log or something at the very least. There are multithreaded logging libraries that will attempt to put your logs in chronological order if you don't want to split them.
@25:41 On the virtual part: The operating system will allocate to you a "page" of memory when your current page is full so its basically the same thing as a small arena allocator, but its so much smaller than what an arena allocator will give you and many many system calls to the OS to ask for more "pages" is what makes allocation take so long. You're giving over your CPU cycles to the OS and that's going to mess up your execution cache because it code that's not in your program that's being called, malloc or whatever is going to be a function that's in a dynamic library aka a function pointer and more cache misses. Profile your system calls! You may find more than you expect. Also align your types (adds padding) so that when you do ask for a value its not going to have to ask for 2 lines (? proper name escapes me) because half of your object is on one line and the other half on another.
@31:47 It looks like the number you're looking for is already computed with collision pairs as well. You seemed to know you needed to make a vector but made it too early! make instances as close as possible to where you use them.
@32:09 I think the multiple solver problem is something that should be handled with a template. From what I saw you don't need to dynamically at runtime change your solver with the same types. Make your solver be something the compiler figures out.
Cherno, has a huge backlog of "The topic for another video", please keep it coming. yes, we do want a cpu cache, memory fragmentation , and what not in the multiverse video
31:45 Good advice about preallocating vectors. If this is a function that runs every frame, I would take it a step further and make the vectors persistent. Clear them at the start of the function and reuse them. This way the memory remains allocated and keeps getting reused. Another option would be to use an auto-resetting frame allocator like you mentioned earlier. However you go about it, the main idea is to not make new heap allocations every frame.
Yes, implementation and profiling of the optimizations would be super interesting to see!
yes yes yes. Can't wish for anything better!
Totally agree, that would be super interesting.
can't wait for a video like that from the best "TheCherno"
At 14:40 where you're talking about cache misses, there's a relevant article which is really good called "Your computer is not a fast PDP-11"
I'd really be interested in seeing how you add the optimizations. In particular, I'd be interested in seeing how you clean up the memory used by the arena allocator once you're left with holes.
I left C++ like a 7 years ago, and this brings so much memories and smile to my face. I'm watching your videos for few weeks now and must say good job and keep uploading. :)
What did you switch to ?
@@tathagatmani probably rust or c
@@tathagatmani Actually I switched first to objective-C and then swift :D Doing iOS mobile development now
c++ has evolved a looot … but he seems to be stuck in c++9x style.
I just started learning C++, currently I (I think) finished learning OOP concepts, and this video is so interesting for me actually! The stuff about the memory and access times is pretty interesting.
Would love to see you profiling this after your first look at it. I'm sure the stack allocation and growing of the vector each frame hits like a truck. That would also allow you to show some before/after benchmarks!
Great topic! This has been in my head the latest weeks when implementing my path tracer and SaH BVH, and the optimisations really add up. Especially referring objects by index and saving them in a 1D array.
If anyone wants to learn more about memory arenas, there's a good write-up called "Untangling Lifetimes: The Arena Allocator" by Ryan Fleury.
You really caught my attention when talking about the CPU cache, as I've done some work with Assembly Language programming WAY in the past, but yeah, understanding how that works is an amazing detail for optimization. OMG, great idea with the logging to a file vs console, I'm just getting to the point in my project where it's starting to become medium sized, and logging is an issue already, so great to know that logging to files is more efficient...plus the macros... it probably helps that I am watching your video at a time when I'm considering re-working my entire codebase for my main project too. LOL OMG, that's amazing that you can pre-allocate memory and pass an allocator to the vector class, I'm totally going to look into this and try it! Great video, thanks for sharing.
for c++ simple logging, you can look at `sync_with_stdio(false)` and `std::cin.tie(NULL)` calls to accelerate your `cout` code a bit. `printf` will in general be faster though because it doesn't deal a lot with multi threaded scenarios. there are even faster ways to output logs, but of course, its non trivial overhead.
What I would like to see is you optimizing a project based on your recommentation you given in this video, then compare the results with an unoptimal solution with via a profiler. Would be super interesting! Great video though! :)
This is awesome, makes me want to write my own physics engine as an exercise. Can't wait for the next video!
I'm not a game dev but this is still very educational.
Just realized you are still doing code reviews and this one had 3 videos. So Now I got my afternoon planned out heh.
im not even a game developer, i just work on the web and the cloud doing backend stuff but this is really interesting to watch. Subbed!!
This main function looks so nice. I wish mine could look so inviting.
this series is really fun to watch and very helpful
Would def love to see you profile this and then implement the optimisations and profile again! (threading, arena, allocators, less heap etc) great video!
I really liked this video, all in for part 2
Great video, though each time I wish that we could see the final optimized version of the project :D
in the moment now i dont have time to watch it. But later i will watch this vid and im sure its interesting because videos about how the hardware components work etc are always a thing i like learning about :D
Regarding the slow Windows terminal you may be interested in Casey Muratori's videos about it and his refterm prototype project
Was going to mention that as well. He goes into some interesting details about conhost if I remember correctly which is doing a lot of crazy things that make consoles slow on Windows.
I wouldn't expect that the virtual memory thing matters all that much considering current CPUs don't prefetch across page boundaries anyway. But things like huge pages do have advantages in terms of TLB lookups and hit rates.
Often the dependencies between chained pointers is more important than the fragmentation. I.e. you could explicitly construnct a linked list in contiguous memory, but iterating will still involve the cpu waiting for each load to complete before it can calculate the next pointer. Iterating over the exact same nodes, but using an index instead of the next pointers, will be much faster. The cpu can prefetch the cache lines.
Why is that channel so good? Humanity deserves it? Oh my, what a gift!
this was fascinating. can you make a video about how to make an arena allocator and then show how you use it when creating vectors?
The reason it's slow to write to stdout is that things like std::flush, std::endl and new lines ("
") will flush the contents of the cout buffer into the stdout buffer terminal (writing to it) this happens instantly because terminals usually have little or no buffering, so it can appear instantly. This also happens with files on disk; although it's perceived as faster because it doesn't flush the contents as frequently, due to how the OS buffers the contents before writing to the file on disk. So it's not that terminals are slow, it's that any I/O is slow in general.
You can avoid this by flushing the cout buffer less frequently (i.e. outside of loops) but it can be an architectural nightmare and often not needed, since you're probably more interested in up-to-date info when debugging. Do what Cherno (and many other projects) does and use different levels of logging for more granularity.
Would be great to hear you talking about Static vs Dynamic libraries!
I've been loving your stuff and gotta say the plug for brilliant is brilliant ! I'm going to check that out. Thank you so much Sir.
extremely valuable knowledge passed here, thanks Cherno ♥
Terminal logging is slow in C++ because most streams, especially cout, tends to flush constantly where as most implemented file logging in C++ doesn't perform constant and immediate flushed for every input.
For a better std::cout -> console performance:
1. Call ios_base::sync_with_stdio(false);
2. Call std::cin.tie(nullptr);
3. Use '
' instead of std::endl
My background is mobile games. I'll still say any build system is better than just VS/Xcode/Android Studio/plain makefiles/shellscripts. Despite CMake being a pain, I'd still recommend learning it because it tends to be most common and most supported by IDEs and toolchains. Despite the wild west of build systems, it tends to be most common.
The most important reasons why to use a build system is getting support for new IDEs automatically and the ability to add linters, static analysis, fuzzing and unit tests easily to your project later. I've worked with too many projects where you're stuck with ancient versions of VS, no tests because nobody figured out how to add them (and code being brittle because of that).
The absolute worst thing you can do is end up with a build process where devs use one process to make local builds and completely separate set of tools to make CI builds.
please release a video where you implement your suggestions. It would be so GREAT !!
Would love a video about the CPU cache and the related!
Haven't seen this in the comments, so will leave it. There's an article called "What Every Programmer Should Know About Memory". It explains in detail how the CPU works with memory, how RAM works, why it's so slow, and why CPU cache memory is so fast. I really recommend reading it (you just need to read only 3-4 first chapters).
I do something very similar with that log macro. its essentially just an X macro that wraps cerr and uses the ascii color codes. from there DLOG and RLOG are called and will log their respective debug/(sparse) release
Nice project and good talk about memory improvements! Memory arenas and transient memory are great and my most used techniques when i do programming these days.
If you are interested, i have a similar physics project (2D fluid simulation) that is a little bit more complex, due to its multi-threading + integrated benchmark support and 4-versions of C++ styles, where i tried to show the difference between naive/from-the-book C++ programming to data-oriented-programming, but didn´t get it exactly right - especially the data-oriented part. Just give me a hint, i will sent you the details.
I'm curious how the actual defragmentation process works in a game engine and how it affects performance in a simulation where we have lots of circles dying
FINALLY! I'VE BEEN WAITING FOR THIS EPISODE FOR AGES
@19:45 About the huge time consumption of logging... You should check out Trice! It speeds up your logging performance on embedded systems :)
Yes if you could look at your optimisations and the effect on performance that would be really cool! Often I spend too much time optimising code for very little return. EDIT...but I do note the FPS is massive here anyway so it is difficult to quantify if it's worth it. Maybe throw in something that really puts a strain on the FPS and see the optimisations make it smooth again? Either way great code Stowy and great review Cherno.
I read C++ standards 2 months ago, and it said that C++23(C++2b) will support .h file as standard header file. It doesn’t mean that .hpp shouldn’t be used, but .h will be supported because it was before planned to phase it out, but as it was used a lot within C but also C++ they will keep it
You can actually get rid of headers entirely if you use modules
The project: c++ gameplay
The cherno explanations: c++ lore
This is the 1th The Cherno video I watch
Wow, really enjoyed this one as a non game/game engine developer!
I wish I could find the motivation and smarts to be able to do stuff like this
this channel should have more subs
C++ ALREADY HAS ARENA ALLOCATOR. It works for all std structures/containers even vector. Its called PMR
A logging setup I've been messing with has the message simply sent to a queue, where a separate thread pulls from the queue and actually logs the thing
I wouldn't worry about fragmentation. It's the heap allocator's job to worry about managing that. And in the general sense, as long as you free memory in the opposite order that you allocated it, fragmentation will not be a problem. I say this as someone who has implemented malloc+free in C. To get a memory leak from allocator fragmentation, you would have to do some insanely stupid things. Of course don't just allocate willy nilly from the heap if you don't have to. Heap allocation carries a performance overhead because when malloc has to get more memory, it has to do so via a system call, which means a context switch, which is slow. That's the `sys` metric given by the `time` command.
Regarding specifically what is said in the video, where you go into low level machine details like the CPU cache, I especially wouldn't worry about that, because that's premature optimization. Worry about choosing efficient algorithms, not about how the machine accomplishes a task. That's the compiler's job. Turn on that -O3 flag. Or -Ofast if you're not worried about slightly less precise math. Sometimes you can justify low level optimizations, like when the Quake devs implemented the fast inverse square root using low level floating point math. But then look what happened--the chipset manufacturers and compiler vendors caught up. Nowadays, the quake inverse square root is no faster (and sometimes slower) than code that a compiler will generate for a more straightforward algorithm. I do not recommend wasting your time optimizing for hardware. The compiler has already done it and you can save a lot more time by choosing a better algorithm. C (and by extension C++) is not a low level language, and your computer is not a fast PDP.
A big problem with that argument is the assumption that the pieces of data necessarily will be fragmented. It's "whataboutism" taken to the extreme. But let's look at an average case where you allocate 100 small objects using a heap allocator: the heap allocator has a free pool of memory, so it slices a chunk off for both the object and the bookkeeping node to manage that memory, and updates the other node to account for the borrow. It does this over and over again until 89 objects in, the pool doesn't have enough memory. So the allocator will do a context switch asking for more memory. The memory comes from the heap, so it will be adjacent to the previous memory, but it will continue to allocate memory until all objects are allocated. The allocator is smart, it doesn't want to waste CPU time by making a bunch of syscalls to allocate tiny blocks of memory, so it does them in bulk. Pages and pools of memory that it marks up and manages. If the addresses were wildly spread out, that would mean the allocator is allocating random pages for every single allocation request, and all those context switches would be a far worse bottleneck than a cache miss. But as it turns out, the heap grows upward. The addresses are all fairly close together.
Now, you can optimize your code to assume that the allocator allocates a huge chunk of memory that's all close together, or you can optimize it to assume that the addresses will be far apart, but in the end, that's all you're doing: assuming. The standard says nothing about how the allocator is implemented. Don't assume. Write better algorithms. If the compiler thinks your array of structs will be more efficient if it turns it into individual arrays of the one element you access, it will do exactly that. That's the ultimate lesson: the compiler is better at optimizing than you are.
Strong argument to use hpp: A potential user does not need to think about extern "C". If it's .hpp, it can be included only and directly in Cpp. .h leaves a lot of room for speculation. Can you import it from C? Can you import it from Cpp? Do you NEED to call extern "C"? It's there for a reason.
cool refrence video for quite a lot of topics. works well as a refresher :-)
Please make a video on handling big data
Along with memory management and time complexity
Logging to console in Windows is indeed substantially, like Substantially slower than on Linux, however there are ways to speed it up as well, both by using Microsofts new terminal as well as using buffering in the program instead of flushing every single log immediately, still not as fast as on Linux, but helps a ton.
Thanks you're giving me a heads up on what to do next. I probably going to start making 2D Physics Engine.
Thanks btw got your brilliant discount :)
I’m surprised you didn’t mention the fact that variables starting with just an underscore are considered reserved by the language.
23:50 it's called 'placement new'
Pls bring more of these code reviews!
Please make a video on how to exploit cache lines and CPU cache in order to build blazing fast applications
The reason why writing to console is slow, is that windows assume a window, so it's written to the UI interopts, while filewriting is just bits on disk
For speeding up console ouput
You can unsync with stdio
```
ios_base::sync_with_stdio(false);
cin.tie(0);
```
@Cherno We can do calloc rather than malloc which will be a contiguous allocation .... that can help but still it can't beat the stack memory.
Both calloc and malloc returns a contiguous allocation of memory - there’s actually very little difference between how those two work
Logging on Linux/macOS: Yes, their terminals are magnitutes faster than Windows. Reason is that they are totally different implemented and Console on Windows is just slow. I read somewhere why it's hard to change. But Files are always faster, that's true.
There is more benefits to contiguous data storage. Cutting down on TLB misses, and VM page misses jump to mind.
Thank You!
Maybe there is time to have a look in to openMP for loading and shaping allocated memory 🤔
Great code review.
Certain IDEs require you to use hpp vs just h if you are using any C++.
We see it often in code, but in C++ it's not a good idea to start a variable identifier with an underscore. Some combinations of single/double underscore identifiers are reserved for the compiler implementation by the C++ standard. I would avoid it completely.
Excellent video, memory is always an interesting topic!
My one suggestion would be to change the storage of bodies in DynamicsWorld. On line 23 in the source file (seen at 27:45) the whole 'if (!body->IsDynamic()) continue;' means that static bodies are loaded into the L1 cache and then immediately discarded. Splitting the storage into static and dynamic bodies will ease the pressure on both the cache and the branch predictor.
You can log into a queue, and then flush the queue on a separate thread
Physics engine: is an engine about physics!! 👍👍👍
With regards to the .hpp header specification over .h, I find it to be very necessary in a lot of projects which are larger where you have a mixture of both c and cpp code (happens way more often than you might think at some companies where you have legacy code).
It does makes a huge difference in those cases because you need to compile those .h files as C code only in some situations and not as C++ especially if they are separate projects in a larger solution base. It just makes it easier to distinguish directly what you are looking at.
I used to be one of the .h default people, and never did understand why someone would use .hpp until I started working on legacy code bases created by other developers in large teams, now it makes sense because organizationally it serves an actual purpose.
I now just use .hpp as default as a result, because I'd rather not go back after the fact and have to specify hey this is actually a cpp header file and you should compile it in your makefile or whatever build system you are using as C++ code specifically and not C code. Just something to consider.
Not just legacy. Many of us use modern C. There are numerous cases I prefer C for.
But you don't compile header files?
@@user-dh8oi2mk4f What I mean is that usually in external build systems you have some method of determining which files are included in which compilation processes, typically by some kind of pattern matching.
You do NOT want .h files which are strictly c linked in unnecessarily with C++ compilation units. This can result in all sorts of unexpected behaviors, especially if you have C headers putting things in global scope with simplified names, which is pretty frequent in legacy code.
If I have multiple binaries in a solution that I need to compile some as C and some as C++ then you don't want to pattern match against all .h files when you are building C++ code in your build steps specifically.
@@fenril6685 But why would you need to figure out which headers are c and c++? The compiler simply pastes the contents of the includes directly into the source file. I don't understand why you need to know which headers are which. Maybe this is helpful if you mix c and c++ in the same directory, but I don't get how it would help with a build system
Is a std::vector with preallocated size a decent way to implement this kind of memory management? Or do you need to do it manually? Im a cpp newbie so pls dont roast me :)
Btw, a great video! Looking forward for pt 2
I am curious about this as well
I forgot to note that it will probably not work well with deleting items (I guess that for this we need a more sophisticated method)...
AABB is axis aligned bounding box, sphere is the best of that kind, no rotation needed also
how many trees you fell during logging, if its so slow, check out the guy that made the windows log ultra fast, not b-locking
also circles and spheres are memory friendly, you only have a center and radius, aabb has center and width-height-depth or 6x coordinates of the planes
k.i.s.s. actually simple default functionality the proper way is essential
ie the programmer does not have to redo the work always
GPU APIs (not libs) are very burdening for this messy requirement of its programmer
javidx9 has some quite excellent videos on how you can make games in C++ and programming in embedded systems, which is really nice if you're into that kind of low level programming 😄 Low Level Learning is also a great channel for that kind of knowledge 😄
CppCon and CppNow also great channels for the more advanced. Amazing talks by Michael Caisse and Luke Valenty this year about what can be done with compile time programming and the type system.
I have no idea why you'd want a pointer there when you KNOW which implementation you use. Hell, why does the class hierarchy even exist? Just use a member variable, not a pointer.
One content, two languages. What I have now written may have a perfect mirror in another language. You can create a program that searches for the perfect language mirror. Thanks to this, you will be able to speak two languages and perform tasks in the shade.Endless enigmatic book in all languages. You can write a book with mirrors in all languages of the world. You can speak two languages at once, you just need to find the perfect reflection, same content, different translation. Infinite Mirrors. Pi 3.14 XBooks. Hybrid language. The algorithm flows through our heads, endless coding, just take off the chameleon masks. Connect words without spaces and you will find hidden tasks in all languages. Our conversations collide in the process, some words as well as numbers in words. We perform tasks hidden between words. You can create a Python coding language from a spoken language. You just need to find the mirrors. Two tongues glued together.
I just started programming in C and I wonder a lot about when to use the heap and when to use the stack. Because I am more comfortable using the stack, I predominantly put all data onto the stack. Is there an easy rule of thumb to when use one or the other?
damn this was a nice review
How comfortable would you feel about making a C++ Graphics course for udemy?
If he just uses a clock for the delta time instead of a fixed time-step, that means his physics engine is not determinstic and thus will produce different results every time he runs a simulation.
yes i didn't knew that at the time, but i'm working on networking at the moment so I realized that mistake. I'll definetly be careful about that if I ever do something like that again haha
Great video, thank you! In modern C++, is heap memory fragmentation a concern for developers, given that the OS uses virtual memory to map to physical memory? My hypothesis is that even if physical RAM is fragmented, but virtual memory is contiguous, the C++ program's performance will not be affected.
Maybe or maybe not. CPUs don't prefetch across page boundaries, probably because of kernel-side page permissions / residency state. The more pages you access, the more TLB slots you use. TLB misses hurt, but maybe not to the level of framerate problems. It's an extra memory access, paid serially. Huge TLB requires defragmented memory on the kernel-side, and has a system-wide limit. Running kernel code to change page residency really hurts. It's many instructions, and a possible disk access.
Leading underscores are reserved in microsoft code. You should never use leading underscore variable if you expect to work on windows. Prefer trailing if you must.
11:33 The webcam picture quality begins to tank because of the video encoding all the little gaps between so many moving circles. It's interesting to see a non-FPS-related side-effect appear while testing FPS-related benchmarks.
About logging, can we just create a Static class and call it's function to log something there (through parameters)
like:
Logger.Log(_currentFps);
and in our release build, we just comment out all the statements in that function.
We would still have an overhead of calling that function and passing parameters, but is it okay to do it like this?
It’s more simple and straightforward to setup sure but you have to keep commenting and uncommenting every time you want to change build type and you have to remember to do that.
His macro way is much better.
I would be quite surprised if your compiler left the function call to an empty function with max optimization
@@user-dh8oi2mk4f fair. Did not think of that
you suggested allocating things like rigid body to the stack because of cpu optimizations but shouldn't the programmer worry about space? Are you banking on the fact that vectors allocate on the heap contiguously? Or should there be a specific buffer created or contiguous heap memory?
with logging what i do is for stuff that gets called all the time i only log failures so you know what happens with those but don't flood the log.
What cholor scheme do you use?
Awesome video btw.
0:39 Ceave Gaming
WORKING thx bro
25:45
Does it really work like this? That you have fragmentation in any percievable way. I thought with virtual memory you're not taking any penalty in reading across pages beyond that you're taking more TLB space because you have multiple pages. Is there any gain in having the actual pages be contiguous?
15:20 Is there a difference between a "Entity Component" system and a "Entity Component System" system/architecture? Both can be implemented with a data-oriented memory layout, correct?