This accessing of L1 cache is the reason game programmers are talking about Data Oriented Design (DOD) in the past few years. It is expensive to send an entire C++ object to L1 cache just to update an x and a y coordinate. You cannot put 100,000 ships on screen quickly. With DOD however, all the x's for all the ships are in a single array. Doing it this way, this array gets sent to the cache, and you rip through all 100,000 x values, then you do the same for the y array. Insanely fast.
Game programmers talk about dod/dop because this style of programming reduces software complexity. And the simpler the software, i.e. the less generic it is, the faster it is generally. An array of floats is simpler than a vector of objects. Dereferencing a float pointer is simpler than calling a getter. Loading 4 floats from an array into an xmm register is simpler than gathering those floats from 4 separate objects, etc.
@@jeeperscreeperson8480 it seems like you are mixing the software complexity definition which is mostly used in order to describe the complexity of a source code as perceived by humans reading said code and "simple" code that generates fewer instructions/those instructions are fast(er than non-dod code). DOD code can be much harder to read/write by an average programmer, it's complexity actually increases because a lot of stuff is not hidden by some abstractions anymore + you control and think about much more stuff than simply putting together some logic and abstracting it away via some interface/design pattern/etc.. And then what you call "simple" code is in fact just "fast" and "optimized" code, that is what has already been mentioned in the previous comment.
For new programmeres playing around and eventually wanting to program a game of some sort; you often end up wanting something to represent a world map or game map, etc, and often this is supposed to be tiles; from a simple pacman to Warcraft. When you make that map, you think you need a 2D array instead of a std::map which is slower. That 2D array is easy to implement as a class where you actually use a 1D array, and get each "tile" by the formula Width + (MaxWidth * Height) That way you just implement the array as f.ex char WorldMap[MaxWidth*MaxHeight] This is just a very simple implementation, you can of course do the same with a vector, and a 1D vector is faster than a 2D vector. The compiler will probably turn a 2d array like int WorldMap[100][100] into a 1D array anyway. And just use the same CurrentWidth + (MaxWidth * CurrentHeight) to index the array anyway.
Its very unlikely in a game engine (engine, not game) that you need/can update 1000 entities' positons in the same function. Its usually lots of different things for all those actors. I like more these kinds of optimizations, but the problem with dop is that it makes it quite hard to implement your game, and its somewhat unnatural.
Well organized talk with excellent explanations of the reason causing performance degradation using easy to understand examples.Would be very helpful event for quite experienced developers
I once got rejected at a job interview because I told them exactly that thing you pointed out about the list/trees, that if you must use it at least it can be accelerated with a custom allocator (preallocating nodes) depending on your use case and if cache misses are a bottleneck. Apparently their "senior" programmer guy didn't agree with that one
No1 tip for high performance: MEASURE EVERYTHING. It doesnt matter how strongly you THINK something is better. The technology stack is complex enough to interact in unforseeable ways.
You want both. Especially since some things are hard to debug/measure accurately being able to deterministcally understand SOME part of whats going on will give u an invaluable heuristic that statistics alone can't grant. (Ie, being able to look at assembly code with a datasheet with ur microcontroller will give u veeerry exact performancr characterisitcs of X instruction over Y))
Dude you have no idea how right you are. I have been trying my hardest to benchmark multithreaded code and get accurate results esp comparing it against linear algorithms... Due to compiler optimizations, CPU optimizations, the overall difficulty of Benchmarking MT code in the first place this entire ordeal has become a nightmare and no matter what results I get and how close they are to what I believe they should be there is still that little voice in my head telling me I just can not trust the output of the Measurements since there is just so damn much going on under the hood idk what to believe about it all.
I wasted so many hours of my life optimising based on theory only for the result to end up being slower… definitely bench mark and do it often if you care for the results
I am just a mechanical engineer doing vibration and signal analysis, but this presentation was very informative, even for me barely starting out in C++ a few days ago.
Quite dry stuff, and not really what I need for my daily work. But gave me a huge amount of new knowledge. Very well presented, really good examples, and in general a clear and entertaining talk. I have really enjoyed watching it!
I know this talk is quite a few years old and it has some solid points to consider but I did want to comment on a few things. One of them is that he compares audio software engineering to game development saying that while game devs worry about code having to be performant at 60 fps, on the other hand audio software development has to worry about 44k sps (samples per second) and this is not a fair comparison. Yes, a game loop may need to run at 60 fps, but within the context of a single frame, many computations are being done on the CPU, and likely way more than a few hundred thousand data-points. The second thing: At one point he says that in general you want to structure your Array of structs next to each other for locality which is true in general but if you utilize the Struct of Arrays (SOA) paradigm, which I don't think he mentions at all you may likely get even better performance out of your code due to the memory being more tightly compacted and also due to not having to fetch certain objects for computations that you don't need.
Good talk. People need to remember that performance in general purpose computing will have its limits, where Fpgas would take could take over. There is only so much performance that can be squeezed out of general purpose computing notwithstanding the versatility and quick development costs of that paradigm. The talk was very good.
At 41:00 He stated no branch predictor can work with Random Numbers but I was wondering if they can detect patterns and the BP is capable of looking ahead why is it not possible for it to generate the Random number in advance and make predictions using that pregenerated Random number
not an expert, but random is usally directly affected by the CPU clock as the source of entropy, so picking a number in advance is not the same number as picking it normally. Also picking extra random numbers and then throwing them away cause of any invalid branches would have an effect, some would be missing, and the following set of random numbers would not match the the original pattern of distribution.
Depending on the CPU architecture, branch prediction can look ahead and precalculate based on different possibilities only up to a certain point before running out of cache. IF it invests entirely on a certain possibility, if that guess is wrong there is a huge penalty since it has to start calculating the branch all over again (from the branching point in code ex. if statement). This limits how far the branch predictor can look ahead when dealing with random numbers (as the number of possibility quickly inflate) in for example a nested if structure.
Summary of what is most surprising: 1. cache associativity hole on round values 2. some modern cpus are good at unaligned access (I was sure we are moving away from that instead?) 3. false sharing
I think there's a mistake at 27:00. 8-way associative cache means there must be 8 blocks(cache lines) per set. So If we have 128 KB 8-way cache, 64 bytes per Block, this means 128 * 1024 / (64 * 8) = 256 sets, 512 bytes per set, 8 blocks per set. Not 8 sets 16KB each like you show in the presentation
52:08 : here, I see a different reason than the one mentionned. One the left code, there is 4 different values accessed ( a[i] b[i] b[i+1] and c[i] ) while on the right version code, only 3 values are accessed ( b[i+1] c[i] and a[i+1] ) so is it really about vectorization ? I don't understand this part can someone explain please ?
The same thing stood out to me initially, but I'm pretty confident it's insignificant here. b[i] and b[i+1] are extremely likely to be on the same cache line.
To vectorise the loop (at 52:00) why not just do this: for (int i = 1; i < 998; ++i) a[i] += b[i] for (int i = 1; i < 998; ++i) b[i+1] += c[i] Not only does this get rid of dependencies, but it results in requiring 2 positions in the cache (for each of the arrays) rather than 3, so better overall practice to reduce crowding. Or does this actually screw up the SIMD vectorisation? I don't know much about it.
I have a question, when Timur Doumler demonstrates random accessing 2-d array is almost 100x slower than row major access, is the rand() being considered. I mean, it is reasonable that random access is slow, but I am not sure how much rand() call affect the performance.
Ya I don't really know why he made that statement. Kind of needlessly provocative IMHO. Its even more skewed in games than you indicate because in games you are not just sending colors to the screen but also dealing with physics systems, multiplayer, tons of custom tools, fast compile times, and a general complexity of game play that is only limited by a designers creativity, etc. Otherwise good talk though.
It's harder to write a program that does some computation in 1/44100 seconds than 44100 such computations in 1 second. It's about bandwidth vs latency. His point is that not only do you need to make computations as efficient as possible, but they also have to have low latency. You don't worry so much about latency in programming a game renderer, and the GPU itself is not optimized for latency, unlike the CPU. Of course he is not an idiot who thinks that the computation he's doing in audio is 200 times faster than game engines. Game engines are written in C++ and C, not Python.
I too think processing 60 frames (or even 250 with modern monitors these days) in a video game, is way way way more expensive than sampling 44kHz audio...
The CPU (which you use to process audio) usually has 8 strong cores while modern GPUs have 3000+ "stupid" cores. It's not the same type of sample and you can't really compare that.
The example at 56:20 is interesting. I didn't quite understand the alignment issues with the offset. Is it because of different offset for both buffers? If I use same offset, say `+2`, for both, would it still have the alignment issue?
The problem is that SIMD instructions only work for aligned data. If both have the same offsets the compiler can special case, the first 2, and then take 4 aligned each.
33 mins 40 secs, I found this out myself trying to optimise a virtual machine. I thought aligning instructions and arguments on 4 or 8 byte boundaries would help. In the end I saw I was only getting a 1% increase in performance.
Actually, frame rendering has to provide much more samples than 24-60. You have to render screenwidth*screenheight samples. So for a 1080p resolution that is about 2*10e6 samples.
Рік тому+3
Maybe, but real-time sound programming is a little different. In a game, for example, you can render the same frame twice and nothing bad will happen. In sound, the membrane of a speaker is a physical object with kinetic energy, and if you send the same signal 2 samples in a row, that means the magnet should apply infinite braking power to stop it for a moment. It's like pulling the emergency brake on a locomotive, the listener will hear an audible click.
So, the correct thing to do is to add that rand()%n calculation into the other two loops as well? Or maybe precompute these random indexes?
3 роки тому
Of course replacing %m with &(m-1) is much faster if m is power of 2 and if m-1 comes from const variable) but he might have kept it for readability instead.
if `rand()` is inside the for loop, then `random loop on 2-d array` would be slow because of cost of `rand()` as well.. [ of course, random 2 d access will be slow.. but should not be that slow compared to column wise access ]
I'm angry too, still wondering if a floppy disk is faster then internal RAM...haha xD But I think it's common knowledge at this point, if you are going in such detail with coding performance then this should be common knowledge. If you are watching a video where a guy explaines that getting data from L1 or L2 cache is time critical you better already know that ROM Memory is slow as fuck in that dimension.
That's a rabbit hole, he can only cover the basics given the time. HashMaps -> Self Balancing Trees -> BTrees -> Esoteric Data Structures to model databases (improve disk read performance when searching) take up an entire lecture on their own.
Really great talk, tho next time maybe try to reduce the number of sub-topics or ask for more time to present them since it looked like you tried to rush in the second half to get everything out.
37:24 It's incredibly shocking how, in the relatively short space of around 30 years, the vast majority of today's programmers - even C++ programmers - have bypassed so much critical expertise in understanding assembly language. I mean it's truly...almost criminally shocking, especially given that... 1) Assembly language is at the absolute core of C++ debugging and, as demonstrated in this video 2) Different compilers can frequently produce different machine code, running at different speeds Higher level languages may well have a more valid reason, but C/C++ programmers have - literally - no excuse.
I suspect in a case with only a few columns, column major might even be faster, because it could use a cache line for each column thus using more of the cache. If you don't line them up so they step on each other's cache slots of course. Most of this is covered in that computer architecture course of a CS bachelors.
In the example at 20:44, wouldn't the rand() be very expensive, since it (at least on linux) uses a system call, with the context switch and everything?
Can someone explain why at 56:14 , calling multiplyAdd with the array pointers offset cancels vectorisation? What data is fetched into the SSE register and how? Thank you.
simd expects the data that will be loaded to be aligned a certain way. By offsetting slightly the compiler could either 1. use unaligned simd load instructions if the cpu supports them or 2. fall back to a non vectorized loop. On older x86 cpus unaligned loads were siginificantly slower but nowerdays it's not so bad. So option 1 may be significantly slower and 2 (almost) certainly will be.
im not sure knowing hardware helps us tremendously, we are bound by data access times whatever we do, and if its significant the impact it means hardware is slow, until we actually design hardware its not our problem.
It’s really only “our problem” when high performance is a business requirement. The people who really stress out about this stuff typically work in gaming, high-frequency trading, low-level networking, etc. The other thing to point out is that we are hitting the physical limits of processor speed and transmission of data between caches. It’s no longer possible for us to just “wait until hardware gets faster.” Either you figure out how to optimize your code and parallelize what you can on multiple threads, or you just eat dirt until some unforeseeable scientific breakthrough allows us to make faster hardware. Those are your options.
You can't make one thing fit all purposes so it is “our” problem. You can't bend physics and force electricity to run faster, but you can bend your approach to data and code and make it easier for the computer to do stuff. If someone managed to send Voyager to space in 1977 and it still operates to this day - we definitely can make a videogame/social network app/whatever to run on a hardware that's a zillion times faster than what people from NASA had 40+ years ago.
The comparison at the beginning is kind of lame... You don't render just 60 frames per second, those frames often consist of 2 million or more samples. In comparison 44000 samples is nearly nothing, even slowest of CPUs can emulate that in real time nowadays.
It applies to both C and C++. To some extent other languages too depending on how they handle things internally. And this is something that has been present for years already. For as long as you keep your data aligned and your array accesses linear you should be good to go. But it's good to know about possible pitfalls in case you need to squeeze some performance out of some of your programs.
The material covered is decent but the way things are explained is quite bad ! It will be safe to say that the people who understand this talk would already have a decent idea of the content.
I was gonna disagree with you but when I watched until the end, I now agree. I think the purpose of the talk is more like raising some questions and curiosity into actual hardware that your c++ code running.
I actually thought this talk was really good. It makes you remember a that if you want to do real proper software development of high performance applications, you MUST know about computer architecture. And this guy Timur seems to really know his stuff. For some reason, SIMD and VLIW and all that fancy stuff that processors do nowadays are quite ignored in the community. But the bottom line is, yeah this talk is not for explaining to someone who is learning. The people who attend this talk are industry experts, who probably studied all of this things but don't have the years of hand on experience in the particular topics they are listening in the talks.
Good content but I feel very distracted with hammmm aaaaaa before each sentence. He really needs to learn how to speak in public. His speech is so stressing so I can't understand a lot of sentences. Aaaam.
"You found books from computer science", yeah the fact he didn't know about and also didn't say "computer architecture" suggest he never attempted to learn a bit of formal computer science. "C++ books never talk about this stuff" LOL are you kidding me? *Skipping thru* Oh wow branch prediction, data alignment, array access i.e. cache locality, false sharing I never heard of it... NOT This is the nth average talk with stuff already heard dozen times.
This accessing of L1 cache is the reason game programmers are talking about Data Oriented Design (DOD) in the past few years. It is expensive to send an entire C++ object to L1 cache just to update an x and a y coordinate. You cannot put 100,000 ships on screen quickly. With DOD however, all the x's for all the ships are in a single array. Doing it this way, this array gets sent to the cache, and you rip through all 100,000 x values, then you do the same for the y array. Insanely fast.
Game programmers talk about dod/dop because this style of programming reduces software complexity. And the simpler the software, i.e. the less generic it is, the faster it is generally. An array of floats is simpler than a vector of objects. Dereferencing a float pointer is simpler than calling a getter. Loading 4 floats from an array into an xmm register is simpler than gathering those floats from 4 separate objects, etc.
@@jeeperscreeperson8480 it seems like you are mixing the software complexity definition which is mostly used in order to describe the complexity of a source code as perceived by humans reading said code and "simple" code that generates fewer instructions/those instructions are fast(er than non-dod code). DOD code can be much harder to read/write by an average programmer, it's complexity actually increases because a lot of stuff is not hidden by some abstractions anymore + you control and think about much more stuff than simply putting together some logic and abstracting it away via some interface/design pattern/etc.. And then what you call "simple" code is in fact just "fast" and "optimized" code, that is what has already been mentioned in the previous comment.
For new programmeres playing around and eventually wanting to program a game of some sort; you often end up wanting something to represent a world map or game map, etc, and often this is supposed to be tiles; from a simple pacman to Warcraft.
When you make that map, you think you need a 2D array instead of a std::map which is slower.
That 2D array is easy to implement as a class where you actually use a 1D array, and get each "tile" by the formula
Width + (MaxWidth * Height)
That way you just implement the array as f.ex
char WorldMap[MaxWidth*MaxHeight]
This is just a very simple implementation, you can of course do the same with a vector, and a 1D vector is faster than a 2D vector.
The compiler will probably turn a 2d array like
int WorldMap[100][100] into a 1D array anyway. And just use the same
CurrentWidth + (MaxWidth * CurrentHeight)
to index the array anyway.
Its very unlikely in a game engine (engine, not game) that you need/can update 1000 entities' positons in the same function.
Its usually lots of different things for all those actors.
I like more these kinds of optimizations, but the problem with dop is that it makes it quite hard to implement your game, and its somewhat unnatural.
@@digitalconsciousness ECS mentioned!!
Well organized talk with excellent explanations of the reason causing performance degradation using easy to understand examples.Would be very helpful event for quite experienced developers
I once got rejected at a job interview because I told them exactly that thing you pointed out about the list/trees, that if you must use it at least it can be accelerated with a custom allocator (preallocating nodes) depending on your use case and if cache misses are a bottleneck. Apparently their "senior" programmer guy didn't agree with that one
Anyone who listens to the people next to them without doing research are dumb.
lol
What company was that?
Ya when you talk like that to your everyday devs they think you are from a different universe or making stuff up lol
Thank you very much. I really like your choice of presenting very intuitive examples.
No1 tip for high performance: MEASURE EVERYTHING.
It doesnt matter how strongly you THINK something is better. The technology stack is complex enough to interact in unforseeable ways.
You want both. Especially since some things are hard to debug/measure accurately being able to deterministcally understand SOME part of whats going on will give u an invaluable heuristic that statistics alone can't grant. (Ie, being able to look at assembly code with a datasheet with ur microcontroller will give u veeerry exact performancr characterisitcs of X instruction over Y))
In addition, its STILL importsnt to understand WHYYY x or y is slower than ghe other so you can work around it and or get hardware vendors to improve
Dude you have no idea how right you are. I have been trying my hardest to benchmark multithreaded code and get accurate results esp comparing it against linear algorithms... Due to compiler optimizations, CPU optimizations, the overall difficulty of Benchmarking MT code in the first place this entire ordeal has become a nightmare and no matter what results I get and how close they are to what I believe they should be there is still that little voice in my head telling me I just can not trust the output of the Measurements since there is just so damn much going on under the hood idk what to believe about it all.
You NEED to have significant expertise in assembly language. No point in having that debugger if you don't FULLY understand that core component...!!!
I wasted so many hours of my life optimising based on theory only for the result to end up being slower… definitely bench mark and do it often if you care for the results
I am just a mechanical engineer doing vibration and signal analysis, but this presentation was very informative, even for me barely starting out in C++ a few days ago.
Quite dry stuff, and not really what I need for my daily work. But gave me a huge amount of new knowledge. Very well presented, really good examples, and in general a clear and entertaining talk. I have really enjoyed watching it!
Glad it was helpful!
These conventions NEVER fail to provide useful knowledge and springboard ideas to help insoire newer developments and technologies!
Excelent presentation, as usual for Timur Doumler! Concise, comprehensive, emotional, fascinating . Recommend!❤
He knows his stuff.
I know this talk is quite a few years old and it has some solid points to consider but I did want to comment on a few things. One of them is that he compares audio software engineering to game development saying that while game devs worry about code having to be performant at 60 fps, on the other hand audio software development has to worry about 44k sps (samples per second) and this is not a fair comparison. Yes, a game loop may need to run at 60 fps, but within the context of a single frame, many computations are being done on the CPU, and likely way more than a few hundred thousand data-points.
The second thing: At one point he says that in general you want to structure your Array of structs next to each other for locality which is true in general but if you utilize the Struct of Arrays (SOA) paradigm, which I don't think he mentions at all you may likely get even better performance out of your code due to the memory being more tightly compacted and also due to not having to fetch certain objects for computations that you don't need.
This is exactly what I was looking for. Great work! Thanks for the info!
Glad it was helpful!
Very intuitive example!Down to earth knowledge!
Thank you, some stuff I knew, but never heard of denormals. Would have to check them out :)
15:30 hash for integers is usually the identity function which is a no-op.
Excellent talk, full of great information and tips!
Glad it was helpful!
What a great talk, thanks!
Good talk. People need to remember that performance in general purpose computing will have its limits, where Fpgas would take could take over. There is only so much performance that can be squeezed out of general purpose computing notwithstanding the versatility and quick development costs of that paradigm. The talk was very good.
This is super informative, thank you for the effort Timur.
Good presentation, knew about all of it except the FPU stuff, but got way more useful details than I had.
At 41:00 He stated no branch predictor can work with Random Numbers but I was wondering if they can detect patterns and the BP is capable of looking ahead why is it not possible for it to generate the Random number in advance and make predictions using that pregenerated Random number
not an expert, but random is usally directly affected by the CPU clock as the source of entropy, so picking a number in advance is not the same number as picking it normally. Also picking extra random numbers and then throwing them away cause of any invalid branches would have an effect, some would be missing, and the following set of random numbers would not match the the original pattern of distribution.
Depending on the CPU architecture, branch prediction can look ahead and precalculate based on different possibilities only up to a certain point before running out of cache. IF it invests entirely on a certain possibility, if that guess is wrong there is a huge penalty since it has to start calculating the branch all over again (from the branching point in code ex. if statement). This limits how far the branch predictor can look ahead when dealing with random numbers (as the number of possibility quickly inflate) in for example a nested if structure.
Summary of what is most surprising:
1. cache associativity hole on round values
2. some modern cpus are good at unaligned access (I was sure we are moving away from that instead?)
3. false sharing
I think there's a mistake at 27:00. 8-way associative cache means there must be 8 blocks(cache lines) per set. So If we have 128 KB 8-way cache, 64 bytes per Block, this means 128 * 1024 / (64 * 8) = 256 sets, 512 bytes per set, 8 blocks per set. Not 8 sets 16KB each like you show in the presentation
Hello! Great content.
Is there a link for a video mentioned at 19:31 ?
Excellent talk!
Nailed, great examples.
47:27 what's the point in wrapping a b c d in atomic if each thread works on its own variable ?
Just curious, Isnt this why data-oriented design is preferred? Seems like this gives a good example for Data oriented Design. Am I wrong??
Nice explanation with code samples and supporting data analysis.
What is the talk he refers to at 49:45?
52:08 : here, I see a different reason than the one mentionned. One the left code, there is 4 different values accessed ( a[i] b[i] b[i+1] and c[i] ) while on the right version code, only 3 values are accessed ( b[i+1] c[i] and a[i+1] ) so is it really about vectorization ? I don't understand this part can someone explain please ?
The same thing stood out to me initially, but I'm pretty confident it's insignificant here. b[i] and b[i+1] are extremely likely to be on the same cache line.
This was really good!!!
To vectorise the loop (at 52:00) why not just do this:
for (int i = 1; i < 998; ++i)
a[i] += b[i]
for (int i = 1; i < 998; ++i)
b[i+1] += c[i]
Not only does this get rid of dependencies, but it results in requiring 2 positions in the cache (for each of the arrays) rather than 3, so better overall practice to reduce crowding. Or does this actually screw up the SIMD vectorisation? I don't know much about it.
Make certain you confirm - and fully understand - what the compiler is doing by critically examining the assembly version of your routine.
This could take double the time. A CPU can sometimes fit two instructions in one cycle in the same core. Therefore it is better to do it in one loop.
Great stuff, at least for someone without degree in SC, thanks!
At 53:20, is there a typo on left side code at line .. a[i] +=b[i+1] instead of a[i] +=b[i]?
I have a question, when Timur Doumler demonstrates random accessing 2-d array is almost 100x slower than row major access, is the rand() being considered. I mean, it is reasonable that random access is slow, but I am not sure how much rand() call affect the performance.
Not considering any oversampling in either domain, isn't it more like ?
gfx : 1920*1080*30
sfx : 44100 * 2
samples per second ?
Why 30?
Ya I don't really know why he made that statement. Kind of needlessly provocative IMHO. Its even more skewed in games than you indicate because in games you are not just sending colors to the screen but also dealing with physics systems, multiplayer, tons of custom tools, fast compile times, and a general complexity of game play that is only limited by a designers creativity, etc. Otherwise good talk though.
It's harder to write a program that does some computation in 1/44100 seconds than 44100 such computations in 1 second. It's about bandwidth vs latency. His point is that not only do you need to make computations as efficient as possible, but they also have to have low latency. You don't worry so much about latency in programming a game renderer, and the GPU itself is not optimized for latency, unlike the CPU.
Of course he is not an idiot who thinks that the computation he's doing in audio is 200 times faster than game engines. Game engines are written in C++ and C, not Python.
I too think processing 60 frames (or even 250 with modern monitors these days) in a video game, is way way way more expensive than sampling 44kHz audio...
The CPU (which you use to process audio) usually has 8 strong cores while modern GPUs have 3000+ "stupid" cores. It's not the same type of sample and you can't really compare that.
Great talk! Thank you! :)
Thanks a lot! Nice explanation!
The example at 56:20 is interesting. I didn't quite understand the alignment issues with the offset. Is it because of different offset for both buffers? If I use same offset, say `+2`, for both, would it still have the alignment issue?
Watched this talk again a year later. I still have the same doubt. Haha.
The problem is that SIMD instructions only work for aligned data. If both have the same offsets the compiler can special case, the first 2, and then take 4 aligned each.
33 mins 40 secs, I found this out myself trying to optimise a virtual machine. I thought aligning instructions and arguments on 4 or 8 byte boundaries would help. In the end I saw I was only getting a 1% increase in performance.
Good explanation
Actually, frame rendering has to provide much more samples than 24-60. You have to render screenwidth*screenheight samples. So for a 1080p resolution that is about 2*10e6 samples.
Maybe, but real-time sound programming is a little different. In a game, for example, you can render the same frame twice and nothing bad will happen. In sound, the membrane of a speaker is a physical object with kinetic energy, and if you send the same signal 2 samples in a row, that means the magnet should apply infinite braking power to stop it for a moment. It's like pulling the emergency brake on a locomotive, the listener will hear an audible click.
Can someone explain to me 54:19 Example please ? I have a hard time understanding it .
Something to with: SIMD vectorisation
16:03, not true. VTune shows you "these things".
🤔How will you find size of cache line [before using alignas(cachelineSize) ]?
What are we supposed to do about problems that specifically require random traversal of arrays?
you wait ;)
I hope you’re not benchmarking memory random access speed with rand function modulus operator on every access at 20:44 🤦♂️
Lol yeah i was thinking that
So, the correct thing to do is to add that rand()%n calculation into the other two loops as well? Or maybe precompute these random indexes?
Of course replacing %m with &(m-1) is much faster if m is power of 2 and if m-1 comes from const variable) but he might have kept it for readability instead.
Yeah. Easily 100 unnecessary cycles just from these two.
Very interesting
I loved this talk. Never knew that a simple contiguous set of variables can be SO MUCH faster than randomly situated ones.
Awesome Thanks...
if `rand()` is inside the for loop, then `random loop on 2-d array` would be slow because of cost of `rand()` as well..
[ of course, random 2 d access will be slow.. but should not be that slow compared to column wise access ]
woudn't it be much better if you generated the random numbers before,
talking about 21:40
You should've included the SSD / harddisk in your memory speed / latency discussion.
what for? If you interacting with hard disk you know it will be slow
I'm angry too, still wondering if a floppy disk is faster then internal RAM...haha xD
But I think it's common knowledge at this point, if you are going in such detail with coding performance then this should be common knowledge.
If you are watching a video where a guy explaines that getting data from L1 or L2 cache is time critical you better already know that ROM Memory is slow as fuck in that dimension.
That's a rabbit hole, he can only cover the basics given the time. HashMaps -> Self Balancing Trees -> BTrees -> Esoteric Data Structures to model databases (improve disk read performance when searching) take up an entire lecture on their own.
Really great talk, tho next time maybe try to reduce the number of sub-topics or ask for more time to present them since it looked like you tried to rush in the second half to get everything out.
37:24 It's incredibly shocking how, in the relatively short space of around 30 years, the vast majority of today's programmers - even C++ programmers - have bypassed so much critical expertise in understanding assembly language. I mean it's truly...almost criminally shocking, especially given that...
1) Assembly language is at the absolute core of C++ debugging and, as demonstrated in this video
2) Different compilers can frequently produce different machine code, running at different speeds
Higher level languages may well have a more valid reason, but C/C++ programmers have - literally - no excuse.
I suspect in a case with only a few columns, column major might even be faster, because it could use a cache line for each column thus using more of the cache. If you don't line them up so they step on each other's cache slots of course. Most of this is covered in that computer architecture course of a CS bachelors.
In the example at 20:44, wouldn't the rand() be very expensive, since it (at least on linux) uses a system call, with the context switch and everything?
It is not a syscall, its a lcg in glibc
Perfect talk except for the ums
Can someone explain why at 56:14 , calling multiplyAdd with the array pointers offset cancels vectorisation? What data is fetched into the SSE register and how? Thank you.
simd expects the data that will be loaded to be aligned a certain way. By offsetting slightly the compiler could either 1. use unaligned simd load instructions if the cpu supports them or 2. fall back to a non vectorized loop. On older x86 cpus unaligned loads were siginificantly slower but nowerdays it's not so bad. So option 1 may be significantly slower and 2 (almost) certainly will be.
@@CBaggers Thank you!!
“want fast python? well tough shit”
im not sure knowing hardware helps us tremendously, we are bound by data access times whatever we do, and if its significant the impact it means hardware is slow, until we actually design hardware its not our problem.
It’s really only “our problem” when high performance is a business requirement. The people who really stress out about this stuff typically work in gaming, high-frequency trading, low-level networking, etc.
The other thing to point out is that we are hitting the physical limits of processor speed and transmission of data between caches. It’s no longer possible for us to just “wait until hardware gets faster.” Either you figure out how to optimize your code and parallelize what you can on multiple threads, or you just eat dirt until some unforeseeable scientific breakthrough allows us to make faster hardware. Those are your options.
Dude, do you think slack running slow on ryzens is because the hardware is slow?
You can't make one thing fit all purposes so it is “our” problem.
You can't bend physics and force electricity to run faster, but you can bend your approach to data and code and make it easier for the computer to do stuff. If someone managed to send Voyager to space in 1977 and it still operates to this day - we definitely can make a videogame/social network app/whatever to run on a hardware that's a zillion times faster than what people from NASA had 40+ years ago.
Cache lines, eh. Thanks!
The comparison at the beginning is kind of lame... You don't render just 60 frames per second, those frames often consist of 2 million or more samples. In comparison 44000 samples is nearly nothing, even slowest of CPUs can emulate that in real time nowadays.
I wish it was more obvious somehow whether or not these videos are about basic topics. I keep getting clickbaited in and wasting time.
Well hard because some are really bad programers.
so we have another thing to worry about while coding in c++, i sometimes believe it is getting too over complicated!
It applies to both C and C++. To some extent other languages too depending on how they handle things internally. And this is something that has been present for years already. For as long as you keep your data aligned and your array accesses linear you should be good to go. But it's good to know about possible pitfalls in case you need to squeeze some performance out of some of your programs.
Can you offer me a job?!
Wp Avada nobody said it was gonna be easy 😀 just remember that C++ let's you work efficiently, but other higher level languages can't provide that
The material covered is decent but the way things are explained is quite bad ! It will be safe to say that the people who understand this talk would already have a decent idea of the content.
I was gonna disagree with you but when I watched until the end, I now agree. I think the purpose of the talk is more like raising some questions and curiosity into actual hardware that your c++ code running.
I actually thought this talk was really good. It makes you remember a that if you want to do real proper software development of high performance applications, you MUST know about computer architecture. And this guy Timur seems to really know his stuff. For some reason, SIMD and VLIW and all that fancy stuff that processors do nowadays are quite ignored in the community. But the bottom line is, yeah this talk is not for explaining to someone who is learning. The people who attend this talk are industry experts, who probably studied all of this things but don't have the years of hand on experience in the particular topics they are listening in the talks.
hmm counter is exploding. Speaker should take a Toastmaster course, that would make his presentation clearer and more engaging.
Good content but I feel very distracted with hammmm aaaaaa before each sentence. He really needs to learn how to speak in public. His speech is so stressing so I can't understand a lot of sentences. Aaaam.
"You found books from computer science", yeah the fact he didn't know about and also didn't say "computer architecture" suggest he never attempted to learn a bit of formal computer science.
"C++ books never talk about this stuff" LOL are you kidding me?
*Skipping thru* Oh wow branch prediction, data alignment, array access i.e. cache locality, false sharing I never heard of it... NOT
This is the nth average talk with stuff already heard dozen times.
Did all of this in my high performance class at my university
Well some of us dont know and now we do.