I suggest you look at `sudo perf top -e branch-misses` - it will just tell you exactly where there are too many mispredicts. Hit enter on the top thing and drill down to what function it is. Build with debug info.
With x86 the biggest effect likely/unlikely has is to rearrange the instructions such that the unlikely branch is moved to a different area of the program. This makes the likely path a serial instruction stream, which is good for instrction cache. Also good for branch prediction in the case there is no branch prediction entry in the table. E.g. the first time thru
Having looked at the comments before watching the entire talk, I was a little worried the talk would end before the speaker got to close the talk. So I'll mention here that the cut off happened during the question section at the end, after the last slide. Still somewhat abrupt!
I watched this talk sometime back around when it was first uploaded, and I am almost certain I missed that joke at the beginning: "it's a talk on performance; the closer you sit, the better the performance." I am so glad YT recommended it to me again.
Awesome talk! It's rare that I would ever watch an hour-long video on UA-cam, but by the time I got to the end of this one I wanted to go back to and watch it again.
1:03:30 He was ready for some more questions and then it ended suddenly with a well deserved applause. Its like games where a cut-scene would end a fight with infinitely spawning enemies. "Well"
Impressively, the talk ends in a Branch Prediction failure. Fedors reaction is amazing : ) From now on, that's exactly what I'll imagine the CPU doing.
Excellent talk about branch predictor and the ways to take advantage of it. When I first saw the title though, for some reason I initially thought that the topic is about functional programming.
If you want to make it as hard as possible for old processors, do taken, taken, not taken, not taken, repeatedly. Every branch will mispredict. They had a two bit counter that incremented if taken, decremented if not taken, and it predicted taken if counter for this branch >= 2. Once it has been saturated one way, it has to mispredict twice in a row to start predicting the other way. Modern processors are not so dumb, they have a bunch of histories, and it selects what history based on whether the last few branches were taken or not (to get correlation).
Best I can tell it may have been the Intel Pentium 4 Willamette, also called NetBurst. I can't find anything online that says what the speaker said -- namely reported the opposite of what it meant to -- but apparently it had pretty bad branch prediction. (I apologize in advanced if I've mixed up some terms. CPU architecture isn't my strong suit!)
Excellent talk. Very sad to see it cut short but it is what it is. I have the book but haven't started yet. I would actually like to know/test if this can be done in GPUs too.
Indeed, a lot of GPU-based implementations of Math functions (such as abs) are intrinsecally branch-free as the divergence between threads in a GPU tend to degrade performance.
No need pretty much? All shader instances within an execution unit will cause all branches for which the prerequisites are fulfilled in any of the instances to be taken (effectively by all these instances), so while at source level you branch, the underlying implementation is effectively branch-free by hardware and system design. Furthermore the cost of branching, when it rarely does happen (aggregated), is pretty much zero, as during the resulting latency window, other work is performed, another context is ready to go and switch in by then. GPUs are INSANELY good at latency hiding and maintaining high nominal utilisation. Branching at source level is actually recommended, for the case that none of the instances in an execution group take a given branch, then the whole execution group can skip it, else the whole group effectively takes it. In the end you don't really have much control over it and often times you have to even pay the associated extra cost of branch-free execution (those other branches being executed as well) in spite of nominally branching. It all even goes back to the very first shader model which had no branching provision at all, at machine language level there were only selective mov instructions, no jump instructions; while shader compiler allowed and recognised branches at source level and converted them correspondingly. Loops had to have a trivially provable max iteration count and would be fully unrolled by the compiler, naturally.
Nice to see a talk like this when there are so many people scoffing at branch free, because they saw one example where a perfectly predictable branch skipped a few ops and ended up slightly faster.
whats even better is a negation of either 1 or 0 with a bitwise AND (-1 is all one-bits). 2 clock cycle latency (NEG; AND) vs 3-4 (multiply): s += -b & x;
In PowerPC, the branch conditional instruction has the prediction baked into it. I've seen, for example, MetroWerks compiler output where its static predictions were very primitive (99% of branches forward were unlikely, 99% of branches backward were likely). I've yet to use the C++20 attributes, but [[likely]] and [[unlikely]] probably give you manual control of the prediction bit for those branches, which is neat. Debug assertions and nullptr checks were totally the first thing I thought of when I learned about these attributes. Even if the compiler is smart enough to recognize a nullptr check and mark it as unlikely (I'm sure it is), it is nice to be able to self-document it with an attribute.
Yes the compiler will always assign a lower probability to a null pointer check indeed. And no, never use likely or unlikely without profiling and looking at the assembly, the C++ standard is ass and says that likely/unlikely means "arbitrarily likely" and what that means in clang and gcc is 80%/20% respectively. It turns out that because of things such as null pointer checks and early returns the compiler can assign to a branch a probability lower than 20% but because unlikely means always 20% marking it as unlikely will actually make it more likely. Again, the C++ standard is complete dogshit and it should have a feature to tell the compiler exactly how likely it is ([[likely 5%]] or something like that), but it doesn't and thus unlikely may jot do what you expect it to do.
The part about the compiler built in likely/unlikely hints is completely off. it is just meta information for the resulting ORDERING OF INSTRUCTIONS in the machine code. This is due to the programmer wanting to avoid (instruction) cache misses, since the branch marked as more likely will immediately follow the conditional check. Furthermore, the unlikely branches may be put at the very very end of a particular procedure (function) or loop with an unconditional jump back to where the branch ends. It may even prevent function inlining in order to improve performance that way (yes, sometimes forcing not to inline improves performance).
Another talk related to this speculative type branching topic was one given by Chandler Carruth here: ua-cam.com/video/2EWejmkKlxs/v-deo.html I'd recommend giving it a watch first if some of the foundational material at the beginning of this talk seemed hard to understand. It's a similar talk to this one except with a focus on loops rather than if-else branches.
Great talk! I would have liked to ask if the array-of-two-bool-index micro-optimization for removing a branch is generally preferable to what I would have done more naively, namely adding or or'ing the two values after multiplying them with 1 and 0, gotten by treating the bool as an int.
Serious question: if we repeat all the time ‘measure before changing’ and that compilers and processors may do better work then why do we think, that after we changed code once, it stays the fastest code? We made code less readable, we removed branch and added more work. Then what if new processor came or new optimization in a compiler etc and original code would be better?
You cannot strictly rely on the code being perpetually faster indeed, but what choice do you have? You can't optimise for the unknown, and pessimising software in the hope that an architecture comes along someday that magically makes your slow code fast obviously doesn't help either. One somewhat warranted assumption you can make is that processors will be optimised towards current, existing code; that they will be similar to today's processors just better and faster, that they will implement similar techniques as today but make them more robust. That processors when they get released and benchmarked by the press, will have to compete on old code, which will be a mix of code profile-optimised for last-gen processors and canonical style code. Another related assumption is that even if the processors open up a different way to do things, old way will not be slower in absolute terms, just in relative terms. So say if your software has minimum specification being an AMD Bulldozer, and you optimise towards that, you have eliminated the performance weakness on the weakest hardware that you target, and that's enough, you don't actually care that the software fails to perform much quicker on a nominally faster processor, you have created a software that is palatable for the entire userbase. I have optimised towards Pentium 4 in the past (well over 10 years ago) knowing that it's the chip family most likely to struggle among all processors people are likely to use, and it didn't make nearly as much of a difference either way on anything like an Athlon XP or a Pentium M or a Core. Obviously neither of these assumptions is absolutely true but they are useful as guiding principles in absence of better information. Of course if you have the freedom - and there exists code which actually does that, though it's exceptionally rare - you can have several different implementations in your software, and profile them when you detect system configuration change or on startup. Of course it has its own hazards as you introduce an indirection at entry into that code, so better be sure it's worth it. Less risky is having the benchmark of optimised vs. canonical code automated in your test suite outside the shipped software. I do think performance benchmark should be part of your CI procedure, because if there are performance regressions - usually an undesired side effect of development such as feature work or bugfixing rather than any sort of platform changes - you'll know about them. These performance regressions are a day to day reality rather than a hypothetical differently behaving platform to come along someday, and your software will generally gradually turn into a disaster if you ignore them.
That's why you don't just measure once, change, and then leave it. You continually measure. If you upgrade your compiler and there's a 5% performance drop, you measure, find the (potentially new) bottleneck, and change. Don't forget that also 99% of the time you shouldn't need to think about this in the first place.
There are many cases where the hardware target is well known and exact. For example an embedded device, controller for some factory, or a transportation system (eg flight mangement sytems cannot make hardware version changes without major testing); orat the big end when you are optimizing for a super computer, a mainframe, or a large server farm deployment (matching hardware) the electrical costs alone can more than offset the cost of targeted optimization and a custom compile. (These cases spend tens of thousands of USD per month on electricity. A 1% efficiency gain during a year is worth a week of developer time.) In a few cases the hardware may be purchased after saftware profiling and optimization results that were made on a selection of sample machines.
Super interesting topic. Great presentation. I wonder where the most to gain is with current software (current hardware is criminally underused due to bad software). Is it branch prediction, is it parallelism, is it cache oriented code? Or is it something else?
Software is vastly different. We develop solvers for different forms of sparse linear equations. This particular case of algorithms is often memory-bound. CPU spends 100 cycles for next portion of data to be loaded from RAM and about 10 cycles to process. So branch misprediction is not a problem in our case. Cache misses and memory speed are.
Depends on the code. General software probably branch prediction. Video, audio, graphics, or other things easily parallelizable probably parallelism and cache. Using SIMD instructions can easily provide an absolutely massive performance gain (perhaps 5-8x) where it makes sense. Multithreading can help with a lot of apps, not just those capable of benefiting from SIMD, though still not always a perfect or easy solution. Cache optimization can benefit, there are some general practices to prevent obvious slowdowns (such as how you iterate multi dimensional arrays), but can be a bit tricky otherwise.
I would say based on relative performance of different processors with different implementations of branch prediction, different number of cores, different SIMD configurations, different cache capacities: all of the above. But branch prediction has a knock on effect on everything, like your high memory latency through the cache hierarchy doesn't really matter so much if you don't do useless requests. On the other hand, bad memory locality just has such a devastating impact on performance that it's really unfortunate to ignore.
Interesting and informative talk! I like the hands-on, example-driven approach. What I don’t like is the constant interruptions (esp. ~ mins 30-40) from the audience questions. These are very hard to follow as a remote viewer and disrupt the flow.
I agree that it reduces the quality of our experience as remote viewers, but you have to remember that these conferences are optimized for the experience of the in-person attendees (as they should be, it's an organic, and expensive event)
There are multiple branches. For example, there could be four branches that can be predicted with ~100% accuracy, and the fifth and final branch is predicted with 50% accuracy. Or just one branch that’s 100% accurate and it just gets evaluated 4x more often than the 50% one.
Perf says that 10% of the branches are mispredicted. However the "interesting" branch was just one of the many branches in the program. There's quite a lot of other code involved, think about the benchmarking framework being used.. For every benchmark instance it might be running quite a bit of logic. Also the for loop termination branch is present. Also, the arrays are filled up before doing the experiment itself, and this also involves some branches (for loop and stuff). All of these branches are well predicted, so the "interesting" ones that are mispredicted are just a part of all the branches of the program.
10% of all branches executed were mispredicted, but the branch in question is only one of several executed in the loop, so it might be 50% mispredicted but when aggregated with a bunch of predicted branches at other sites the final rate could be 10%.
Great talk like always quality content from Fedor! I will definitely buy his book as I'm rly interested in these type of optimisations. Also can someone explain why the optimisation with function pointers doesn't work when functions are inlined?
Functions wouldn't work *ever* . A function call is a jump, so basically a branch. You could get the pointer branch-lessly, but then you would have to stall until the actual instructions that make up the function are loaded. It's basically like loading an array. No matter how fast you can get the pointer to that array, when you dereference it to use it, you still have to wait for the array to get cached. Same with functions. And if you just had if (cond) foo(); else bar(); you would always have a fast branch and a very slow branch, but if you introduce indirection with the function pointer you always have a pretty slow branch by having to cache the function instructions every single time, only after you know where they are stored. If the condition was sufficiently unpredictable, then it may be worth it, but it seems unlikely. And for the inlined function it's even worse because an inlined function doesn't need to build a new stack frame since all the instructions are right there, inlined. Maybe even the caching of instructions is faster this way. If you use pointers you are forcing the compiler not to inline since it needs to be able to point to anything.
@@rauldragu9447 I think what may work is call both functions, store the results in the array, then use the condition to index into the result of the correct function. But if the function is more expensive than a branch misprediction on deciding which function to call, then you should just use the branch. Like Fedor Pikus said, measure! If performance doesn't matter enough for you to make a benchmark and measure, then you shouldn't do any special tricks either.
My low-mid (second from the bottom level) level interview was 70% leadership to start with, followed by 30% extremely basic technical. I didn't expect leadership questions at such a low level, and I messed up that interview so hard, but fortunately they hired all 4 people that made it to interview so it's all good!
You know why, right? Because an extremely technical person has a tendency to get lost in minutia and might spend hours, days and sometimes months thinking about how they can save a few nanoseconds in a program that needs to have 100% uptime, instead. Like this guy.
It can't optimize `c[i] = rand() >= 0` because function `rand()` is treated as "deterministic" random generator. The internal state of RNG must be advanced even though the returned value is discarded. The best one can get is `rand(); c[i] = 1;`
@@somehrjdbddhsjsbnsndndk751 Computers, at least without special hardware, are incapable of making truly random numbers. So rand() with the same given seed will give you the same sequence of "random" numbers every time you call it. If the compiler optimized out the rand() call, then it would be violating one of the rules of the compiler since it would be affecting the output of the program.
@@somehrjdbddhsjsbnsndndk751 Tomasz was saying that `rand()` has side-effects, so it can't be eliminated entirely. This is because `rand()` has an internal state that holds that last randomly generated number to generate the next one. If you didn't call rand() at all, that internal state would not be changed properly, and that could result in observable differences between optimized code and not optimized code, namely that the sequence of random numbers would be generated for a given seed would be applied to different vectors in his program. Now, of course, the programmer might not care about that, and would like this kind of optimization to occur, but the compiler has no way of asking or knowing that the programmer does not care about it. Regardless, I would argue that a compiler warning would be better in this context than an optimization. If I had some condition buried in my code that was always true, then that condition is probably written incorrectly or I could delete it myself. There are some tools which will give you a warning when a condition is always true, but often times, the tool does not have a type that can describe what the real outputs look like. I.e. if something returns an "int" or a "number" rather than a "positive integer", the tool has no way to determine that the returned value will always be greater than or equal to 0.
In examples 1 and 2: isn't there an additional data dependency because both the 'if' and the 'else' branch write to a1? It seems a2 remains at 0 in those examples.
1. this video was made in 2021, with supposedly not an ancient CPU in the test system. I've heard so much about modern CPUs having redundant pipelines that keep evaluating both sides of the branch (and still keeping the branch prediction circuitry, in case the pipelines are "shorter" than the branch code paths). If that's true, why doesn't that make the handling of the ASM code generated by the compiler much more efficient? 10% (minority) wrong predictions should still lead to high efficiency. 2. why isn't the C compiler taking advantage of the SIMD in the fastest version of C branchless? Isn't there some compiler option you could have turned on to get the compiler to make code that performs closer to the optimal ASM impl? .... are compiler implementors and CPU designers lying to us or are they optimizing to unrepresentative/narrow test cases?
1. Evaluating both sides of the branch is gonna perform worse than branch prediction for the vast majority of branches. As Fedor said in the talk, 1% mispredict is already a really bad case, so if you just take both paths always, you do unnecessary work for 99% of all branches. So 1% of your code becomes faster, 99% of it becomes slower. This is not good enough to use as a general solution. I also don't know what CPUs you are talking about. 2. You can tell your compiler to autovectorize using flags, but I believe it's not done by default. One reason for this is that available SIMD instruction sets differ between processors, so the compiler will of course avoid generating code that your CPU may not even be able to run by default. So you have to tell it what SIMD ISA you are targeting. It also absolutely can't do as good of a job as optimal assembly, because 1) it doesn't have as much information as you, so it can't make the same optimizations as you. The guarantees the code makes are different than the guarantees that the algorithm makes. And the strongest optimizations often change how the program runs internally, which the compiler can not do, because it only does transformations that keep the observable program behaviour identical. And 2) SIMD code needs certain preconditions to excell, which are often not present in normal code. So the "optimal assembly" might involve preparing the data in a different way and transforming it in a different, and that code might be scatter across multiple functions. A programmer can make these changes because they can reason that the end result is identical, but it might be way too many instructions for the optimizer to reason about. There really is no reason to attribute malice to compiler or CPU vendors here.
If you depend on a machine that needs to be 99% efficient to keep up with your problem, then you have a poor hardware design to begin with that will fail the next time your requirements change. In most applications the CPU is waiting most of the time, so why in the world do you care? Gaming didn't get faster because they optimized the hell out of branches. It got faster because they are selling GPUs.
Fedor explains why in the last 3 paragraphs of chapter 3: CPU Architecture, Resources, and Performance - Branchless computing. In short, function pointer calls prevent inlining, which is a more beneficial optimisation than avoiding mispredicted branches.
Why code like `rand() & 0x1` generates branches that can be missed? Doesnt this piece of code perform a branchless stream of instruction (with some bitwise-and call)?
Just ran perf stat on python3 hello world and got 6% branch misses. pypy3 had 3% but way more total branches... g++ or clang++ with -O0 or -O3 and I still get 2.5% misses(though a lot less total branches, 800k vs 30M) using iostream with endl same branches and misses for printf() Also 3M instructions for printing hello world? hmmm... 950k instructions and 220k branches for ` int main(){ return 0;} ` must be overhead from perf or the OS, the assembly is basically zero eax then ret
A;so just tested some assembly loops. I'd say around 10G instructions or 1G branches per run is a good level to washout the bulk of the overhead. Maybe bump that an order of magnitude for fine tuning(but then your also going to want some control over cpu core-thread scheduling to prevent context switching from molesting your cache.)
How to find a job for a C++ developer? I'm not joking, I'm a dotnet developer, and I'm asking the serios quesion. Companies (as usual) want to see professional C++ developers after they finished university, which is impossible.
I think a function table is a kind of branching mechanism because the jump address depends on a runtime value which must be predicted to avoid flushing the pipeline.
Function results table, forcing you to evaluate everything. He specifically said that a table of function pointers is usually bad, and if those can be inlined, it's usually terrible.
you write this kind of code and then some other person will look at your code and be like "was this guy on drugs when he wrote this?". I think all this makes C++ a horrid language
That "session is over" was definitely not predicted.
It definitely wasn't branchless
Should have used an array
I suggest you look at `sudo perf top -e branch-misses` - it will just tell you exactly where there are too many mispredicts. Hit enter on the top thing and drill down to what function it is. Build with debug info.
I didn't even know you could use top with perf! Learn something new everyday.
With x86 the biggest effect likely/unlikely has is to rearrange the instructions such that the unlikely branch is moved to a different area of the program. This makes the likely path a serial instruction stream, which is good for instrction cache. Also good for branch prediction in the case there is no branch prediction entry in the table. E.g. the first time thru
Thanks for the input!
Very interesting with a bit of a drama at the end: "Your session is over". 😏
Thanks for repeating the questions.
underappreciated comment
Having looked at the comments before watching the entire talk, I was a little worried the talk would end before the speaker got to close the talk. So I'll mention here that the cut off happened during the question section at the end, after the last slide. Still somewhat abrupt!
Session is over, thank you!
Could have said "I'm sorry, we've run out of time, and we need to prepare the next session".
I watched this talk sometime back around when it was first uploaded, and I am almost certain I missed that joke at the beginning: "it's a talk on performance; the closer you sit, the better the performance." I am so glad YT recommended it to me again.
"The closer you sit, better the performance" - I see what you did there :)
Awesome talk! It's rare that I would ever watch an hour-long video on UA-cam, but by the time I got to the end of this one I wanted to go back to and watch it again.
Glad you enjoyed it!
This is absolutely mind blowing session
1:03:30 He was ready for some more questions and then it ended suddenly with a well deserved applause. Its like games where a cut-scene would end a fight with infinitely spawning enemies. "Well"
Man that ending is depressing.
Impressively, the talk ends in a Branch Prediction failure.
Fedors reaction is amazing : ) From now on, that's exactly what I'll imagine the CPU doing.
Excellent talk about branch predictor and the ways to take advantage of it. When I first saw the title though, for some reason I initially thought that the topic is about functional programming.
the art of writing efficient programs is a very good introductory book into this performance focused universe, highly recommended
what a way to close the session
seriously, what's up with that? i've seen surgeries with less strict timetables
okay, i haven't, but you get the idea
That was amazingly rude. "I'm sorry, we've run out of time and need to prepare the next session" would have been fine.
could not have predicted that
The Tyler Durden close.
@@GeorgeTsirosMost conferences work that way, its the norm, you know it if you have participated of one.
great talk, unfortunate end though :D can anyone tell me which CPU it was that had the faulty branch predictor?
Does that really matter?
Are you going to write code only for one specific CPU, e.g. ARM?
@@vasiliynkudryavtsev no, I just asked out of curiosity
any modern cpu will do the same, his code was written specifically to fool modern cpus
If you want to make it as hard as possible for old processors, do taken, taken, not taken, not taken, repeatedly. Every branch will mispredict. They had a two bit counter that incremented if taken, decremented if not taken, and it predicted taken if counter for this branch >= 2. Once it has been saturated one way, it has to mispredict twice in a row to start predicting the other way. Modern processors are not so dumb, they have a bunch of histories, and it selects what history based on whether the last few branches were taken or not (to get correlation).
Best I can tell it may have been the Intel Pentium 4 Willamette, also called NetBurst. I can't find anything online that says what the speaker said -- namely reported the opposite of what it meant to -- but apparently it had pretty bad branch prediction.
(I apologize in advanced if I've mixed up some terms. CPU architecture isn't my strong suit!)
It was a great ride! Thanks Fedor & @CppCon 👍
Our pleasure!
Excellent talk. Very sad to see it cut short but it is what it is. I have the book but haven't started yet. I would actually like to know/test if this can be done in GPUs too.
Indeed, a lot of GPU-based implementations of Math functions (such as abs) are intrinsecally branch-free as the divergence between threads in a GPU tend to degrade performance.
It can, and it is done. Many aspects to talk about when optimizing for GPUs
No need pretty much? All shader instances within an execution unit will cause all branches for which the prerequisites are fulfilled in any of the instances to be taken (effectively by all these instances), so while at source level you branch, the underlying implementation is effectively branch-free by hardware and system design. Furthermore the cost of branching, when it rarely does happen (aggregated), is pretty much zero, as during the resulting latency window, other work is performed, another context is ready to go and switch in by then. GPUs are INSANELY good at latency hiding and maintaining high nominal utilisation. Branching at source level is actually recommended, for the case that none of the instances in an execution group take a given branch, then the whole execution group can skip it, else the whole group effectively takes it.
In the end you don't really have much control over it and often times you have to even pay the associated extra cost of branch-free execution (those other branches being executed as well) in spite of nominally branching.
It all even goes back to the very first shader model which had no branching provision at all, at machine language level there were only selective mov instructions, no jump instructions; while shader compiler allowed and recognised branches at source level and converted them correspondingly. Loops had to have a trivially provable max iteration count and would be fully unrolled by the compiler, naturally.
hand drawn illustrations are great touch
Nice to see a talk like this when there are so many people scoffing at branch free, because they saw one example where a perfectly predictable branch skipped a few ops and ended up slightly faster.
How many times in your life did you need those 0.3ns back? :-)
@@lepidoptera9337: 0.3 times
cool upload CppCon. I smashed that thumbs up on your video. Maintain up the excellent work.
Much appreciated
18:30 register aliasing blew my mind
I had no idea eax is a logical register!
Always happy to see Fedor!
these concepts apply not only to c++ but many other languages, very good presentation
The session is over, thank you......
Loved it. Thanks Fedor!
Glad you enjoyed it!
“Predictor is good in prediction. We are not“ - nice :)
Respect. I learned a ton of stuff by watching this. Many thanks!!!
Surprised that the humble s += b*x could be so much more efficient. I've been using that one for years
though i would have really expected the compiler to pick up on this
whats even better is a negation of either 1 or 0 with a bitwise AND (-1 is all one-bits). 2 clock cycle latency (NEG; AND) vs 3-4 (multiply):
s += -b & x;
@@47Mortuus I assume that one only makes sense if you are adding integers.
Cpu and compiler engineers are mad geniuses lol
Indeed.
Great talk and I have learned a lot of stuff in only 1 hour :)
Great summary of the subject, I learnt a lot
Glad to hear it!
In PowerPC, the branch conditional instruction has the prediction baked into it. I've seen, for example, MetroWerks compiler output where its static predictions were very primitive (99% of branches forward were unlikely, 99% of branches backward were likely). I've yet to use the C++20 attributes, but [[likely]] and [[unlikely]] probably give you manual control of the prediction bit for those branches, which is neat.
Debug assertions and nullptr checks were totally the first thing I thought of when I learned about these attributes. Even if the compiler is smart enough to recognize a nullptr check and mark it as unlikely (I'm sure it is), it is nice to be able to self-document it with an attribute.
Yes the compiler will always assign a lower probability to a null pointer check indeed.
And no, never use likely or unlikely without profiling and looking at the assembly, the C++ standard is ass and says that likely/unlikely means "arbitrarily likely" and what that means in clang and gcc is 80%/20% respectively. It turns out that because of things such as null pointer checks and early returns the compiler can assign to a branch a probability lower than 20% but because unlikely means always 20% marking it as unlikely will actually make it more likely.
Again, the C++ standard is complete dogshit and it should have a feature to tell the compiler exactly how likely it is ([[likely 5%]] or something like that), but it doesn't and thus unlikely may jot do what you expect it to do.
@@marcossidoruk8033 That is much more of an implementation failure than a standard failure.
thank you very much , excited talk
Marvellous. Thanks. Deep thinking...
You're most welcome
The part about the compiler built in likely/unlikely hints is completely off.
it is just meta information for the resulting ORDERING OF INSTRUCTIONS in the machine code. This is due to the programmer wanting to avoid (instruction) cache misses, since the branch marked as more likely will immediately follow the conditional check.
Furthermore, the unlikely branches may be put at the very very end of a particular procedure (function) or loop with an unconditional jump back to where the branch ends. It may even prevent function inlining in order to improve performance that way (yes, sometimes forcing not to inline improves performance).
The session had ended 🤣
Another talk related to this speculative type branching topic was one given by Chandler Carruth here:
ua-cam.com/video/2EWejmkKlxs/v-deo.html
I'd recommend giving it a watch first if some of the foundational material at the beginning of this talk seemed hard to understand. It's a similar talk to this one except with a focus on loops rather than if-else branches.
Great talk!
I would have liked to ask if the array-of-two-bool-index micro-optimization for removing a branch is generally preferable to what I would have done more naively, namely adding or or'ing the two values after multiplying them with 1 and 0, gotten by treating the bool as an int.
I guess the answer could be: measure it
Really great topic and good info from Fedor, but his style takes some getting used to.
Serious question: if we repeat all the time ‘measure before changing’ and that compilers and processors may do better work then why do we think, that after we changed code once, it stays the fastest code? We made code less readable, we removed branch and added more work.
Then what if new processor came or new optimization in a compiler etc and original code would be better?
You cannot strictly rely on the code being perpetually faster indeed, but what choice do you have? You can't optimise for the unknown, and pessimising software in the hope that an architecture comes along someday that magically makes your slow code fast obviously doesn't help either.
One somewhat warranted assumption you can make is that processors will be optimised towards current, existing code; that they will be similar to today's processors just better and faster, that they will implement similar techniques as today but make them more robust. That processors when they get released and benchmarked by the press, will have to compete on old code, which will be a mix of code profile-optimised for last-gen processors and canonical style code.
Another related assumption is that even if the processors open up a different way to do things, old way will not be slower in absolute terms, just in relative terms. So say if your software has minimum specification being an AMD Bulldozer, and you optimise towards that, you have eliminated the performance weakness on the weakest hardware that you target, and that's enough, you don't actually care that the software fails to perform much quicker on a nominally faster processor, you have created a software that is palatable for the entire userbase. I have optimised towards Pentium 4 in the past (well over 10 years ago) knowing that it's the chip family most likely to struggle among all processors people are likely to use, and it didn't make nearly as much of a difference either way on anything like an Athlon XP or a Pentium M or a Core.
Obviously neither of these assumptions is absolutely true but they are useful as guiding principles in absence of better information.
Of course if you have the freedom - and there exists code which actually does that, though it's exceptionally rare - you can have several different implementations in your software, and profile them when you detect system configuration change or on startup. Of course it has its own hazards as you introduce an indirection at entry into that code, so better be sure it's worth it. Less risky is having the benchmark of optimised vs. canonical code automated in your test suite outside the shipped software.
I do think performance benchmark should be part of your CI procedure, because if there are performance regressions - usually an undesired side effect of development such as feature work or bugfixing rather than any sort of platform changes - you'll know about them. These performance regressions are a day to day reality rather than a hypothetical differently behaving platform to come along someday, and your software will generally gradually turn into a disaster if you ignore them.
That's why you don't just measure once, change, and then leave it. You continually measure. If you upgrade your compiler and there's a 5% performance drop, you measure, find the (potentially new) bottleneck, and change.
Don't forget that also 99% of the time you shouldn't need to think about this in the first place.
There are many cases where the hardware target is well known and exact. For example an embedded device, controller for some factory, or a transportation system (eg flight mangement sytems cannot make hardware version changes without major testing); orat the big end when you are optimizing for a super computer, a mainframe, or a large server farm deployment (matching hardware) the electrical costs alone can more than offset the cost of targeted optimization and a custom compile.
(These cases spend tens of thousands of USD per month on electricity. A 1% efficiency gain during a year is worth a week of developer time.)
In a few cases the hardware may be purchased after saftware profiling and optimization results that were made on a selection of sample machines.
Super interesting topic. Great presentation. I wonder where the most to gain is with current software (current hardware is criminally underused due to bad software). Is it branch prediction, is it parallelism, is it cache oriented code? Or is it something else?
Software is vastly different. We develop solvers for different forms of sparse linear equations. This particular case of algorithms is often memory-bound. CPU spends 100 cycles for next portion of data to be loaded from RAM and about 10 cycles to process.
So branch misprediction is not a problem in our case. Cache misses and memory speed are.
Depends on the code. General software probably branch prediction. Video, audio, graphics, or other things easily parallelizable probably parallelism and cache. Using SIMD instructions can easily provide an absolutely massive performance gain (perhaps 5-8x) where it makes sense. Multithreading can help with a lot of apps, not just those capable of benefiting from SIMD, though still not always a perfect or easy solution. Cache optimization can benefit, there are some general practices to prevent obvious slowdowns (such as how you iterate multi dimensional arrays), but can be a bit tricky otherwise.
I'd say memory layout and parallelization
@@dmitrykargin4060 I find this quite interesting. Do you work with direct solvers or iterative solvers?
I would say based on relative performance of different processors with different implementations of branch prediction, different number of cores, different SIMD configurations, different cache capacities: all of the above.
But branch prediction has a knock on effect on everything, like your high memory latency through the cache hierarchy doesn't really matter so much if you don't do useless requests.
On the other hand, bad memory locality just has such a devastating impact on performance that it's really unfortunate to ignore.
Interesting and informative talk! I like the hands-on, example-driven approach.
What I don’t like is the constant interruptions (esp. ~ mins 30-40) from the audience questions. These are very hard to follow as a remote viewer and disrupt the flow.
I agree that it reduces the quality of our experience as remote viewers, but you have to remember that these conferences are optimized for the experience of the in-person attendees (as they should be, it's an organic, and expensive event)
Really well presented!
Thank you kindly!
Thank you kindly!
In the first example, how do you get 10% only misprediction, if the predicate is random ?
There are multiple branches. For example, there could be four branches that can be predicted with ~100% accuracy, and the fifth and final branch is predicted with 50% accuracy. Or just one branch that’s 100% accurate and it just gets evaluated 4x more often than the 50% one.
Perf says that 10% of the branches are mispredicted. However the "interesting" branch was just one of the many branches in the program. There's quite a lot of other code involved, think about the benchmarking framework being used.. For every benchmark instance it might be running quite a bit of logic. Also the for loop termination branch is present. Also, the arrays are filled up before doing the experiment itself, and this also involves some branches (for loop and stuff). All of these branches are well predicted, so the "interesting" ones that are mispredicted are just a part of all the branches of the program.
It is all branches for the entire program, which includes a lot more than just the code being benchmarked.
there are other branches in the code. There are branches in the loop around the `if` statement. There are branches in the benchmark framework etc.
10% of all branches executed were mispredicted, but the branch in question is only one of several executed in the loop, so it might be 50% mispredicted but when aggregated with a bunch of predicted branches at other sites the final rate could be 10%.
very informative
thanks
Glad it was helpful!
Great talk like always quality content from Fedor! I will definitely buy his book as I'm rly interested in these type of optimisations.
Also can someone explain why the optimisation with function pointers doesn't work when functions are inlined?
Functions wouldn't work *ever* . A function call is a jump, so basically a branch. You could get the pointer branch-lessly, but then you would have to stall until the actual instructions that make up the function are loaded. It's basically like loading an array. No matter how fast you can get the pointer to that array, when you dereference it to use it, you still have to wait for the array to get cached. Same with functions. And if you just had if (cond) foo(); else bar(); you would always have a fast branch and a very slow branch, but if you introduce indirection with the function pointer you always have a pretty slow branch by having to cache the function instructions every single time, only after you know where they are stored. If the condition was sufficiently unpredictable, then it may be worth it, but it seems unlikely.
And for the inlined function it's even worse because an inlined function doesn't need to build a new stack frame since all the instructions are right there, inlined. Maybe even the caching of instructions is faster this way. If you use pointers you are forcing the compiler not to inline since it needs to be able to point to anything.
@@rauldragu9447 I think what may work is call both functions, store the results in the array, then use the condition to index into the result of the correct function. But if the function is more expensive than a branch misprediction on deciding which function to call, then you should just use the branch. Like Fedor Pikus said, measure! If performance doesn't matter enough for you to make a benchmark and measure, then you shouldn't do any special tricks either.
Really good presentation, thank you!
Great talk!
My low-mid (second from the bottom level) level interview was 70% leadership to start with, followed by 30% extremely basic technical. I didn't expect leadership questions at such a low level, and I messed up that interview so hard, but fortunately they hired all 4 people that made it to interview so it's all good!
You know why, right? Because an extremely technical person has a tendency to get lost in minutia and might spend hours, days and sometimes months thinking about how they can save a few nanoseconds in a program that needs to have 100% uptime, instead. Like this guy.
@@lepidoptera9337 lol whatever makes you feel better.
31:21 anyone knows what CPU he is talking about? I'm quite curious
Very great talk, thank you!
It can't optimize `c[i] = rand() >= 0` because function `rand()` is treated as "deterministic" random generator. The internal state of RNG must be advanced even though the returned value is discarded. The best one can get is `rand(); c[i] = 1;`
Deterministic, I don't get it and I don't understand why?
@@somehrjdbddhsjsbnsndndk751 Computers, at least without special hardware, are incapable of making truly random numbers. So rand() with the same given seed will give you the same sequence of "random" numbers every time you call it.
If the compiler optimized out the rand() call, then it would be violating one of the rules of the compiler since it would be affecting the output of the program.
@@somehrjdbddhsjsbnsndndk751 Tomasz was saying that `rand()` has side-effects, so it can't be eliminated entirely. This is because `rand()` has an internal state that holds that last randomly generated number to generate the next one. If you didn't call rand() at all, that internal state would not be changed properly, and that could result in observable differences between optimized code and not optimized code, namely that the sequence of random numbers would be generated for a given seed would be applied to different vectors in his program.
Now, of course, the programmer might not care about that, and would like this kind of optimization to occur, but the compiler has no way of asking or knowing that the programmer does not care about it.
Regardless, I would argue that a compiler warning would be better in this context than an optimization. If I had some condition buried in my code that was always true, then that condition is probably written incorrectly or I could delete it myself. There are some tools which will give you a warning when a condition is always true, but often times, the tool does not have a type that can describe what the real outputs look like. I.e. if something returns an "int" or a "number" rather than a "positive integer", the tool has no way to determine that the returned value will always be greater than or equal to 0.
In examples 1 and 2: isn't there an additional data dependency because both the 'if' and the 'else' branch write to a1? It seems a2 remains at 0 in those examples.
Very creative talk
How they managed to have an in person convention in 2021?
The power of C++.
"Your session is over" :(
What a damn shame that happened!
18:57 shoud lt be register renaming?
1. this video was made in 2021, with supposedly not an ancient CPU in the test system. I've heard so much about modern CPUs having redundant pipelines that keep evaluating both sides of the branch (and still keeping the branch prediction circuitry, in case the pipelines are "shorter" than the branch code paths). If that's true, why doesn't that make the handling of the ASM code generated by the compiler much more efficient? 10% (minority) wrong predictions should still lead to high efficiency.
2. why isn't the C compiler taking advantage of the SIMD in the fastest version of C branchless? Isn't there some compiler option you could have turned on to get the compiler to make code that performs closer to the optimal ASM impl?
.... are compiler implementors and CPU designers lying to us or are they optimizing to unrepresentative/narrow test cases?
1. Evaluating both sides of the branch is gonna perform worse than branch prediction for the vast majority of branches. As Fedor said in the talk, 1% mispredict is already a really bad case, so if you just take both paths always, you do unnecessary work for 99% of all branches. So 1% of your code becomes faster, 99% of it becomes slower. This is not good enough to use as a general solution. I also don't know what CPUs you are talking about.
2. You can tell your compiler to autovectorize using flags, but I believe it's not done by default. One reason for this is that available SIMD instruction sets differ between processors, so the compiler will of course avoid generating code that your CPU may not even be able to run by default. So you have to tell it what SIMD ISA you are targeting. It also absolutely can't do as good of a job as optimal assembly, because 1) it doesn't have as much information as you, so it can't make the same optimizations as you. The guarantees the code makes are different than the guarantees that the algorithm makes. And the strongest optimizations often change how the program runs internally, which the compiler can not do, because it only does transformations that keep the observable program behaviour identical. And 2) SIMD code needs certain preconditions to excell, which are often not present in normal code. So the "optimal assembly" might involve preparing the data in a different way and transforming it in a different, and that code might be scatter across multiple functions. A programmer can make these changes because they can reason that the end result is identical, but it might be way too many instructions for the optimizer to reason about.
There really is no reason to attribute malice to compiler or CPU vendors here.
If you depend on a machine that needs to be 99% efficient to keep up with your problem, then you have a poor hardware design to begin with that will fail the next time your requirements change. In most applications the CPU is waiting most of the time, so why in the world do you care? Gaming didn't get faster because they optimized the hell out of branches. It got faster because they are selling GPUs.
Thank you CppCon. Is the presentation slides available anywhere?
Thank you for your comment. We are investigating this and hope to resolve the issue shortly.
@@CppCon idk what issue you're even trying to resolve lmao. They asked for the slides, this aint a bug report :s gotta love automated bot comments...
Why does the function pointer trick not work ? Perhaps expensive memory look up for the function ?
Fedor explains why in the last 3 paragraphs of chapter 3: CPU Architecture, Resources, and Performance - Branchless computing. In short, function pointer calls prevent inlining, which is a more beneficial optimisation than avoiding mispredicted branches.
cool!
Why code like `rand() & 0x1` generates branches that can be missed? Doesnt this piece of code perform a branchless stream of instruction (with some bitwise-and call)?
I'm not an expert but it depends on the implementation of "rand()" and it may not be inlined by the compiler
I so wanted the bool conditional to be the solution at: @44:12
Why can't compilers do this type of optimization?
Can you explain this in a way that a java programmer would understand?
53:10 - can this technique be used to make an effective Meltdown/Spectre-proof code?
Awkward ending to say the least
give him another hour
bool(x)+bool(y) is a bit scary, booleans shouldn't have arithmetic adition defined, it should have only boolean operations defined.
2:58 beginning
1:03:28 EOF
It’s crazy the quality of people Russia has lost
Do your cats own laptops ? LOL.
You know that you have no real problems in your life if you are focused on the branch prediction pipeline of your CPU. ;-)
Just ran perf stat on python3 hello world and got 6% branch misses. pypy3 had 3% but way more total branches...
g++ or clang++ with -O0 or -O3 and I still get 2.5% misses(though a lot less total branches, 800k vs 30M) using iostream with endl
same branches and misses for printf()
Also 3M instructions for printing hello world?
hmmm... 950k instructions and 220k branches for ` int main(){ return 0;} ` must be overhead from perf or the OS, the assembly is basically zero eax then ret
A;so just tested some assembly loops. I'd say around 10G instructions or 1G branches per run is a good level to washout the bulk of the overhead. Maybe bump that an order of magnitude for fine tuning(but then your also going to want some control over cpu core-thread scheduling to prevent context switching from molesting your cache.)
How to find a job for a C++ developer? I'm not joking, I'm a dotnet developer, and I'm asking the serios quesion. Companies (as usual) want to see professional C++ developers after they finished university, which is impossible.
Ignore all conditions written in the vacancy description. Most companies don't care what they have written there.
Let me make a summary: one can use a function table to replace branches.
58:15 didn't he say that doesn't work though?
That would be a really bad summary, because this is something that does NOT work, as presented in the talk.
He specifically mentioned this as something that almost never works 58:19
I think a function table is a kind of branching mechanism because the jump address depends on a runtime value which must be predicted to avoid flushing the pipeline.
Function results table, forcing you to evaluate everything. He specifically said that a table of function pointers is usually bad, and if those can be inlined, it's usually terrible.
you write this kind of code and then some other person will look at your code and be like "was this guy on drugs when he wrote this?". I think all this makes C++ a horrid language
Very interesting, thanks
Glad you think so!