Branchless Programming in C++ - Fedor Pikus - CppCon 2021

Поділитися
Вставка
  • Опубліковано 4 січ 2022
  • cppcon.org/
    github.com/CppCon/CppCon2021
    ---
    Have you ever written code like this:
    void f(bool b, long x, long& s) { if (b) s += x; }
    Probably you have. Would you like me to tell you how much performance you left on the table? With a small change, that function could be made 2.5 times faster.
    What about this code:
    if (a[i] && b[i]) do_something(); else do_something_else();
    Would you believe me if I told you that, under some not-so-exotic conditions, this line runs four times slower than it could be? It’s true, and I’m going to show you when and why.
    This presentation will explain how modern CPUs handle computations to ensure that the hardware is used efficiently (which is how you get high performance from the processor). We will then learn how conditional code disrupts the ideal flow of computations and the countermeasures employed by the CPU designers to retain good performance in the presence of such code. Sometimes these measures are sufficient, often with the help of the compiler. But when they aren’t, it is up to the programmer to recover lost performance by coding with fewer branches.
    ---
    Fedor Pikus
    Fedor G Pikus is a Chief Engineering Scientist in the Design to Silicon division of Mentor Graphics Corp (Siemens business). His earlier positions included a Senior Software Engineer at Google and a Chief Software Architect for Calibre PERC, LVS, DFM at Mentor Graphics. He joined Mentor Graphics in 1998 when he made a switch from academic research in computational physics to the software industry. Fedor is a recognized expert on high-performance computing and C++, he presented his works at CPPCon, SD West, DesignCon, in Software Development Journal, and is also an O’Reilly author. His responsibilities as a Chief Scientist include planning the long-term technical direction of Calibre products, directing and training the engineers who work on these products, design, and architecture of the software, and research in the new design and software technologies. Fedor has over 25 patents and over 100 papers and conference presentations on physics, EDA, software design, and C++ language.
    ---
    Videos Filmed & Edited by Bash Films: www.BashFilms.com
    UA-cam Channel Managed by Digital Medium Ltd events.digital-medium.co.uk
    *--*
  • Наука та технологія

КОМЕНТАРІ • 176

  • @doc7115
    @doc7115 Рік тому +58

    That "session is over" was definitely not predicted.

    • @DavidsKanal
      @DavidsKanal 19 днів тому

      It definitely wasn't branchless

  • @denispriyomov6086
    @denispriyomov6086 2 роки тому +56

    Very interesting with a bit of a drama at the end: "Your session is over". 😏

  • @douggale5962
    @douggale5962 2 роки тому +80

    I suggest you look at `sudo perf top -e branch-misses` - it will just tell you exactly where there are too many mispredicts. Hit enter on the top thing and drill down to what function it is. Build with debug info.

    • @playman350
      @playman350 Рік тому +7

      I didn't even know you could use top with perf! Learn something new everyday.

  • @happycats-go8sv
    @happycats-go8sv Рік тому +16

    1:03:30 He was ready for some more questions and then it ended suddenly with a well deserved applause. Its like games where a cut-scene would end a fight with infinitely spawning enemies. "Well"

    • @DasAntiNaziBroetchen
      @DasAntiNaziBroetchen Рік тому +8

      Man that ending is depressing.

    • @garychap8384
      @garychap8384 7 місяців тому +3

      Impressively, the talk ends in a Branch Prediction failure.
      Fedors reaction is amazing : ) From now on, that's exactly what I'll imagine the CPU doing.

  • @binarytv2904
    @binarytv2904 2 роки тому +8

    "The closer you sit, better the performance" - I see what you did there :)

  • @tolkienfan1972
    @tolkienfan1972 2 роки тому +61

    With x86 the biggest effect likely/unlikely has is to rearrange the instructions such that the unlikely branch is moved to a different area of the program. This makes the likely path a serial instruction stream, which is good for instrction cache. Also good for branch prediction in the case there is no branch prediction entry in the table. E.g. the first time thru

  • @Thiago1337
    @Thiago1337 2 роки тому +25

    Session is over, thank you!

    • @spacelem
      @spacelem Рік тому +9

      Could have said "I'm sorry, we've run out of time, and we need to prepare the next session".

  • @masondeross
    @masondeross Рік тому +4

    I watched this talk sometime back around when it was first uploaded, and I am almost certain I missed that joke at the beginning: "it's a talk on performance; the closer you sit, the better the performance." I am so glad YT recommended it to me again.

  • @stephenjames2951
    @stephenjames2951 2 роки тому +29

    Thanks for repeating the questions.

    • @cryp0g00n4
      @cryp0g00n4 2 роки тому

      underappreciated comment

  • @endian675
    @endian675 2 роки тому +16

    Awesome talk! It's rare that I would ever watch an hour-long video on UA-cam, but by the time I got to the end of this one I wanted to go back to and watch it again.

    • @CppCon
      @CppCon  2 роки тому +8

      Glad you enjoyed it!

  • @serhii-ratz
    @serhii-ratz Рік тому +19

    This is absolutely mind blowing session

  • @Ryan1729
    @Ryan1729 2 роки тому +12

    Having looked at the comments before watching the entire talk, I was a little worried the talk would end before the speaker got to close the talk. So I'll mention here that the cut off happened during the question section at the end, after the last slide. Still somewhat abrupt!

  • @piotrarturklos
    @piotrarturklos 2 роки тому +12

    Excellent talk about branch predictor and the ways to take advantage of it. When I first saw the title though, for some reason I initially thought that the topic is about functional programming.

  • @JohnDoe-ly4ls
    @JohnDoe-ly4ls 2 роки тому +5

    It was a great ride! Thanks Fedor & @CppCon 👍

    • @CppCon
      @CppCon  2 роки тому

      Our pleasure!

  • @Marius-ir1qn
    @Marius-ir1qn 2 роки тому +69

    what a way to close the session

    • @GeorgeTsiros
      @GeorgeTsiros 2 роки тому +19

      seriously, what's up with that? i've seen surgeries with less strict timetables
      okay, i haven't, but you get the idea

    • @spacelem
      @spacelem Рік тому +19

      That was amazingly rude. "I'm sorry, we've run out of time and need to prepare the next session" would have been fine.

    • @jake1367
      @jake1367 Рік тому +19

      could not have predicted that

    • @ernststravoblofeld
      @ernststravoblofeld Рік тому

      The Tyler Durden close.

    • @marcossidoruk8033
      @marcossidoruk8033 11 місяців тому +4

      ​@@GeorgeTsirosMost conferences work that way, its the norm, you know it if you have participated of one.

  • @GeorgeTsiros
    @GeorgeTsiros 2 роки тому +3

    Always happy to see Fedor!

  • @ugurcan969
    @ugurcan969 2 роки тому +5

    Great talk and I have learned a lot of stuff in only 1 hour :)

  • @jbar7742
    @jbar7742 2 роки тому +90

    great talk, unfortunate end though :D can anyone tell me which CPU it was that had the faulty branch predictor?

    • @vasiliynkudryavtsev
      @vasiliynkudryavtsev 2 роки тому +1

      Does that really matter?
      Are you going to write code only for one specific CPU, e.g. ARM?

    • @jbar7742
      @jbar7742 2 роки тому +37

      @@vasiliynkudryavtsev no, I just asked out of curiosity

    • @z140140
      @z140140 2 роки тому +1

      any modern cpu will do the same, his code was written specifically to fool modern cpus

    • @douggale5962
      @douggale5962 2 роки тому +3

      If you want to make it as hard as possible for old processors, do taken, taken, not taken, not taken, repeatedly. Every branch will mispredict. They had a two bit counter that incremented if taken, decremented if not taken, and it predicted taken if counter for this branch >= 2. Once it has been saturated one way, it has to mispredict twice in a row to start predicting the other way. Modern processors are not so dumb, they have a bunch of histories, and it selects what history based on whether the last few branches were taken or not (to get correlation).

    • @kaychubies
      @kaychubies 8 місяців тому +1

      Best I can tell it may have been the Intel Pentium 4 Willamette, also called NetBurst. I can't find anything online that says what the speaker said -- namely reported the opposite of what it meant to -- but apparently it had pretty bad branch prediction.
      (I apologize in advanced if I've mixed up some terms. CPU architecture isn't my strong suit!)

  • @AhmedSam
    @AhmedSam 2 роки тому +8

    Loved it. Thanks Fedor!

    • @CppCon
      @CppCon  2 роки тому +1

      Glad you enjoyed it!

  • @bsuperbrain
    @bsuperbrain Рік тому +1

    the art of writing efficient programs is a very good introductory book into this performance focused universe, highly recommended

  • @KeyserTheRedBeard
    @KeyserTheRedBeard 2 роки тому +3

    cool upload CppCon. I smashed that thumbs up on your video. Maintain up the excellent work.

    • @CppCon
      @CppCon  2 роки тому

      Much appreciated

  • @jimr5855
    @jimr5855 Рік тому

    Really good presentation, thank you!

  • @boguscoder
    @boguscoder 2 роки тому +3

    hand drawn illustrations are great touch

  • @huzifa30
    @huzifa30 2 роки тому +3

    thank you very much , excited talk

  • @innovationscode9909
    @innovationscode9909 2 роки тому +2

    Marvellous. Thanks. Deep thinking...

    • @CppCon
      @CppCon  2 роки тому

      You're most welcome

  • @DankAirsoft
    @DankAirsoft 2 роки тому +10

    Great summary of the subject, I learnt a lot

    • @CppCon
      @CppCon  2 роки тому +3

      Glad to hear it!

  • @younghsiang2509
    @younghsiang2509 2 роки тому +5

    The session is over, thank you......

  • @krellin
    @krellin 3 місяці тому

    these concepts apply not only to c++ but many other languages, very good presentation

  • @douggale5962
    @douggale5962 2 роки тому +1

    Nice to see a talk like this when there are so many people scoffing at branch free, because they saw one example where a perfectly predictable branch skipped a few ops and ended up slightly faster.

    • @lepidoptera9337
      @lepidoptera9337 Рік тому

      How many times in your life did you need those 0.3ns back? :-)

  • @BenjaminBuch
    @BenjaminBuch Рік тому

    Very great talk, thank you!

  • @leshommesdupilly
    @leshommesdupilly Рік тому +2

    Cpu and compiler engineers are mad geniuses lol

  • @PedroOliveira-sl6nw
    @PedroOliveira-sl6nw 2 роки тому +39

    Excellent talk. Very sad to see it cut short but it is what it is. I have the book but haven't started yet. I would actually like to know/test if this can be done in GPUs too.

    • @jorcyd
      @jorcyd 2 роки тому +9

      Indeed, a lot of GPU-based implementations of Math functions (such as abs) are intrinsecally branch-free as the divergence between threads in a GPU tend to degrade performance.

    • @DagarCoH
      @DagarCoH 2 роки тому

      It can, and it is done. Many aspects to talk about when optimizing for GPUs

    • @SianaGearz
      @SianaGearz Рік тому +1

      No need pretty much? All shader instances within an execution unit will cause all branches for which the prerequisites are fulfilled in any of the instances to be taken (effectively by all these instances), so while at source level you branch, the underlying implementation is effectively branch-free by hardware and system design. Furthermore the cost of branching, when it rarely does happen (aggregated), is pretty much zero, as during the resulting latency window, other work is performed, another context is ready to go and switch in by then. GPUs are INSANELY good at latency hiding and maintaining high nominal utilisation. Branching at source level is actually recommended, for the case that none of the instances in an execution group take a given branch, then the whole execution group can skip it, else the whole group effectively takes it.
      In the end you don't really have much control over it and often times you have to even pay the associated extra cost of branch-free execution (those other branches being executed as well) in spite of nominally branching.
      It all even goes back to the very first shader model which had no branching provision at all, at machine language level there were only selective mov instructions, no jump instructions; while shader compiler allowed and recognised branches at source level and converted them correspondingly. Loops had to have a trivially provable max iteration count and would be fully unrolled by the compiler, naturally.

  • @toufikoran8416
    @toufikoran8416 2 роки тому +5

    very informative
    thanks

    • @CppCon
      @CppCon  2 роки тому +1

      Glad it was helpful!

  • @MMIIRRKKOO
    @MMIIRRKKOO 2 роки тому +2

    Great talk!

  • @bobby9568
    @bobby9568 2 роки тому +3

    Really well presented!

    • @CppCon
      @CppCon  2 роки тому

      Thank you kindly!

    • @CppCon
      @CppCon  2 роки тому

      Thank you kindly!

  • @SerFluffikins
    @SerFluffikins 2 роки тому +1

    In examples 1 and 2: isn't there an additional data dependency because both the 'if' and the 'else' branch write to a1? It seems a2 remains at 0 in those examples.

  • @serhii-ratz
    @serhii-ratz Рік тому +1

    “Predictor is good in prediction. We are not“ - nice :)

  • @christophclear1438
    @christophclear1438 2 роки тому +23

    Great talk!
    I would have liked to ask if the array-of-two-bool-index micro-optimization for removing a branch is generally preferable to what I would have done more naively, namely adding or or'ing the two values after multiplying them with 1 and 0, gotten by treating the bool as an int.

    • @EgD996
      @EgD996 2 роки тому +21

      I guess the answer could be: measure it

  • @u263a3
    @u263a3 2 роки тому +10

    The session had ended 🤣

  • @robertjmccabe
    @robertjmccabe 2 роки тому +1

    Very creative talk

  • @MarcoBergamin
    @MarcoBergamin Рік тому

    Very interesting, thanks

    • @CppCon
      @CppCon  Рік тому

      Glad you think so!

  • @uy-ge3dm
    @uy-ge3dm 2 роки тому +14

    Surprised that the humble s += b*x could be so much more efficient. I've been using that one for years

    • @jbar7742
      @jbar7742 2 роки тому +4

      though i would have really expected the compiler to pick up on this

    • @47Mortuus
      @47Mortuus 2 роки тому +12

      whats even better is a negation of either 1 or 0 with a bitwise AND (-1 is all one-bits). 2 clock cycle latency (NEG; AND) vs 3-4 (multiply):
      s += -b & x;

    • @DasAntiNaziBroetchen
      @DasAntiNaziBroetchen Рік тому

      @@47Mortuus I assume that one only makes sense if you are adding integers.

  • @BowMcGee
    @BowMcGee 2 роки тому +21

    Super interesting topic. Great presentation. I wonder where the most to gain is with current software (current hardware is criminally underused due to bad software). Is it branch prediction, is it parallelism, is it cache oriented code? Or is it something else?

    • @dmitrykargin4060
      @dmitrykargin4060 2 роки тому +10

      Software is vastly different. We develop solvers for different forms of sparse linear equations. This particular case of algorithms is often memory-bound. CPU spends 100 cycles for next portion of data to be loaded from RAM and about 10 cycles to process.
      So branch misprediction is not a problem in our case. Cache misses and memory speed are.

    • @xeridea
      @xeridea 2 роки тому +3

      Depends on the code. General software probably branch prediction. Video, audio, graphics, or other things easily parallelizable probably parallelism and cache. Using SIMD instructions can easily provide an absolutely massive performance gain (perhaps 5-8x) where it makes sense. Multithreading can help with a lot of apps, not just those capable of benefiting from SIMD, though still not always a perfect or easy solution. Cache optimization can benefit, there are some general practices to prevent obvious slowdowns (such as how you iterate multi dimensional arrays), but can be a bit tricky otherwise.

    • @iuppiterzeus9663
      @iuppiterzeus9663 2 роки тому +3

      I'd say memory layout and parallelization

    • @bva0
      @bva0 2 роки тому

      @@dmitrykargin4060 I find this quite interesting. Do you work with direct solvers or iterative solvers?

    • @SianaGearz
      @SianaGearz 2 роки тому

      I would say based on relative performance of different processors with different implementations of branch prediction, different number of cores, different SIMD configurations, different cache capacities: all of the above.
      But branch prediction has a knock on effect on everything, like your high memory latency through the cache hierarchy doesn't really matter so much if you don't do useless requests.
      On the other hand, bad memory locality just has such a devastating impact on performance that it's really unfortunate to ignore.

  • @Minty_Meeo
    @Minty_Meeo 2 роки тому +5

    In PowerPC, the branch conditional instruction has the prediction baked into it. I've seen, for example, MetroWerks compiler output where its static predictions were very primitive (99% of branches forward were unlikely, 99% of branches backward were likely). I've yet to use the C++20 attributes, but [[likely]] and [[unlikely]] probably give you manual control of the prediction bit for those branches, which is neat.
    Debug assertions and nullptr checks were totally the first thing I thought of when I learned about these attributes. Even if the compiler is smart enough to recognize a nullptr check and mark it as unlikely (I'm sure it is), it is nice to be able to self-document it with an attribute.

    • @marcossidoruk8033
      @marcossidoruk8033 11 місяців тому

      Yes the compiler will always assign a lower probability to a null pointer check indeed.
      And no, never use likely or unlikely without profiling and looking at the assembly, the C++ standard is ass and says that likely/unlikely means "arbitrarily likely" and what that means in clang and gcc is 80%/20% respectively. It turns out that because of things such as null pointer checks and early returns the compiler can assign to a branch a probability lower than 20% but because unlikely means always 20% marking it as unlikely will actually make it more likely.
      Again, the C++ standard is complete dogshit and it should have a feature to tell the compiler exactly how likely it is ([[likely 5%]] or something like that), but it doesn't and thus unlikely may jot do what you expect it to do.

    • @mytech6779
      @mytech6779 8 місяців тому

      @@marcossidoruk8033 That is much more of an implementation failure than a standard failure.

  • @OmarChida
    @OmarChida 2 роки тому +14

    Great talk like always quality content from Fedor! I will definitely buy his book as I'm rly interested in these type of optimisations.
    Also can someone explain why the optimisation with function pointers doesn't work when functions are inlined?

    • @rauldragu9447
      @rauldragu9447 2 роки тому +9

      Functions wouldn't work *ever* . A function call is a jump, so basically a branch. You could get the pointer branch-lessly, but then you would have to stall until the actual instructions that make up the function are loaded. It's basically like loading an array. No matter how fast you can get the pointer to that array, when you dereference it to use it, you still have to wait for the array to get cached. Same with functions. And if you just had if (cond) foo(); else bar(); you would always have a fast branch and a very slow branch, but if you introduce indirection with the function pointer you always have a pretty slow branch by having to cache the function instructions every single time, only after you know where they are stored. If the condition was sufficiently unpredictable, then it may be worth it, but it seems unlikely.
      And for the inlined function it's even worse because an inlined function doesn't need to build a new stack frame since all the instructions are right there, inlined. Maybe even the caching of instructions is faster this way. If you use pointers you are forcing the compiler not to inline since it needs to be able to point to anything.

    • @qwertyuiop-cu2ve
      @qwertyuiop-cu2ve 10 місяців тому +1

      @@rauldragu9447 I think what may work is call both functions, store the results in the array, then use the condition to index into the result of the correct function. But if the function is more expensive than a branch misprediction on deciding which function to call, then you should just use the branch. Like Fedor Pikus said, measure! If performance doesn't matter enough for you to make a benchmark and measure, then you shouldn't do any special tricks either.

  • @VitisCZ
    @VitisCZ 6 місяців тому

    31:21 anyone knows what CPU he is talking about? I'm quite curious

  • @nitsanbh
    @nitsanbh 3 місяці тому

    18:30 register aliasing blew my mind
    I had no idea eax is a logical register!

  • @fzz5964
    @fzz5964 2 роки тому +1

    cool!

  • @leonid998
    @leonid998 3 місяці тому

    18:57 shoud lt be register renaming?

  • @omarkhan6217
    @omarkhan6217 2 роки тому +3

    give him another hour

  • @skybuck2000
    @skybuck2000 2 роки тому

    Why does the function pointer trick not work ? Perhaps expensive memory look up for the function ?

    • @GordonAitchJay
      @GordonAitchJay 2 роки тому +2

      Fedor explains why in the last 3 paragraphs of chapter 3: CPU Architecture, Resources, and Performance - Branchless computing. In short, function pointer calls prevent inlining, which is a more beneficial optimisation than avoiding mispredicted branches.

  • @shilangyu
    @shilangyu Рік тому

    Why code like `rand() & 0x1` generates branches that can be missed? Doesnt this piece of code perform a branchless stream of instruction (with some bitwise-and call)?

    • @DasAntiNaziBroetchen
      @DasAntiNaziBroetchen Рік тому

      I'm not an expert but it depends on the implementation of "rand()" and it may not be inlined by the compiler

  • @bishop3000home
    @bishop3000home 2 роки тому +10

    Serious question: if we repeat all the time ‘measure before changing’ and that compilers and processors may do better work then why do we think, that after we changed code once, it stays the fastest code? We made code less readable, we removed branch and added more work.
    Then what if new processor came or new optimization in a compiler etc and original code would be better?

    • @SianaGearz
      @SianaGearz 2 роки тому +4

      You cannot strictly rely on the code being perpetually faster indeed, but what choice do you have? You can't optimise for the unknown, and pessimising software in the hope that an architecture comes along someday that magically makes your slow code fast obviously doesn't help either.
      One somewhat warranted assumption you can make is that processors will be optimised towards current, existing code; that they will be similar to today's processors just better and faster, that they will implement similar techniques as today but make them more robust. That processors when they get released and benchmarked by the press, will have to compete on old code, which will be a mix of code profile-optimised for last-gen processors and canonical style code.
      Another related assumption is that even if the processors open up a different way to do things, old way will not be slower in absolute terms, just in relative terms. So say if your software has minimum specification being an AMD Bulldozer, and you optimise towards that, you have eliminated the performance weakness on the weakest hardware that you target, and that's enough, you don't actually care that the software fails to perform much quicker on a nominally faster processor, you have created a software that is palatable for the entire userbase. I have optimised towards Pentium 4 in the past (well over 10 years ago) knowing that it's the chip family most likely to struggle among all processors people are likely to use, and it didn't make nearly as much of a difference either way on anything like an Athlon XP or a Pentium M or a Core.
      Obviously neither of these assumptions is absolutely true but they are useful as guiding principles in absence of better information.
      Of course if you have the freedom - and there exists code which actually does that, though it's exceptionally rare - you can have several different implementations in your software, and profile them when you detect system configuration change or on startup. Of course it has its own hazards as you introduce an indirection at entry into that code, so better be sure it's worth it. Less risky is having the benchmark of optimised vs. canonical code automated in your test suite outside the shipped software.
      I do think performance benchmark should be part of your CI procedure, because if there are performance regressions - usually an undesired side effect of development such as feature work or bugfixing rather than any sort of platform changes - you'll know about them. These performance regressions are a day to day reality rather than a hypothetical differently behaving platform to come along someday, and your software will generally gradually turn into a disaster if you ignore them.

    • @Adowrath
      @Adowrath 2 роки тому +2

      That's why you don't just measure once, change, and then leave it. You continually measure. If you upgrade your compiler and there's a 5% performance drop, you measure, find the (potentially new) bottleneck, and change.
      Don't forget that also 99% of the time you shouldn't need to think about this in the first place.

    • @mytech6779
      @mytech6779 8 місяців тому

      There are many cases where the hardware target is well known and exact. For example an embedded device, controller for some factory, or a transportation system (eg flight mangement sytems cannot make hardware version changes without major testing); orat the big end when you are optimizing for a super computer, a mainframe, or a large server farm deployment (matching hardware) the electrical costs alone can more than offset the cost of targeted optimization and a custom compile.
      (These cases spend tens of thousands of USD per month on electricity. A 1% efficiency gain during a year is worth a week of developer time.)
      In a few cases the hardware may be purchased after saftware profiling and optimization results that were made on a selection of sample machines.

  • @ciCCapROSTi
    @ciCCapROSTi Рік тому +1

    Really great topic and good info from Fedor, but his style takes some getting used to.

  • @jieliu756
    @jieliu756 2 роки тому

    Thank you CppCon. Is the presentation slides available anywhere?

    • @CppCon
      @CppCon  2 роки тому

      Thank you for your comment. We are investigating this and hope to resolve the issue shortly.

    • @AlFredo-sx2yy
      @AlFredo-sx2yy 11 місяців тому

      @@CppCon idk what issue you're even trying to resolve lmao. They asked for the slides, this aint a bug report :s gotta love automated bot comments...

  • @47Mortuus
    @47Mortuus 2 роки тому +10

    The part about the compiler built in likely/unlikely hints is completely off.
    it is just meta information for the resulting ORDERING OF INSTRUCTIONS in the machine code. This is due to the programmer wanting to avoid (instruction) cache misses, since the branch marked as more likely will immediately follow the conditional check.
    Furthermore, the unlikely branches may be put at the very very end of a particular procedure (function) or loop with an unconditional jump back to where the branch ends. It may even prevent function inlining in order to improve performance that way (yes, sometimes forcing not to inline improves performance).

  • @Revoker1221
    @Revoker1221 2 роки тому +7

    Another talk related to this speculative type branching topic was one given by Chandler Carruth here:
    ua-cam.com/video/2EWejmkKlxs/v-deo.html
    I'd recommend giving it a watch first if some of the foundational material at the beginning of this talk seemed hard to understand. It's a similar talk to this one except with a focus on loops rather than if-else branches.

  • @HFsrj
    @HFsrj 2 роки тому +18

    In the first example, how do you get 10% only misprediction, if the predicate is random ?

    • @theeinhaender5132
      @theeinhaender5132 2 роки тому +2

      There are multiple branches. For example, there could be four branches that can be predicted with ~100% accuracy, and the fifth and final branch is predicted with 50% accuracy. Or just one branch that’s 100% accurate and it just gets evaluated 4x more often than the 50% one.

    • @giacomo.delazzari
      @giacomo.delazzari 2 роки тому +4

      Perf says that 10% of the branches are mispredicted. However the "interesting" branch was just one of the many branches in the program. There's quite a lot of other code involved, think about the benchmarking framework being used.. For every benchmark instance it might be running quite a bit of logic. Also the for loop termination branch is present. Also, the arrays are filled up before doing the experiment itself, and this also involves some branches (for loop and stuff). All of these branches are well predicted, so the "interesting" ones that are mispredicted are just a part of all the branches of the program.

    • @meekrab9027
      @meekrab9027 2 роки тому

      It is all branches for the entire program, which includes a lot more than just the code being benchmarked.

    • @tomasdzetkulic9871
      @tomasdzetkulic9871 2 роки тому

      there are other branches in the code. There are branches in the loop around the `if` statement. There are branches in the benchmark framework etc.

    • @froody7
      @froody7 2 роки тому

      10% of all branches executed were mispredicted, but the branch in question is only one of several executed in the loop, so it might be 50% mispredicted but when aggregated with a bunch of predicted branches at other sites the final rate could be 10%.

  • @Sidelobes
    @Sidelobes 2 роки тому +3

    Interesting and informative talk! I like the hands-on, example-driven approach.
    What I don’t like is the constant interruptions (esp. ~ mins 30-40) from the audience questions. These are very hard to follow as a remote viewer and disrupt the flow.

    • @vasiapatov4544
      @vasiapatov4544 2 роки тому +4

      I agree that it reduces the quality of our experience as remote viewers, but you have to remember that these conferences are optimized for the experience of the in-person attendees (as they should be, it's an organic, and expensive event)

  • @tomaszstanislawski457
    @tomaszstanislawski457 2 роки тому +3

    It can't optimize `c[i] = rand() >= 0` because function `rand()` is treated as "deterministic" random generator. The internal state of RNG must be advanced even though the returned value is discarded. The best one can get is `rand(); c[i] = 1;`

    • @somehrjdbddhsjsbnsndndk751
      @somehrjdbddhsjsbnsndndk751 2 роки тому

      Deterministic, I don't get it and I don't understand why?

    • @gianni50725
      @gianni50725 2 роки тому

      @@somehrjdbddhsjsbnsndndk751 Computers, at least without special hardware, are incapable of making truly random numbers. So rand() with the same given seed will give you the same sequence of "random" numbers every time you call it.
      If the compiler optimized out the rand() call, then it would be violating one of the rules of the compiler since it would be affecting the output of the program.

    • @niles_5003
      @niles_5003 Рік тому

      @@somehrjdbddhsjsbnsndndk751 Tomasz was saying that `rand()` has side-effects, so it can't be eliminated entirely. This is because `rand()` has an internal state that holds that last randomly generated number to generate the next one. If you didn't call rand() at all, that internal state would not be changed properly, and that could result in observable differences between optimized code and not optimized code, namely that the sequence of random numbers would be generated for a given seed would be applied to different vectors in his program.
      Now, of course, the programmer might not care about that, and would like this kind of optimization to occur, but the compiler has no way of asking or knowing that the programmer does not care about it.
      Regardless, I would argue that a compiler warning would be better in this context than an optimization. If I had some condition buried in my code that was always true, then that condition is probably written incorrectly or I could delete it myself. There are some tools which will give you a warning when a condition is always true, but often times, the tool does not have a type that can describe what the real outputs look like. I.e. if something returns an "int" or a "number" rather than a "positive integer", the tool has no way to determine that the returned value will always be greater than or equal to 0.

  • @AbstractObserver
    @AbstractObserver 2 роки тому +4

    How they managed to have an in person convention in 2021?

  • @jean-luclachance7242
    @jean-luclachance7242 Рік тому

    Why can't compilers do this type of optimization?

  • @vladimir0rus
    @vladimir0rus 2 роки тому +6

    "Your session is over" :(

  • @multiHappyHacker
    @multiHappyHacker 2 роки тому

    I so wanted the bool conditional to be the solution at: @44:12

  • @PrzemyslawSliwinski
    @PrzemyslawSliwinski 2 роки тому +2

    53:10 - can this technique be used to make an effective Meltdown/Spectre-proof code?

  • @chriswysocki8816
    @chriswysocki8816 2 роки тому +1

    1. this video was made in 2021, with supposedly not an ancient CPU in the test system. I've heard so much about modern CPUs having redundant pipelines that keep evaluating both sides of the branch (and still keeping the branch prediction circuitry, in case the pipelines are "shorter" than the branch code paths). If that's true, why doesn't that make the handling of the ASM code generated by the compiler much more efficient? 10% (minority) wrong predictions should still lead to high efficiency.
    2. why isn't the C compiler taking advantage of the SIMD in the fastest version of C branchless? Isn't there some compiler option you could have turned on to get the compiler to make code that performs closer to the optimal ASM impl?
    .... are compiler implementors and CPU designers lying to us or are they optimizing to unrepresentative/narrow test cases?

    • @simivb
      @simivb 2 роки тому +1

      1. Evaluating both sides of the branch is gonna perform worse than branch prediction for the vast majority of branches. As Fedor said in the talk, 1% mispredict is already a really bad case, so if you just take both paths always, you do unnecessary work for 99% of all branches. So 1% of your code becomes faster, 99% of it becomes slower. This is not good enough to use as a general solution. I also don't know what CPUs you are talking about.
      2. You can tell your compiler to autovectorize using flags, but I believe it's not done by default. One reason for this is that available SIMD instruction sets differ between processors, so the compiler will of course avoid generating code that your CPU may not even be able to run by default. So you have to tell it what SIMD ISA you are targeting. It also absolutely can't do as good of a job as optimal assembly, because 1) it doesn't have as much information as you, so it can't make the same optimizations as you. The guarantees the code makes are different than the guarantees that the algorithm makes. And the strongest optimizations often change how the program runs internally, which the compiler can not do, because it only does transformations that keep the observable program behaviour identical. And 2) SIMD code needs certain preconditions to excell, which are often not present in normal code. So the "optimal assembly" might involve preparing the data in a different way and transforming it in a different, and that code might be scatter across multiple functions. A programmer can make these changes because they can reason that the end result is identical, but it might be way too many instructions for the optimizer to reason about.
      There really is no reason to attribute malice to compiler or CPU vendors here.

    • @lepidoptera9337
      @lepidoptera9337 Рік тому

      If you depend on a machine that needs to be 99% efficient to keep up with your problem, then you have a poor hardware design to begin with that will fail the next time your requirements change. In most applications the CPU is waiting most of the time, so why in the world do you care? Gaming didn't get faster because they optimized the hell out of branches. It got faster because they are selling GPUs.

  • @KX36
    @KX36 2 роки тому

    My low-mid (second from the bottom level) level interview was 70% leadership to start with, followed by 30% extremely basic technical. I didn't expect leadership questions at such a low level, and I messed up that interview so hard, but fortunately they hired all 4 people that made it to interview so it's all good!

    • @lepidoptera9337
      @lepidoptera9337 Рік тому

      You know why, right? Because an extremely technical person has a tendency to get lost in minutia and might spend hours, days and sometimes months thinking about how they can save a few nanoseconds in a program that needs to have 100% uptime, instead. Like this guy.

    • @DasAntiNaziBroetchen
      @DasAntiNaziBroetchen Рік тому

      @@lepidoptera9337 lol whatever makes you feel better.

  • @edgararakelyan9326
    @edgararakelyan9326 Рік тому +4

    Awkward ending to say the least

  • @PiotrKundu
    @PiotrKundu 3 місяці тому

    1:03:28 EOF

  • @none_of_your_business
    @none_of_your_business 2 роки тому +1

    bool(x)+bool(y) is a bit scary, booleans shouldn't have arithmetic adition defined, it should have only boolean operations defined.

  • @user-nm7sp5xj7q
    @user-nm7sp5xj7q 2 роки тому +1

    2:58 beginning

  • @lepidoptera9337
    @lepidoptera9337 Рік тому

    You know that you have no real problems in your life if you are focused on the branch prediction pipeline of your CPU. ;-)

  • @XKS99
    @XKS99 Рік тому +1

    It’s crazy the quality of people Russia has lost

  • @skybuck2000
    @skybuck2000 2 роки тому +2

    Do your cats own laptops ? LOL.

  • @mytech6779
    @mytech6779 8 місяців тому

    Just ran perf stat on python3 hello world and got 6% branch misses. pypy3 had 3% but way more total branches...
    g++ or clang++ with -O0 or -O3 and I still get 2.5% misses(though a lot less total branches, 800k vs 30M) using iostream with endl
    same branches and misses for printf()
    Also 3M instructions for printing hello world?
    hmmm... 950k instructions and 220k branches for ` int main(){ return 0;} ` must be overhead from perf or the OS, the assembly is basically zero eax then ret

    • @mytech6779
      @mytech6779 8 місяців тому

      A;so just tested some assembly loops. I'd say around 10G instructions or 1G branches per run is a good level to washout the bulk of the overhead. Maybe bump that an order of magnitude for fine tuning(but then your also going to want some control over cpu core-thread scheduling to prevent context switching from molesting your cache.)

  • @736939
    @736939 2 роки тому

    How to find a job for a C++ developer? I'm not joking, I'm a dotnet developer, and I'm asking the serios quesion. Companies (as usual) want to see professional C++ developers after they finished university, which is impossible.

    • @BarafuAlbino
      @BarafuAlbino Рік тому +2

      Ignore all conditions written in the vacancy description. Most companies don't care what they have written there.

  • @kilasuelika6955
    @kilasuelika6955 2 роки тому +8

    Let me make a summary: one can use a function table to replace branches.

    • @greob
      @greob 2 роки тому +6

      58:15 didn't he say that doesn't work though?

    • @iUniversEi
      @iUniversEi 2 роки тому +3

      That would be a really bad summary, because this is something that does NOT work, as presented in the talk.

    • @kebien6020
      @kebien6020 2 роки тому +4

      He specifically mentioned this as something that almost never works 58:19

    • @Derheight
      @Derheight 2 роки тому +2

      I think a function table is a kind of branching mechanism because the jump address depends on a runtime value which must be predicted to avoid flushing the pipeline.

    • @MMIIRRKKOO
      @MMIIRRKKOO 2 роки тому +2

      Function results table, forcing you to evaluate everything. He specifically said that a table of function pointers is usually bad, and if those can be inlined, it's usually terrible.

  • @paulfunigga
    @paulfunigga Рік тому +1

    you write this kind of code and then some other person will look at your code and be like "was this guy on drugs when he wrote this?". I think all this makes C++ a horrid language