Introduction to Hardware Efficiency in Cpp - Ivica Bogosavljevic - CppCon 2022

Поділитися
Вставка
  • Опубліковано 12 вер 2024

КОМЕНТАРІ • 9

  • @basheyev
    @basheyev Рік тому +4

    Great talk! Conclusion: be branch prediction, data prefetcher, vectorization & cache line friendly!

  • @topin8997
    @topin8997 Рік тому +2

    I think that's a good introduction to get a general idea of fast code, which boils to to "keep your data compact, access it sequentially". As it is introduction, there was few mention of profiler tools without going into any details. Still, there _was_ performance tests that clearly shows why it's better. Still, two more thing worth mentioning: "reduce memory allocation/deallocation and conditional jumps wherever possible".
    I can't find video here, but one guy said he reduced computation time of some train logistics simulation from days to hours by reusing some vectors. That's because for large vectors OS actually allocates it on first access to it, not all at once immediately. Just measure how much time it takes to create vector(100_000_000) and to std::fill it with next.
    Next, which was only tangentially mentioned, is conditions and branch mispredictions. CPU actively predicts what branch of condition are most likely to be next and execute it in advance. That's why for loops are fast, they are likely to go on then be done. But, it branching is random, it fails constantly. Sometimes, code like r = a*(c>0) + b*(c 0 ? a : b;. Nowadays most compilers can vectorize this simple line, but may fail in some more complex cases, so keeping branching to a minimum is a good thing anyway.
    EDIT: Check out Ivica's blog johnysswlab.com/author/ibogi/ for a lot more details on optimization. Looks great

    • @47Mortuus
      @47Mortuus Рік тому +1

      apart from the fact that "r = c > 0 ? a : b;" is often translated to machine code using branch free, 1 clock cycle conditional moves, for actual cases where a * (c > 0) is faster, PLEASE dear god PLEAAAAASE use a & -(c > 0) instead, as -0 is all 0 bits in 2s complement and -1 is all 1 bits in 2s complement. I just hate to see that 4 clock cycle latency 1 issue per cycle throughput integer multiplicaion when telling people about such micro optimizations you can even encapsulate in a meaningfully named forceinline function.
      But again: Measure first and second look at and understand the compiled code in the assembly language of the platform you're targeting, as CMP + CMOVcc is faster that 2x ILP { CMP, SETcc, NEG, AND } ADD even, and most definitely faster that 2x ILP + 1 cycle overhead { CMP, SETcc, IMUL }, ADD...
      Micro optimization requires much, MUCH more knowledge than one might assume at first - sometimes ADDING A BRANCH TO A SINGLE HARDWARE INSTRUCTION CAN BE FASTER, as with "uint32_t c = a / b" being slower than "uint32_t c = b > a ? 0 : b / a", depending on your data of course and even if it is poorly predicted by the hardware. This kind of micro optimization you mentioned is way further down the road than optimal memory layout, which is the most impactful optimization by a mile and pretty much the only topic in this talk and only requires knowledge of higher level languages such as C++, which is pretty much the only language used in this event. And since micro optimization is a way more advanced topic, it can often de-optimize code when done poorly and/or in a naive/misinformed manner, as illustrated in your comment (compilers cannot optimize your "r = a*(c>0) + b*(c

  • @DotcomL
    @DotcomL Рік тому +2

    A fantastic collection of tips here, thank you for the talk

  • @Azer_GG
    @Azer_GG Рік тому +1

    Thanks for the great talk!

  • @Luca-yy4zh
    @Luca-yy4zh Рік тому

    Thanks for these useful tips

  • @player-eric
    @player-eric Рік тому

    Hi! Could you please provide accurate subtitles for this video?

  • @stavb9400
    @stavb9400 Рік тому

    Does your matrix multiplication yield the same result when u switch the k and j loop ?

    • @mellowdv
      @mellowdv Рік тому

      I tried it and on my CPU both ran for around 70 seconds, no real difference after swapping loops but I'm no performance expert.