Parallel C++: False Sharing

Поділитися
Вставка
  • Опубліковано 11 вер 2024
  • In this video we at the basics of false sharing!
    For code samples: github.com/coff...
    For live content: / coffeebeforearch

КОМЕНТАРІ • 19

  • @simonepernice8059
    @simonepernice8059 Рік тому +7

    That is the most enlightening lesson on cache related performance effect that I have ever had! Thank you very much Nick! Cheers, Simone

  • @yihan4835
    @yihan4835 Рік тому +3

    Currently trying to break into the HFT space, your video is super helpful!

  • @Gauthamphongalkar
    @Gauthamphongalkar 10 місяців тому +1

    In every tutorial people talk theory, but you show it practically which makes you special!.. thank you! BIG FAN :)

  • @giusepperana6354
    @giusepperana6354 Рік тому +1

    That's incredible how much of a difference it makes. And explained very well and clearly. Thanks for direct sharing, fantastic video.

  • @tenko3211
    @tenko3211 9 місяців тому

    In the false_sharing example, since each thead has its own counter, there is no need to use std::array for the counters. I found that using std::array makes the program much faster.

  • @ychevall
    @ychevall 4 місяці тому

    Do we still need -lpthread in modern g++? I thought gcc is now smart enough to include the library by default

  • @thisisolorin394
    @thisisolorin394 Рік тому +1

    what would happen if an array of int was used instead of atomic? Would the other elements of the array be invalidated as well if they are not atomic?

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  Рік тому

      Yep - invalidations occurs whether or not you use atomic. If a core wants to write to an integer on a cache line that currently is in another core’s cache, it has to invalidate that cache line and get exclusive access to it.
      Atomic makes sure that the entire read+modify+write operation occurs as one atomic unit (I.e., it can’t be broken up). Atomic is more of a specifier for how actions are performed on a piece of memory (not really the details of what that memory looks like). An atomic int and and a regular int are just 4B of memory. With an atomic int though, things like increment result in atomic instructions (eg, and increment with the lock prefix in x86), while an increment of a normal integer will just result in something like an increment instruction (without a lock prefix).

    • @thisisolorin394
      @thisisolorin394 Рік тому

      @@CoffeeBeforeArch hm, i tried it on my machine and got basically no cache misses when running perf c2c and the program ran in 0.x sec or so like in the case with alignas. Any thoughts?

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  Рік тому +1

      @@thisisolorin394 If the program completed almost instantly, there's a good chance that the compiler just optimized away the operation. Compiler optimizers are pretty clever about getting rid of code that generates values that are not used later on in the program (things like atomic prevent that from happening).

  • @nareknazari6443
    @nareknazari6443 Рік тому

    hello . what is your compailer?

    • @mtusharx
      @mtusharx Рік тому

      gnu c++, version I couldn't see.

  • @empireempire3545
    @empireempire3545 Рік тому

    isnt this alignas a bit of a dirty hack? ; ) i mean, you're using up way more memory than you need to - in this case it doesnt matter, but what to do in cases when it does? Another thing - what if your code is supposed to work on multiple different types of processors and there simply ISNT a single size of cacheline?
    I wonder if this problem can be solved in a different way somehow.

    • @CoffeeBeforeArch
      @CoffeeBeforeArch  Рік тому +3

      It's a pretty standard technique for solving these kinds of problems. A dummy array of padding bytes is also pretty standard, and is shown as a solution in a recent performance blog post by Netlifx - netflixtechblog.com/seeing-through-hardware-counters-a-journey-to-threefold-performance-increase-2721924a2822
      There usually is a space-speed trade-off in performance (e.g., double buffering doubles the amount of memory allocated to buffers). Padding to avoid false sharing is generally small, and not a major overhead (in terms of overall memory capacity). You only need the padding at the edges of where a problem is being partitioned (and is at most a cache line of padding at each boundary which is, in general, only 64B each). If you are in some super memory constrained embedded system, your only real option would be to move around data members that are being written to so that they are naturally spaced apart (or find a way to limit the number you're doing in the first place).
      Software optimizations are not always portable - you're inherently tuning to the underlying architecture your code is running on. If you have a different cache line size, you might have to tune it differently. You could just have the spacing controlled by something like a preprocessor macro that is set at compile-time based on the cache line size of the target architecture.

    • @matheusmarchetti628
      @matheusmarchetti628 Рік тому

      If the counter variable was local to the lambda, it should also avoid false sharing. Scott Meyers showed a similar example in his presentation about caches

    • @killacrad
      @killacrad 7 місяців тому

      @@matheusmarchetti628 that's a totally different use case with counter variable being thread local, performance penalty from cache coherence does not apply.

  • @deverloperfantom1372
    @deverloperfantom1372 Рік тому

    hi wi the wery smole lesson dot`t trasted infotmasion bu the lesson i`m stat goot see serealizm