The best part is that you can learn from them regardless of if you use C++ or not. I write Rust but since there isn't much native content for Rust, I watch these CppCon videos and they're made really well by really smart and experienced people.
Well on the one hand you're watching something by a person who wants to teach you something, on the other hand you're reading something written by someone who assumes you're already an expert in what you're reading about.
Hello can you please explain to me what he meant at 56:00 time stamp when he was talking about loading an atomic value in a destructor. I don't understand it and it has been now 4 days that I am bothered with it. If you can just point me to some resource at least, I would be really thankful. I even asked a question in the cpp forum but it seems no one knows really as I got zero answers to it. Two things I don't understand: 1) Why reading that value in the destructor is wrong or can cause undefined behavior? 2) How come the fast that we could use a non atomic read solves the problem? It is even harder for me to understand as I don't seem to detect or have a hint at what the issue is. Thank you very much.
@@somehrjdbddhsjsbnsndndk751 It's not wrong and it won't cause undefined behavior. However, it's also not necessary: atomic accesses are needed only when two threads may be accessing the same value at the same time. Usually, when the destructor runs, no other thread is accessing the same object. Therefore, atomic access is not necessary. To explicitly state that I do not expect any other thread to call methods on an object at the same time as it's destroyed, I would like to use explicit non-atomic accesses in the destructor. On the other hand, there are some rare cases when another thread may observe the object while it's being destroyed. To highlight that I'm dealing with this specific and tricky situation, I would like to use atomic accesses in the destructor only when they are truly needed. The standard gives me no way to express this difference. Every access to an atomic variable is atomic. Even if I know that particular access happens in a single-threaded context, I can't convey this knowledge to the reader of my program.
This was an amazing talk. I’ve generally had a very loose understanding of memory ordering - enough to know I need to stay away from it. I can confidently say I finally understand it thanks to this. Rarely will I comment on videos. This video was just too good not to. Hands down one of my favorite talks.
How I understand it: A memory barrier forces the cpu to commit its variable-changes to a place where they are visible to other threads on other cores such as main memory or through special sockets between caches (as he talks about in the q&a). Otherwise changes are accumulated in L1 and L2 caches and written to main memory together (flush of the cache line). x.store(1, std::memory_order_release) commits all writes that happened on the executing CPU before the memory barrier to main memory and makes it instantly visible to other threads. x.load(std::memory_order_acquire) makes sure that all the writes that happened until the writing of x (with x.store()) in the other thread are now visible to this thread -> it "acquires" those writes , basically requesting clarity of memory at that point
@@Astfresser If I understand it, release/acquire must be paired and then it says everything to the point of the store operation is guaranteed to be written. SEQ_CST is quite hard to understand. And then it's a bit hard to understand what happens when there are pairs like release/seq_cst. And I guess once I use condition_wait and notify_one, I can just use relaxed on all atomic variables because these synchronization primitives and mutexes use memory barriers as well.
am I the only one who is slightly confused, especially about the memory order part? everyone says this talk is great and all but I'd watched Herb Sutter talk before on atomics and I thought I somewhat understood it and now watched this and now I'm confused
As to the naming of compare_exchange_weak(), from Anthony Williams "C++ Concurrency in Action: Practical Multi threading" pg. 113: "For compare_exchange_weak(), the store might not be successful even if the original value was equal to the expected value, in which case the value of the variable in unchanged and the return value of compare_exchange_weak() is false. This is most likely to to happen on machines that lack a single compare-and-exchange instruction, if the processor can't guarantee that the operation has been done atomically - possibly because the thread performing the operation was switched out tin the middle of the necessary sequence of instructions and another thread scheduled in its place by the operating system where there are more threads than processors. This is called a spurious failure, because the reason for the failure is a function of the timing rather than the values of the variables."
On LoadLink/StoreConditional architectures, the StoreConditional could fail for arbitrary reasons. For example, you might have a single hardware monitor, and you need to invalidate it (lose the link) on a context switch, or when cache coherence messages require you to move the line associated with the monitor. "TimedLock" is not necessarily a bad way to model this. However, it's a bit misleading in the sense that time isn't expressed in terms of a simple time unit; rather, it's saying "enough time has passed for a frustrating operation to arrive before I make progress, forcing me to try again."
great work ... : ) ,, .. in case you like to review a topic again: // 11:24 limitations for atomic // 15:40 compare_exchange_strong // 23:06 atomic vs mutex speed // 26:30 secret of atomic ! // 33:58 compare_exchange_weak // 42:12 memory // 56:48 review
13:15 This is crazy! I would NEVER have guessed this. To misquote Haldane's Law, things aren't just more complicated than we imagine, they're more complicated than we CAN imagine.
Program B (code in 2:20) performs the computation locally and only locks the mutex to store the result. No wonder it is so much faster than A where an atomic read-modify-write has to lock the cache line for each update.. That's comparing apples and oranges.
To be really honest, with this I finally comprehended how threads are important and hence fort, a little intro to parallel computing. Really getting into the C++ magic stuff
Great talk. I have my own benchmark but it shows: atomic > mutex > spinlock('>' means faster). Definitely my benchmark doesn't coordinate with his. Any chance I can take a look at the source code of his benchmark?
Awesome talk; I've watched it several times! I have a question, though: I understand that unsynchronized reading and writing to the same location in memory is undefined behavior. I also understand that the CPU/memory subsystem moves data in terms of cache lines. What I am still unclear on is what happens if I have two variables in my program that just so happen to be on the same cache line (e.g. two 32-bit ints), but I didn't know that: My program has two threads that each read and write from their own variable, but I, the programmer, thought it was OK because there's no (intentional) "sharing?" If these threads are non-atomically writing to their own variables, do they still have the same problem where they can "stomp" on each other's writes?
If two variables are on the same cache line but not at the same address, correctness-wise nothing bad happens (caches are hardware optimizations that are supposed to be invisible when observable results are concerned). Performance-wise, if two threads are accessing the same cache line, it's very bad: it uses the same hardware lock as it does when using a true shared value like a mutex. Each thread locks the entire cache line and blocks the other thread until it's done. So you get the performance of a mutex without the benefits. Now, in order to get two values that close, you have to either declare them as local variables next to each other or put them in the same data structure. You are very unlikely to get this by calling new/malloc twice. But if you have two threads operating on parts of the same data structure (like elements in a vector) then you need to be careful. If it's a large vector (array, etc) and you divide it between threads (thread 0 gets elements 0 to N-1, thread 1 gets N-1 to 2N), don't worry about it, odds are the threads will be looping over their sections of the vector and won't hit the boundary around Nth element at the same time. If you have a data structure with one or two small elements per thread (like an array of ints, one per thread) then you will get dismal performance unless you scatter them across different cache lines. The easiest way is using alignas to align each per-thread value on a 64-byte boundary (in C++17, alignas(std::hardware_destructive_interference_size) is available instead of hardcoding 64).
It's funny to me that the default is not a CAS-based multiply. On the other hand, it is tragic that they made the method name for CAS so goddamn long. It's somehow longer than the standard name for the operation (compare and swap).
Spurious failures are there for ll/sc architectures. sc will fail if the atomic value (cache line) has been written, even if the same value was written as was originally stored. This is useful, as it makes it easier to write ABA-safe algorithms. sc will also fail "spuriously" in other situations such as a context switch (or cache flush, etc.) between the ll and sc operations. cmpxchg has no equivalent failure mode, because the operation is atomic with respect to context switches. ll and sc, as separate instructions, don't have this same guarantee. ll/sc platforms can spuriously fail the sc in many ways; a cache flush/context switch is the biggie, but they might well not allow multiple locations to be lled simultaneously; they might not even allow a regular load to occur between an ll/sc pair. I think the compiler should make sure to respect these rules (there's no point doing nested ll/sc pairs if they'll always fail, after all) but the compiler can't prevent context switches. If you're writing for an ll/sc architecture, trapping these spurious failures and handling them specially can allow certain algorithmic improvements. In this situation, you'd use the weak version. If you're using an algorithm written for a CAS architecture, and have no special handling of sc failures, then you'll want to use the strong version anyway.
Yep, pretty much. In the one LL/SC architecture I helped architect, if anything came within a hair's breadth of the location being monitored (due to LL), we'd shoot down the monitor and force a loop. One thing we did mess up on that architecture: We didn't properly invalidate the monitor when taking an interrupt, so we required a costly software workaround to manually do so. This was an artifact of our memsys folks not really talking with our CPU folks, and the fact that the architect above it all was being somewhat ignored by both, at least with respect to LL/SC, since multicore systems were still rare in our space at that time. It was a rather unfortunate situation, as the feature could have been quite nice. IIRC, we ended up only shipping LL/SC support on one device, which is rather disappointing to me. It was a casualty of politics, I suppose. Our particular variant of LL/SC actually split it into _three_ operations: LoadLink/StoreLink/CommitLink. The reason had to do with our pipeline behavior (exposed pipeline VLIW). We didn't have a natural way to store a value to memory and write a result to the register file afterward. So, SL would deposit data in the monitor alongside the monitored address, and CMTL would actually pull the trigger (write the value to memory) if the monitor hadn't been invalidated. From the CPU's perspective, CMTL's pipeline behavior was similar to a load.
To prevent ABA, I simply include a counter within some bits of cmpxchg'd value. That way not only I know same value has been written, but also how many times... not that knowing that count is that helpful in real life scenarios.
On slide 53 (38:00) what happens if one thread sets, say, q[7] while another sets q[8]? If they are both in the same cache line, won't the threads interfere with each other?
50:56 On x86: - all loads are acquire-loads, all stores are release-stores ● but adding the other barrier is expensive Does it mean that on x86 even I specify "memory_order_relaxed" for a store (just to give an example) the real memory order continues to be "memory_order_release" ? No possibility of "downgrade" in this architecture?
I saw this 2 years late, but not necessarily. Nothing prevents the compiler from rearranging things around even if x86 architecture doesn’t. Or worse, the compiler doesn’t rearrange now, but does in a newer version in the future and the bug becomes visible
I'd assume so. Using MMX has been counterproductive in last 15-20 years. Of course, if sizeof(long) == 4 (bytes), you could fit two of them in an MMX register. No idea why anyone would ever want to do that. To clarify, his sizeof(long) seems to be 8, which is *not* true on all x86-64 platforms.
25:42 I am confused. How does "spinlock" catch up at large number of threads? In your benchmark it is faster all the way? Beats atomic easily doesn't it? Also, on 51:13, "release on read and acquire on write are really expensive". Isn't it compile-time prohibited to do those operations?
Depends on the hardware, but yes, spinlock can be faster (a lot of effort went into optimizing and eliding locks in hardware). Note that even when it gives faster throughput, spinlock burns a lot more CPU time, so it's possible that the overall application is better off doing atomics or CAS. In C++, it is UB (but not diagnosed by the compiler) to use acquire order with store() or release order with load(). You have to use the exchange hack. That being said, this is a C++ limitation, hardware may be able to do it, and X86 can but it's very expensive relative to the "natural" memory order (of course, you have to write inline assembly to get this behavior).
@9:16 The threads executed parallely to increment the value of X from 0. Both of the threads will get X as 0 and set to 1 and writes back to memory. Even though the memory is their it got to be rewrite by other thread to 1 only So how come x got to be 2
Both threads try to do the increment. One threat succeeds, but the other gets blocked or told to try again or something and will wait until the first has completed before trying increment again
Hello can you please explain to me what he meant at 56:00 time stamp when he was talking about loading an atomic value in a destructor. I don't understand it and it has been now 4 days that I am bothered with it. If you can just point me to some resource at least, I would be really thankful. I even asked a question in the cpp forum but it seems no one knows really as I got zero answers to it. Two things I don't understand: 1) Why reading that value in the destructor is wrong or can cause undefined behavior? 2) How come the fast that we could use a non atomic read solves the problem? It is even harder for me to understand as I don't seem to detect or have a hint at what the issue is. Thank you very much.
I believe the implication was that it is nonsensical/bad for your program to allow one thread to be using an object at the same time another thread is destroying that object - one should never write code that allows this case to happen. So, it is not the "reading that value in the destructor" which is wrong, it is the idea that another thread can be using the object while it is being concurrently destroyed which is wrong. And finally, the fact that there is a std::memory_order written there implies to the reader of the code that this nonsensical situation exists.
The example at the beginning of the talk is quite misleading. The mutex is only acquired once per thread, where the atomic int is incremented once per person in every thread! For a fair comparison, the atomic version should have a local accumulator like the mutex version does.
Fedor Pikus deliberately compared a slow algorithm that uses an atomic with a fast algorithm that uses a lock to prove the point he makes at 03:22 (slide 7): "Algorithms reign supreme". In other words, we should first make sure that we use good algorithms before we fiddle with details like locks versus atomics. I think his point was to remind his audience that everything they learn in this class is less important than good algorithms. Maybe he should have made that clearer. He addresses a few more details in his answer to the question at 01:06:55.
JiveDadson the problem with these types of questions is the either/or presentation. He should have asked: did you expect this? 1) Yes; 2) No; 3) I don’t have enough information to have an informed opinion.
Note that for struct B {long; long; }, b.is_lock_free == true before gcc7, but b.is_lock_free == false since gcc7. This is due to an ABI change in libatomic. For more info, please look at gcc mailing list, gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html.
9:13, could someone explain the difference between: #include std::atomic x(0); //vs std::atomic x=0; presumably, the behavior would not be the same otherwise he probably wouldn't have called attention to it. The top one is calling the default(i think) constructor and the other calling the copy(i think) constructor? How would the result in different behavior? Is it the implicit cast of 0 to atomic? It's such a minor difference and C++ is filled with these sorts of quirks. It's almost impossible to write "good" code because there so many of these land mines. In any case, could someone clarify this situation a bit, please?
Yes, the second one is about copy constructor. This is explained quite well here: stackoverflow.com/questions/15249998/why-are-stdatomic-objects-not-copyable
Honestly, syntactically, I don't quite get behind that one. The two lines are both initialization (not assignment!) and should both call the very same copy ctor. Maybe this is something compiler specific (I'm on MSVC), but both versions compile just fine for me and there are no runtime issues whatsoever.
The way I interpreted it is that atomic simply doesn't have a copy constructor or assignment operator. And when I looked at the source that confirmed my suspicion. The reason for this I think is to prevent users from making the mistaken assumption that if you had two atomic variables x and y and you wrote x = y you'd think that was an atomic operation, which it wouldn't be, so they simply removed that option altogether.
Can anyone please help me with this. Why "struct a {long x};" is lock-free but not "struct b {long x, long y, long z};" ? Links or any other source would help. Thanks in advance.
The gripe about the c++ standard is wrong. (56:16) reason: constructors and destructors execute in isolation. Of course, unless you call the destructor directly.
Why the designers of C++ did not made it the way, that each new thread makes a unique copy of all variables in the program, so then such problems never arise?
The expression "become visible" is just awful. What Pikus meant by "becoming visible" after an acquire barrier is simply that operations that appear after the barrier in the program are guaranteed to execute after the barrier. Likewise, "becoming visible" before a release barrier means operations before the barrier in the program are guaranteed to have been executed by the time the barrier is reached. A little caveat is that for the acquire barrier, operations before the barrier can be moved to execute after the barrier, but operations after the barrier in the program cannot be moved to execute before the barrier is reached. The opposite applies to release barrier.
🤔 At 10:00: My understanding is all threads(cores) share same caches. They don’t have different caches, then what’s mean by communication between caches?
As I understand all cores usually share the L3 cache, the L2 is usually a smaller group and the L1 is usually per core. In zen the L1 and L2 are shared by one core (2 threads) and the L3 is shared by all cores
i used to think all these conferences were bs, but man.. these days i just kept watching cppcon when i was supposed to watch memes and vines
relatable.....
I just read that while staying up for 10hours.. watching almost cppcon only 🙃
This one is if you write Rust code :-D
The best part is that you can learn from them regardless of if you use C++ or not. I write Rust but since there isn't much native content for Rust, I watch these CppCon videos and they're made really well by really smart and experienced people.
Memes and WHATS? This comment sure shows its age ...
Really love this talk, using 1 hour watching this video is actually more efficient than reading cpp reference using a day :)
I totally agree. Very few talks make me feel this way.
maybe it is just me who speed more than one hour to watch this video😅
Well on the one hand you're watching something by a person who wants to teach you something, on the other hand you're reading something written by someone who assumes you're already an expert in what you're reading about.
Finally a good explanation of memory orders without hand waving and saying 'it's to complex to worry about'
Hello can you please explain to me what he meant at 56:00 time stamp when he was talking about loading an atomic value in a destructor. I don't understand it and it has been now 4 days that I am bothered with it. If you can just point me to some resource at least, I would be really thankful. I even asked a question in the cpp forum but it seems no one knows really as I got zero answers to it. Two things I don't understand: 1) Why reading that value in the destructor is wrong or can cause undefined behavior? 2) How come the fast that we could use a non atomic read solves the problem? It is even harder for me to understand as I don't seem to detect or have a hint at what the issue is. Thank you very much.
@@somehrjdbddhsjsbnsndndk751 It's not wrong and it won't cause undefined behavior. However, it's also not necessary: atomic accesses are needed only when two threads may be accessing the same value at the same time. Usually, when the destructor runs, no other thread is accessing the same object. Therefore, atomic access is not necessary. To explicitly state that I do not expect any other thread to call methods on an object at the same time as it's destroyed, I would like to use explicit non-atomic accesses in the destructor. On the other hand, there are some rare cases when another thread may observe the object while it's being destroyed. To highlight that I'm dealing with this specific and tricky situation, I would like to use atomic accesses in the destructor only when they are truly needed. The standard gives me no way to express this difference. Every access to an atomic variable is atomic. Even if I know that particular access happens in a single-threaded context, I can't convey this knowledge to the reader of my program.
Finally, this guy made me understand memory ordering. Nice presentation...
Glad it helped!
I really enjoy listening to Fedor. Knowledgeable speaker doing a great job explaining complicated topics.
This talk was superb. Complex topic delivered eloquently. Thanks for that.
Glad you liked it!
This was an amazing talk. I’ve generally had a very loose understanding of memory ordering - enough to know I need to stay away from it. I can confidently say I finally understand it thanks to this.
Rarely will I comment on videos. This video was just too good not to. Hands down one of my favorite talks.
I was hoping for a clear explanation of the various std::memory_order types. It starts at 42:26
You have to be patient
How I understand it: A memory barrier forces the cpu to commit its variable-changes to a place where they are visible to other threads on other cores such as main memory or through special sockets between caches (as he talks about in the q&a). Otherwise changes are accumulated in L1 and L2 caches and written to main memory together (flush of the cache line). x.store(1, std::memory_order_release) commits all writes that happened on the executing CPU before the memory barrier to main memory and makes it instantly visible to other threads. x.load(std::memory_order_acquire) makes sure that all the writes that happened until the writing of x (with x.store()) in the other thread are now visible to this thread -> it "acquires" those writes , basically requesting clarity of memory at that point
@@Astfresser If I understand it, release/acquire must be paired and then it says everything to the point of the store operation is guaranteed to be written. SEQ_CST is quite hard to understand. And then it's a bit hard to understand what happens when there are pairs like release/seq_cst.
And I guess once I use condition_wait and notify_one, I can just use relaxed on all atomic variables because these synchronization primitives and mutexes use memory barriers as well.
the most decent and clear explanation of atomics i've seen so far
Best video explaining memory ordering ever.
All these conferences on cppcon are awesome.
Thank you so much!
These conferences are pure gold
am I the only one who is slightly confused, especially about the memory order part? everyone says this talk is great and all but I'd watched Herb Sutter talk before on atomics and I thought I somewhat understood it and now watched this and now I'm confused
Very insightful. Especially with the pictures things fell in the place!
As to the naming of compare_exchange_weak(), from Anthony Williams "C++ Concurrency in Action: Practical Multi threading" pg. 113: "For compare_exchange_weak(), the store might not be successful even if the original value was equal to the expected value, in which case the value of the variable in unchanged and the return value of compare_exchange_weak() is false. This is most likely to to happen on machines that lack a single compare-and-exchange instruction, if the processor can't guarantee that the operation has been done atomically - possibly because the thread performing the operation was switched out tin the middle of the necessary sequence of instructions and another thread scheduled in its place by the operating system where there are more threads than processors. This is called a spurious failure, because the reason for the failure is a function of the timing rather than the values of the variables."
On LoadLink/StoreConditional architectures, the StoreConditional could fail for arbitrary reasons. For example, you might have a single hardware monitor, and you need to invalidate it (lose the link) on a context switch, or when cache coherence messages require you to move the line associated with the monitor.
"TimedLock" is not necessarily a bad way to model this. However, it's a bit misleading in the sense that time isn't expressed in terms of a simple time unit; rather, it's saying "enough time has passed for a frustrating operation to arrive before I make progress, forcing me to try again."
@@Key_C0de hey man...
i like this guy
great work ... : ) ,,
.. in case you like to review a topic again:
// 11:24 limitations for atomic
// 15:40 compare_exchange_strong
// 23:06 atomic vs mutex speed
// 26:30 secret of atomic !
// 33:58 compare_exchange_weak
// 42:12 memory
// 56:48 review
13:15 This is crazy! I would NEVER have guessed this.
To misquote Haldane's Law, things aren't just more complicated than we imagine, they're more complicated than we CAN imagine.
at 48:00 he is referencing some other talk, about locks?
what video is that?
Great talk, it's the clearest explanation I ever watched.
Thank you for your comment. Pleased to hear that this presentation was helpful.
c++ is beautiful, and this talk is great
What an amazing talk! Very well explained!
Program B (code in 2:20) performs the computation locally and only locks the mutex to store the result. No wonder it is so much faster than A where an atomic read-modify-write has to lock the cache line for each update.. That's comparing apples and oranges.
Pruibiebe Hastoet that’s his point. Algorithm trumps everything.
Then, measure performance to verify assumptions.
Amazing presentation. Learned a lot. Thank you!
Excellent. First time I really understand these ideas.
You let me know more detailed about the atmoic under the hood. When it comes to hardware, the optimize granularity can be really small.
That's a great talk! Worth to watch!
Eye opening presentation, very well explained
To be really honest, with this I finally comprehended how threads are important and hence fort, a little intro to parallel computing.
Really getting into the C++ magic stuff
the message I get is "use someone else's lock free queue"
Very good talk! Thank you very much for speeding up my learning about atomics and lock free algorithms.
Thank you Mr.Fedor !
Interesting. std::atomic actually offers more guarantees than I assumed it would (the miracle of low expectations :-).
Great. Thanks for speech, examples and bechmarks.
That was really interesting and useful talk, thanks a lot!
Glad to hear that!
For the 2nd question, memory_order_relaxed just guarantees atomic operation will be performed, but no constraint applied.
"It's hard to write if you don't care about occaisonal bugs; It's *really* hard if you do"
Great presentation! That intro was great, LOL
Glad you liked it!
Great talk. I have my own benchmark but it shows: atomic > mutex > spinlock('>' means faster). Definitely my benchmark doesn't coordinate with his. Any chance I can take a look at the source code of his benchmark?
Awesome talk; I've watched it several times! I have a question, though:
I understand that unsynchronized reading and writing to the same location in memory is undefined behavior. I also understand that the CPU/memory subsystem moves data in terms of cache lines. What I am still unclear on is what happens if I have two variables in my program that just so happen to be on the same cache line (e.g. two 32-bit ints), but I didn't know that: My program has two threads that each read and write from their own variable, but I, the programmer, thought it was OK because there's no (intentional) "sharing?" If these threads are non-atomically writing to their own variables, do they still have the same problem where they can "stomp" on each other's writes?
For modern processors that can do single-byte reads, there's no such worry.
If two variables are on the same cache line but not at the same address, correctness-wise nothing bad happens (caches are hardware optimizations that are supposed to be invisible when observable results are concerned). Performance-wise, if two threads are accessing the same cache line, it's very bad: it uses the same hardware lock as it does when using a true shared value like a mutex. Each thread locks the entire cache line and blocks the other thread until it's done. So you get the performance of a mutex without the benefits. Now, in order to get two values that close, you have to either declare them as local variables next to each other or put them in the same data structure. You are very unlikely to get this by calling new/malloc twice. But if you have two threads operating on parts of the same data structure (like elements in a vector) then you need to be careful. If it's a large vector (array, etc) and you divide it between threads (thread 0 gets elements 0 to N-1, thread 1 gets N-1 to 2N), don't worry about it, odds are the threads will be looping over their sections of the vector and won't hit the boundary around Nth element at the same time. If you have a data structure with one or two small elements per thread (like an array of ints, one per thread) then you will get dismal performance unless you scatter them across different cache lines. The easiest way is using alignas to align each per-thread value on a 64-byte boundary (in C++17, alignas(std::hardware_destructive_interference_size) is available instead of hardcoding 64).
Thank you,@@FedorPikus!
It's funny to me that the default is not a CAS-based multiply. On the other hand, it is tragic that they made the method name for CAS so goddamn long. It's somehow longer than the standard name for the operation (compare and swap).
Amazing Talk..I just lovd it
How come no programming language handles cache lines explicitly?
That is the job of the OS and micro architecture.
It's not portable to do so, different architectures have different cache models.
Is there a reason for the omission of atomic single bit operations in the standard?
(Excellent presentation!)
Spurious failures are there for ll/sc architectures. sc will fail if the atomic value (cache line) has been written, even if the same value was written as was originally stored. This is useful, as it makes it easier to write ABA-safe algorithms. sc will also fail "spuriously" in other situations such as a context switch (or cache flush, etc.) between the ll and sc operations. cmpxchg has no equivalent failure mode, because the operation is atomic with respect to context switches. ll and sc, as separate instructions, don't have this same guarantee.
ll/sc platforms can spuriously fail the sc in many ways; a cache flush/context switch is the biggie, but they might well not allow multiple locations to be lled simultaneously; they might not even allow a regular load to occur between an ll/sc pair. I think the compiler should make sure to respect these rules (there's no point doing nested ll/sc pairs if they'll always fail, after all) but the compiler can't prevent context switches.
If you're writing for an ll/sc architecture, trapping these spurious failures and handling them specially can allow certain algorithmic improvements. In this situation, you'd use the weak version. If you're using an algorithm written for a CAS architecture, and have no special handling of sc failures, then you'll want to use the strong version anyway.
Yep, pretty much.
In the one LL/SC architecture I helped architect, if anything came within a hair's breadth of the location being monitored (due to LL), we'd shoot down the monitor and force a loop.
One thing we did mess up on that architecture: We didn't properly invalidate the monitor when taking an interrupt, so we required a costly software workaround to manually do so. This was an artifact of our memsys folks not really talking with our CPU folks, and the fact that the architect above it all was being somewhat ignored by both, at least with respect to LL/SC, since multicore systems were still rare in our space at that time.
It was a rather unfortunate situation, as the feature could have been quite nice. IIRC, we ended up only shipping LL/SC support on one device, which is rather disappointing to me. It was a casualty of politics, I suppose.
Our particular variant of LL/SC actually split it into _three_ operations: LoadLink/StoreLink/CommitLink. The reason had to do with our pipeline behavior (exposed pipeline VLIW). We didn't have a natural way to store a value to memory and write a result to the register file afterward. So, SL would deposit data in the monitor alongside the monitored address, and CMTL would actually pull the trigger (write the value to memory) if the monitor hadn't been invalidated. From the CPU's perspective, CMTL's pipeline behavior was similar to a load.
To prevent ABA, I simply include a counter within some bits of cmpxchg'd value. That way not only I know same value has been written, but also how many times... not that knowing that count is that helpful in real life scenarios.
Good talk
Would someone please know which talk he is referring to when he says "I'm not going to go into as much details as Paul did"?
On slide 53 (38:00) what happens if one thread sets, say, q[7] while another sets q[8]? If they are both in the same cache line, won't the threads interfere with each other?
50:56
On x86:
- all loads are acquire-loads, all stores are release-stores
● but adding the other barrier is expensive
Does it mean that on x86 even I specify "memory_order_relaxed" for a store (just to give an example) the real memory order continues to be "memory_order_release" ?
No possibility of "downgrade" in this architecture?
I saw this 2 years late, but not necessarily. Nothing prevents the compiler from rearranging things around even if x86 architecture doesn’t. Or worse, the compiler doesn’t rearrange now, but does in a newer version in the future and the bug becomes visible
I figure that atomics work a lot like git. you prepare memory in your local cache ("local repo") and release it ("pushing to a remote origin")
Fantastic!
28:51 Isn't that supposed to be an xmm register rather than a mmx register?
I'd assume so. Using MMX has been counterproductive in last 15-20 years. Of course, if sizeof(long) == 4 (bytes), you could fit two of them in an MMX register. No idea why anyone would ever want to do that. To clarify, his sizeof(long) seems to be 8, which is *not* true on all x86-64 platforms.
25:42 I am confused. How does "spinlock" catch up at large number of threads? In your benchmark it is faster all the way? Beats atomic easily doesn't it?
Also, on 51:13, "release on read and acquire on write are really expensive". Isn't it compile-time prohibited to do those operations?
Depends on the hardware, but yes, spinlock can be faster (a lot of effort went into optimizing and eliding locks in hardware). Note that even when it gives faster throughput, spinlock burns a lot more CPU time, so it's possible that the overall application is better off doing atomics or CAS.
In C++, it is UB (but not diagnosed by the compiler) to use acquire order with store() or release order with load(). You have to use the exchange hack. That being said, this is a C++ limitation, hardware may be able to do it, and X86 can but it's very expensive relative to the "natural" memory order (of course, you have to write inline assembly to get this behavior).
@9:16 The threads executed parallely to increment the value of X from 0.
Both of the threads will get X as 0 and set to 1 and writes back to memory.
Even though the memory is their it got to be rewrite by other thread to 1 only
So how come x got to be 2
Both threads try to do the increment. One threat succeeds, but the other gets blocked or told to try again or something and will wait until the first has completed before trying increment again
Logarithmic conference cppcon.
Thanks.
Hello can you please explain to me what he meant at 56:00 time stamp when he was talking about loading an atomic value in a destructor. I don't understand it and it has been now 4 days that I am bothered with it. If you can just point me to some resource at least, I would be really thankful. I even asked a question in the cpp forum but it seems no one knows really as I got zero answers to it. Two things I don't understand: 1) Why reading that value in the destructor is wrong or can cause undefined behavior? 2) How come the fast that we could use a non atomic read solves the problem? It is even harder for me to understand as I don't seem to detect or have a hint at what the issue is. Thank you very much.
I believe the implication was that it is nonsensical/bad for your program to allow one thread to be using an object at the same time another thread is destroying that object - one should never write code that allows this case to happen.
So, it is not the "reading that value in the destructor" which is wrong, it is the idea that another thread can be using the object while it is being concurrently destroyed which is wrong. And finally, the fact that there is a std::memory_order written there implies to the reader of the code that this nonsensical situation exists.
it really helps a lot !!!!
Great talk.
The example at the beginning of the talk is quite misleading. The mutex is only acquired once per thread, where the atomic int is incremented once per person in every thread! For a fair comparison, the atomic version should have a local accumulator like the mutex version does.
Fedor Pikus deliberately compared a slow algorithm that uses an atomic with a fast algorithm that uses a lock to prove the point he makes at 03:22 (slide 7): "Algorithms reign supreme". In other words, we should first make sure that we use good algorithms before we fiddle with details like locks versus atomics. I think his point was to remind his audience that everything they learn in this class is less important than good algorithms. Maybe he should have made that clearer. He addresses a few more details in his answer to the question at 01:06:55.
The instant I saw the slide, I blurted out, "That''s stoopid."
Not a great idea to encourage people to raise their hands and then tell the audience that everyone who raised his hand got it wrong.. Grrrrrr.
Why not? That's a great way of letting people know that their current mental model is incorrect.
JiveDadson the problem with these types of questions is the either/or presentation. He should have asked: did you expect this? 1) Yes; 2) No; 3) I don’t have enough information to have an informed opinion.
Amazing talk! Thanks!
How fantanstic!
Note that for struct B {long; long; }, b.is_lock_free == true before gcc7, but b.is_lock_free == false since gcc7. This is due to an ABI change in libatomic. For more info, please look at gcc mailing list, gcc.gnu.org/ml/gcc-patches/2017-01/msg02344.html.
On Windows the fastest way I've found in general is to use a benaphore.
“Demons fly out your nose”. Ha! Excellent talk.
Thank you sir. Really well explained.
Great video. Thanks
18:23, the joke is so good that it makes someone need an ambulance.
49:22: thx!! solve my doubt about CAS
Thanks
9:13, could someone explain the difference between:
#include
std::atomic x(0); //vs
std::atomic x=0;
presumably, the behavior would not be the same otherwise he probably wouldn't have called attention to it. The top one is calling the default(i think) constructor and the other calling the copy(i think) constructor? How would the result in different behavior? Is it the implicit cast of 0 to atomic? It's such a minor difference and C++ is filled with these sorts of quirks. It's almost impossible to write "good" code because there so many of these land mines.
In any case, could someone clarify this situation a bit, please?
Yes, the second one is about copy constructor.
This is explained quite well here:
stackoverflow.com/questions/15249998/why-are-stdatomic-objects-not-copyable
Hey,
You can check the answer to your questions here.
stackoverflow.com/questions/43939343/in-class-initialization-of-atomic
Honestly, syntactically, I don't quite get behind that one. The two lines are both initialization (not assignment!) and should both call the very same copy ctor. Maybe this is something compiler specific (I'm on MSVC), but both versions compile just fine for me and there are no runtime issues whatsoever.
The way I interpreted it is that atomic simply doesn't have a copy constructor or assignment operator. And when I looked at the source that confirmed my suspicion. The reason for this I think is to prevent users from making the mistaken assumption that if you had two atomic variables x and y and you wrote x = y you'd think that was an atomic operation, which it wouldn't be, so they simply removed that option altogether.
Can anyone please help me with this.
Why "struct a {long x};" is lock-free but not "struct b {long x, long y, long z};" ?
Links or any other source would help. Thanks in advance.
He explained it himself starting at 28:00.
What a tough crowd
Can't SSE and AVX extensions be used to increase the atomic lock free size?
It sounds like Intel doesn't want to guarantee it: software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/536979
Double plus good.
future lecture should just invite some CPU architect to explain
The gripe about the c++ standard is wrong. (56:16)
reason: constructors and destructors execute in isolation.
Of course, unless you call the destructor directly.
43:11 was that Elemental by Tears For Fears? 😂
"Welcome to the defense of dark art class" 🤣
Why "aquire and release barrier" instead of "sync caches"?
Why the designers of C++ did not made it the way, that each new thread makes a unique copy of all variables in the program, so then such problems never arise?
Great talk but the plot @ 22:50 has the worst legend ever
The expression "become visible" is just awful.
What Pikus meant by "becoming visible" after an acquire barrier is simply that operations that appear after the barrier in the program are guaranteed to execute after the barrier.
Likewise, "becoming visible" before a release barrier means operations before the barrier in the program are guaranteed to have been executed by the time the barrier is reached.
A little caveat is that for the acquire barrier, operations before the barrier can be moved to execute after the barrier, but operations after the barrier in the program cannot be moved to execute before the barrier is reached. The opposite applies to release barrier.
Becoming visible means exactly that, the written data is now visible to other threads
measurably failed attempt to make a drama at the beginning - it's a different algorithms - you can't compare apples to oranges
🤔 At 10:00: My understanding is all threads(cores) share same caches. They don’t have different caches, then what’s mean by communication between caches?
As I understand all cores usually share the L3 cache, the L2 is usually a smaller group and the L1 is usually per core. In zen the L1 and L2 are shared by one core (2 threads) and the L3 is shared by all cores
Wonderful!!!
Many thanks!!