C++ and Beyond 2012: Herb Sutter - atomic Weapons 2 of 2

toalexsmail

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 26 гру 2024
channel9.msdn....
Download the slides. 1drv.ms/b/s!Aq...
Abstract:
This session in one word: Deep.
It's a session that includes topics I've publicly said for years is Stuff You Shouldn't Need To Know and I Just Won't Teach, but it's becoming achingly clear that people do need to know about it. Achingly, heartbreakingly clear, because some hardware incents you to pull out the big guns to achieve top performance, and C++ programmers just are so addicted to full performance that they'll reach for the big red levers with the flashing warning lights. Since we can't keep people from pulling the big red levers, we'd better document the A to Z of what the levers actually do, so that people don't SCRAM unless they really, really, really meant to.
Topics Covered:
The facts: The C++11 memory model and what it requires you to do to make sure your code is correct and stays correct. We'll include clear answers to several FAQs: "how do the compiler and hardware cooperate to remember how to respect these rules?", "what is a race condition?", and the ageless one-hand-clapping question "how is a race condition like a debugger?"
The tools: The deep interrelationships and fundamental tradeoffs among mutexes, atomics, and fences/barriers. I'll try to convince you why standalone memory barriers are bad, and why barriers should always be associated with a specific load or store.
The unspeakables: I'll grudgingly and reluctantly talk about the Thing I Said I'd Never Teach That Programmers Should Never Need To Now: relaxed atomics. Don't use them! If you can avoid it. But here's what you need to know, even though it would be nice if you didn't need to know it.
The rapidly-changing hardware reality: How locks and atomics map to hardware instructions on ARM and x86/x64, and throw in POWER and Itanium for good measure - and I'll cover how and why the answers are actually different last year and this year, and how they will likely be different again a few years from now. We'll cover how the latest CPU and GPU hardware memory models are rapidly evolving, and how this directly affects C++ programmers.
For more information see
www.toalexsmail...

КОМЕНТАРІ • 21

@Roms8313 2 роки тому ⁺⁴
this talk is amazing :O thanks so much
@qwertyuiop-cu2ve Рік тому ⁺²
At 1:25:39, why can line B go above the decrement? More generally why can something in the body of an if statement be reordered before something in the condition check? This seems to defeat the point of an if statement to avoid doing work in certain cases depending on the condition.
@BertPdeboy 7 місяців тому
my guess is "delete" being an high-level operation that itself requires memory writes, hence requires memory ordering. The presentation has an optimization example where unrelated stores are moved up before the release.
The example below uses the same optimization example. The multithreaded observed effect of pointer delete is the pointer becomes "null".
if control_block_ptr->refs.fetch_sub(1,memory_order_release) == 0 {
....global::allocator::mem_block_free_list = { ... }
....control_block_ptr = null
}
~ ~ could become ~ ~
r1 = control_block_ptr
r2 = control_block_ptr.refs
control_block_ptr = null // legal compiler optimization
r3 = r2.fetch_sub(1,memory_order_release) // flushes write buffer
if r3 == 0 {
....global::allocator::mem_block_free_list = { ... }
}
if r3 != 0 {
....control_block_ptr = r1 // updates write buffer
}
// lack of additional release, lack of buffer flush, aka race condition since fetch_sub call.
which shows a wrong ordering. note;
- that the code above is still highlevel, the compiler produces more concrete assembly statements
- the ordering change could happen anywhere in the technology stack between ]c-program ; cpu-cache[
@aharrison24 5 місяців тому ⁺²
Because the compiler is allowed to assume that there are no data races on accesses to non-atomic variables. Without the acquire ordering on the decrement, the compiler is for instance allowed to speculatively reorder a non-atomic load from the body of the if-statement to a point *before* the if-statement. Because there can't be data races, it 'knows' the value it reads should be unaffected by whether it does the load before or after checking the condition. That's not a safe assumption for the ref-count code, so adding acquire ordering ensures that any modifications to the object in thread 1 are completed and visible to thread 2 before it is able to perform any operations from the body of the if-statement.
@qwertyuiop-cu2ve 4 місяці тому
@@aharrison24 It doesn't sit right with me that the body can be done before checking. This goes against the causality that you intend by writing the if statement. A load of the pointer itself won't cause any data races (as far as I know in this example it doesn't change after initialization), it's about the data that the pointer points to. If the other thread dereferences the pointer after delete, then we have a problem.
@aharrison24 4 місяці тому
@@qwertyuiop-cu2ve I'm totally with you on it not sitting right! But that's why we have to completely change our mindset when reasoning about this stuff. In a single thread of execution, the compiler is allowed to pull all sorts of cunning tricks to make code go faster. As long as those tricks don't change the *observable* behaviour of the code, then they are fair game (the "as if" rule). The CPU too will perform out-of-order execution of instructions in order to avoid costly pipeline stalls.
Given that accessing main memory can be so expensive, it's common to reorder loads to happen as early as possible, so that there is less waiting around when the value is actually needed. In the specific case you asked about, if the compiler can see that a variable might be read inside an if-block, it's perfectly reasonable for it to generate code that speculatively reads the value into a register *before* evaluating the if-condition. If control flow subsequently doesn't enter the if-block then the value is simply thrown away.
In single-threaded code this can be done safely if the compiler can see that no-one else could change the value of the variable in the meantime (and it can assume no data races). In multi-threaded code these optimisations can obviously cause very surprising effects. And that's why the control-block decrement needs to have acquire semantics in the slide you referenced.
The art of lock-free programming is in using the memory-order arguments to atomic operations to constrain the behaviour of both the compiler and the CPU. But seeing things through the lens of 'instruction reordering' is often not helpful. Really the only way to successfully (and portably) reason about multi-threaded code involving relaxed atomics is to think in terms of the *synchronizes-with* and *happens-before* relations of the C++ memory model. Those relations do not talk about control flow. They are concerned only with controlling how memory operations on one thread are made visible to other threads.
@dv8silencer 6 років тому ⁺¹
Thank you Herb Sutter
@dhairyar3556 8 місяців тому ⁺²
herb is a god
@jaredmulconry Рік тому
21:49 It's worth noting that speculative execution in modern processors will behave like the top example. Since speculatively-executed operations will not commit their changes until proven correct, it will be as-if the final write happened conditionally.
The good news is, in practice, this doesn't significantly impact how the CPU operates.
However, Herb is correct that the compiler needs to give the CPU the equivalent of that code for it to behave that way.
Key takeaway: avoid putting memory barriers or conditional logic in performance-critical code. Loops, especially. The optimiser and processor will thank you.
@szirsp 2 роки тому
28:10 What if the read accesses a HW register and the register read has a side effect? (maybe register read clears the interrupt status flags)
Are all reads always truly safe everywhere?
Would volatile or some other keyword prevent the compiler from generating unwanted reads?
@szirsp 2 роки тому
(or consumes data from HW FIFO)
@nobodynada_ 2 роки тому ⁺¹
At the compiler level: yes, 'volatile' will prevent the compiler from generating unwanted reads. However, as Herb discusses in the last section, volatile only applies to compiler optimizations; the hardware is still allowed to synthesize memory reads, and in addition to marking your variables as volatile you need to turn on some architecture-specific settings to tell the hardware to play ball.
On both x86 and ARM, you need to set some bits in the pagetables to indicate the memory type, which tells the hardware what optimizations it's allowed to apply when accessing the memory. For x86, you'll want to mark the memory as "uncacheable" or "strong uncacheable" to deny speculative reads (x86 developers' manual, volume 3A, sections 4.9 "Paging and Memory Typing" and 11.3 "Methods of Caching Available"). On ARM, you want to mark the pages as device memory (developer.arm.com/documentation/den0024/a/Memory-Ordering/Memory-types/Device-memory; ARM Architectural Reference Manual, sections B2.7 "Memory types available" and D8.5 "Memory region attributes").
@szirsp 2 роки тому
@@nobodynada_ Thank you
@mysanoop Рік тому
Is standalone fence like a barrier, and not tied to any particular variable?
And alternatively, in the case of a release+acquire -- will other unrelated releases and acquires be able to interleave into the release+acquire in question?
@hanyanglee9018 2 роки тому ⁺⁴
Do we have any update to this topic today? It's been 10 years.
@herbsutter4340 Рік тому ⁺¹²
It's all still current. I just referred a commercial compiler optimization question to this talk a few minutes ago :)
@brewtaldeafmedal1262 10 місяців тому ⁺²
And people say God isn't real?
@jamesonnash4948 3 роки тому ⁺¹
At about minute 12, you said that it may be impossible to update a single bit in a byte. However, isn't this merely possibly slower, and not impossible on any processor for regular memory? For example, a possible implementation on x86 might be to use cmpxchg in a loop until a write succeeds that only changes the specific bit? On LL/SC processors, it seems like this would be even easier, and at no increased performance penalty, since all atomic operations would be written with this loop pattern. What makes this impossible?
@quangtung2912 3 роки тому ⁺⁵
That's not problem he talked about. The problem is about updating on some bits that is not conceptually shared, but because of lacking update operation on only single bit, it might end up sharing.
For example, you have a byte with 1 bit is only used at a single thread, no synchronization is need, and another 7 bits is shared and protected by a lock.
If you update that one bit, you might end up updating the other 7 bits, too.
The compiler doesn't know it's shared, or it's atomic, it doesn't make any sense for compiler to always be conservative here by always generating LL/SC instructions.
@MMIIRRKKOO 5 років тому ⁺²
Intel 8051 does single bits read and writes if I'm not mistaken (on *some* memory locations)
@rostislavstelmach9168 4 роки тому ⁺²
8051 is a microcontroller, not a microprocessor, thus it does single bit writes. On the other hand, processors can do only word writes ( 1 byte or 4 bytes ).