CppCon 2017: Chandler Carruth “Going Nowhere Faster”

CppCon

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 27 лип 2024
CppCon.org
-
Presentation Slides, PDFs, Source Code and other presenter materials are available at: github.com/CppCon/CppCon2017
-
You care about the performance of your C++ code. You have followed basic patterns to make your C++ code efficient. You profiled your application or server and used the appropriate algorithms to minimize how much work is done and the appropriate data structures to make it fast. You even have reliable benchmarks to cover the most critical and important parts of the system for performance. But you're profiling the benchmark and need to squeeze even more performance out of it... What next?
This talk dives into the performance and optimization concerns of the important, performance critical loops in your program. How do modern CPUs execute these loops, and what influences their performance? What can you do to make them faster? How can you leverage the C++ compiler to do this while keeping the code maintainable and clean? What optimization techniques do modern compilers make available to you? We'll cover all of this and more, with piles of code, examples, and even live demo.
While the talk will focus somewhat on x86 processors and the LLVM compiler, but everything will be broadly applicable and basic mappings for other processors and toolchains will be discussed throughout. However, be prepared for a lot of C++ code and assembly.
-
Chandler Carruth: Google, Software Engineer
Chandler Carruth leads the Clang team at Google, building better diagnostics, tools, and more. Previously, he worked on several pieces of Google’s distributed build system. He makes guest appearances helping to maintain a few core C++ libraries across Google’s codebase, and is active in the LLVM and Clang open source communities. He received his M.S. and B.S. in Computer Science from Wake Forest University, but disavows all knowledge of the contents of his Master’s thesis. He is regularly found drinking Cherry Coke Zero in the daytime and pontificating over a single malt scotch in the evening.
-
Videos Filmed & Edited by Bash Films: www.BashFilms.com
*-----*
Register Now For CppCon 2022: cppcon.org/registration/
*-----*

КОМЕНТАРІ • 81

@eauxpie 6 років тому ⁺¹⁴⁶
If only he dug into that last question, we might have known about Spectre that much earlier.
@henke37 4 роки тому ⁺¹²
Maybe not spectre in general, but certainly the variant with stale cache entry loads.
@VivekYadav-ds8oz 2 роки тому ⁺⁴
What a butterfly effect would that have been lol
@echosystemd 6 років тому ⁺¹⁰⁶
Chandler helps me realize that I know nothing about benchmark.
@osere6432 3 роки тому ⁺²⁷
In his 2018 talk he discusses the last question in great detail
@andik70 3 роки тому ⁺¹
You have a link
@shadowtuber1589 Рік тому
ua-cam.com/video/_f7O3IfIR2k/v-deo.html
@kamilziemian995 Рік тому
This talk from 2018 is: CppCon 2018: Chandler Carruth “Spectre: Secrets, Side-Channels, Sandboxes, and Security”.
ua-cam.com/video/_f7O3IfIR2k/v-deo.html
@jackwang3397 2 роки тому ⁺¹⁰
I have no idea about assembly language or the further details but still feel the passion of the lecturer and the audience. I am satisfied and cheer like everyone else did at 35:10 during my lunch break. Good job!
@hanneshauptmann 6 років тому ⁺⁷⁸
I am a simple man, I see a good talk, I press like.
@dipi71 6 років тому
... and write an encouraging comment to do the same.
So do I. Cheers!
@Zeturic 4 роки тому ⁺⁴⁹
20:48
Going from .01 to .04 is a 300% increase in cache misses (e.g. 4x the amount), not .03%. When you look at it that way, dramatic changes in performance aren't that surprising.
@movax20h 4 роки тому ⁺¹¹
The problem is he is counting L1 dcache misses relative to L1 dcache loads within one run. Because of dcache misses his programs runs slower, and does less iterations, and less dcache loads overall. The way to make it more clear is to set the number of iterations to the fixed value instead so each benchmark tries to do same amount of loads overall. Then compare between the runs. But yes, 4x observation from the relative measurements is a good observation. The 0.01% cache misses doesn't mean that 0.01% of your TIME were missed. It is just a counter. But L1 dcache misses are like 20 times more expensive than hits, so if you weight it properly, you will see a real sensibile info. You just need to know how to interpret these numbers.
@kamilziemian995 Рік тому ⁺¹⁰
How much I want to have 1/1000 of Carruth's knowledge about compilers.
@ldmnyblzs 6 років тому ⁺¹⁸
That command prompt is just pure madness!
@emanuellandeholm5657 Рік тому ⁺¹
I'm not a C++ developer (I have a background in C), and I don't really know x86, but these talks by Chandler Carruth are so interesting to me. This is like crack! :D
@victornoagbodji 6 років тому ⁺¹
amazing as always!
@ZeroPlayerGame 5 років тому ⁺⁵
The reason for tight-loop speedup is because the branch is never missed after the first execution, so it is predicted super well and a memory store is just avoided at all.
@movax20h 4 роки тому ⁺²
The memory store is not avoided. It still does happen. The reason is super fast because it is to the cache line that is already in L1 dcache (we just loaded it), and due to something called write forwarding (not important here, but it works behind the scene to resolve load after store problems quickly, even if it misses L1 dcache).
The probable reason the code is equal speed is due to something called op fusion. The CPU actually recognize the jump by small relative offset and move after it, and basically replaces it into conditional store I think.
@bluecircle56 5 років тому ⁺⁷
"I don't even know. Magic and unicorns?"
Great talk!
@abc3631 4 роки тому ⁺²
Thats what i love about chandler's talks, he goes to the nub of the topics, and hard ones at that, rather than just glossing over.
@AlexeySalmin 6 років тому ⁺¹
TSO doesn't allow reordering anything past a store. Therefore a regular load won't flush the store buffer because it's being reordered ahead of buffered writes, not past them.
@JackAdrianZappa 5 років тому ⁺⁵
The most important thing you can get out of this talk is that Magic and Unicorns keeps the processor from crashing! :D
@christophrcr 6 років тому ⁺¹⁵
In the clamp example, after the first iteration, all values will already be clamped. I guess that's why branch prediction will always be right after some point.
@tobiasfuchs7016 6 років тому ⁺¹²
10:00 Ohmygosh I'm using the same nvim setup, shell and even colorscheme as Chandler Carruth, I will never change my ~/.config again!
@orbital1337 6 років тому ⁺¹
What's the colorscheme? Looks pretty nice.
@leonhrad 6 років тому
pretty sure it's jellybeans
@victornoagbodji 6 років тому ⁺¹
yes please, give away the colorscheme and nvim setup : )
@ericcurtin2812 6 років тому ⁺¹
Vanilla vim, dark background for life lol:
.vimrc
syntax off
set background=dark
set mouse=a
set hlsearch
@llothar68 6 років тому ⁺⁵
Damit, he always gets the nicest hardware to play with. Hey boss can you hear me? I want a AMD EPYC workstation too.
@davidjulitz7446 4 роки тому ⁺¹
Hmmm, cmovxx should be the right choice.
Could it be that cmovxx only perform worse because it can only use registers? Looks to me that moving an immediate made the performance difference here.
However, modern processors are obviously to complex to know what is the best optimization for every particular case.
@aqg7vy 6 років тому ⁺³
What is that vim setup??
@warrenhenning8064 3 роки тому ⁺²
6:08 did anything ever come of the Efficiency Sanitizer?
@FalcoGer Рік тому
36:30 use std::inner_product. I'm sure it is well optimized and does what you want without you re-implementing it.
Running std::inner_product on vectors with 65535 elements with random values was around 3 times faster than calling the dotproduct function. (27.2ms vs 75.3ms on my machine)
@SatyajitGhana7 4 роки тому
code for the benchmark utility ?
@N00byEdge 6 років тому ⁺¹⁷
So what if he used $255 instead of %ebx on the cmove too??
@CornedBee 6 років тому ⁺²⁴
You can't. cmov only works between registers.
@chris3kpl 6 років тому ⁺⁴
I wonder if simple masking with 0xff or cast to char whould be faster in clamp example? Because of this special case to clamping to 255.
@MatthewChaplain 6 років тому ⁺¹⁴
Yes, but you would get a modulo instead of a clamp. e.g. 257 & 0xFF is 1, not 255.
@chris3kpl 6 років тому ⁺⁴
Yes, you're right! :)
@dangkhoi137 3 роки тому ⁺²
What are his vimrc settings?
@PeterFoxtrott 4 роки тому ⁺²
Could someone explain me what he means with that modern processors don't have a branch forward prediction anymore? What else should a branch predictor do other than forward? And that means it does not exist anymore on modern processors? 32:06
@movax20h 4 роки тому ⁺²
There are two branch predictors on a global level in the CPU. There is a 'static' predictor, and dynamic predictor. The static predictor is used when the code is completely fresh and never run before (cold code). The static predictor usually only takes into account whetever a jump is forward or backward, and if it is small jump or big jump, and if it is a relative jump or absolute jump. Usually static predictor will predict branch taken for short relative jumps backwards (because they are most likely loop jump), but will predict large jumps forward as not taken (it would be to some even more cold code, possibly error handling or end of the loop handling. Usually the big relative jumps backward will also be predicted false, as they are unlikely to be loops, or they are likely to be very big loop anyway, so it is likely there is already plenty to do for the CPU to do. It might still ask the cache to prefetch the cacheline, if it is big jump one way or another, but it might be wasteful, and on multi core CPU it might add a lot to wasted memory traffic that could be used better by other cores. Sometimes these things can be tweaked in BIOS or microcode for specific CPU, but also these things change between each version of CPU, between CPUs of different manufacturers, and one way or another is hard to know which is better. Most CPU ISAs doesn't have hints in assembly to indicate which jumps are more likely or not, it is only done by code locality (i.e. closer code is more likely to be a part of loop, and should be preffered, but I don't know where is the cutoff - probably tuned on a lot of different code bases during a design).
The issue in his code is actually mostly due to a random distribution of his data, and how speculative execution works. He just flushes entire pipeline on 50% of the loop iterations.
@zarakivishalkenchan 5 років тому ⁺¹
This talk's title should have been "cmov spoiling instruction pipelining".
@ginkner 6 років тому ⁺²
Stupid question, but what command line is that? Ive never seen git look like that
@chris3kpl 6 років тому ⁺⁶
Ben it's a fish shell in tmux
@chris3kpl 6 років тому
Ben, and editor is probably nvim
@MitchPleune 6 років тому ⁺²
looks like it says fish and that looks like a powerline something or other.
@gehngis 6 років тому ⁺³
Yes and the theme looks like bobthefish: github.com/oh-my-fish/oh-my-fish/blob/master/docs/Themes.md#bobthefish
@arnabthakuria2243 4 місяці тому
what font is this. looks very nice
@user-bf6in4cj3k 5 місяців тому
Where can I find the code that was used in this presentation?
@user-bf6in4cj3k 5 місяців тому
Where can I get the c++ code in his presentation?
@Cwize1 6 років тому ⁺⁴
My theory on why using 255 is faster: 255 is the max value of an unsigned byte. So it will have a tendency to crop up a lot within programs. So internally the CPU has a register dedicated solely to storing that value. (Though I know nothing about CPU design. So I am probably completely wrong.)
@peterfireflylund 6 років тому ⁺²
You are. Some CPU's hardwire a "register" to a specific value but that value is 0, not 255 (or any other "all bits set" value). The x86 CPU's don't do this but the ARM CPU's do in 64-bit mode. The x86 CPU's are in your laptop, the ARM CPU's are in your phone.
Byte masks are common in performance-critical code so modern compilers recognize them.
@ChandlerCarruth 6 років тому ⁺²³
Sadly, the entire demo with the clamp loop was broken, no need to speculate about this one goof. =[ My apologies about this, I feel really bad. I tried to craft a tiny reproducer of an issue I see commonly in the real world and thought I had succeeded, but my data poor.
I have a much better example. Suffice to say, a key ingredient is that the branch *has* to be well predicted or the branch can't possibly help (which you see in the subsequent discussion around how the processor works). A second key ingredient is that you can craft much more dramatic examples using cmov vs. a branch, which I should have to make this more clear and less confusing. Again, sorry about that.
@Peter_Cordes 5 років тому ⁺¹
@@ChandlerCarruth - I wrote up an analysis a while ago of a case where `gcc -O3` is slower (on sorted data) because it chooses CMOV. stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-than-o2 (And some versions really shoot itself in the foot by insisting on putting CMOV into the loop-carried dep chain, even if you write it in C as conditional zeroing of an input to a loop-carried ADD). Fun fact: gcc -O3 -fprofile-use figures this out and uses a branch when you do profile-guided optimization on predictable data.
I assume that's the kind of case you say would make a better example, because it would tie in with your 2nd bit about the dot product loop (which does have a loop-carried dep chain, so it would hurt the "unrolling" across iterations that out-of-order execution gives us). Using a control dependency which the CPU will speculate past can very much be a win vs. a CMOV data dependency in a loop-carried dependency chain.
----
If CMOV had been much slower, it makes me think of stackoverflow.com/questions/54190958/throughput-analysis-on-memory-copy-benchmark where a byte-at-a-time asm copy loop is only running at 1 iter per 4 clocks on IvyBridge, even though it's mostly getting cache hits. But I think that's some kind of HW prefetch failure, not totally due to stores depending on loads.
Breaking the data dependency between load and store with a branch instead of CMOV was plausible in your test, but unlikely on a modern x86 with deep enough out-of-order execution to keep lots of those independent dependency chains in flight at once. Maybe on an in-order Atom. :P
@BenjaminDirgo 6 років тому ⁺⁵
I wonder if Spectre makes this talk obsolete.
@ericcurtin2812 6 років тому ⁺¹
I opened a PR on his github wondering the exact same thing github.com/chandlerc/chandlerc.github.io/pull/1
If I was getting a response I probably would have got one by now.
@meneldal 5 років тому ⁺⁴
Well he made a talk about that this year ua-cam.com/video/_f7O3IfIR2k/v-deo.html
@mcneeleypeter 5 років тому ⁺¹
Nothing has changed from this talk due to spectre. The hardware still operates the exact same way.
@charlesreynolds8696 Рік тому
What are the implications of this post-spectre?
@alpers.2123 Рік тому ⁺²
It turns out, it wasn't magic and unicorns
@cmilkau 5 років тому
"You can't speculate on cmov" - There seems no reason why you couldn't. Maybe today's processors cannot but I can't see any difference between branching and conditional operations that would prevent speculation. After all a branch is just a conditional write to the instruction pointer, which is kind of the hardest thing you could speculate about. What am I missing?
@Peter_Cordes 5 років тому ⁺²
If you want speculation that can mispredict and might need to roll back, use a branch! If you want a data dependency, use CMOV. Speculating CMOV would defeat the entire purpose. (Unless you dynamically evaluate and predict whether it would be a good idea to do that for any given CMOV).
But that would be hard to implement. An instruction that could decode to *either* an ALU operation or a branch would need another whole predictor, and a whole new mechanism for pretending we branched in the internal uops when the x86 code didn't contain branch.
Plus you'd need that predictor as an input to every decoder. Or if it's only to the complex decoder, then CMOV could cause delays in decoding by having to always decode in the complex decoder, even if it chooses to only decode to a single ALU uop (with a normal data dependency) instead of a branch, in cases where it wouldn't predict well as a branch, e.g. on a condition that was only true some of the time, not almost all the time.
(Possibly related: one reason that `rep movs` startup is slow is that micro-code branches on Intel CPUs are special and can't use branch prediction. See stackoverflow.com/questions/33902068/what-setup-does-rep-do for a quote from Andy Glew, former Intel architect who implemented fast strings on P6. This may mean it would be impossible for a CMOV to decode to a branch-like uop over a mov uop in a way that could actually speculate.)
See also stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-than-o2 for another case where CMOV is not a win, but it is a win on unsorted data. With more links and details about when CMOV is/isn't a good idea.
@bloopbleep7082 4 роки тому
Conditional moves don't change the instruction stream. Conditional branches potentially do. Thus you usually don't need to flush the instruction pipeline with conditional moves.
@marcpanther7924 6 років тому
What is he using? My bash doesn't look as cool as that
@dipi71 6 років тому ⁺²
It’s bobthefish; but note the numerous redrawing bugs during the talk. It doesn’t look completely trustworthy to me. Oh well, maybe it’s nvim.
@pfeilspitze 5 років тому ⁺⁷
"April 2019: Intel® Architecture Code Analyzer has reached its End Of Life" :(
@LeDabe Рік тому
intel iacz is dead, long live llvm-mca
@SanjeevkumarMSnj 3 роки тому
which terminal he is using ?
@marianaldenhoevel7240 8 місяців тому
My takeaway:
If you really care about performance you have to measure it (fine) and try non-obvious things to see wether they change something (not that great).
Even if Chandler disagrees we then just hope that the system you will be running them on in anger agrees with the one that you measured. The next CPU may think different. And that CPU can be on the machine you use in production or simply change under you next month.
That feels like the software-equivalent of "shut up and calculate" in physics. Not a very satisfying place to be in.
Luckily my code has all the time in the world and can just blunder through what my thick brain writes down.
@kosbarable 5 років тому
Is it was in Simpsons?
Probably no.
Thumb up!
@slavanap 6 років тому ⁺¹
35:03. Man, you just jump over a memory write operation (for the majority of your data elements). There is a huge impact that comes from that. I guess, it's more noticeable than not using `cmov`.
ADDED: 49:01 YES! Put your hands away from `cmov`! That's not the issue in this command!
@SolomonUcko 4 роки тому ⁺¹
Didn't he say that the majority of the elements are over 255, and need to be modified?
@Thecommet 5 років тому
9:08 Your indices array ranges from 0 to count, when it should be 0 to count-1. v[count] is out of bounds
@movax20h 4 роки тому ⁺⁵
The RNG will not generate count, the RNG(0, count, count), generates a 'count' elements (last parameter) by each of them is between 0 and count-1, inclusive. This is very common in RNG design, to not include the max element. And is similar to how iterators and loops work.
@joe-ti8rz 6 років тому
Go chandler. Survive. Be better. Buy the tibetian book of the death and meditate.
@noxabellus 6 років тому
Looks, talks, acts like a Chandler. Thinks like a damn Einstein O_O

Наступне

Автоматичне відтворення

CppCon 2014: Chandler Carruth "Efficiency with Algorithms, Performance with Data Structures"