Godbolt Compiler Explorer Adventures
Вставка
- Опубліковано 30 вер 2024
- In this video I want to share a really fun and extremely interesting website. It's called the Godbolt Compiler Explorer! Really, an amazing, fun website for C++ and C coders who are interested in modern optimizing compilers, and Assembly language.
Link to Compiler Explorer:
godbolt.org/
Sorry, but I accidentally specified the MS compiler's AVX2 flag wrong, so it didn't AVX2 anything... :( Anywho, here are the compiler settings I was using for each compiler (with the amended MS AVX2 flag):
Clang 9: -O3 -march=core-avx2 -m64 -cl-fast-relaxed-math
GCC 9.2: -O3 -mavx2 -funsafe-math-optimizations
Intel ICC 19.0.1: -Ofast -march=core-avx2 -m64
MSVC 19.22: -O3 /arch:AVX2 /Ox /fp:fast
Support What's a Creel? on Patreon:
/ whatsacreel
FaceBook: / whatsacreel
Music Channel: / @creelsmusic5814
Another channel with random things: / @ftlol6091
Software used to make this vid:
Visual Studio 2019 Community:
www.visualstud...
Blender:
www.blender.org/
OBS:
obsproject.com/
Davinci Resolve 16:
www.blackmagic...
Meshroom:
alicevision.gi...
"fun safe math optimization"
Lol
Good to see compiler explorer is getting so much popularity. Great video creel! Theres always something interesting I learn by watching your videos. Keep em coming!
Cheers mate!! It's such an amazingly informative website! Matt Godbolt has done some really fun talks available on UA-cam too! Imho excellent presenter,and amazing website :)
Never trust the compiler to do all the optimization for you :D Great video as always!
Fer sure brus!! They're really amazing at some things, and then not so much, hahaha! Cheers for watching :)
@@WhatsACreel From what I can tell, clang and gcc place a higher priority on "clever" optimizations than on low-hanging fruit, and a higher priority on ensuring that they apply optimizations whenever possible than on avoiding "optimizations" that would yield bogus code. For example, if T1 and T2 have matching representations (e.g. 64-bit "long" and "long long"), then given `if (mode) *(T1*)x = something; else *(T2*)x = something;`, gcc will optimize that to `*(T2*)x = something;` and then further optimize the surrounding code by recognizing that the assignment "can't" affect an object of type T1.
Further, when targeting the Cortex-M0, gcc is also prone to make some rather silly "optimizations" such as taking an object that is loaded with a constant before a loop, substituting a constant, and then reloading that constant on every iteration of the loop. Because of nonsense like that, code which exploits the register qualifier can actually end up faster at -O0 than -O2, and could be better yet if -O2 were augmented with a mode that applied safe peephole optimizations like omitting sign-extension instructions when no downstream code will care about the upper bits of the registers in question. If-O2 can't do better than that, what's the point?
I built the exact same thing for Cuda c++ in 2009 called cudapad. It had a windows where you type code in on left and then on the right would instantly (in a second) popup with the ask in the right window. It would draw lines visual to match up the two sides and also had diff. It also had different compiler options that could be selected also. I guess I wish I pursued it more.
Ever since I've found this channel, assembly has been more interesting than I could think.
So good of you to say! Cheers for stopping by mate, have a good one :)
re: find minimum: I found the difference between the gcc and intel versions most interesting.
The intel version would be faster, they've pre-treated the possibility of an unaligned array by scalar-processing the first few until alignment is reached (I assume clang also did this) then broadcast the found minimum as the lane starting points for the direct-from-memory (requiring aligned data) AVX instructions, Whereas the GCC version is reading the memory in 2 separate 128-bit unaligned reads which can be slightly faster than unaligned 256-bit reads but end up requiring 3 instructions instead of 1, including a slow lane-swap. Both of them missed the possibility of doing 2 vpminsd operations per cycle by using just one register, but clang is OTT with it's block of 8 vpmin's using 4 registers. 2 is enough.
I would personally avoid 3-part addressing in a fast loop unless I'm addressing multiple arrays from the same index, since it forces extra work n the address generators - just base and offset is enough with incrementing the base pointer and decrementing a separate loop counter with break on zero, instead of having a cmp.
the loop ends up looking like:
.loop
vpminsd ymm0, ymm0, ymmword ptr [rdi]
vpminsd ymm1, ymm1, ymmword ptr [rdi + 32]
add rdi, 64
sub rax, 1
jne loop
and should execute at 1 loop/clock cycle on anything with AVX2 instructions.
The compiler explorer is particularly convenient for exploring situations where gcc or clang use their twisted interpretation of the Standard as an excuse to get "creative". Consider, for example, "unsigned mul(unsigned short x, unsigned short y) { return x*y;}". According to the published Rationale document, the authors of the Standard thought that compilers for modern hardware would process such code sensibly even for values INT_MAX+1u to UINT_MAX, and there was thus no need to worry about whether it mandated such behavior. The gcc compiler, however, will sometimes interpret the lack of a mandate as an excuse to jump the rails. While I haven't found any situations where clang would get creative with that particular construct, it often gets creative in other ways.
That's a really interesting point! I am not sure where the compilers bend the rules. I am sure they do, but I figured it was all related to relaxed flags in the command line? Maybe they outright dodge the standard all the time?? Cheers for sharing and watching mate :)
I think you should make a monthly series out of this! Looking at fun code in godbolt and judging how well different compilers do! Very entertaining :)
Did he just say: "...ignore my poor programming skills..." ? 18:54 ... Man, I would like to have 1/2 of your poor programming skills :D
Love this site! It's really good for learning how compilers translate code into individual instructions and how exactly those instructions are used for us beginner assembly learners, great tool indeed :)
For sure mate!! Compilers are really amazing ASM coders, really awesome tool for learning :)
I just thought of a funny project we could try when the Creel AI lib gets ready: THE ULTIMATE NEURAL C++ COMPILER. Use c++ code as an input and the neural net outputs an assembly. The compiled code could be given inputs and its performance could be evaluated based on whether it gets all the outputs correct. I think we might be able to turn some simple algorithms into assembly (or get a good laugh). :D
Ok, this is the greatest idea ever!! Not sure if it would be amazing or terriblem but would certainly be fun :)
Dear sir, Can you make a series of videos about performance programming with the Intel compiler Parallel Studio XE, for example using it's automatic vectorization constructs, Math Kernel Library, and others?
This is an excellent idea! I don't have the Intel compiler installed. I do have MKL tho. It's actually Amazing!!! I'm still confused as to how it works, but would love to share some of that info at some point. cheers for watching brus, and sharing ideas :)
Efficient algorithm to sum from 1 to N: print(N*(N+1)/2)
re: floating-point AVX unrolling: clang is slightly overkill but also misses 50% of the erformance it could be getting. Newer cpu's (skylake and after) have 2 vector add ports and the instruction latency is 4 cycles whereas throughput is 0.5 (2 adds per cycle). In order to not have the adding process held up by loop dependency it needs to use 8 registers, so the unrolling by 16 is slightly overkill, but it's still only 50% effective because it's only using 4 registers for the summing. clang manages 50% of that. intel manages 25% of that. gcc manages 12% of that. Maybe clang wasn't updated to account for 2 adders yet?
LEA is sometimes used to perform some math operations that could be fit into the formula
C+R1+R2*2^n (where n is 0,1,2,3) so lea reg1, [reg2+1] is like reg1=reg2+1 and so on
sometimes you will see the compiler doing lea reg1,[reg1+reg1*8] to multiply reg1 by 9 for example. I suspect the reasons is using different parts of the pipeline to do math right next to each other so there doesn't need to be a stall, like if an adder/multiplier being used, and then you need another multiply by 9 right after it, use the addressing unit to do the multiply by 9
This is very cool. I decided to write my own dumb pop_count() function and try it in GodBolt myself.
None of the compilers picked up what I was doing.
// Luke's dumb PopCount
int PopCount(int i)
{
int pop = 0;
for(; i > 0; i=i>>1)
pop += (i%2);
return pop;
}
I used the same compiler versions and options.
I reckon that's a really good PopCount! Surprising that the compilers didn't figure it out. Cheers for sharing mate :)
@@WhatsACreel I'll take that as a HUGE compliment. Thanks!
@@WhatsACreel I'm not a C programmer OR an ASM programmer, I work in higher-level languages for the most part. IS there a way of writing popount in C so that EVERY compiler says "Yes, I see what you're doing - let's use popcount"? It seems to me that that would be the best way to write it. Or maybe there's a compiler macro for it or something?
'@@TehMagilla I don't know of any way that all the compilers recognize. There's a macro, yes, _popcount, or something like that. If you search for 'popcnt intrinsic', it should turn up. On x86, that's the fastest way to compute the pop count.
Outside of that, the best way I know is called 'kernighan's algorithm'. Definitely worth a look if you've not seen it :)
Why did it not multi-thread the sum()?
On min() - that's a radix sort & get single array[0] value. (Neg value to smallest array index)
If TV was like this video, I'd be watching it all the time
Cheers mate!! Thanks for watching :)
Well, your videos sure can satisfy the nerdy urges ;)
maybe the "eax xor eax" ( eax = 0 ), is because you put the "uint count = 0;" at the start, and eax = count
so perhaps the compiler didnt realize that i didnt need that line (?)
Yeah, there's something strange going on there. I think it was deep trickery. But certainly possible it was just a mistake, hahaha. Cheers for watching mate :)
This is a trick that used to be used a LOT on the Z80 to zero a register without using a load immediate. On the Z80 xor reg,reg is a smaller instruction than load immediate. I think the same thing was used on 6502 (but I can't remember).
Oh, yes, XOR is commonly used in x86 for that too! "XOR reg, reg" is treated as a special instruction in a lot of CPU's, and executes abnormally quickly, something like 1/4 a clock cycle from memory. It's got me baffled exactly why the compiler would Zero, tho. Considering popcnt will overwrite it...? I think it's either a trick to break a dependency chain on EAX, or it's the thing mentioned by spj... If you "XOR EAX, EAX", then you're saying any pending computations on EAX are irrelevant, and the POPCNT can begin to execute immediately. I think that's what it is.
I mean... it says right there... COMPILER EXPLORER not godbolt compiler....
Oh my thumbnail... damn!! I spent soo much time on the lightning bolts... I will fix it, hold up brus! Cheers for watching :)
There we go, it's fixed, not sure how long it'll take to roll out, but cheers for pointing this out :)
@@WhatsACreel I've gotten used to it by now :) Great video, thanks :)
Haha, awesome, Matt! Sooo good of you to stop by! Cheers for the website brus :)
So the first one is from sum of 1 to n integers is n(n+1)/2 I'm assuming because that's the only way I can see why imul is used
9:52 Maybe xor fills the branch delay slot? Although normally this slot occurs after a branch instruction and not at the branch target as far as I know...
There's a thought!! I like this idea, I am not sure. It's certainly something funny going on. Something deep like that. Well, cheers for watching mate :)
been a fan of your channel for a LONG time and i absolutely love the way you explain all of your topics and showcase some really interesting topics such as this one - compiler optimizations and loop unrolling!
love it, cheers mate
Cheers mate, so good of you to say! Cheers for watching brus :)
Fascinating! It would be interesting to input some code that would take a while to actually run, then take your own guess as to which compiler will do the best job. Then actually compile it and see which one really is faster.
Fer sure mate! I'd back Clang every time! Thing is a beast :)
@@WhatsACreel I have never tried Clang, always been a fan of the GNU compilers, but after watching this I may just give it a shot. Been away from programming for a while and thinking about getting back into it. Heard good things about Clang in the past.
But who explores the explorers.... you do.
Nice video! At 5:00 because is misspelled 😉
Thank you for creating the best and most helpful videos about low level programming (Assembly) that I have been able to find. It really seems the current popular opinion about programming, is "You shouldn't use Assembly, the compiler is better than you!" Which is obviously wrong in even these limited examples. Well, back to binge watching your videos. Thanks again,
Gday Creel!
Gday mate :)
xor eax eax (or other register) is common in "windows" assembly. making sure to start with a clean register
You are an absolute lad, greetings from germany.
Ignore my troll persona.
Absolutely amazing, well done ya, lets get out of ere xD
keeping those algorithms happy..
This is so cool.
Cheers mate :) thanks for watching
AVX AVX AVX
Hey, I got inspired and got thunderstruck by autism and wanted to optimize the first example for clang and made this:
# ifndef U64
# define U64 unsigned long long
# endif
const U64 sum_1__n(const U64 n = 0ULL) noexcept {
U64 t{0ULL}, _{0ULL};
for(; _
That was fun. Thanks.
You're welcome brother, cheers for watching :)
Are we optimizing for speed or code size? The latter is preferable when in doubt.
I'd say for speed. Tho I didn't specify directly, I just used O3 and fast maths. There's certainly a lot more flags available in these compilers for both speed and size, so maybe it was somewhere-in-the-middle optimizations? Hahaha, cheers for watching brus :)
Welcome to The Asm Is Right, Come On Down! More compiler gameshows please!
Awesome! More compiler game shows would be fun :)
👍
Is there an offline mode?
Yes, you can get it set up locally.
@@Qazqi how?
@@unodos1821 See the Github project that's linked to from the site: github.com/mattgodbolt/compiler-explorer
@@Qazqi Thanks.