Godbolt Compiler Explorer Adventures

Creel

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 30 вер 2024
In this video I want to share a really fun and extremely interesting website. It's called the Godbolt Compiler Explorer! Really, an amazing, fun website for C++ and C coders who are interested in modern optimizing compilers, and Assembly language.
Link to Compiler Explorer:
godbolt.org/
Sorry, but I accidentally specified the MS compiler's AVX2 flag wrong, so it didn't AVX2 anything... :( Anywho, here are the compiler settings I was using for each compiler (with the amended MS AVX2 flag):
Clang 9: -O3 -march=core-avx2 -m64 -cl-fast-relaxed-math
GCC 9.2: -O3 -mavx2 -funsafe-math-optimizations
Intel ICC 19.0.1: -Ofast -march=core-avx2 -m64
MSVC 19.22: -O3 /arch:AVX2 /Ox /fp:fast
Support What's a Creel? on Patreon:
/ whatsacreel
FaceBook: / whatsacreel
Music Channel: / @creelsmusic5814
Another channel with random things: / @ftlol6091
Software used to make this vid:
Visual Studio 2019 Community:
www.visualstud...
Blender:
www.blender.org/
OBS:
obsproject.com/
Davinci Resolve 16:
www.blackmagic...
Meshroom:
alicevision.gi...

КОМЕНТАРІ • 77

@RosarioLeonardi 4 роки тому ⁺⁵⁶
"fun safe math optimization"
@wowepic2256 3 роки тому
Lol
@bjordsvennson2726 5 років тому ⁺²⁸
Good to see compiler explorer is getting so much popularity. Great video creel! Theres always something interesting I learn by watching your videos. Keep em coming!
@WhatsACreel 5 років тому ⁺³
Cheers mate!! It's such an amazingly informative website! Matt Godbolt has done some really fun talks available on UA-cam too! Imho excellent presenter,and amazing website :)
@ProjectPhysX 5 років тому ⁺²⁹
Never trust the compiler to do all the optimization for you :D Great video as always!
@WhatsACreel 5 років тому ⁺⁸
Fer sure brus!! They're really amazing at some things, and then not so much, hahaha! Cheers for watching :)
@flatfingertuning727 4 роки тому ⁺³
@@WhatsACreel From what I can tell, clang and gcc place a higher priority on "clever" optimizations than on low-hanging fruit, and a higher priority on ensuring that they apply optimizations whenever possible than on avoiding "optimizations" that would yield bogus code. For example, if T1 and T2 have matching representations (e.g. 64-bit "long" and "long long"), then given `if (mode) *(T1*)x = something; else *(T2*)x = something;`, gcc will optimize that to `*(T2*)x = something;` and then further optimize the surrounding code by recognizing that the assignment "can't" affect an object of type T1.
Further, when targeting the Cortex-M0, gcc is also prone to make some rather silly "optimizations" such as taking an object that is loaded with a constant before a loop, substituting a constant, and then reloading that constant on every iteration of the loop. Because of nonsense like that, code which exploits the register qualifier can actually end up faster at -O0 than -O2, and could be better yet if -O2 were augmented with a mode that applied safe peephole optimizations like omitting sign-extension instructions when no downstream code will care about the upper bits of the registers in question. If-O2 can't do better than that, what's the point?
@FunWithBits 4 роки тому ⁺⁸
I built the exact same thing for Cuda c++ in 2009 called cudapad. It had a windows where you type code in on left and then on the right would instantly (in a second) popup with the ask in the right window. It would draw lines visual to match up the two sides and also had diff. It also had different compiler options that could be selected also. I guess I wish I pursued it more.
@xmesaj2 5 років тому ⁺⁹
Ever since I've found this channel, assembly has been more interesting than I could think.
@WhatsACreel 5 років тому ⁺³
So good of you to say! Cheers for stopping by mate, have a good one :)
@treelibrarian7618 Рік тому ⁺¹
re: find minimum: I found the difference between the gcc and intel versions most interesting.
The intel version would be faster, they've pre-treated the possibility of an unaligned array by scalar-processing the first few until alignment is reached (I assume clang also did this) then broadcast the found minimum as the lane starting points for the direct-from-memory (requiring aligned data) AVX instructions, Whereas the GCC version is reading the memory in 2 separate 128-bit unaligned reads which can be slightly faster than unaligned 256-bit reads but end up requiring 3 instructions instead of 1, including a slow lane-swap. Both of them missed the possibility of doing 2 vpminsd operations per cycle by using just one register, but clang is OTT with it's block of 8 vpmin's using 4 registers. 2 is enough.
I would personally avoid 3-part addressing in a fast loop unless I'm addressing multiple arrays from the same index, since it forces extra work n the address generators - just base and offset is enough with incrementing the base pointer and decrementing a separate loop counter with break on zero, instead of having a cmp.
the loop ends up looking like:
.loop
vpminsd ymm0, ymm0, ymmword ptr [rdi]
vpminsd ymm1, ymm1, ymmword ptr [rdi + 32]
add rdi, 64
sub rax, 1
jne loop
and should execute at 1 loop/clock cycle on anything with AVX2 instructions.
@flatfingertuning727 5 років тому ⁺⁵
The compiler explorer is particularly convenient for exploring situations where gcc or clang use their twisted interpretation of the Standard as an excuse to get "creative". Consider, for example, "unsigned mul(unsigned short x, unsigned short y) { return x*y;}". According to the published Rationale document, the authors of the Standard thought that compilers for modern hardware would process such code sensibly even for values INT_MAX+1u to UINT_MAX, and there was thus no need to worry about whether it mandated such behavior. The gcc compiler, however, will sometimes interpret the lack of a mandate as an excuse to jump the rails. While I haven't found any situations where clang would get creative with that particular construct, it often gets creative in other ways.
@WhatsACreel 5 років тому ⁺²
That's a really interesting point! I am not sure where the compilers bend the rules. I am sure they do, but I figured it was all related to relaxed flags in the command line? Maybe they outright dodge the standard all the time?? Cheers for sharing and watching mate :)
@Tystros 2 роки тому ⁺²
I think you should make a monthly series out of this! Looking at fun code in godbolt and judging how well different compilers do! Very entertaining :)
@MeeroSom 3 роки тому ⁺²
Did he just say: "...ignore my poor programming skills..." ? 18:54 ... Man, I would like to have 1/2 of your poor programming skills :D
@kpjVideo 5 років тому ⁺⁵
Love this site! It's really good for learning how compilers translate code into individual instructions and how exactly those instructions are used for us beginner assembly learners, great tool indeed :)
@WhatsACreel 5 років тому
For sure mate!! Compilers are really amazing ASM coders, really awesome tool for learning :)
@E-AE-A 5 років тому ⁺⁴
I just thought of a funny project we could try when the Creel AI lib gets ready: THE ULTIMATE NEURAL C++ COMPILER. Use c++ code as an input and the neural net outputs an assembly. The compiled code could be given inputs and its performance could be evaluated based on whether it gets all the outputs correct. I think we might be able to turn some simple algorithms into assembly (or get a good laugh). :D
@WhatsACreel 5 років тому ⁺¹
Ok, this is the greatest idea ever!! Not sure if it would be amazing or terriblem but would certainly be fun :)
@konstantinrebrov675 5 років тому ⁺³
Dear sir, Can you make a series of videos about performance programming with the Intel compiler Parallel Studio XE, for example using it's automatic vectorization constructs, Math Kernel Library, and others?
@WhatsACreel 5 років тому ⁺²
This is an excellent idea! I don't have the Intel compiler installed. I do have MKL tho. It's actually Amazing!!! I'm still confused as to how it works, but would love to share some of that info at some point. cheers for watching brus, and sharing ideas :)
@yowut8075 4 роки тому ⁺²
Efficient algorithm to sum from 1 to N: print(N*(N+1)/2)
@treelibrarian7618 Рік тому
re: floating-point AVX unrolling: clang is slightly overkill but also misses 50% of the erformance it could be getting. Newer cpu's (skylake and after) have 2 vector add ports and the instruction latency is 4 cycles whereas throughput is 0.5 (2 adds per cycle). In order to not have the adding process held up by loop dependency it needs to use 8 registers, so the unrolling by 16 is slightly overkill, but it's still only 50% effective because it's only using 4 registers for the summing. clang manages 50% of that. intel manages 25% of that. gcc manages 12% of that. Maybe clang wasn't updated to account for 2 adders yet?
@Ikkepop Рік тому
LEA is sometimes used to perform some math operations that could be fit into the formula
C+R1+R2*2^n (where n is 0,1,2,3) so lea reg1, [reg2+1] is like reg1=reg2+1 and so on
sometimes you will see the compiler doing lea reg1,[reg1+reg1*8] to multiply reg1 by 9 for example. I suspect the reasons is using different parts of the pipeline to do math right next to each other so there doesn't need to be a stall, like if an adder/multiplier being used, and then you need another multiply by 9 right after it, use the addressing unit to do the multiply by 9
@TehMagilla 3 роки тому ⁺²
This is very cool. I decided to write my own dumb pop_count() function and try it in GodBolt myself.
None of the compilers picked up what I was doing.
// Luke's dumb PopCount
int PopCount(int i)
{
int pop = 0;
for(; i > 0; i=i>>1)
pop += (i%2);
return pop;
}
@TehMagilla 3 роки тому
I used the same compiler versions and options.
@WhatsACreel 3 роки тому ⁺²
I reckon that's a really good PopCount! Surprising that the compilers didn't figure it out. Cheers for sharing mate :)
@TehMagilla 3 роки тому
@@WhatsACreel I'll take that as a HUGE compliment. Thanks!
@TehMagilla 3 роки тому ⁺²
@@WhatsACreel I'm not a C programmer OR an ASM programmer, I work in higher-level languages for the most part. IS there a way of writing popount in C so that EVERY compiler says "Yes, I see what you're doing - let's use popcount"? It seems to me that that would be the best way to write it. Or maybe there's a compiler macro for it or something?
@WhatsACreel 3 роки тому ⁺²
'@@TehMagilla I don't know of any way that all the compilers recognize. There's a macro, yes, _popcount, or something like that. If you search for 'popcnt intrinsic', it should turn up. On x86, that's the fastest way to compute the pop count.
Outside of that, the best way I know is called 'kernighan's algorithm'. Definitely worth a look if you've not seen it :)
@Wyld1one Рік тому
Why did it not multi-thread the sum()?
On min() - that's a radix sort & get single array[0] value. (Neg value to smallest array index)
@whatamievendoing 5 років тому ⁺⁴
If TV was like this video, I'd be watching it all the time
@WhatsACreel 5 років тому ⁺¹
Cheers mate!! Thanks for watching :)
@whatamievendoing 5 років тому
Well, your videos sure can satisfy the nerdy urges ;)
@jgcooper 5 років тому ⁺³
maybe the "eax xor eax" ( eax = 0 ), is because you put the "uint count = 0;" at the start, and eax = count
so perhaps the compiler didnt realize that i didnt need that line (?)
@WhatsACreel 5 років тому
Yeah, there's something strange going on there. I think it was deep trickery. But certainly possible it was just a mistake, hahaha. Cheers for watching mate :)
@MikeMoretti 5 років тому
This is a trick that used to be used a LOT on the Z80 to zero a register without using a load immediate. On the Z80 xor reg,reg is a smaller instruction than load immediate. I think the same thing was used on 6502 (but I can't remember).
@WhatsACreel 5 років тому
Oh, yes, XOR is commonly used in x86 for that too! "XOR reg, reg" is treated as a special instruction in a lot of CPU's, and executes abnormally quickly, something like 1/4 a clock cycle from memory. It's got me baffled exactly why the compiler would Zero, tho. Considering popcnt will overwrite it...? I think it's either a trick to break a dependency chain on EAX, or it's the thing mentioned by spj... If you "XOR EAX, EAX", then you're saying any pending computations on EAX are irrelevant, and the POPCNT can begin to execute immediately. I think that's what it is.
@antoniocs8873 5 років тому ⁺³
I mean... it says right there... COMPILER EXPLORER not godbolt compiler....
@WhatsACreel 5 років тому ⁺²
Oh my thumbnail... damn!! I spent soo much time on the lightning bolts... I will fix it, hold up brus! Cheers for watching :)
@WhatsACreel 5 років тому ⁺¹
There we go, it's fixed, not sure how long it'll take to roll out, but cheers for pointing this out :)
@MattGodbolt 5 років тому ⁺³
@@WhatsACreel I've gotten used to it by now :) Great video, thanks :)
@WhatsACreel 5 років тому ⁺¹
Haha, awesome, Matt! Sooo good of you to stop by! Cheers for the website brus :)
@edwardmacnab354 2 роки тому
So the first one is from sum of 1 to n integers is n(n+1)/2 I'm assuming because that's the only way I can see why imul is used
@Cubinator73 5 років тому ⁺²
9:52 Maybe xor fills the branch delay slot? Although normally this slot occurs after a branch instruction and not at the branch target as far as I know...
@WhatsACreel 5 років тому ⁺¹
There's a thought!! I like this idea, I am not sure. It's certainly something funny going on. Something deep like that. Well, cheers for watching mate :)
@VlingoVideo 5 років тому ⁺²
been a fan of your channel for a LONG time and i absolutely love the way you explain all of your topics and showcase some really interesting topics such as this one - compiler optimizations and loop unrolling!
love it, cheers mate
@WhatsACreel 5 років тому ⁺¹
Cheers mate, so good of you to say! Cheers for watching brus :)
@NeilRoy 5 років тому ⁺²
Fascinating! It would be interesting to input some code that would take a while to actually run, then take your own guess as to which compiler will do the best job. Then actually compile it and see which one really is faster.
@WhatsACreel 5 років тому ⁺²
Fer sure mate! I'd back Clang every time! Thing is a beast :)
@NeilRoy 5 років тому
@@WhatsACreel I have never tried Clang, always been a fan of the GNU compilers, but after watching this I may just give it a shot. Been away from programming for a while and thinking about getting back into it. Heard good things about Clang in the past.
@nicholasbuckner5221 3 роки тому
But who explores the explorers.... you do.
@JJCUBER 4 роки тому ⁺¹
Nice video! At 5:00 because is misspelled 😉
@DonaldDuvall 3 роки тому
Thank you for creating the best and most helpful videos about low level programming (Assembly) that I have been able to find. It really seems the current popular opinion about programming, is "You shouldn't use Assembly, the compiler is better than you!" Which is obviously wrong in even these limited examples. Well, back to binge watching your videos. Thanks again,
@stuffedk 5 років тому ⁺²
Gday Creel!
@WhatsACreel 5 років тому
Gday mate :)
@rudyardkipling4517 3 роки тому
xor eax eax (or other register) is common in "windows" assembly. making sure to start with a clean register
@saskiafalken6350 3 роки тому
You are an absolute lad, greetings from germany.
Ignore my troll persona.
@Kaboomnz 3 роки тому
Absolutely amazing, well done ya, lets get out of ere xD
@KoMaHu3aM 3 роки тому
keeping those algorithms happy..
@gathda 5 років тому ⁺¹
This is so cool.
@WhatsACreel 5 років тому ⁺¹
Cheers mate :) thanks for watching
@sikendongol4208 3 роки тому
AVX AVX AVX
@Taivuttaja Рік тому
Hey, I got inspired and got thunderstruck by autism and wanted to optimize the first example for clang and made this:
# ifndef U64
# define U64 unsigned long long
# endif
const U64 sum_1__n(const U64 n = 0ULL) noexcept {
U64 t{0ULL}, _{0ULL};
for(; _
@PeterPetrakis 5 років тому ⁺¹
That was fun. Thanks.
@WhatsACreel 5 років тому
You're welcome brother, cheers for watching :)
@theexplosionist2019 5 років тому ⁺³
Are we optimizing for speed or code size? The latter is preferable when in doubt.
@WhatsACreel 5 років тому
I'd say for speed. Tho I didn't specify directly, I just used O3 and fast maths. There's certainly a lot more flags available in these compilers for both speed and size, so maybe it was somewhere-in-the-middle optimizations? Hahaha, cheers for watching brus :)
@AlgoFodder 5 років тому ⁺³
Welcome to The Asm Is Right, Come On Down! More compiler gameshows please!
@WhatsACreel 5 років тому
Awesome! More compiler game shows would be fun :)
@lx2222x 4 роки тому
👍
@unodos1821 5 років тому
Is there an offline mode?
@Qazqi 5 років тому ⁺¹
Yes, you can get it set up locally.
@unodos1821 5 років тому
@@Qazqi how?
@Qazqi 5 років тому ⁺²
@@unodos1821 See the Github project that's linked to from the site: github.com/mattgodbolt/compiler-explorer
@unodos1821 5 років тому
@@Qazqi Thanks.

Наступне

Автоматичне відтворення