AVX512 (2 of 3): Programming AVX512 in 3 Different Ways

Creel

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 21 лис 2024

КОМЕНТАРІ • 74

@punishedsnake492 4 роки тому ⁺⁴⁵
This is pure gold. Not much info could be found about using AVX512, so I'm very grateful for this series of videos.
@WhatsACreel 4 роки тому ⁺¹²
Glad you liked is Snake! Cheers for watching :)
@MrGooglevideoviewer 2 роки тому ⁺¹
bit late for this comment... but I just wanted to say you are a bloody legend. Thanks for going to all the effort of making these. Cheers! 👍
@HuntingKingYT Рік тому ⁺²
GCC Auto-Vectorization: I’m gonna end this man’s whole career
@anonmouse-zr9cn Рік тому
This is great. Very approachable.
@SystemsDevel 2 роки тому ⁺³
Thanks for the amazing x8664 Assembly videos! Learned so much! What is this beautiful syntax highlighting you have in VS? Please tell me :)
@PunmasterSTP 4 місяці тому
VCL? More like "Very cool, and powerful as hell!" 👍
@hudevin7187 2 роки тому
Thanks for your interesting and useful stuff.
@timkox9640 4 роки тому ⁺⁵
Realy interesting stuff, thank you for this great explanation. At some point you mention that the asm code is not faster than standard C++. Is it because of the unaligned arrays or is there more to it?
@WhatsACreel 4 роки тому ⁺¹²
It's because C++ compiler can't inline our ASM, so the function call itself will slow things to the point where there's no point in using ASM like that. If we get into ASM, we usually try to stay there for as long as possible. The way we did it in the vid was just a few instructions, and so the function call will make that much slower than regular C++, even though the AVX512 instructions themselves will probably be very fast! Sorry I was unclear. Thanks for watching mate, hope this clears it up :)
@timkox9640 4 роки тому
@@WhatsACreel Ok, it makes sense, thank you very much. I love your content, can't wait for the third part. Cheers ;)
@BSOD.Enjoyer 4 роки тому ⁺³
@@WhatsACreel With clang or gcc, you can do inline asm in x64. clang/LLVM is pretty easy to set up in visual studio. It is not conformant c++ though.
@frognik79 4 роки тому ⁺⁶
The first one seems so simple, is there any difference in assembly between them?
@WhatsACreel 4 роки тому ⁺¹²
Mr. Fog's library is amazing! There's sometimes a small speed difference between it and intrinsics or native ASM, but it's usually negligible. That would indicate that sometimes it's not translated directly to single instructions. But, VCL includes a whole bunch of really powerful mathematical functions that are not available in ASM, so that's great. Some really fast implementations too!!
There's thousands of instructions in the x86/64 instruction set, and so the library doesn't attempt to capture all the flexibility of native ASM, but I'd say for many conceivable tasks, VCL is certainly a very good way to go! I find it's and excellent way to prototype vector algorithms too. Really simple and easy to debug :)
I'm sure there's differences, yep. I've never looked deeply into what they are, but found the library performs really very well. Hope this helps, and cheers for watching :)
@AlyxSharkBite-2000 3 роки тому
I just came across this, great video!
@diegonayalazo 3 роки тому
Thanks
@ricos1497 4 роки тому ⁺²
Great video. I understood very little of it, but it was interesting nevertheless. Have you ever done a video on your background and what type of programming you do, as I'm quite interested to know where to start on things like this. My experience of coding is writing powershell scripts, vba, a bit of C#, SQL scripts (on the data side, rather than performance) that sort of thing, but I'm entirely self taught and effectively piggy back on what others have done before me. I have little understanding of the back end processes and such like. Any recommendations on where to start to get into these things, should I get a degree or should I just try hacking into the FBI and hope for a lucky hit? Much appreciated.
@xniyana9956 3 роки тому ⁺¹
I'm no where close to being on Creel's level but I do understand a lot of this stuff. I can actually write a fair bit of x86 assembly code but only the old school way, eg mov, add, cmp, div etc. and only 32 bit assembly for now. I also know some C/C++, C#, VB.Net, VB6 and a couple other languages.
It's not as intimidating as you think. I'm entirely self-taught. But I'm not going to lie, it won't be easy for you to level up out of basic scripting but just being able to write scripts puts you sooooo far ahead of the average person. I'll recommend you focus heavily on C# since you're already familiar with it and you can learn a lot in that environment. There is plenty out there in the world of C#. Just write a lot of code and read a lot when you get stuck. Rinse and repeat and within 2 years you should be able to do some amazing things. Also, don't be afraid to push yourself.
@steveokinevo 4 роки тому ⁺¹
Beautiful Chris just beautiful man great video, like the xmm and ymm sets, with aligned data on 16 byte and 32 byte boundaries respectively, the avx512 would be 64byte aligned ?
@WhatsACreel 4 роки тому ⁺⁴
Yes, that's right! AVX512 alignment is 64 bytes. Since the original AVX instruction set, the alignment restrictions have been relaxed. We still have to align for the MOVAPD and other aligned moves, but we can use an unaligned operands as the final parameters to AVX and AVX512 instructions. Cheers for watching mate :)
@steveokinevo 4 роки тому
@@WhatsACreel No wories, NICE ONE for that, Cheers
@amber1862 4 роки тому ⁺³
Beautiful video as always mate! Have you got a PayPal donation account set up at all? I'm sure I'm not alone when I say that I'd love to donate without having to sign up to Patreon :)
I have a future video request/idea relating to a question asked in this comment section: I'd love a video on assembly-related optimization traps beginners (like myself) can fall into, such as seeing the 3 lines of assembly towards the end of this video and not realising that although the SIMD add operation itself is one cycle, the loading and storing portion of the same instruction can take MANY cycles when doing it manually. A top 10 optimization traps video would be INCREDIBLY useful!
@WhatsACreel 4 роки тому ⁺¹
Sorry, I don't have a one time donation set up. Cheers for the thought though, I am thinking of setting one up. And it would be excellent to make an ASM traps and pitfalls vid! I really like that idea. At the moment, I'm working on a 'ASM Misconceptions' vid, so that's kind of similar. I hope I can record and share soon. Well, thank you for the suggestions and thank you for watching, have a great day :)
@WhoNoMe 4 роки тому ⁺³
Why did you write in the video that the “manual” asm is “much” slower than using the vectorclass.h even tho u use like 3 instructions in asm?
@nayjames123 4 роки тому ⁺³
Because you have to load and store the vectors from memory for the add. In c++ they could stay in the registers removing the need for unnecessary loads/stores
@salainen6850 4 роки тому
@@nayjames123 There is also overhead when calling the assembly function.
@WhatsACreel 4 роки тому ⁺⁵
Yes, the answers here are right! It's because the ASM will require a function call. The compiler can inline and optimize its own functions, but when we use ASM, it doesn't optimize. So when we go into ASM, we usually want to stay there for as long as possible, otherwise the time for the function call and loading of the data will not be mitigated and the ASM will perform poorly.
@bbq1423 4 роки тому ⁺⁷
Question: Does SIMD instructions run parallel on a transistor level or is there some kind of internal for loop in the CPU?
@WhatsACreel 4 роки тому ⁺²¹
Yes, they are parallel. For the most part anyway. There might be some complex instructions that split into micro ops, which could be executed by different pipes in sequence. But generally, it's all at the same time. Cheers for watching!
@alan2here 4 роки тому ⁺¹
@@WhatsACreel cheers for answering questions :)
@arditm2178 4 роки тому
So... Do avx512 gather scatter instructions provide any performance benefit or is it just for cleaner code? And perhaps a chance for future better hardware implementation?
@WhatsACreel 4 роки тому ⁺¹
I am not sure on the performance. If I remember correctly, they gather elements based on the bits of a K register. I assume the normal penalty for cache line misses would still hold, since the instructions would only be reading from 1 or 2 cache lines. Pretty much the same as any other instruction that reads the whole 64 bytes.
That's speculation though, and I'd certainly love to explore it a little. My memory is often flawed, so I might be completely wrong. I'd say they're useful, but they're not the completely arbitrary gathers that we might hope for.
Hope this helps, and if you do explore the performance I'd love to read/hear about it if you'd like to share. Cheers for watching mate, have a good one :)
@gideonmaxmerling204 4 роки тому ⁺¹
may I ask, why can't you pass the zmmwords to assembly and return the result through zmm0 instead of doing pointers (using vectorcall)?
edit: is there a better calling convention then the "c" calling convention?
@WhatsACreel 4 роки тому ⁺³
Do you know, I'm not sure if there's a better one. I haven't studied calling conventions for a while. The x64 ones tend to all be very similar. C is pretty good. It uses registers for the first 4 ints and floats. But these vector types are arrays, so I'm not sure there's any better way to pass them then by pointer. Which is essentially what the C convention does.
It's easy to establish your own calling convention once you're in Assembly. Then you don't have to worry about any calling convention, unless you interact with other C functions. Of course, that can be tricky if you're not careful too!
Sorry I can't help more. Thank you for watching, and thank you for this interesting question :)
@gideonmaxmerling204 4 роки тому ⁺¹
@@WhatsACreel you should try calling a normal c++ function and passing it an intrinsic zmmword then looking at the disassembly
@OpenGL4ever Рік тому
There is a WP article about calling conventions the article is called "x86 calling conventions". This should give a nice overview.
@MagnusTheUltramarine 3 роки тому
Why is it that the parameters are passed to rcx, rdx and r8? also what is stored in those registers?
Anyways, thanks a lot for these videos!
@WhatsACreel 3 роки тому
It's just the calling convention. Folks had to decide on some registers and they just decided those ones. It's different in Linux. Yeah, but I have no good explanation, just the convention
There's nothing special about those registers, they're general purpose, you can store whatever you like in them.
Hope this helps, have a good one :)
@MagnusTheUltramarine 3 роки тому
@@WhatsACreel Thanks man. I watched all your new playlist on modern x64 assembly, and avx512. You truly enjoy what you explain!
Maybe you could make some videos on how to make a mini retro game or some kind of program in masm, in order to put this ideas in practice, just an idea
@OpenGL4ever Рік тому
Calling conventions are compiler specific and that's what the compiler expects when an extern function is used. Many different compilers adhere to a specific calling convention and as a developer there isn't much you can do about it because then you would have to change the compiler.
@roberthowell8267 4 роки тому ⁺¹¹
I don't know about this... I'm most DEFINITELY going with an AMD cpu
@WhatsACreel 4 роки тому ⁺¹³
I can't fault AMD right now! Great CPU's :)
@llothar68 4 роки тому
And it's also not recommended when programming for macOS now. Because Rosetta2 does not emulate AVX512
@AlyxSharkBite-2000 3 роки тому ⁺¹
If you want to do AVX512 you will need an Intel CPU either one of the LGA 2066 i9 or one of the upcoming 11th Gen Core i7 or i9 (or equivalent Xeon) AMD doesn’t support AVX512.
@roberthowell8267 3 роки тому ⁺¹
@@AlyxSharkBite-2000 ok we'll see how relevant avx512 is soon enough
@AlyxSharkBite-2000 3 роки тому ⁺¹
@@roberthowell8267 Oh I wasn’t saying it was I was only saying if you wanted it (figured you did since this was an AVX512 video) you needed an Intel. Didn’t want you to pick up a CPU and it not having a feature you wanted.
@lx2222x 4 роки тому ⁺¹
LIKE AND SUB, THIS MAN IS AWSOME
@EpicHardware 4 роки тому ⁺⁶
wow 0 dislikes, i guess haters don't care about avx :P
@amber1862 4 роки тому ⁺³
Many acclaimed studies have shown it's physcally impossible to dislike an Australian talking about low-level performance computing.
@theexplosionist2019 4 роки тому ⁺²
There won't be any mainstream AVX-512 until RocketLake.
@WhatsACreel 4 роки тому ⁺¹
That's an interesting point of view! Maybe this gigantic instruction set won't work out at all? Fascinating time we are in right now :)
@llothar68 4 роки тому ⁺³
@@WhatsACreel I thought Intel learned this with blowing 100 billions on Itanium VLIW architecture. But Intel is run by business graduates and not technical persons like Lisa Su.
@Alex-op2kc 3 роки тому ⁺¹
Part 3: ua-cam.com/video/543a1b-cPmU/v-deo.html
@AlexDanut 3 роки тому ⁺²
Ok, but seriously, what is a creel?
@NeilRoy 4 роки тому
Hey, I seen that Blender folder on your desktop. Whatcha doin' with Blender? :)
@WhatsACreel 4 роки тому ⁺²
Blender is amazing!! I use it for the 3D in some of these vids, and photogrammetry (which is creating models from photographs), sometimes I just make little towns and houses and things for fun :) I'd like to sell on turbo squid or Unity store eventually, but at the moment, it's such a learning curve, still just a beginner :)
@NeilRoy 4 роки тому ⁺¹
@@WhatsACreel Nice! Been messing around with it myself. Another neat program is MakeHuman which is free and allows you to create human 3D models which you can import into Blender. Also free. So, make some people for your towns. :)
@WhatsACreel 4 роки тому ⁺¹
@@NeilRoy Make Human is really great! Thanks mate :)
@WhatsACreel 4 роки тому ⁺²
There's a plugin for Blender called Manuel Bastioni Lab. It's good too. Not a lot of assets though.
@TheNoodlyAppendage 2 роки тому
The problem with AVX512 is the hardware supports the opcodes, but doesnt support them in hardware. With only 2 FPU's its no faster than SSE
@OpenGL4ever Рік тому
This will have the very simple reason that there are currently hardly any applications for the end user that support AVX512. But for compilers and developers it's good that they can buy CPUs that can do AVX512 so that they can adapt their software and compilers to it. That is why the space on the silicon chip was very likely saved.
As soon as the software supports AVX512 better, there will also be hardware with more AVX512 execution units per core, so that the additional performance compared to AVX2 can also be used.
It's basically a chicken and egg problem. But why waste space for the chicken if the egg hasn't even been laid yet.
@dankillinger 4 роки тому ⁺⁵
:)
@theterribleanimator1793 4 роки тому
:)
@WhatsACreel 4 роки тому ⁺⁴
:)
@anthonynjoroge5780 4 роки тому
:-}
@ClayWheeler 3 роки тому
Not gonna lie . There's Intel FanBoy who said "Intel is better because video games can run on AVX 512 on it" .
I was like: "Bruh, show me any video game that requires AVX 512 right now"
@OpenGL4ever Рік тому
No game currently requires AVX512, nor would that be wise as it would then run on very little hardware. However, there are games that support it and use it when it's available. The last of us is one of them and there are also comparisons on YT. Just search for AVX512 on/off.
@orestescm7644 3 роки тому
shame this will not work with Ryzen cpus
@DaveAxiom 4 роки тому
11:28 NASM uses the standard Intel syntax. MASM uses a modified proprietary syntax!
@jozo035 4 роки тому ⁺⁴
AVX512 is just too good to be truth. In theory You can get 256 SP-GFlops per core (FMA at 4 GHz). With 28 or 56 cores (if you have dual die Xeons available) you have performance above GPU accelerators at much lower TDP (which is often most important parameter).
In reality, AVX512 proved to be disaster (Xeon-Phi 7xxx was released in 2014)...
@WhatsACreel 4 роки тому ⁺³
Yep, rough introduction, for sure! Similar things occurred with the original AVX. Though, at that time, they didn't have Ryzen to compete with! I think it was the opposite too. I think the original AVX was slow to start up, but once it was going it sped up?
I hope the throughput of the floating point can be improved. It's my only real worry about the instruction set. Oh, and compilers too - I mean, it was hard enough to effectively vectorise code, trying to automatically wrangle an instruction set like AVX512 from regular C++ code will be very difficult! I'm sure those compiler authors are clever enough to get some amazing things happening already :)
At the moment, throughput is 1 per cycle for the simpler floating instructions. It's 1/2 for AVX, so you get pretty much the same amount of flops. If they can improve that, even a little, I think it will do wonders!
Certainly love the masking abilities!! Really great stuff :)
Only time will tell :)
@OpenGL4ever Рік тому
@@WhatsACreel The reason why is, because for AVX2 they use two AVX2 units per core in a super scalar way. And with AVX512, these two units just work together as one. So in the end, the result is the same. But if you ask me, this is not important at the moment, because Compilers have to adapt anyway first.

Наступне

Автоматичне відтворення

AVX512 (3 of 3): Deep Dive into AVX512 Mechanisms