Realy interesting stuff, thank you for this great explanation. At some point you mention that the asm code is not faster than standard C++. Is it because of the unaligned arrays or is there more to it?
It's because C++ compiler can't inline our ASM, so the function call itself will slow things to the point where there's no point in using ASM like that. If we get into ASM, we usually try to stay there for as long as possible. The way we did it in the vid was just a few instructions, and so the function call will make that much slower than regular C++, even though the AVX512 instructions themselves will probably be very fast! Sorry I was unclear. Thanks for watching mate, hope this clears it up :)
Mr. Fog's library is amazing! There's sometimes a small speed difference between it and intrinsics or native ASM, but it's usually negligible. That would indicate that sometimes it's not translated directly to single instructions. But, VCL includes a whole bunch of really powerful mathematical functions that are not available in ASM, so that's great. Some really fast implementations too!! There's thousands of instructions in the x86/64 instruction set, and so the library doesn't attempt to capture all the flexibility of native ASM, but I'd say for many conceivable tasks, VCL is certainly a very good way to go! I find it's and excellent way to prototype vector algorithms too. Really simple and easy to debug :) I'm sure there's differences, yep. I've never looked deeply into what they are, but found the library performs really very well. Hope this helps, and cheers for watching :)
Great video. I understood very little of it, but it was interesting nevertheless. Have you ever done a video on your background and what type of programming you do, as I'm quite interested to know where to start on things like this. My experience of coding is writing powershell scripts, vba, a bit of C#, SQL scripts (on the data side, rather than performance) that sort of thing, but I'm entirely self taught and effectively piggy back on what others have done before me. I have little understanding of the back end processes and such like. Any recommendations on where to start to get into these things, should I get a degree or should I just try hacking into the FBI and hope for a lucky hit? Much appreciated.
I'm no where close to being on Creel's level but I do understand a lot of this stuff. I can actually write a fair bit of x86 assembly code but only the old school way, eg mov, add, cmp, div etc. and only 32 bit assembly for now. I also know some C/C++, C#, VB.Net, VB6 and a couple other languages. It's not as intimidating as you think. I'm entirely self-taught. But I'm not going to lie, it won't be easy for you to level up out of basic scripting but just being able to write scripts puts you sooooo far ahead of the average person. I'll recommend you focus heavily on C# since you're already familiar with it and you can learn a lot in that environment. There is plenty out there in the world of C#. Just write a lot of code and read a lot when you get stuck. Rinse and repeat and within 2 years you should be able to do some amazing things. Also, don't be afraid to push yourself.
Beautiful Chris just beautiful man great video, like the xmm and ymm sets, with aligned data on 16 byte and 32 byte boundaries respectively, the avx512 would be 64byte aligned ?
Yes, that's right! AVX512 alignment is 64 bytes. Since the original AVX instruction set, the alignment restrictions have been relaxed. We still have to align for the MOVAPD and other aligned moves, but we can use an unaligned operands as the final parameters to AVX and AVX512 instructions. Cheers for watching mate :)
Beautiful video as always mate! Have you got a PayPal donation account set up at all? I'm sure I'm not alone when I say that I'd love to donate without having to sign up to Patreon :) I have a future video request/idea relating to a question asked in this comment section: I'd love a video on assembly-related optimization traps beginners (like myself) can fall into, such as seeing the 3 lines of assembly towards the end of this video and not realising that although the SIMD add operation itself is one cycle, the loading and storing portion of the same instruction can take MANY cycles when doing it manually. A top 10 optimization traps video would be INCREDIBLY useful!
Sorry, I don't have a one time donation set up. Cheers for the thought though, I am thinking of setting one up. And it would be excellent to make an ASM traps and pitfalls vid! I really like that idea. At the moment, I'm working on a 'ASM Misconceptions' vid, so that's kind of similar. I hope I can record and share soon. Well, thank you for the suggestions and thank you for watching, have a great day :)
Because you have to load and store the vectors from memory for the add. In c++ they could stay in the registers removing the need for unnecessary loads/stores
Yes, the answers here are right! It's because the ASM will require a function call. The compiler can inline and optimize its own functions, but when we use ASM, it doesn't optimize. So when we go into ASM, we usually want to stay there for as long as possible, otherwise the time for the function call and loading of the data will not be mitigated and the ASM will perform poorly.
Yes, they are parallel. For the most part anyway. There might be some complex instructions that split into micro ops, which could be executed by different pipes in sequence. But generally, it's all at the same time. Cheers for watching!
So... Do avx512 gather scatter instructions provide any performance benefit or is it just for cleaner code? And perhaps a chance for future better hardware implementation?
I am not sure on the performance. If I remember correctly, they gather elements based on the bits of a K register. I assume the normal penalty for cache line misses would still hold, since the instructions would only be reading from 1 or 2 cache lines. Pretty much the same as any other instruction that reads the whole 64 bytes. That's speculation though, and I'd certainly love to explore it a little. My memory is often flawed, so I might be completely wrong. I'd say they're useful, but they're not the completely arbitrary gathers that we might hope for. Hope this helps, and if you do explore the performance I'd love to read/hear about it if you'd like to share. Cheers for watching mate, have a good one :)
may I ask, why can't you pass the zmmwords to assembly and return the result through zmm0 instead of doing pointers (using vectorcall)? edit: is there a better calling convention then the "c" calling convention?
Do you know, I'm not sure if there's a better one. I haven't studied calling conventions for a while. The x64 ones tend to all be very similar. C is pretty good. It uses registers for the first 4 ints and floats. But these vector types are arrays, so I'm not sure there's any better way to pass them then by pointer. Which is essentially what the C convention does. It's easy to establish your own calling convention once you're in Assembly. Then you don't have to worry about any calling convention, unless you interact with other C functions. Of course, that can be tricky if you're not careful too! Sorry I can't help more. Thank you for watching, and thank you for this interesting question :)
It's just the calling convention. Folks had to decide on some registers and they just decided those ones. It's different in Linux. Yeah, but I have no good explanation, just the convention There's nothing special about those registers, they're general purpose, you can store whatever you like in them. Hope this helps, have a good one :)
@@WhatsACreel Thanks man. I watched all your new playlist on modern x64 assembly, and avx512. You truly enjoy what you explain! Maybe you could make some videos on how to make a mini retro game or some kind of program in masm, in order to put this ideas in practice, just an idea
Calling conventions are compiler specific and that's what the compiler expects when an extern function is used. Many different compilers adhere to a specific calling convention and as a developer there isn't much you can do about it because then you would have to change the compiler.
If you want to do AVX512 you will need an Intel CPU either one of the LGA 2066 i9 or one of the upcoming 11th Gen Core i7 or i9 (or equivalent Xeon) AMD doesn’t support AVX512.
@@roberthowell8267 Oh I wasn’t saying it was I was only saying if you wanted it (figured you did since this was an AVX512 video) you needed an Intel. Didn’t want you to pick up a CPU and it not having a feature you wanted.
@@WhatsACreel I thought Intel learned this with blowing 100 billions on Itanium VLIW architecture. But Intel is run by business graduates and not technical persons like Lisa Su.
Blender is amazing!! I use it for the 3D in some of these vids, and photogrammetry (which is creating models from photographs), sometimes I just make little towns and houses and things for fun :) I'd like to sell on turbo squid or Unity store eventually, but at the moment, it's such a learning curve, still just a beginner :)
@@WhatsACreel Nice! Been messing around with it myself. Another neat program is MakeHuman which is free and allows you to create human 3D models which you can import into Blender. Also free. So, make some people for your towns. :)
This will have the very simple reason that there are currently hardly any applications for the end user that support AVX512. But for compilers and developers it's good that they can buy CPUs that can do AVX512 so that they can adapt their software and compilers to it. That is why the space on the silicon chip was very likely saved. As soon as the software supports AVX512 better, there will also be hardware with more AVX512 execution units per core, so that the additional performance compared to AVX2 can also be used. It's basically a chicken and egg problem. But why waste space for the chicken if the egg hasn't even been laid yet.
Not gonna lie . There's Intel FanBoy who said "Intel is better because video games can run on AVX 512 on it" . I was like: "Bruh, show me any video game that requires AVX 512 right now"
No game currently requires AVX512, nor would that be wise as it would then run on very little hardware. However, there are games that support it and use it when it's available. The last of us is one of them and there are also comparisons on YT. Just search for AVX512 on/off.
AVX512 is just too good to be truth. In theory You can get 256 SP-GFlops per core (FMA at 4 GHz). With 28 or 56 cores (if you have dual die Xeons available) you have performance above GPU accelerators at much lower TDP (which is often most important parameter). In reality, AVX512 proved to be disaster (Xeon-Phi 7xxx was released in 2014)...
Yep, rough introduction, for sure! Similar things occurred with the original AVX. Though, at that time, they didn't have Ryzen to compete with! I think it was the opposite too. I think the original AVX was slow to start up, but once it was going it sped up? I hope the throughput of the floating point can be improved. It's my only real worry about the instruction set. Oh, and compilers too - I mean, it was hard enough to effectively vectorise code, trying to automatically wrangle an instruction set like AVX512 from regular C++ code will be very difficult! I'm sure those compiler authors are clever enough to get some amazing things happening already :) At the moment, throughput is 1 per cycle for the simpler floating instructions. It's 1/2 for AVX, so you get pretty much the same amount of flops. If they can improve that, even a little, I think it will do wonders! Certainly love the masking abilities!! Really great stuff :) Only time will tell :)
@@WhatsACreel The reason why is, because for AVX2 they use two AVX2 units per core in a super scalar way. And with AVX512, these two units just work together as one. So in the end, the result is the same. But if you ask me, this is not important at the moment, because Compilers have to adapt anyway first.
This is pure gold. Not much info could be found about using AVX512, so I'm very grateful for this series of videos.
Glad you liked is Snake! Cheers for watching :)
bit late for this comment... but I just wanted to say you are a bloody legend. Thanks for going to all the effort of making these. Cheers! 👍
GCC Auto-Vectorization: I’m gonna end this man’s whole career
This is great. Very approachable.
Thanks for the amazing x8664 Assembly videos! Learned so much! What is this beautiful syntax highlighting you have in VS? Please tell me :)
VCL? More like "Very cool, and powerful as hell!" 👍
Thanks for your interesting and useful stuff.
Realy interesting stuff, thank you for this great explanation. At some point you mention that the asm code is not faster than standard C++. Is it because of the unaligned arrays or is there more to it?
It's because C++ compiler can't inline our ASM, so the function call itself will slow things to the point where there's no point in using ASM like that. If we get into ASM, we usually try to stay there for as long as possible. The way we did it in the vid was just a few instructions, and so the function call will make that much slower than regular C++, even though the AVX512 instructions themselves will probably be very fast! Sorry I was unclear. Thanks for watching mate, hope this clears it up :)
@@WhatsACreel Ok, it makes sense, thank you very much. I love your content, can't wait for the third part. Cheers ;)
@@WhatsACreel With clang or gcc, you can do inline asm in x64. clang/LLVM is pretty easy to set up in visual studio. It is not conformant c++ though.
The first one seems so simple, is there any difference in assembly between them?
Mr. Fog's library is amazing! There's sometimes a small speed difference between it and intrinsics or native ASM, but it's usually negligible. That would indicate that sometimes it's not translated directly to single instructions. But, VCL includes a whole bunch of really powerful mathematical functions that are not available in ASM, so that's great. Some really fast implementations too!!
There's thousands of instructions in the x86/64 instruction set, and so the library doesn't attempt to capture all the flexibility of native ASM, but I'd say for many conceivable tasks, VCL is certainly a very good way to go! I find it's and excellent way to prototype vector algorithms too. Really simple and easy to debug :)
I'm sure there's differences, yep. I've never looked deeply into what they are, but found the library performs really very well. Hope this helps, and cheers for watching :)
I just came across this, great video!
Thanks
Great video. I understood very little of it, but it was interesting nevertheless. Have you ever done a video on your background and what type of programming you do, as I'm quite interested to know where to start on things like this. My experience of coding is writing powershell scripts, vba, a bit of C#, SQL scripts (on the data side, rather than performance) that sort of thing, but I'm entirely self taught and effectively piggy back on what others have done before me. I have little understanding of the back end processes and such like. Any recommendations on where to start to get into these things, should I get a degree or should I just try hacking into the FBI and hope for a lucky hit? Much appreciated.
I'm no where close to being on Creel's level but I do understand a lot of this stuff. I can actually write a fair bit of x86 assembly code but only the old school way, eg mov, add, cmp, div etc. and only 32 bit assembly for now. I also know some C/C++, C#, VB.Net, VB6 and a couple other languages.
It's not as intimidating as you think. I'm entirely self-taught. But I'm not going to lie, it won't be easy for you to level up out of basic scripting but just being able to write scripts puts you sooooo far ahead of the average person. I'll recommend you focus heavily on C# since you're already familiar with it and you can learn a lot in that environment. There is plenty out there in the world of C#. Just write a lot of code and read a lot when you get stuck. Rinse and repeat and within 2 years you should be able to do some amazing things. Also, don't be afraid to push yourself.
Beautiful Chris just beautiful man great video, like the xmm and ymm sets, with aligned data on 16 byte and 32 byte boundaries respectively, the avx512 would be 64byte aligned ?
Yes, that's right! AVX512 alignment is 64 bytes. Since the original AVX instruction set, the alignment restrictions have been relaxed. We still have to align for the MOVAPD and other aligned moves, but we can use an unaligned operands as the final parameters to AVX and AVX512 instructions. Cheers for watching mate :)
@@WhatsACreel No wories, NICE ONE for that, Cheers
Beautiful video as always mate! Have you got a PayPal donation account set up at all? I'm sure I'm not alone when I say that I'd love to donate without having to sign up to Patreon :)
I have a future video request/idea relating to a question asked in this comment section: I'd love a video on assembly-related optimization traps beginners (like myself) can fall into, such as seeing the 3 lines of assembly towards the end of this video and not realising that although the SIMD add operation itself is one cycle, the loading and storing portion of the same instruction can take MANY cycles when doing it manually. A top 10 optimization traps video would be INCREDIBLY useful!
Sorry, I don't have a one time donation set up. Cheers for the thought though, I am thinking of setting one up. And it would be excellent to make an ASM traps and pitfalls vid! I really like that idea. At the moment, I'm working on a 'ASM Misconceptions' vid, so that's kind of similar. I hope I can record and share soon. Well, thank you for the suggestions and thank you for watching, have a great day :)
Why did you write in the video that the “manual” asm is “much” slower than using the vectorclass.h even tho u use like 3 instructions in asm?
Because you have to load and store the vectors from memory for the add. In c++ they could stay in the registers removing the need for unnecessary loads/stores
@@nayjames123 There is also overhead when calling the assembly function.
Yes, the answers here are right! It's because the ASM will require a function call. The compiler can inline and optimize its own functions, but when we use ASM, it doesn't optimize. So when we go into ASM, we usually want to stay there for as long as possible, otherwise the time for the function call and loading of the data will not be mitigated and the ASM will perform poorly.
Question: Does SIMD instructions run parallel on a transistor level or is there some kind of internal for loop in the CPU?
Yes, they are parallel. For the most part anyway. There might be some complex instructions that split into micro ops, which could be executed by different pipes in sequence. But generally, it's all at the same time. Cheers for watching!
@@WhatsACreel cheers for answering questions :)
So... Do avx512 gather scatter instructions provide any performance benefit or is it just for cleaner code? And perhaps a chance for future better hardware implementation?
I am not sure on the performance. If I remember correctly, they gather elements based on the bits of a K register. I assume the normal penalty for cache line misses would still hold, since the instructions would only be reading from 1 or 2 cache lines. Pretty much the same as any other instruction that reads the whole 64 bytes.
That's speculation though, and I'd certainly love to explore it a little. My memory is often flawed, so I might be completely wrong. I'd say they're useful, but they're not the completely arbitrary gathers that we might hope for.
Hope this helps, and if you do explore the performance I'd love to read/hear about it if you'd like to share. Cheers for watching mate, have a good one :)
may I ask, why can't you pass the zmmwords to assembly and return the result through zmm0 instead of doing pointers (using vectorcall)?
edit: is there a better calling convention then the "c" calling convention?
Do you know, I'm not sure if there's a better one. I haven't studied calling conventions for a while. The x64 ones tend to all be very similar. C is pretty good. It uses registers for the first 4 ints and floats. But these vector types are arrays, so I'm not sure there's any better way to pass them then by pointer. Which is essentially what the C convention does.
It's easy to establish your own calling convention once you're in Assembly. Then you don't have to worry about any calling convention, unless you interact with other C functions. Of course, that can be tricky if you're not careful too!
Sorry I can't help more. Thank you for watching, and thank you for this interesting question :)
@@WhatsACreel you should try calling a normal c++ function and passing it an intrinsic zmmword then looking at the disassembly
There is a WP article about calling conventions the article is called "x86 calling conventions". This should give a nice overview.
Why is it that the parameters are passed to rcx, rdx and r8? also what is stored in those registers?
Anyways, thanks a lot for these videos!
It's just the calling convention. Folks had to decide on some registers and they just decided those ones. It's different in Linux. Yeah, but I have no good explanation, just the convention
There's nothing special about those registers, they're general purpose, you can store whatever you like in them.
Hope this helps, have a good one :)
@@WhatsACreel Thanks man. I watched all your new playlist on modern x64 assembly, and avx512. You truly enjoy what you explain!
Maybe you could make some videos on how to make a mini retro game or some kind of program in masm, in order to put this ideas in practice, just an idea
Calling conventions are compiler specific and that's what the compiler expects when an extern function is used. Many different compilers adhere to a specific calling convention and as a developer there isn't much you can do about it because then you would have to change the compiler.
I don't know about this... I'm most DEFINITELY going with an AMD cpu
I can't fault AMD right now! Great CPU's :)
And it's also not recommended when programming for macOS now. Because Rosetta2 does not emulate AVX512
If you want to do AVX512 you will need an Intel CPU either one of the LGA 2066 i9 or one of the upcoming 11th Gen Core i7 or i9 (or equivalent Xeon) AMD doesn’t support AVX512.
@@AlyxSharkBite-2000 ok we'll see how relevant avx512 is soon enough
@@roberthowell8267 Oh I wasn’t saying it was I was only saying if you wanted it (figured you did since this was an AVX512 video) you needed an Intel. Didn’t want you to pick up a CPU and it not having a feature you wanted.
LIKE AND SUB, THIS MAN IS AWSOME
wow 0 dislikes, i guess haters don't care about avx :P
Many acclaimed studies have shown it's physcally impossible to dislike an Australian talking about low-level performance computing.
There won't be any mainstream AVX-512 until RocketLake.
That's an interesting point of view! Maybe this gigantic instruction set won't work out at all? Fascinating time we are in right now :)
@@WhatsACreel I thought Intel learned this with blowing 100 billions on Itanium VLIW architecture. But Intel is run by business graduates and not technical persons like Lisa Su.
Part 3: ua-cam.com/video/543a1b-cPmU/v-deo.html
Ok, but seriously, what is a creel?
Hey, I seen that Blender folder on your desktop. Whatcha doin' with Blender? :)
Blender is amazing!! I use it for the 3D in some of these vids, and photogrammetry (which is creating models from photographs), sometimes I just make little towns and houses and things for fun :) I'd like to sell on turbo squid or Unity store eventually, but at the moment, it's such a learning curve, still just a beginner :)
@@WhatsACreel Nice! Been messing around with it myself. Another neat program is MakeHuman which is free and allows you to create human 3D models which you can import into Blender. Also free. So, make some people for your towns. :)
@@NeilRoy Make Human is really great! Thanks mate :)
There's a plugin for Blender called Manuel Bastioni Lab. It's good too. Not a lot of assets though.
The problem with AVX512 is the hardware supports the opcodes, but doesnt support them in hardware. With only 2 FPU's its no faster than SSE
This will have the very simple reason that there are currently hardly any applications for the end user that support AVX512. But for compilers and developers it's good that they can buy CPUs that can do AVX512 so that they can adapt their software and compilers to it. That is why the space on the silicon chip was very likely saved.
As soon as the software supports AVX512 better, there will also be hardware with more AVX512 execution units per core, so that the additional performance compared to AVX2 can also be used.
It's basically a chicken and egg problem. But why waste space for the chicken if the egg hasn't even been laid yet.
:)
:)
:)
:-}
Not gonna lie . There's Intel FanBoy who said "Intel is better because video games can run on AVX 512 on it" .
I was like: "Bruh, show me any video game that requires AVX 512 right now"
No game currently requires AVX512, nor would that be wise as it would then run on very little hardware. However, there are games that support it and use it when it's available. The last of us is one of them and there are also comparisons on YT. Just search for AVX512 on/off.
shame this will not work with Ryzen cpus
11:28 NASM uses the standard Intel syntax. MASM uses a modified proprietary syntax!
AVX512 is just too good to be truth. In theory You can get 256 SP-GFlops per core (FMA at 4 GHz). With 28 or 56 cores (if you have dual die Xeons available) you have performance above GPU accelerators at much lower TDP (which is often most important parameter).
In reality, AVX512 proved to be disaster (Xeon-Phi 7xxx was released in 2014)...
Yep, rough introduction, for sure! Similar things occurred with the original AVX. Though, at that time, they didn't have Ryzen to compete with! I think it was the opposite too. I think the original AVX was slow to start up, but once it was going it sped up?
I hope the throughput of the floating point can be improved. It's my only real worry about the instruction set. Oh, and compilers too - I mean, it was hard enough to effectively vectorise code, trying to automatically wrangle an instruction set like AVX512 from regular C++ code will be very difficult! I'm sure those compiler authors are clever enough to get some amazing things happening already :)
At the moment, throughput is 1 per cycle for the simpler floating instructions. It's 1/2 for AVX, so you get pretty much the same amount of flops. If they can improve that, even a little, I think it will do wonders!
Certainly love the masking abilities!! Really great stuff :)
Only time will tell :)
@@WhatsACreel The reason why is, because for AVX2 they use two AVX2 units per core in a super scalar way. And with AVX512, these two units just work together as one. So in the end, the result is the same. But if you ask me, this is not important at the moment, because Compilers have to adapt anyway first.