How come you were using vmovupd instead of vmovups in the rounding example to load the single precision floats. Is there any difference between the two, like vmovups requires 4byte alignment. Or do the vmovu* instructions remove all alignment requirements and expand to the same micro ops?
Oh well spied!! It is a mistake. On some CPU's I understand there is a penalty for switching data types like that, so it should definitely be VMOVUPS! Pinned mate, thanks for pointing this out :)
Watching a video like this makes me understand how CPUs keep gaining millions upon millions of transistors. The muxing, control lines, registers and logic in general to implement all of these instructions, things like broadcasting etc would just keep piling on the transistors..! And the detail in that koala drawing ... 🤯
We ain't in Kansas anymore, Toto. Loved this trilogy, thank you. The Kmasks, the compressed displacement, broadcasting, the register files, all of it is exciting. I have experimented with SIMD since you first introduced us to SSE. I suspect that the power of these instructions will only really be experienced after a paradigm shift in the way we structure data. The classic vision of data in records (structs, Classes etc.) has served is well with the classic architectures . These revolved around pointers and pointer arithmetic (shock! horror! bare naked pointers are at the foundation of it ALL). The new architecture is less friendly to the mixing of numerical, textual and bit-field data. It thrives on sequential lists of data all being the same types. So, data currently stored thus: Name: Michael, Age: 69, Salary : beyond your wildest dreams; Name: Creel, Age: ... etc. will need to be stared as Name: Michael, Creel; Age: 69... etc. I, perhaps ,need a lot more examples of data for clarity but the idea being that each numerical field can be accessed as one long array. Why? When one isn't number crunching enterprise amounts of data, the overhead to gather the numerical data from classic records can erase the advantage of these powerful instruction sets. It is not an obstacle just a different view of your data. I really like your Dad's pictures. I can see why you are so proud of him. Stay healthy.
I’m with you mate! SIMD is really exciting stuff, but it does leave many languages in the dust. There’s just too much flexibility to express with most modern languages. I think you are alluding to a topic called “SOA vs AOS”. Storing data as an array of structures, versus a structure of arrays. SIMD is very good at SOA, but computer languages usually use AOS. We want all the names together in an array, all the ages in another array, etc. Then we can manipulate or search 16 ages at once in SIMD, and we don’t have to gather them from all over RAM :) That’s a perfect example of one of the ways modern languages are not designed to take advantage of this stuff! I did a video a long time ago on that topic, but I don’t think I covered it well. Maybe we could revisit it? I reckon you’re spot on Michael! Thanks for the kind words mate, stay healthy too :)
Great intro to this instruction set! I'm a database guy, so not quite sure when I'll ever write my first assembly code, but your teaching style is so good that I can't help watching!
Really cool and awesome videos talking about some of the AVX512 ! Can you explore more about avx512 in the future, especially the FMA instructions in the future? CUDA with tensor cores boost the GEMM computation through put by such a big degree that now the Ampere A100 basically has more silicon for tensor cores instead of the common CUDA cores. The GA100 actually has much less fp32 CUDA cores than the GA102 gaming/content creation lineups(like RTX A600 or 3090) . It would be interesting to see how the avx512 FMA implementation on BLAS boosts the speed/throughput in comparison to that of avx2 and no avx at all.
I would love to! Trouble is getting hardware :( I have a 1050ti, that's about it at the moment. Certainly a fine card, but not exactly state of the art. Some more AVX512 vids would be fun!! The fmads are awesome!! Thank you for the suggestions mate, and thanks for watching :)
The GA100 has 6912 CUDA cores, with over 13,000 FP32 units. The GA102-derived RTX 3090 and RTX 3080 have around 4,000-5,000 FP32 CUDA cores, with around 8,000-10,000 FP32 FPUs. Most of the die area on Ampere GPUs is still reserved by the shader/CUDA cores. The GA100 lacks RT cores, those are for GA102 and smaller dies only.
Hello mate . Will u make some videos on opencl and gpu programming. Should be nice interesting addition to ur high performance software computing guide.
@Creel 20:24 Did you notice, that it rounded myFloats[0] = 1.5 to 2, but myFloats[4] = 0.5 to 0? I would consider that strange. If a value is x.5 i would always expect to round the value upwards.
Really interesting video, big thanks for this. I'm sure I'm missing something, but in that AVXFoundationDetection code, after the cpuid instruction I see you test bit 16 of ebx by doing an initial shift of bit 16 into bit 0 (shr ebx, 16) and then the test of bit 0 (old bit 16) with 'and ebx, 1'. Could you also test bit 16 directly with that 'and' thus bypassing the need for a shift? All bits would be zero'd appart from bit 16 so the appropriate status flags (e.g. Z) would still only reflect the status of bit 16, so true/false return maintained?
I noticed that when using "round nearest" 1.5 rounds up to 2, but 0.5 rounds down to 0. I think both values can be represented exactly with float4 though, so I found this a bit surprising.
8:21 - The only thing special about k0 is that you can't use it in a lot of instructions. The *encoding* "000" is used to mean "no mask". It doesn't *read* k0 when you do that, it just is hardwired to "no mask". That's why changing the contents of k0 doesn't change that behaviour - it never actually reads the register. But because the encoding is reserved, it also means you can't use k0 for most instructions. It's a perfectly good normal register, it's just the *encoding* for most vector instructions is reserved, so you can only really use it in the other mask-register instructions - as a temporary or things like that. Annoyingly, some assemblers allow you to use the "{k0}" syntax, which is technically illegal. Because again - the instruction doesn't read k0! They should produce an error, but they don't.
Would ever do a performance comparison between AVX-512 and AVX2? AVX-512 is notorious for downclocking due to the heat generated. I believe it only happens with using the 512 bit registers, so AVX-512 instructions on 256 bit registers don't have that problem? Not sure, but it's be great to see an investigation (both the instructions and the extra registers - if we can still use the full 32 256 bit registers at full speed, AVX-512 would still be worth it IMO).
Great topic! I'm not sure if we'll cover it. AVX512 runs at half the speed for floating point, that's even before the downclocking. I didn't check all the instructions, and the broadcasting and masks still add a lot of flexibility. But, certainly at the moment, it seems like AVX512 is best used for integer operations. Cheers for the suggestion :)
@@WhatsACreel It really depends on the uarch. I don't know much about Ice Lake, but I do know about Skylake Server (SKX) which embeds AVX512 for about 3 years now. On this uarch, you can execute 2 FP instructions per cycles, which is the same as for integer ones. So downclocking aside, AVX512 is twice as fast as AVX2 on this uarch. Maybe you could do a little micro-benchmark on your machine to see?
I find it funny that in almost every other language, it's like: you don't have it? figure out on the internet how to make it work, it will eventually work. On assembly: you don't have it? get a CPU which has it
Can rust run different instruction sets too? I can't really find much about it's compiler options other than people praising the errors it logs lol I've never heard of a low level mask like kmask before!! I'm a noob so that's not surprising, but the idea of bit level masks really tickled my brain 😊
I really have know idea how to code with AVX512. The last coding class I took in college was 2010, which was advanced C++. I am extremely interested in getting back into programming beginning with emulators.
Is there an option with VS to build .exe files which then load the code blocks upon execution depending on the architecture? e.g. You can have procs optimized for AVX512 which run instead if the CPU can handle it and an alternate proc module if not?
I have a question, why use lea when you can use mov with the pointer operator, don't really remember the syntax but you know what I mean also, in the command vcvtps2dq, I understand everything except dq, I understand the it is a 4 byte int but what is dq?
Lea is typically preferred by most compilers, I believe it is because it doesn't update flags so it avoids creating conflicts in instructions when the cpu wants to do some reordering of instructions.
I can't remember what the code was doing? LEA makes a pointer, MOV just moves the data. It's possible you can use the data directly in a SIMD instruction, it's usually only the final operand that can be memory, but check the manual if you reckon there might be a faster way than my instructions. It's certainly possible I just did something stupid :) As for the DQ, I have to admit, I have no idea!! Haha, as far as I know, it means Double Quadword, so it refers to the 128 bits of an SSE register. I'm not sure what that's got to do with integers though?
@@WhatsACreel what you did was: lea rax, myDouble but in fact, you could have just done: mov rax, offset myDouble to just load the address of mydouble, since myDouble is just an alias for [*some number*]
@@gideonmaxmerling204 Oh that's great!! I've never seen that syntax! I've always just used LEA for addresses and MOV for data. Cheers for sharing mate, that's cool :)
@@WhatsACreel thinking about it, you could also have not done the mov instruction and just "bcst [ OFFSET myDouble]". But considering that myDouble is an alias for [*number*], I think you should try doing "bcst myDouble". I would test it myself but my cpu only has avx2
It takes a very long time to send data to the GPU, tell the GPU what to do with it (and only 1 thing for all the data), and get the data back. It is only worth the overhead for very large parallel data sets, that don't need to communicate with the CPU too much. CPU SIMD doesn't require the overhead of using the GPU and is also more flexible.
Great point AAA! I'd also say that GPU's really are SIMD. The warps in CUDA programming show us that a GPU is really just a 32 way SIMD device, it's not very different from a CPU at all. GPU's are becoming more and more like CPU's while CPU's are becoming more and more GPU like. Maybe they'll meet in the middle at some point? Just my two cents, cheers for watching mates :)
In my experience, debugging/troubleshooting CPU code is way easier than GPU code. Part of this, I believe, is that no mainstream language has GPU code as a first class language feature; you always have to install the vendor's SDK or a third party library. For an example, C# got official ARM support before it got official GPGPU (which is none).
Another reason is that high-level compilers that generate normal x86 code have no idea about the GPU. The GPU must therefore always be specially programmed and accessed via APIs. With SIMD units of the CPU, all you have to do is tell the compiler to use SIMD and then it will optimize your code for it as much as possible without you having to rewrite the code. However, it makes sense to design the code in such a way that it can be easily optimized for SIMD units.
int83886080[3] moon_pos; // 10MB, accurate for 3 million iterations int2[4] quadrants = {0, 1, 2, 3}; unfloat1024 n; // unsigned, float, unit range (0 to 1)
How come you were using vmovupd instead of vmovups in the rounding example to load the single precision floats. Is there any difference between the two, like vmovups requires 4byte alignment. Or do the vmovu* instructions remove all alignment requirements and expand to the same micro ops?
Oh well spied!! It is a mistake. On some CPU's I understand there is a penalty for switching data types like that, so it should definitely be VMOVUPS! Pinned mate, thanks for pointing this out :)
Oh man, I can't wait to see some AVX1024 registers 😆
Watching a video like this makes me understand how CPUs keep gaining millions upon millions of transistors. The muxing, control lines, registers and logic in general to implement all of these instructions, things like broadcasting etc would just keep piling on the transistors..!
And the detail in that koala drawing ... 🤯
Pan jest mistrzem.
Oh, wasn't expecting the art montage at the end. Appreciate it all the same with the series 🤭
Interesting stuff. Those masks are quite fascinating. Also love your dad's artwork. Very talented.
Cheers brus! You're a legend Neil :)
Your dad is an absolute legend indeed!
We ain't in Kansas anymore, Toto. Loved this trilogy, thank you. The Kmasks, the compressed displacement, broadcasting, the register files, all of it is exciting. I have experimented with SIMD since you first introduced us to SSE. I suspect that the power of these instructions will only really be experienced after a paradigm shift in the way we structure data. The classic vision of data in records (structs, Classes etc.) has served is well with the classic architectures . These revolved around pointers and pointer arithmetic (shock! horror! bare naked pointers are at the foundation of it ALL). The new architecture is less friendly to the mixing of numerical, textual and bit-field data. It thrives on sequential lists of data all being the same types. So, data currently stored thus: Name: Michael, Age: 69, Salary : beyond your wildest dreams; Name: Creel, Age: ... etc. will need to be stared as Name: Michael, Creel; Age: 69... etc. I, perhaps ,need a lot more examples of data for clarity but the idea being that each numerical field can be accessed as one long array. Why? When one isn't number crunching enterprise amounts of data, the overhead to gather the numerical data from classic records can erase the advantage of these powerful instruction sets. It is not an obstacle just a different view of your data.
I really like your Dad's pictures. I can see why you are so proud of him. Stay healthy.
I’m with you mate! SIMD is really exciting stuff, but it does leave many languages in the dust. There’s just too much flexibility to express with most modern languages.
I think you are alluding to a topic called “SOA vs AOS”. Storing data as an array of structures, versus a structure of arrays. SIMD is very good at SOA, but computer languages usually use AOS. We want all the names together in an array, all the ages in another array, etc. Then we can manipulate or search 16 ages at once in SIMD, and we don’t have to gather them from all over RAM :)
That’s a perfect example of one of the ways modern languages are not designed to take advantage of this stuff! I did a video a long time ago on that topic, but I don’t think I covered it well. Maybe we could revisit it?
I reckon you’re spot on Michael! Thanks for the kind words mate, stay healthy too :)
Your channel is a gem ❤
Great intro to this instruction set! I'm a database guy, so not quite sure when I'll ever write my first assembly code, but your teaching style is so good that I can't help watching!
Awesome, I received 11 points! The compressed displacement explanation and example was brilliant. And thank you for sharing your Dad's artwork.
Really very good picture of Kwala.or Quwala. By the way great tutorial for Intel CPU AVX512 series all three.
Great video - its interesting as a developer to learn some asm/intrinsincs.
Great video, great lesson about AVX512 mechanisms
Really cool and awesome videos talking about some of the AVX512 ! Can you explore more about avx512 in the future, especially the FMA instructions in the future? CUDA with tensor cores boost the GEMM computation through put by such a big degree that now the Ampere A100 basically has more silicon for tensor cores instead of the common CUDA cores. The GA100 actually has much less fp32 CUDA cores than the GA102 gaming/content creation lineups(like RTX A600 or 3090) . It would be interesting to see how the avx512 FMA implementation on BLAS boosts the speed/throughput in comparison to that of avx2 and no avx at all.
I would love to! Trouble is getting hardware :( I have a 1050ti, that's about it at the moment. Certainly a fine card, but not exactly state of the art. Some more AVX512 vids would be fun!! The fmads are awesome!! Thank you for the suggestions mate, and thanks for watching :)
The GA100 has 6912 CUDA cores, with over 13,000 FP32 units. The GA102-derived RTX 3090 and RTX 3080 have around 4,000-5,000 FP32 CUDA cores, with around 8,000-10,000 FP32 FPUs. Most of the die area on Ampere GPUs is still reserved by the shader/CUDA cores. The GA100 lacks RT cores, those are for GA102 and smaller dies only.
Avx2 also has automatic broadcasting cool instruction
Thanks! I was capabale to understand most of the stuff.
Hello mate . Will u make some videos on opencl and gpu programming. Should be nice interesting addition to ur high performance software computing guide.
Thank you for video.
@Creel
20:24
Did you notice, that it rounded myFloats[0] = 1.5 to 2, but myFloats[4] = 0.5 to 0?
I would consider that strange. If a value is x.5 i would always expect to round the value upwards.
Really interesting video, big thanks for this.
I'm sure I'm missing something, but in that AVXFoundationDetection code, after the cpuid instruction I see you test bit 16 of ebx by doing an initial shift of bit 16 into bit 0 (shr ebx, 16) and then the test of bit 0 (old bit 16) with 'and ebx, 1'. Could you also test bit 16 directly with that 'and' thus bypassing the need for a shift? All bits would be zero'd appart from bit 16 so the appropriate status flags (e.g. Z) would still only reflect the status of bit 16, so true/false return maintained?
I noticed that when using "round nearest" 1.5 rounds up to 2, but 0.5 rounds down to 0. I think both values can be represented exactly with float4 though, so I found this a bit surprising.
Yes, why did this happen?
I also noticed it, and wanted to ask the same!
same here
Same here
This CPU is capabale of ZVX521 Fondation instruction set!
The year is 2051. Intel has added a new instruction: `DOOM`.
(it runs Doom :P)
*EDIT:* wow, your dad's art is so cool :0
loved it, thank you so much!
Great video!! May I know which CPU are you using?
It's an i5 1035g1 I believe. Cheers for watching :)
@@WhatsACreel For those who want to know the generation, it's generation 10 and an Ice Lake.
8:21 - The only thing special about k0 is that you can't use it in a lot of instructions. The *encoding* "000" is used to mean "no mask". It doesn't *read* k0 when you do that, it just is hardwired to "no mask". That's why changing the contents of k0 doesn't change that behaviour - it never actually reads the register. But because the encoding is reserved, it also means you can't use k0 for most instructions. It's a perfectly good normal register, it's just the *encoding* for most vector instructions is reserved, so you can only really use it in the other mask-register instructions - as a temporary or things like that.
Annoyingly, some assemblers allow you to use the "{k0}" syntax, which is technically illegal. Because again - the instruction doesn't read k0! They should produce an error, but they don't.
Would ever do a performance comparison between AVX-512 and AVX2? AVX-512 is notorious for downclocking due to the heat generated. I believe it only happens with using the 512 bit registers, so AVX-512 instructions on 256 bit registers don't have that problem? Not sure, but it's be great to see an investigation (both the instructions and the extra registers - if we can still use the full 32 256 bit registers at full speed, AVX-512 would still be worth it IMO).
Great topic! I'm not sure if we'll cover it. AVX512 runs at half the speed for floating point, that's even before the downclocking. I didn't check all the instructions, and the broadcasting and masks still add a lot of flexibility. But, certainly at the moment, it seems like AVX512 is best used for integer operations. Cheers for the suggestion :)
@@WhatsACreel It really depends on the uarch. I don't know much about Ice Lake, but I do know about Skylake Server (SKX) which embeds AVX512 for about 3 years now.
On this uarch, you can execute 2 FP instructions per cycles, which is the same as for integer ones. So downclocking aside, AVX512 is twice as fast as AVX2 on this uarch.
Maybe you could do a little micro-benchmark on your machine to see?
17:28 - you're welcome!
Thanks
okay series, but great fucking drawing. Your dad totally stold the show. The detail is incredible.
Very well taught. Cheers! :)
Wooooow!!
I find it funny that in almost every other language, it's like: you don't have it? figure out on the internet how to make it work, it will eventually work.
On assembly: you don't have it? get a CPU which has it
Could you explain why the assembler code of avx512 funxtion like that, why we use those regeister?
man I love your videos so much :'D
Can rust run different instruction sets too? I can't really find much about it's compiler options other than people praising the errors it logs lol
I've never heard of a low level mask like kmask before!! I'm a noob so that's not surprising, but the idea of bit level masks really tickled my brain 😊
Great one, dude!
Can you make Videos that explain how, for example Loop Streaming works, and how one can abuse it to get Very fast Loops ?
I really have know idea how to code with AVX512. The last coding class I took in college was 2010, which was advanced C++. I am extremely interested in getting back into programming beginning with emulators.
Hey Creel, a question: How can I create a txt file and read and write to it using assembly?
Coming from Intel 8051: Gee, I can subtract 1 from all the big azz registers. Future is now!
Is there an option with VS to build .exe files which then load the code blocks upon execution depending on the architecture? e.g. You can have procs optimized for AVX512 which run instead if the CPU can handle it and an alternate proc module if not?
You could use the preprocessor for this. It's available in C and C++.
Im not sure what to do with all this knowledge
I have a question, why use lea when you can use mov with the pointer operator, don't really remember the syntax but you know what I mean
also, in the command vcvtps2dq, I understand everything except dq, I understand the it is a 4 byte int but what is dq?
Lea is typically preferred by most compilers, I believe it is because it doesn't update flags so it avoids creating conflicts in instructions when the cpu wants to do some reordering of instructions.
I can't remember what the code was doing? LEA makes a pointer, MOV just moves the data. It's possible you can use the data directly in a SIMD instruction, it's usually only the final operand that can be memory, but check the manual if you reckon there might be a faster way than my instructions. It's certainly possible I just did something stupid :)
As for the DQ, I have to admit, I have no idea!! Haha, as far as I know, it means Double Quadword, so it refers to the 128 bits of an SSE register. I'm not sure what that's got to do with integers though?
@@WhatsACreel what you did was:
lea rax, myDouble
but in fact, you could have just done:
mov rax, offset myDouble
to just load the address of mydouble, since myDouble is just an alias for [*some number*]
@@gideonmaxmerling204 Oh that's great!! I've never seen that syntax! I've always just used LEA for addresses and MOV for data. Cheers for sharing mate, that's cool :)
@@WhatsACreel thinking about it, you could also have not done the mov instruction and just "bcst [ OFFSET myDouble]".
But considering that myDouble is an alias for [*number*], I think you should try doing "bcst myDouble".
I would test it myself but my cpu only has avx2
question: why is simd still developed ? arent they obsoleted by gpus or is there something they can do gpus cant?
It takes a very long time to send data to the GPU, tell the GPU what to do with it (and only 1 thing for all the data), and get the data back. It is only worth the overhead for very large parallel data sets, that don't need to communicate with the CPU too much. CPU SIMD doesn't require the overhead of using the GPU and is also more flexible.
Great point AAA!
I'd also say that GPU's really are SIMD. The warps in CUDA programming show us that a GPU is really just a 32 way SIMD device, it's not very different from a CPU at all. GPU's are becoming more and more like CPU's while CPU's are becoming more and more GPU like. Maybe they'll meet in the middle at some point?
Just my two cents, cheers for watching mates :)
In my experience, debugging/troubleshooting CPU code is way easier than GPU code. Part of this, I believe, is that no mainstream language has GPU code as a first class language feature; you always have to install the vendor's SDK or a third party library. For an example, C# got official ARM support before it got official GPGPU (which is none).
as gpu capabilities grow we might soon ask: why are cpus developed?
Another reason is that high-level compilers that generate normal x86 code have no idea about the GPU. The GPU must therefore always be specially programmed and accessed via APIs. With SIMD units of the CPU, all you have to do is tell the compiler to use SIMD and then it will optimize your code for it as much as possible without you having to rewrite the code. However, it makes sense to design the code in such a way that it can be easily optimized for SIMD units.
You should sell prints of that koala drawing.
Cool stuff. Too bad my 9900k cannot do any of that XD
can your finish your direct 2d sereis
xor eax, eax shr ebx, 17 adc eax, 0
just to be more confuzering!
by the way the order of your playlist is reversed. so it goes from part 3 down to 1.
int83886080[3] moon_pos; // 10MB, accurate for 3 million iterations
int2[4] quadrants = {0, 1, 2, 3};
unfloat1024 n; // unsigned, float, unit range (0 to 1)
Man AVX512 looks so useful. Sad that I won't be able to use it cause almost all the code I write is for an arm5te processor
ARM has NEON though! :P
You can emulate AVX512 in bochs and run bochs on your ARM processor. It will be very slow, but for learning it should be good enough.
@@OpenGL4ever its a console
@@Illya9999 Well then you lost. As far as i know on ARM a SIMD unit is only available from the later generation, the ARMv6.
@@OpenGL4ever yep
Your camera can be cut a little bit smaller. But it wasn't in the way for the content, so it was good anyway.
Amd will eventually have to add avx 512 if it become popular with software companies or else and will be screwed
Zen4 will have it afaik