There's a difference between MHz and IPS. Not only recent processors run faster they also do multiple instructions per clock using multiple pipelines (superscalar). Old CPUs didn't even do one instruction per clock. That was only possible with pipelining.
C Intrinsics are very useful if you want to write high performance functions; The compiler understands them and will manage the register allocation and optimization, which saves you from doing it if you code in asm! There are still times when you are able to produce faster code than the compiler, even using intrinsics. It takes a lot of effort, but it's very satisfying when you can gain that extra speed!
This is one of those videos that makes you truly appreciate what goes on in a CPU and just how far we've come from simple add ax,bx instructions. I started watching this because I wanted to understand SSE instructions needed to solve a Windows CrackMe, I leave with a wondrous appreciation of Assembly and their implementation in the CPU in general. Bravo good Sir!
OMG, I have wondered how to use these registers exactly, I've seen them in other's code before, but wasn't confident enough to use them. Great introduction, and now feel better on how to experiment with them. I subscribed over a year ago, and love your videos, I love assembly. Thanks!
Love your stuff man..you have come up in the world with the graphics..in some way i like the x87 to do built in trig functions wish sse ave had them especially for matrix rotate in 3d graphics..avx512 is awesome mainly for processing graphic and images and for copying large amount of data
Oh sh*t! Stepping up the production! My only critique is letting the bright sun shine into the camera during the segment near 5:27. Other than that, nice! Seeing those register blocks was satisfying!
I took MIMD to an extreme level on my 12-core HT Xeon: I wanted to search all the seed states of my Z80 pseudo-random number generator so it would produce some desired sequences. AVX code allowed me to process 2 sets (16 XMM registers was enough to do this) of 16 bytes each on 24 threads to process 768 bytes at a time! Using 256 bit YMM registers to process a total of 1536 bytes at a time didn't gain much; AVX only has 128 bit integer operations, so the lane manipulations ate most of the gains! The AVX2 version works on an AVX2 CPU I have, but it's only 4-core, which only processes 256 bytes at a time ... @ 4 GHz instead of 3 GHz on my Xeon.
Just found out that Clang compiler supports x64 inline assembly with variables as operands. All you need to do is download Clang toolkit via Visual Studio Installer and change your platform toolset to LLVM (clang-cl) in project's properties.
This is so interesting. With a good knowledge of it, how much could you improve a video rendering software, a physics engine, a 3D rendering library or maybe things like a neural network or a trading software?
MMX was a TOTAL BOGDE! It shared the FPU register stack, so you couldn't mix MMX and FPU instructions; There was a big performance penalty to switch modes!
Great video! It is my understanding that G++ and MSVC handle intrinsics differently. Will it be possible at some point in this series of SIMD videos to cover that difference? I appreciate that the series is primarily concerned with assembly (most assembly developers are familiar with the differences between AT&T and Intel syntaxes) but C++ can be valuable when the data is embedded in various classes and /or structs. This can complicate the gathering of that data for passing to external PROCs.
Thanks for these videos! They are awesome! They're extremely helpful for a CS student like me. Do you have any assembly books you would recommend for beginners?
I think all your assembly tutorial videos are fantastic. Your other videos are mostly beyond my capacity to understand, but I still find them highly interesting. But this video made me so dizzy! 😂 Seriously, I've had to stop twice so far, to pause, so I could stare at my wall until it stopped moving. I'm currently halfway through the video and convinced I must have picked up a can of beer instead of a can of coke. Keep up the content, it's great, entertaining and very educational, but please... I beg you, no more spinning, lense flares and checkerboards. OK, that's a 5 minute break; let's see if I can make it to the end without stopping again...
I get the feeling that whenever Intel adds some new instructions, they ONLY add the instructions to help with the VERY LIMITED use cases they are thinking of at the time! For instance, the SSE 2&4 Insert/Extract instructions, which operate on 128 bit XMM registers using constant index; Why not also allow the index to be in a register!?!?!? Then, when AVX extends 128 bit SIMD to 256 bits, the insert/extract instructions weren't given AVX 256 bit forms to address bytes 0 to 31 etc. ! Also, some of the crazy program I have written could have benefited greatly from something like BTS YMM0,AX (return the state of a single bit and set it [I used AX there to imply extension to AVX512] ) to use for 256 very fast flags that don't require memory read/write! AVX512 is still missing these capabilities! I think this all stems from SSE being 128 bits, so all the instructions were implemented within the CPU around that size. When AVX was being designed, Intel took the easy (cheaper) way out and just added another 128-bit 'lane' without actually doing a decent job by properly extending it!
Are the xmm and ymm registers accept data only from memory? I see in their instruction sets that source operands are usually memory or other xmm and ymm registers
Because vc19 inclines functions extremely well, and you may not create inlined assembler function in x64, it is impossible to beat the “release” mode compiler in anything if you turn its optimisation up to the max. As such, could you make some tutorial over using intrinsics, which can be inline, to beat a compiler with a real world function?
It's difficult if the function is small. It's not impossible to beat the compiler though. Depending on the algorithm, sometimes it's really easy. I'd like to make some videos on Agner Fog's Vector Class Library. It's similar but cleaner than intrinsics. Hopefully I can finish that soon. Cheers for watching and cheers for this great suggestion :)
I was hoping to learn the alignment requirements before the end of the video! very informative as a whole though, great job! for avx2, what are the alignment requirements for reading or writing a ymm register from memory? is it 32 byte alignment?
Yes, alignment is 32 bytes I believe. So, there's VMOVAPS, moves aligned data, but if your data isn't aligned (or you don't know) then you can use VMOVUPS, move unaligned packed singles. Each of the Move instructions has an aligned an unaligned version, VMOVDQA/VMOVDQU, VMOVAPD/VMOVUPD, etc. As for the other instructions, the data has to be aligned when you use a memory operand as the second operand. So like VADDPS or whatever. There's no unaligned versions of those. Theres also no versions that allow memory as the first operand. I hope this helps! Cheers for watching :)
Very good! Is there a way to check if the cpu has avx or avx512 support in asm? Like ifdef avx512 -> do 512 bit register operations and if not do the present smaller ones. Like a generic way to write the assembly. I am new to asm so this might just be nonsense.
Not nonsense at all! There certainly is a way to check which instruction sets are available from ASM! It's a special instruction called "CPUID". We put a function number into EAX, then call CPUID, and it returns information in EAX, EBX, ECX and EDX. All of the information is encoded. You need the AMD or Intel manuals to check what the bits mean. But there's loads of information, including which instructions the CPU supports! I would love to make a video on this topic. In the meantime, you could google the CPUID instruction. It's pretty dense, but good stuff! Well, hope this helps, thanks for watching :)
Sorry, I remember I did a CPUID video: ua-cam.com/video/p5X1Sf5ejCc/v-deo.html It's old, so it won't go into AVX512. Would be great to make an update. Anywho, cheers for watching :)
Like Explosionist mentioned, there's no way to do this in one instruction. You can combine a few horizontal adds together though. You have to include an extra VPERM2F128 to get the upper and lower halves to add. So you VPERM2F128 to some other tmp register, then add those halves, then horizontal add. Hope this helps, cheers for watching :)
I have a question about assembly. If I write some MASM code that runs on Intel Core processor, can I compile it on amd ryzen processor as well ? Or are there some differences?
Another question: I have a std::list of structs . They are sorted according to an uint16_t field. I am trying to find a faster algorithm than std::upper_bound(). Upper_bound takes significantly longer to find the insertion point in a std::list than it does with std::vector. The number of structs is generally 36863. If none found, then repeat for the next 8 structs. Does this sound doable, given the overhead of gathering the fields for packing etc.?
a std::list is a doubly linked list, so finding and comparing an element besides the front or back is linear time O(n) + 1 or O(n), while std::vector is always constant O(1) + 1 or O(1). That is why it is slower, however accessing first or last are the same for both data structures, O(1). Just using a simple binary_search algorithm to find the element or struct (as long as the elements in the array are pre-sorted), and then exchange, swap or replace the values of the struct in O(1) might work, and you won't get faster than O(log n), which this kind of solution might offer.
Chris man jumping ahead here but have been playing with AVX YMM registers how would one align to 32? ALIGN only allows 16. Keeps memory crashing with violaion on test in visual studio? Would i need to setup a seperate data segment ? Found this but still unsure. JUNK SEGMENT PAGE 'DATA' test_ymm real4 8 dup(8.8) JUNK ENDS Cheers pal
I usually use _aligned_malloc. It's only available in Windows C++. You can also allocate additional padding (alignment +4). Then add to your allocated pointer some amount to make it aligned, and recording the amount you added in the int before the pointer. Hold on, I wrote some code. Not sure it is 100%, but you probs get the idea. You can do the same in ASM too if you're not in C++
The code is largely untested. Just wrote it this morning in thinking about your comment, so do be careful with it. Hope this helps mate, have a good one :)
@@WhatsACreel UPDATED: Thanks mate yeah doing it from ASM ill try the mentioned above with padding and note it, both sides of the coin MASM and c++. Funnily enough this seems to work with the align directive, testing here: JUNK32 SEGMENT ALIGN(32) ".data" ALIGN 32 ;now accepted test_ymm real4 8 dup(8.8) JUNK32 ENDS Can now align by 32 once segement is on a new paragraph boundary before in the .DATA segement it would give ERROR: "invalid combination with segment alignment" and only allowed max of 16 It now compiles with 32. cheers ill keep at it.
My dude, the tutorials are awesome! However CG makes this one completely incomprehensible, hard to follow and distracting. Please, consider using plain old slides. Cheers!
He taught Assembly in the middle of the jungle, at the beach, and now from inside the matrix itself
I like the graphics in this video. I looks like the host is telling about SIMD and at the same time flexing the power of it with the graphics.
There's a difference between MHz and IPS. Not only recent processors run faster they also do multiple instructions per clock using multiple pipelines (superscalar). Old CPUs didn't even do one instruction per clock. That was only possible with pipelining.
They do indeed!! And they superscalar these vector instructions too! Really wild performance! Cheers for sharing and watching mate :)
@EramSemperRecta i8088
C Intrinsics are very useful if you want to write high performance functions; The compiler understands them and will manage the register allocation and optimization, which saves you from doing it if you code in asm!
There are still times when you are able to produce faster code than the compiler, even using intrinsics.
It takes a lot of effort, but it's very satisfying when you can gain that extra speed!
Now with epic backgrounds! Really enjoying learning some asm, thanks!
He has been travelling the world in this tutorial series and now he's filming while walking up the stairway to heaven
This is one of those videos that makes you truly appreciate what goes on in a CPU and just how far we've come from simple add ax,bx instructions. I started watching this because I wanted to understand SSE instructions needed to solve a Windows CrackMe, I leave with a wondrous appreciation of Assembly and their implementation in the CPU in general. Bravo good Sir!
Thanks heaps mate :) Modern CPU's are really amazing things!! Cheers for watching mate :)
OMG, I have wondered how to use these registers exactly, I've seen them in other's code before, but wasn't confident enough to use them. Great introduction, and now feel better on how to experiment with them. I subscribed over a year ago, and love your videos, I love assembly. Thanks!
WHOA! So cool! The future is here!!!
Too true!! Cheers for watching mate :)
These videos are like well organised lecrures in some university. They are short and easy to understand.
Absolutely loving your videos mate, just found you! Keep it coming :D
the demo is marvelous
I remember years ago using intrinsics to make a scale2x - 4x algo that ran 60fps+, only ran at like 10fps using regular x86.
These videos are really great...highly appreciate it ✨ I would want arm64 assemly next or maybe risc-v ⚡
Wow, such a good explanation that a beginner like me on (modern) asm learns easily. Thumbs up
Love your stuff man..you have come up in the world with the graphics..in some way i like the x87 to do built in trig functions wish sse ave had them especially for matrix rotate in 3d graphics..avx512 is awesome mainly for processing graphic and images and for copying large amount of data
Oh sh*t! Stepping up the production! My only critique is letting the bright sun shine into the camera during the segment near 5:27. Other than that, nice! Seeing those register blocks was satisfying!
I took MIMD to an extreme level on my 12-core HT Xeon:
I wanted to search all the seed states of my Z80 pseudo-random number generator so it would produce some desired sequences.
AVX code allowed me to process 2 sets (16 XMM registers was enough to do this) of 16 bytes each on 24 threads to process 768 bytes at a time!
Using 256 bit YMM registers to process a total of 1536 bytes at a time didn't gain much; AVX only has 128 bit integer operations, so the lane manipulations ate most of the gains!
The AVX2 version works on an AVX2 CPU I have, but it's only 4-core, which only processes 256 bytes at a time ... @ 4 GHz instead of 3 GHz on my Xeon.
🎶 Black Hole Sun, won't you come, and wash the rain awaayyyy
RIP Chris Cornell :( Great song!
Your videos are better than a good cheese. And it's a french guy who tells you that !
Thank you so much for these videos!
You're welcome, thanks for watching :)
Just found out that Clang compiler supports x64 inline assembly with variables as operands. All you need to do is download Clang toolkit via Visual Studio Installer and change your platform toolset to LLVM (clang-cl) in project's properties.
6:12
>AVX2
>2016
i7-5820k has AVX2 and the launch was 2014.
1:46
>Pentium 4 @ 3,8Ghz - 2000
3,8Ghz was achieved in the year 2004.
This is so interesting. With a good knowledge of it, how much could you improve a video rendering software, a physics engine, a 3D rendering library or maybe things like a neural network or a trading software?
ASM can do wonders! I'm hoping to release two projects soon showing some really fun tricks! Thanks for watching mate :)
I'm pretty sure you can achieve much greater performance with offloading to a powerful modern gpu nowadays for a lot of the use cases you mentioned.
Fascinating stuff.
Great fun! Cheers for watching mate :)
This is absolutely great. Thanks!!!
Good long work on animation!
Blender is amazing! Cheers for watching mate :)
i came here to see examples of how movaps is used, but i received more knowledge than i expected to, and i loved it. sub'd! :D
just love it ❤
Great content! Is Creel creating the SIMD video? that was also something
Great as always!
MMX was a TOTAL BOGDE! It shared the FPU register stack, so you couldn't mix MMX and FPU instructions; There was a big performance penalty to switch modes!
Great video! It is my understanding that G++ and MSVC handle intrinsics differently. Will it be possible at some point in this series of SIMD videos to cover that difference? I appreciate that the series is primarily concerned with assembly (most assembly developers are familiar with the differences between AT&T and Intel syntaxes) but C++ can be valuable when the data is embedded in various classes and /or structs. This can complicate the gathering of that data for passing to external PROCs.
I've never used intrinsics with GNU, so this is new info for me :) I'd certainly like to explore intrinsics at some point. Cheers for watching mate :)
Thanks for these videos! They are awesome! They're extremely helpful for a CS student like me. Do you have any assembly books you would recommend for beginners?
HHahahahahhhaha the graphics crack me up! 😂🤣😂🤣😂🤣😂 Entertaining and extremely educational!! Thank you!
Very informative. Thank you!!!
Would you ever consider tackling the topic of memory fences? Maybe even non temporal moves and also prefetching?
I think all your assembly tutorial videos are fantastic. Your other videos are mostly beyond my capacity to understand, but I still find them highly interesting. But this video made me so dizzy! 😂 Seriously, I've had to stop twice so far, to pause, so I could stare at my wall until it stopped moving.
I'm currently halfway through the video and convinced I must have picked up a can of beer instead of a can of coke.
Keep up the content, it's great, entertaining and very educational, but please... I beg you, no more spinning, lense flares and checkerboards.
OK, that's a 5 minute break; let's see if I can make it to the end without stopping again...
Really great advice :) I wondered this while editing! Thanks for letting me know mate, and thanks for watching :)
I get the feeling that whenever Intel adds some new instructions, they ONLY add the instructions to help with the VERY LIMITED use cases they are thinking of at the time!
For instance, the SSE 2&4 Insert/Extract instructions, which operate on 128 bit XMM registers using constant index; Why not also allow the index to be in a register!?!?!?
Then, when AVX extends 128 bit SIMD to 256 bits, the insert/extract instructions weren't given AVX 256 bit forms to address bytes 0 to 31 etc. !
Also, some of the crazy program I have written could have benefited greatly from something like BTS YMM0,AX (return the state of a single bit and set it [I used AX there to imply extension to AVX512] ) to use for 256 very fast flags that don't require memory read/write!
AVX512 is still missing these capabilities!
I think this all stems from SSE being 128 bits, so all the instructions were implemented within the CPU around that size.
When AVX was being designed, Intel took the easy (cheaper) way out and just added another 128-bit 'lane' without actually doing a decent job by properly extending it!
thanks, great vid!:D
Thanks!
Playlist: ua-cam.com/play/PLKK11Ligqitg9MOX3-0tFT1Rmh3uJp7kA.html
So what letter will they use to prefix the "MM" when they go to 1024 bit registers?
Could you do tutorials on using straight up assembly using nasm and not sticking to Visual Studio and MASM.
Are the xmm and ymm registers accept data only from memory? I see in their instruction sets that source operands are usually memory or other xmm and ymm registers
Can you explain X86 vs ARM. What's your prediction.
Because vc19 inclines functions extremely well, and you may not create inlined assembler function in x64, it is impossible to beat the “release” mode compiler in anything if you turn its optimisation up to the max. As such, could you make some tutorial over using intrinsics, which can be inline, to beat a compiler with a real world function?
It's difficult if the function is small. It's not impossible to beat the compiler though. Depending on the algorithm, sometimes it's really easy. I'd like to make some videos on Agner Fog's Vector Class Library. It's similar but cleaner than intrinsics. Hopefully I can finish that soon. Cheers for watching and cheers for this great suggestion :)
@@WhatsACreel Good to get a response - "Cheers, mate!"
wow, you went all out on titles & animation!
PS. Question -- how many Suns do you guys have down under?
Hahaha! It's just the HDRI background! Some pretty wild lens flare :) Trying to up the production quality. Cheers for watching mate!
@@WhatsACreel hey, dude, thanks for making these!
I was hoping to learn the alignment requirements before the end of the video! very informative as a whole though, great job!
for avx2, what are the alignment requirements for reading or writing a ymm register from memory? is it 32 byte alignment?
Yes, alignment is 32 bytes I believe. So, there's VMOVAPS, moves aligned data, but if your data isn't aligned (or you don't know) then you can use VMOVUPS, move unaligned packed singles. Each of the Move instructions has an aligned an unaligned version, VMOVDQA/VMOVDQU, VMOVAPD/VMOVUPD, etc.
As for the other instructions, the data has to be aligned when you use a memory operand as the second operand. So like VADDPS or whatever. There's no unaligned versions of those. Theres also no versions that allow memory as the first operand.
I hope this helps! Cheers for watching :)
Very good! Is there a way to check if the cpu has avx or avx512 support in asm? Like ifdef avx512 -> do 512 bit register operations and if not do the present smaller ones. Like a generic way to write the assembly. I am new to asm so this might just be nonsense.
Not nonsense at all! There certainly is a way to check which instruction sets are available from ASM! It's a special instruction called "CPUID". We put a function number into EAX, then call CPUID, and it returns information in EAX, EBX, ECX and EDX. All of the information is encoded. You need the AMD or Intel manuals to check what the bits mean. But there's loads of information, including which instructions the CPU supports! I would love to make a video on this topic. In the meantime, you could google the CPUID instruction. It's pretty dense, but good stuff!
Well, hope this helps, thanks for watching :)
Sorry, I remember I did a CPUID video: ua-cam.com/video/p5X1Sf5ejCc/v-deo.html
It's old, so it won't go into AVX512. Would be great to make an update. Anywho, cheers for watching :)
Thank you for the video!
Is there a way to do horizontal sums on floating point using AVX2?
vhaddps repeated 3 times works.
Like Explosionist mentioned, there's no way to do this in one instruction. You can combine a few horizontal adds together though. You have to include an extra VPERM2F128 to get the upper and lower halves to add. So you VPERM2F128 to some other tmp register, then add those halves, then horizontal add. Hope this helps, cheers for watching :)
I have a question about assembly. If I write some MASM code that runs on Intel Core processor, can I compile it on amd ryzen processor as well ? Or are there some differences?
Yes, Intel and AMD CPU's are both x86/x64. They run the same Assembly code! No need for a recompile or anything. Cheers for watching mate :)
@@WhatsACreel Thank you for respond.
Another question: I have a std::list of structs . They are sorted according to an uint16_t field. I am trying to find a faster algorithm than std::upper_bound(). Upper_bound takes significantly longer to find the insertion point in a std::list than it does with std::vector. The number of structs is generally 36863. If none found, then repeat for the next 8 structs. Does this sound doable, given the overhead of gathering the fields for packing etc.?
a std::list is a doubly linked list, so finding and comparing an element besides the front or back is linear time O(n) + 1 or O(n), while std::vector is always constant O(1) + 1 or O(1). That is why it is slower, however accessing first or last are the same for both data structures, O(1).
Just using a simple binary_search algorithm to find the element or struct (as long as the elements in the array are pre-sorted), and then exchange, swap or replace the values of the struct in O(1) might work, and you won't get faster than O(log n), which this kind of solution might offer.
Creel, any reason you don't use syntax highlighting? Or did you get a new computer or something and not re-install AsmDude?
Nevermind! In the next video Creel has syntax highlighting! "ASM Dude VS 2017 Extension"
Chris man jumping ahead here but have been playing with AVX YMM registers how would one align to 32? ALIGN only allows 16. Keeps memory crashing with violaion on test in visual studio?
Would i need to setup a seperate data segment ?
Found this but still unsure.
JUNK SEGMENT PAGE 'DATA'
test_ymm real4 8 dup(8.8)
JUNK ENDS
Cheers pal
I usually use _aligned_malloc. It's only available in Windows C++. You can also allocate additional padding (alignment +4). Then add to your allocated pointer some amount to make it aligned, and recording the amount you added in the int before the pointer. Hold on, I wrote some code. Not sure it is 100%, but you probs get the idea. You can do the same in ASM too if you're not in C++
The code is largely untested. Just wrote it this morning in thinking about your comment, so do be careful with it. Hope this helps mate, have a good one :)
@@WhatsACreel UPDATED: Thanks mate yeah doing it from ASM ill try the mentioned above with padding and note it, both sides of the coin MASM and c++. Funnily enough this seems to work with the align directive, testing here:
JUNK32 SEGMENT ALIGN(32) ".data"
ALIGN 32 ;now accepted
test_ymm real4 8 dup(8.8)
JUNK32 ENDS
Can now align by 32 once segement is on a new paragraph boundary before in the .DATA segement it would give ERROR:
"invalid combination with segment alignment" and only allowed max of 16
It now compiles with 32.
cheers ill keep at it.
hi, dude plz keeping update. thanks from china.
OK, MMX is old, obsolete ... but is it supported on todays CPUs? Can you still use it in asm?
Vectors
Waaaay too many ads ...
the moving background is extremely distracting; otherwise great video
My dude, the tutorials are awesome! However CG makes this one completely incomprehensible, hard to follow and distracting. Please, consider using plain old slides. Cheers!