hands down one of the best instructors on youtube. just plowed through the whole series and it was great. i took over 1000 lines of notes and i feel like i have a way better understanding of assembly than i did before. i hope to see it keep going!
This series was so awesome! Absolutely fantastic! You are a talented teacher! And a gentleman and scholar besides! =D I learned assembler for several Motorola chips and several Intel chips in the mid 80's and early 90's. I've always wanted to create a computer language of my own design for PC's (Windows, Linux, and OSX). It's a pipe dream, but recently I've been thinking about it again. I don't want to use llvm or gcc or anything, I want to start by creating a little mini home-spun assembler, and then just slowly add a piece at a time over time. Build up the language from there. The instruction set reference alone for x86-64 is over 2500 pages long! I don't need nearly all those instructions for my purposes! Only a tiny subset. But jeepers, how do people even come up with compilers for that!? I guess llvm, gcc, nasm and masm all somehow manage to produce x64 machine code. It seems like they must have had scores of programmers working for years! I learned about compilers and linkers and such in school, but. That was in the early 90's. Things have changed a bit since the 80386! Holy pee! And my memory isn't exactly eidetic. Anyway, somehow, I want to role my own little assembler. And role my own linker for calling external C libraries, too, if I can manage it. But it's hard to figure out where to even start. I guess the first thing is to find out about the .exe format, and. Then maybe do some disassembly of some C or C++ code, and see if I can find the instructions in the .exe, and then manually change them to some different ones that I find in the reference manual... and just kind of plink away at it from there.
Well, I just bought my first computer and turned it on no problem. I think this video filled me in with all I need. Just you didn't cover what an email is.
What a great video again. Would be good a video showing some small asm task which is faster than doing a corresponding C++ version. So an example of some simple task C++ compiler does slower than your asm code.
In a struck of massive autism, I once decided to write wrappers (static inline header functions) for all the intrinsics with better names without all these ugly underscores and prefixes, which match the original asm names. I also used function overloading in a lot of places and templates instead of the horrible _MM_SHUFFLE macro: m128 x = shufps(y, z); If anyone is interested, I can maybe go over the code once more and upload it somewhere. I used it with g++ and I am not so sure it would work with other compilers as it is right now.
The pic asm is cool too! I don't have a programmer though, so I am not sure :( I did have Atmel studio and MPLab installed at one point - they had some great emulators from memory... But it's been a while. Certainly an excellent suggestion! But cheers for watching :)
@@WhatsACreel All good man! Like, for real, before i found your channel i had zero idea where i wanted to go with my knowledge of ASM especially since I was trained with old asf uprocessors like the venerable 8086.
Nice video dude. One correction though. There are 32 YMM registers that were added to AVX. So you go from YMM0 to YMM31. And they also expanded XMM0 thru XMM31.
They did indeed!! But only in AVX512. I wanted to save AVX512 because of those changes, the mask registers and the new instructions. I really hope AMD adopts it. It is a crazy awesome instruction set!! Anywho, thank you for the info, and thank you for watching :)
It is safe to assume that x64 machines do have SSE2. It is 20 year old and even Windows 8 and later versions require it and if i remember correctly Microsoft's compiler always assumes SSE2 to be available when compiling for x64, but not for x86 and according to 2020 Steam HW survey 100% of users had SSE2 and SSE3 available. edit: SSE2 capable processor is always assumed on Microsoft's compiler for x64 and for x86, SSE2 capable processor is assumed by default.
That's good to know! I think the adoption of AVX and AVX2 has been fairly universal too. It's got to be around 90% by now? Anywho, cheers for the info, and cheers for watching :)
really amazing work, indeed. I love your way to explain things. I enjoyed also the quantum computing introduction. You could do also some assembly video with some quick one-qubit gate calculation, using the power of AVX :D
Great, great videos, thanks!!! I was wondering, why they added soo many SIMD registers, 32 for avx512? what's the use of having so many? can the cpu process more than one register per cycle (superscalar simd)?
Could you discuss integer SIMD when you come back around to this topic? Also, are there vector x scalar instructions, so you can do something like multiply all members of a vector by a single scalar, or do you just have to pack a vector with copies of the same scalar?
I'd like to cover integers, yes! They're more fiddly since there's more types, signed/unsigned, there's no division, and the multiplication is complicated. Some really amazing instructions though!! So I'm hoping to cover them shortly. The scalar operations only work on the lowest elements, so we have to broadcast the scalar if we want it to affect all elements. Cheers for the suggestion, and cheers for watching :)
Cheers for the videos man! Just binged this series in a day or two. Just wondering - how did you learn all this? I remember you saying you studied music - did you do something computer science related as well? Really curious as I'm studying EE but really interested in this sort of stuff as well! (instead of just ARM and microcontroller programming etc.)
Also, I wanted to see if I could put these instructions to use in C++ by trying to make a simple "vector3" class with a member function to add vector3's together. But im starting to think someone has done this already. Idk if anyone else here has seen some sort of implementation I could look at?
is it there any gain to switch from intrindics to asm ? i've noticed a small gain on switching from vcl to intrinsics (around 12-15 sec at 7minutes run)
There is, yes. There’s benefits to both. For this particular video, we would see much better speed from the intrinsics. Really, if we jump into ASM, we want to stay there for as long as possible, since the switching between C++ and ASM can be very expensive due to saving and restoring registers and setting up stack frames. But the benefits to ASM are that we get to control the registers, when not even the L1 cache is fast enough! We can define our our own calling conventions which return multiple data or recurse indefinitely, and use any instructions we want whether there’s intrinsics or not, or even commandeer RPB and the RSP if we’re running out of registers. We also have access to very low level techniques, like self-modifying-code. You could get all 12 cores to write machine code for each other while they execute in a beautiful symphony of seg faults and race conditions! Haha, that was from an answer I wrote for Quora once. It’s true though. Modern Assembly is a monstrous language with ridiculous possibilities. Hopefully we can explore some of these techniques at some point. Anywho, cheers watching mate :)
@@WhatsACreel hmmm, very interesting, I thought L1 is the absolut limit (or this was my target to be L1 bound), now I understand I can go even further to zero, thank you !!
Been watching the whole series in 2024 and it's still great ! I need the tutorial for setting up visual studio 2022 though.. the UI for writing assembly in c++ empty code is different and i still cant figured it out 🤕
If register zmm is for 512 bit then what is going to be used for 1024 bit? Are they just gonna make it overflow and start using amm, then bmm, cmm ect... :D
You can use it, but MOVSS it only moves one single. That's the mnemonic for Move Scalar Single. It moves 32 bits into the lowest element of the first operand. The PS version is the packed one. Hope this helps, and cheers for watching :)
Not sure I understand mate. These instructions are as low level as it gets! They are single instructions for the CPU. They don't break into smaller ASM or anything. There's no lower level exposed to programmers than single Assembly instructions. They perform multiple operations at once, but that's just how SIMD works, that's what makes modern CPU's sooo powerful! I hope this clears it up a little, and cheers for watching :)
@@WhatsACreel I guess what I could say is I need to know how to do anything you can think of with floating point numbers on paper but only in a binary representation.
Oh, ok! Yes, the IEEE754 standard is probs worth looking up! I did a video series a long time ago on that, but I think other folks have since released much better ones. Or you might be interested in fixed point? It’s not floating point, but it’s very interesting and sometimes extremely fast!! Anywho, thank you for watching and good luck finding the info you're looking for :)
Thanks to you i now know assembly better than say most other programming languages xD
hands down one of the best instructors on youtube. just plowed through the whole series and it was great. i took over 1000 lines of notes and i feel like i have a way better understanding of assembly than i did before. i hope to see it keep going!
Cheers mate! Thank you for watching :)
@@WhatsACreel do you have linkedin? I'd love to connect
Just marathoned this whole series in 2 days, amazing work. Sending some money your way!
Ha! Thank you my friend :)
This series was so awesome! Absolutely fantastic! You are a talented teacher! And a gentleman and scholar besides! =D
I learned assembler for several Motorola chips and several Intel chips in the mid 80's and early 90's. I've always wanted to create a computer language of my own design for PC's (Windows, Linux, and OSX). It's a pipe dream, but recently I've been thinking about it again. I don't want to use llvm or gcc or anything, I want to start by creating a little mini home-spun assembler, and then just slowly add a piece at a time over time. Build up the language from there.
The instruction set reference alone for x86-64 is over 2500 pages long! I don't need nearly all those instructions for my purposes! Only a tiny subset. But jeepers, how do people even come up with compilers for that!? I guess llvm, gcc, nasm and masm all somehow manage to produce x64 machine code. It seems like they must have had scores of programmers working for years! I learned about compilers and linkers and such in school, but. That was in the early 90's. Things have changed a bit since the 80386! Holy pee! And my memory isn't exactly eidetic.
Anyway, somehow, I want to role my own little assembler. And role my own linker for calling external C libraries, too, if I can manage it. But it's hard to figure out where to even start. I guess the first thing is to find out about the .exe format, and. Then maybe do some disassembly of some C or C++ code, and see if I can find the instructions in the .exe, and then manually change them to some different ones that I find in the reference manual... and just kind of plink away at it from there.
Need to mention that AVX and especially AVX512 commands can throttle CPU frequency, so for some CPUs it's faster to use several SSE multiplies instead
you filled a huge technical gap in me, thanks a million! brilliant series; please keep it up.
Would love another video showing an example problem with operation masks to go branchless!
Awesome! You got an incredible amount of information into 23 minutes. Well done you.
Cheers mate! Glad you liked it :)
Pure Quality, great vid man
Thank you for stopping Herr Ste :)
Well, I just bought my first computer and turned it on no problem. I think this video filled me in with all I need. Just you didn't cover what an email is.
Hahaha :)
Yay new video!
Yay! Thank you for watching :)
Thank you for this series!
i would love to see you doing a series on OpenCL
plz i need the tutorial
What a great video again.
Would be good a video showing some small asm task which is faster than doing a corresponding C++ version. So an example of some simple task C++ compiler does slower than your asm code.
Playlist: ua-cam.com/play/PLKK11Ligqitg9MOX3-0tFT1Rmh3uJp7kA.html
Cheers , you're the best!
In a struck of massive autism, I once decided to write wrappers (static inline header functions) for all the intrinsics with better names without all these ugly underscores and prefixes, which match the original asm names. I also used function overloading in a lot of places and templates instead of the horrible _MM_SHUFFLE macro: m128 x = shufps(y, z); If anyone is interested, I can maybe go over the code once more and upload it somewhere. I used it with g++ and I am not so sure it would work with other compilers as it is right now.
Ha! This sounds amazing! Good on you :)
@@WhatsACreel github.com/asdfjkloe/simd
@@WhatsACreel github dot com slash asdfjkloe slash simd (can not post links directly here)
@@INT41O github.com/asdfjkloe/simd
Yo dude your videos are a blessing. Been watching them the past few weeks. Question tho, are you gonna explore other ISAs like RISC V or AVR?
The pic asm is cool too! I don't have a programmer though, so I am not sure :( I did have Atmel studio and MPLab installed at one point - they had some great emulators from memory... But it's been a while. Certainly an excellent suggestion! But cheers for watching :)
@@WhatsACreel All good man! Like, for real, before i found your channel i had zero idea where i wanted to go with my knowledge of ASM especially since I was trained with old asf uprocessors like the venerable 8086.
Nice video dude. One correction though. There are 32 YMM registers that were added to AVX. So you go from YMM0 to YMM31. And they also expanded XMM0 thru XMM31.
They did indeed!! But only in AVX512. I wanted to save AVX512 because of those changes, the mask registers and the new instructions. I really hope AMD adopts it. It is a crazy awesome instruction set!! Anywho, thank you for the info, and thank you for watching :)
Does anyone have a book recommendation on x64 Assembly to actually turn the introductory knowledge from this series into something more advanced?
Excellent video, thanks. You really remind me of Vincent Schiavelli btw...
Ha! The guy is a legend! Looks a little like a Zombie, but I'll take it! Thanks for watching :)
It is safe to assume that x64 machines do have SSE2. It is 20 year old and even Windows 8 and later versions require it and if i remember correctly Microsoft's compiler always assumes SSE2 to be available when compiling for x64, but not for x86 and according to 2020 Steam HW survey 100% of users had SSE2 and SSE3 available.
edit: SSE2 capable processor is always assumed on Microsoft's compiler for x64 and for x86, SSE2 capable processor is assumed by default.
That's good to know! I think the adoption of AVX and AVX2 has been fairly universal too. It's got to be around 90% by now? Anywho, cheers for the info, and cheers for watching :)
@@WhatsACreel Yes. AVX adoption is already quite high, 92% according to the same survey, but AVX2 not yet, it was it at 76%.
x64 MUST have SSE2 support. It's in the specs. Floating point numbers must be passed in SSE registers according to the x64 ABI.
really amazing work, indeed. I love your way to explain things. I enjoyed also the quantum computing introduction. You could do also some assembly video with some quick one-qubit gate calculation, using the power of AVX :D
Quantum SIMD??? :D
Great, great videos, thanks!!!
I was wondering, why they added soo many SIMD registers, 32 for avx512? what's the use of having so many? can the cpu process more than one register per cycle (superscalar simd)?
Great vid, as always! Is there a performance difference with aligned vs unaligned data in AVX2?
Thanks for the care you put into your videos!
What's the difference between this playlist and the other x64 assembly/c++ playlist you have?
Thanks Creel!
Could you discuss integer SIMD when you come back around to this topic? Also, are there vector x scalar instructions, so you can do something like multiply all members of a vector by a single scalar, or do you just have to pack a vector with copies of the same scalar?
I'd like to cover integers, yes! They're more fiddly since there's more types, signed/unsigned, there's no division, and the multiplication is complicated. Some really amazing instructions though!! So I'm hoping to cover them shortly.
The scalar operations only work on the lowest elements, so we have to broadcast the scalar if we want it to affect all elements.
Cheers for the suggestion, and cheers for watching :)
is this lesson the lastest 🧐 becouse i will download all playlist 😍
Cheers for the videos man! Just binged this series in a day or two. Just wondering - how did you learn all this? I remember you saying you studied music - did you do something computer science related as well? Really curious as I'm studying EE but really interested in this sort of stuff as well! (instead of just ARM and microcontroller programming etc.)
What happens when you divide by zero with SIMD instructions? An interrupt?
It sets the result to infinity and continues without error! Cheeky, cheeky, IEEE 754! :)
@@WhatsACreel Lol, sneaky!
Also, I wanted to see if I could put these instructions to use in C++ by trying to make a simple "vector3" class with a member function to add vector3's together. But im starting to think someone has done this already. Idk if anyone else here has seen some sort of implementation I could look at?
is it there any gain to switch from intrindics to asm ? i've noticed a small gain on switching from vcl to intrinsics (around 12-15 sec at 7minutes run)
There is, yes. There’s benefits to both. For this particular video, we would see much better speed from the intrinsics. Really, if we jump into ASM, we want to stay there for as long as possible, since the switching between C++ and ASM can be very expensive due to saving and restoring registers and setting up stack frames.
But the benefits to ASM are that we get to control the registers, when not even the L1 cache is fast enough! We can define our our own calling conventions which return multiple data or recurse indefinitely, and use any instructions we want whether there’s intrinsics or not, or even commandeer RPB and the RSP if we’re running out of registers. We also have access to very low level techniques, like self-modifying-code. You could get all 12 cores to write machine code for each other while they execute in a beautiful symphony of seg faults and race conditions!
Haha, that was from an answer I wrote for Quora once. It’s true though. Modern Assembly is a monstrous language with ridiculous possibilities. Hopefully we can explore some of these techniques at some point.
Anywho, cheers watching mate :)
@@WhatsACreel hmmm, very interesting, I thought L1 is the absolut limit (or this was my target to be L1 bound), now I understand I can go even further to zero, thank you !!
Been watching the whole series in 2024 and it's still great ! I need the tutorial for setting up visual studio 2022 though.. the UI for writing assembly in c++ empty code is different and i still cant figured it out 🤕
First video still works. On how ti set it up
If register zmm is for 512 bit then what is going to be used for 1024 bit? Are they just gonna make it overflow and start using amm, then bmm, cmm ect... :D
Or they're do what Arm and Risc-V are doing and set aside fixed-width SIMD altogether and add variable-length vector registers.
Hello! I have a question
When dealing with vectors, can I simply use movss instead of movaps ? Or something bad would happen.
You can use it, but MOVSS it only moves one single. That's the mnemonic for Move Scalar Single. It moves 32 bits into the lowest element of the first operand. The PS version is the packed one. Hope this helps, and cheers for watching :)
I was thinking you was going to show how to do it with lower level asm instead of having the specialized hardware do it
Not sure I understand mate. These instructions are as low level as it gets! They are single instructions for the CPU. They don't break into smaller ASM or anything. There's no lower level exposed to programmers than single Assembly instructions. They perform multiple operations at once, but that's just how SIMD works, that's what makes modern CPU's sooo powerful! I hope this clears it up a little, and cheers for watching :)
@@WhatsACreel I guess what I could say is I need to know how to do anything you can think of with floating point numbers on paper but only in a binary representation.
Oh, ok! Yes, the IEEE754 standard is probs worth looking up! I did a video series a long time ago on that, but I think other folks have since released much better ones. Or you might be interested in fixed point? It’s not floating point, but it’s very interesting and sometimes extremely fast!! Anywho, thank you for watching and good luck finding the info you're looking for :)
+1
yo
Sup :)
I didn't know Nicolas Cage was into science :)
What a mess modern x64 assembler is 💩
Defiently a different world
Certainly a lot of instructions! Thousands!! Cheers for watching :)