Modern x64 Assembly 15: Introduction to SIMD

Creel

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 25 лис 2024

КОМЕНТАРІ • 91

@Libertoso 2 роки тому ⁺⁴
He taught Assembly in the middle of the jungle, at the beach, and now from inside the matrix itself
@MagnetLoop 3 роки тому ⁺²
I like the graphics in this video. I looks like the host is telling about SIMD and at the same time flexing the power of it with the graphics.
@cmuller1441 4 роки тому ⁺²²
There's a difference between MHz and IPS. Not only recent processors run faster they also do multiple instructions per clock using multiple pipelines (superscalar). Old CPUs didn't even do one instruction per clock. That was only possible with pipelining.
@WhatsACreel 4 роки тому ⁺⁴
They do indeed!! And they superscalar these vector instructions too! Really wild performance! Cheers for sharing and watching mate :)
@MA-nx3xj 4 роки тому
@EramSemperRecta i8088
@Lord-Sméagol Місяць тому
C Intrinsics are very useful if you want to write high performance functions; The compiler understands them and will manage the register allocation and optimization, which saves you from doing it if you code in asm!
There are still times when you are able to produce faster code than the compiler, even using intrinsics.
It takes a lot of effort, but it's very satisfying when you can gain that extra speed!
@cellularmitosis2 4 роки тому ⁺⁵
Now with epic backgrounds! Really enjoying learning some asm, thanks!
@Aerazar 2 роки тому
He has been travelling the world in this tutorial series and now he's filming while walking up the stairway to heaven
@EvilSapphireR 4 роки тому ⁺¹
This is one of those videos that makes you truly appreciate what goes on in a CPU and just how far we've come from simple add ax,bx instructions. I started watching this because I wanted to understand SSE instructions needed to solve a Windows CrackMe, I leave with a wondrous appreciation of Assembly and their implementation in the CPU in general. Bravo good Sir!
@WhatsACreel 4 роки тому
Thanks heaps mate :) Modern CPU's are really amazing things!! Cheers for watching mate :)
@tombranson9341 4 роки тому ⁺²
OMG, I have wondered how to use these registers exactly, I've seen them in other's code before, but wasn't confident enough to use them. Great introduction, and now feel better on how to experiment with them. I subscribed over a year ago, and love your videos, I love assembly. Thanks!
@LukeAvedon 4 роки тому ⁺³
WHOA! So cool! The future is here!!!
@WhatsACreel 4 роки тому ⁺¹
Too true!! Cheers for watching mate :)
@davitberishvili8062 3 роки тому
These videos are like well organised lecrures in some university. They are short and easy to understand.
@itsjustdel 3 роки тому ⁺¹
Absolutely loving your videos mate, just found you! Keep it coming :D
@fgtdjkg Рік тому
the demo is marvelous
@frognik79 4 роки тому
I remember years ago using intrinsics to make a scale2x - 4x algo that ran 60fps+, only ran at like 10fps using regular x86.
@md.jannatulnayem4328 2 роки тому
These videos are really great...highly appreciate it ✨ I would want arm64 assemly next or maybe risc-v ⚡
@jiifox3245 3 роки тому
Wow, such a good explanation that a beginner like me on (modern) asm learns easily. Thumbs up
@lukehanscom482 4 роки тому ⁺¹
Love your stuff man..you have come up in the world with the graphics..in some way i like the x87 to do built in trig functions wish sse ave had them especially for matrix rotate in 3d graphics..avx512 is awesome mainly for processing graphic and images and for copying large amount of data
@Alex-op2kc 3 роки тому
Oh sh*t! Stepping up the production! My only critique is letting the bright sun shine into the camera during the segment near 5:27. Other than that, nice! Seeing those register blocks was satisfying!
@Lord-Sméagol Місяць тому
I took MIMD to an extreme level on my 12-core HT Xeon:
I wanted to search all the seed states of my Z80 pseudo-random number generator so it would produce some desired sequences.
AVX code allowed me to process 2 sets (16 XMM registers was enough to do this) of 16 bytes each on 24 threads to process 768 bytes at a time!
Using 256 bit YMM registers to process a total of 1536 bytes at a time didn't gain much; AVX only has 128 bit integer operations, so the lane manipulations ate most of the gains!
The AVX2 version works on an AVX2 CPU I have, but it's only 4-core, which only processes 256 bytes at a time ... @ 4 GHz instead of 3 GHz on my Xeon.
@danielgawedzki3425 4 роки тому ⁺¹
🎶 Black Hole Sun, won't you come, and wash the rain awaayyyy
@WhatsACreel 4 роки тому
RIP Chris Cornell :( Great song!
@awazin4031 3 роки тому
Your videos are better than a good cheese. And it's a french guy who tells you that !
@E7ite 4 роки тому ⁺¹
Thank you so much for these videos!
@WhatsACreel 4 роки тому
You're welcome, thanks for watching :)
@bigtemp9697 4 роки тому
Just found out that Clang compiler supports x64 inline assembly with variables as operands. All you need to do is download Clang toolkit via Visual Studio Installer and change your platform toolset to LLVM (clang-cl) in project's properties.
@ecchichanf 4 роки тому ⁺¹
6:12
>AVX2
>2016
i7-5820k has AVX2 and the launch was 2014.
1:46
>Pentium 4 @ 3,8Ghz - 2000
3,8Ghz was achieved in the year 2004.
@azertyuiop7893 4 роки тому ⁺¹²
This is so interesting. With a good knowledge of it, how much could you improve a video rendering software, a physics engine, a 3D rendering library or maybe things like a neural network or a trading software?
@WhatsACreel 4 роки тому ⁺¹⁶
ASM can do wonders! I'm hoping to release two projects soon showing some really fun tricks! Thanks for watching mate :)
@sociocritical 4 роки тому ⁺¹
I'm pretty sure you can achieve much greater performance with offloading to a powerful modern gpu nowadays for a lot of the use cases you mentioned.
@NeilRoy 4 роки тому ⁺¹
Fascinating stuff.
@WhatsACreel 4 роки тому ⁺¹
Great fun! Cheers for watching mate :)
@epimenide9i 3 роки тому
This is absolutely great. Thanks!!!
@him21016 4 роки тому ⁺²
Good long work on animation!
@WhatsACreel 4 роки тому ⁺¹
Blender is amazing! Cheers for watching mate :)
@jeremyng1021 2 роки тому
i came here to see examples of how movaps is used, but i received more knowledge than i expected to, and i loved it. sub'd! :D
@md.jannatulnayem4328 Рік тому
just love it ❤
@dagonmeister 2 роки тому
Great content! Is Creel creating the SIMD video? that was also something
@FranciscoCrespoOM 4 роки тому
Great as always!
@Lord-Sméagol Місяць тому
MMX was a TOTAL BOGDE! It shared the FPU register stack, so you couldn't mix MMX and FPU instructions; There was a big performance penalty to switch modes!
@willofirony 4 роки тому ⁺³
Great video! It is my understanding that G++ and MSVC handle intrinsics differently. Will it be possible at some point in this series of SIMD videos to cover that difference? I appreciate that the series is primarily concerned with assembly (most assembly developers are familiar with the differences between AT&T and Intel syntaxes) but C++ can be valuable when the data is embedded in various classes and /or structs. This can complicate the gathering of that data for passing to external PROCs.
@WhatsACreel 4 роки тому
I've never used intrinsics with GNU, so this is new info for me :) I'd certainly like to explore intrinsics at some point. Cheers for watching mate :)
@ronaldskorobogat3152 4 роки тому ⁺²
Thanks for these videos! They are awesome! They're extremely helpful for a CS student like me. Do you have any assembly books you would recommend for beginners?
@gert-janvanderkamp3508 4 роки тому ⁺¹
HHahahahahhhaha the graphics crack me up! 😂🤣😂🤣😂🤣😂 Entertaining and extremely educational!! Thank you!
@FreeenergyDan 4 роки тому
Very informative. Thank you!!!
@alienrenders 3 роки тому
Would you ever consider tackling the topic of memory fences? Maybe even non temporal moves and also prefetching?
@codersg 4 роки тому ⁺²
I think all your assembly tutorial videos are fantastic. Your other videos are mostly beyond my capacity to understand, but I still find them highly interesting. But this video made me so dizzy! 😂 Seriously, I've had to stop twice so far, to pause, so I could stare at my wall until it stopped moving.
I'm currently halfway through the video and convinced I must have picked up a can of beer instead of a can of coke.
Keep up the content, it's great, entertaining and very educational, but please... I beg you, no more spinning, lense flares and checkerboards.
OK, that's a 5 minute break; let's see if I can make it to the end without stopping again...
@WhatsACreel 4 роки тому ⁺¹
Really great advice :) I wondered this while editing! Thanks for letting me know mate, and thanks for watching :)
@Lord-Sméagol Місяць тому
I get the feeling that whenever Intel adds some new instructions, they ONLY add the instructions to help with the VERY LIMITED use cases they are thinking of at the time!
For instance, the SSE 2&4 Insert/Extract instructions, which operate on 128 bit XMM registers using constant index; Why not also allow the index to be in a register!?!?!?
Then, when AVX extends 128 bit SIMD to 256 bits, the insert/extract instructions weren't given AVX 256 bit forms to address bytes 0 to 31 etc. !
Also, some of the crazy program I have written could have benefited greatly from something like BTS YMM0,AX (return the state of a single bit and set it [I used AX there to imply extension to AVX512] ) to use for 256 very fast flags that don't require memory read/write!
AVX512 is still missing these capabilities!
I think this all stems from SSE being 128 bits, so all the instructions were implemented within the CPU around that size.
When AVX was being designed, Intel took the easy (cheaper) way out and just added another 128-bit 'lane' without actually doing a decent job by properly extending it!
@famouz5880 3 роки тому
thanks, great vid!:D
@diegonayalazo 3 роки тому
Thanks!
@Alex-op2kc 3 роки тому
Playlist: ua-cam.com/play/PLKK11Ligqitg9MOX3-0tFT1Rmh3uJp7kA.html
@TheGuyThatEveryoneIgnores 4 роки тому
So what letter will they use to prefix the "MM" when they go to 1024 bit registers?
@s1nister688 4 роки тому
Could you do tutorials on using straight up assembly using nasm and not sticking to Visual Studio and MASM.
@sagivalia5041 2 роки тому
Are the xmm and ymm registers accept data only from memory? I see in their instruction sets that source operands are usually memory or other xmm and ymm registers
@DB-nl9xw 4 роки тому
Can you explain X86 vs ARM. What's your prediction.
@him21016 4 роки тому ⁺¹
Because vc19 inclines functions extremely well, and you may not create inlined assembler function in x64, it is impossible to beat the “release” mode compiler in anything if you turn its optimisation up to the max. As such, could you make some tutorial over using intrinsics, which can be inline, to beat a compiler with a real world function?
@WhatsACreel 4 роки тому ⁺¹
It's difficult if the function is small. It's not impossible to beat the compiler though. Depending on the algorithm, sometimes it's really easy. I'd like to make some videos on Agner Fog's Vector Class Library. It's similar but cleaner than intrinsics. Hopefully I can finish that soon. Cheers for watching and cheers for this great suggestion :)
@him21016 4 роки тому
@@WhatsACreel Good to get a response - "Cheers, mate!"
@sent4dc 4 роки тому ⁺¹
wow, you went all out on titles & animation!
PS. Question -- how many Suns do you guys have down under?
@WhatsACreel 4 роки тому
Hahaha! It's just the HDRI background! Some pretty wild lens flare :) Trying to up the production quality. Cheers for watching mate!
@sent4dc 4 роки тому
@@WhatsACreel hey, dude, thanks for making these!
@chai116 4 роки тому ⁺¹
I was hoping to learn the alignment requirements before the end of the video! very informative as a whole though, great job!
for avx2, what are the alignment requirements for reading or writing a ymm register from memory? is it 32 byte alignment?
@WhatsACreel 4 роки тому ⁺²
Yes, alignment is 32 bytes I believe. So, there's VMOVAPS, moves aligned data, but if your data isn't aligned (or you don't know) then you can use VMOVUPS, move unaligned packed singles. Each of the Move instructions has an aligned an unaligned version, VMOVDQA/VMOVDQU, VMOVAPD/VMOVUPD, etc.
As for the other instructions, the data has to be aligned when you use a memory operand as the second operand. So like VADDPS or whatever. There's no unaligned versions of those. Theres also no versions that allow memory as the first operand.
I hope this helps! Cheers for watching :)
@JusticeHunter 4 роки тому ⁺¹
Very good! Is there a way to check if the cpu has avx or avx512 support in asm? Like ifdef avx512 -> do 512 bit register operations and if not do the present smaller ones. Like a generic way to write the assembly. I am new to asm so this might just be nonsense.
@WhatsACreel 4 роки тому
Not nonsense at all! There certainly is a way to check which instruction sets are available from ASM! It's a special instruction called "CPUID". We put a function number into EAX, then call CPUID, and it returns information in EAX, EBX, ECX and EDX. All of the information is encoded. You need the AMD or Intel manuals to check what the bits mean. But there's loads of information, including which instructions the CPU supports! I would love to make a video on this topic. In the meantime, you could google the CPUID instruction. It's pretty dense, but good stuff!
Well, hope this helps, thanks for watching :)
@WhatsACreel 4 роки тому
Sorry, I remember I did a CPUID video: ua-cam.com/video/p5X1Sf5ejCc/v-deo.html
It's old, so it won't go into AVX512. Would be great to make an update. Anywho, cheers for watching :)
@ProjectPhysX 4 роки тому ⁺¹
Thank you for the video!
Is there a way to do horizontal sums on floating point using AVX2?
@theexplosionist2019 4 роки тому ⁺¹
vhaddps repeated 3 times works.
@WhatsACreel 4 роки тому
Like Explosionist mentioned, there's no way to do this in one instruction. You can combine a few horizontal adds together though. You have to include an extra VPERM2F128 to get the upper and lower halves to add. So you VPERM2F128 to some other tmp register, then add those halves, then horizontal add. Hope this helps, cheers for watching :)
@2605mac 4 роки тому ⁺¹
I have a question about assembly. If I write some MASM code that runs on Intel Core processor, can I compile it on amd ryzen processor as well ? Or are there some differences?
@WhatsACreel 4 роки тому ⁺²
Yes, Intel and AMD CPU's are both x86/x64. They run the same Assembly code! No need for a recompile or anything. Cheers for watching mate :)
@2605mac 4 роки тому
@@WhatsACreel Thank you for respond.
@willofirony 4 роки тому
Another question: I have a std::list of structs . They are sorted according to an uint16_t field. I am trying to find a faster algorithm than std::upper_bound(). Upper_bound takes significantly longer to find the insertion point in a std::list than it does with std::vector. The number of structs is generally 36863. If none found, then repeat for the next 8 structs. Does this sound doable, given the overhead of gathering the fields for packing etc.?
@tombranson9341 4 роки тому
a std::list is a doubly linked list, so finding and comparing an element besides the front or back is linear time O(n) + 1 or O(n), while std::vector is always constant O(1) + 1 or O(1). That is why it is slower, however accessing first or last are the same for both data structures, O(1).
Just using a simple binary_search algorithm to find the element or struct (as long as the elements in the array are pre-sorted), and then exchange, swap or replace the values of the struct in O(1) might work, and you won't get faster than O(log n), which this kind of solution might offer.
@Alex-op2kc 3 роки тому
Creel, any reason you don't use syntax highlighting? Or did you get a new computer or something and not re-install AsmDude?
@Alex-op2kc 3 роки тому
Nevermind! In the next video Creel has syntax highlighting! "ASM Dude VS 2017 Extension"
@steveokinevo 4 роки тому
Chris man jumping ahead here but have been playing with AVX YMM registers how would one align to 32? ALIGN only allows 16. Keeps memory crashing with violaion on test in visual studio?
Would i need to setup a seperate data segment ?
Found this but still unsure.
JUNK SEGMENT PAGE 'DATA'
test_ymm real4 8 dup(8.8)
JUNK ENDS
Cheers pal
@WhatsACreel 4 роки тому ⁺¹
I usually use _aligned_malloc. It's only available in Windows C++. You can also allocate additional padding (alignment +4). Then add to your allocated pointer some amount to make it aligned, and recording the amount you added in the int before the pointer. Hold on, I wrote some code. Not sure it is 100%, but you probs get the idea. You can do the same in ASM too if you're not in C++
@WhatsACreel 4 роки тому ⁺¹
The code is largely untested. Just wrote it this morning in thinking about your comment, so do be careful with it. Hope this helps mate, have a good one :)
@steveokinevo 4 роки тому
@@WhatsACreel UPDATED: Thanks mate yeah doing it from ASM ill try the mentioned above with padding and note it, both sides of the coin MASM and c++. Funnily enough this seems to work with the align directive, testing here:
JUNK32 SEGMENT ALIGN(32) ".data"
ALIGN 32 ;now accepted
test_ymm real4 8 dup(8.8)
JUNK32 ENDS
Can now align by 32 once segement is on a new paragraph boundary before in the .DATA segement it would give ERROR:
"invalid combination with segment alignment" and only allowed max of 16
It now compiles with 32.
cheers ill keep at it.
@clayouyang2157 4 роки тому
hi, dude plz keeping update. thanks from china.
@SquallSf Рік тому
OK, MMX is old, obsolete ... but is it supported on todays CPUs? Can you still use it in asm?
@huypt7739 3 роки тому
Vectors
@rockapedra1130 4 роки тому
Waaaay too many ads ...
@dontaskme1625 4 роки тому
the moving background is extremely distracting; otherwise great video
@singlebit6661 4 роки тому
My dude, the tutorials are awesome! However CG makes this one completely incomprehensible, hard to follow and distracting. Please, consider using plain old slides. Cheers!

Наступне

Автоматичне відтворення

Modern x64 Assembly 16: Basic SIMD Floating Point Arithmetic