Top 10 Craziest Assembly Language Instructions

Поділитися
Вставка
  • Опубліковано 25 гру 2024

КОМЕНТАРІ • 1,3 тис.

  • @NotDwight
    @NotDwight 3 роки тому +4179

    TIL I learned there's an audience for top 10 videos about assembly instructions. Cool.

    • @thegrandnil764
      @thegrandnil764 3 роки тому +88

      I'm surprised our community is so large

    • @TheActualDP
      @TheActualDP 3 роки тому +44

      I'm surprised this has > 10^5 views.

    • @icedragon769
      @icedragon769 3 роки тому +42

      having only ever worked with RISC assembly like MIPS in school, seeing the extremes of what you poor poor x86 driver authors have to deal with is entertaining and enlightening.

    • @jimviau327
      @jimviau327 3 роки тому +4

      Sojit , in this case It doesn't appear that this video content will ever be of service to the quality of life you are seeking. Did I just wrote that ? I'm not even sure I understand myself. :)

    •  3 роки тому +3

      @@TheActualDP It has 2#10_1111_0010_1010_0010# views (I love ADA's based integers :D)

  • @davidjohnston4240
    @davidjohnston4240 3 роки тому +1922

    RdSeed - It's not always slow. There's a FIFO on the output of the RNG. RdSeed pulls from that FIFO. If you haven't just pulled a bunch of values from the FIFO, the value will be available immediately because the FIFO is not empty. If you try to continuously pull from Rdseed and measure the average time per instruction, it will appear slower because you are limited to the physical rate of generation of full entropy numbers from the RNG, which requires a whole lot of computation - Generate 512 bits from the entropy source, AES-CBC-MAC them together to get 128 bits (that's two RdRand result's worth) XOR it with and output from the DRBG (another 3 AES operations, just like SP800-90C describes) stuff the two 64 bit numbers from the 128 bit result into the output FIFO. How do I know all that? I designed it.

    • @xelaxander
      @xelaxander 3 роки тому +295

      The true gold is down in the comments

    • @LKRaider
      @LKRaider 3 роки тому +113

      Oh cool. When did you design it? Care to share some history?

    • @davidjohnston4240
      @davidjohnston4240 3 роки тому +816

      @@LKRaider It was around 2009 I started. It ended up first in the Ivy Bridge processors with the RdRand instruction. I had been working on writing cryptographic protocols in standard committees (802.11i, 802.16 etc) and they all needed cryptographically secure random numbers and when I looked at the SP800-90 specification back then, it was not sufficient. It described DRBGs (aka PRNGs) but not entropy extraction or physical entropy sources. A small team of 4 people was assembled, myself, a mathematician, an analog designer and a corporate cat herder. the math guy came up with some of the mathematical principles and identified the best papers describing how to quantify the entropy, the analog guy did the physical entropy source, the cat herder got it into silicon and I designed the digital logic that takes the partially random bits, turns them into full random bits with an entropy extractor and seeds a PRNG/DRBG with that full entropy data to make the resulting stream of random numbers fast enough. Since then the other three left (2 retired and one died) and I've been the main owner of the RNGs since. RdSeed which gives full entropy output as per SP800-90C and X9.82 was added with Broadwell. This was so you could make arbitrarily large keys from it. Faster and slower versions were created (fast for servers, slower for energy efficient chips) also I've designed a few other types of RNG for specific needs, like super small ones, non uniform ones and floating point ones. I contributed to the development of SP80090B and SP800-90C and the revision of SP800-90A which now cover most of what you need in a secure RNG. A couple of years ago I finished a book on random numbers which was published (Random Number Generators, Principles and Practices). So getting involved to solve my problem of where do I get random numbers has turned into the defining part of my career. The standard are still changing. Certification requirements are still evolving and the need for new RNGs that fit in different contexts keeps up apace, so it has become a full time job for myself and a small number of colleagues.

    • @luxsomething
      @luxsomething 3 роки тому +79

      Wow that's amazing

    • @ohchristusername
      @ohchristusername 3 роки тому +181

      @@davidjohnston4240 What a lovely comment chain to stumble upon, great read!
      May your random continue to prosper!

  • @electroflame6188
    @electroflame6188 3 роки тому +1466

    Dot product of packed singles in your area

    • @Rudxain
      @Rudxain 3 роки тому +94

      I would like it in my boot sector

    • @TheLightningStalker
      @TheLightningStalker 3 роки тому +21

      The probability of finding a project worth uploading commits of my sus code is very low.

    • @molybd3num823
      @molybd3num823 3 роки тому +14

      @@TheLightningStalker but never zero

    • @dubbynelson
      @dubbynelson 3 роки тому +24

      dot product of deez nuts packed on your chin

    • @sumuduranathunga
      @sumuduranathunga 3 роки тому

      I think 🤔 it's must be cross product

  • @luck3949
    @luck3949 3 роки тому +1635

    Wow, so the task I was given in a job interview was actually an assambler one-liner. Good to know.

    • @DOSeater
      @DOSeater 3 роки тому +280

      If you'd said that in the job interview you'd get instantly hired

    • @luck3949
      @luck3949 3 роки тому +210

      @@DOSeater I wish I knew this 2 months ago. I got that job anyway, but it took a little more interview iterations. Now I'm a happy developer of a delivery robot :)

    • @DOSeater
      @DOSeater 3 роки тому +29

      @@luck3949 Nice! I'm happy it worked out for you

    • @guywithknife
      @guywithknife 3 роки тому +366

      "Oh, that's easy, you can do it in one cycle using the PSCMPXCHGFMADDRABCXYZUW instruction"

    • @mika2666
      @mika2666 3 роки тому +15

      Which one was it?

  • @Requiem100500
    @Requiem100500 3 роки тому +709

    I love how hyped this guy is about CPU instructions. Really fun to listen to.

    • @tkeleth2931
      @tkeleth2931 3 роки тому +29

      This dude could describe paint drying on a wall and I'd be entertained. I've never seen an assembly instruction before this video lol

    • @ChristopherGray00
      @ChristopherGray00 2 роки тому +2

      i don't know why but for me it's quite annoying.

    • @HuntingKingYT
      @HuntingKingYT Рік тому +1

      I'm also hyped when I learn something truly revolutionary

    • @MichaelMantion
      @MichaelMantion Рік тому +1

      I am surprised he wasn't more excited.

    • @____________________________.x
      @____________________________.x Рік тому +1

      Are you kidding me? I hate his voice with every fibre of my being. I've subbed only because he has subtitles and the other videos look interesting. That first 30 seconds was excruciating, I may need a lie down in a dark room

  • @ChildOfTheLie96
    @ChildOfTheLie96 3 роки тому +575

    Lol, this guy has that kind of voice that makes it sound like he's constantly on the brink of laughter

    • @KanaalMTS
      @KanaalMTS 3 роки тому +6

      The way you write sounds very British 😂😂

    • @douwehuysmans5959
      @douwehuysmans5959 3 роки тому +4

      He sounds like BuzzFeeds IT guy

    • @julian-xy7gh
      @julian-xy7gh 3 роки тому +5

      I have the same feeling with Tim from the Unmade Podcast. Maybe it's the Australian accent haha

    • @bakedbeings
      @bakedbeings 3 роки тому +8

      @@julian-xy7gh Australian here: it's not universal for Aussies, he's just a gem 💎

    • @2112jonr
      @2112jonr 3 роки тому +7

      More like madness.
      Assembly language has that effect...
      .

  • @0xABADCAFE
    @0xABADCAFE 3 роки тому +1352

    So the most amazing thing about these instructions to me is the fact so many of them run in single digit cycles. You have to marvel at the engineering effort that has gone into it. Also, a compiler has to basically be sentient to know when and how to use some of these.

    • @MrHaggyy
      @MrHaggyy 3 роки тому +128

      Yes there went millions of hours of engineering into getting to the point where you could write Hallo World in Python etc.

    • @altaroffire56
      @altaroffire56 3 роки тому +508

      No. If the compiler was sentient, it would kill itself.

    • @swarnavasamanta2628
      @swarnavasamanta2628 3 роки тому +28

      @@altaroffire56 LOL

    • @swarnavasamanta2628
      @swarnavasamanta2628 3 роки тому +75

      @@MrHaggyy And billions of hours for a javascript hello world. i think capable computer engineers brought this upon their selves by providing layers and layers of abstraction and burying need for internal necessary concepts to get something done. No wonder the developers now are too shallow in their concepts, probably not their fault if they get hired only after 6 months of python for data structures (they have no incentive to learn the deeper internals if they get paid shitload for sitting in a desk). Hell i would say most people choose programming or development for making bucks, learning and interest comes later. There only a few people now who are truly interested and curious in the core of things and it might just be that after 10 years understanding these would just be luxury and not necessity. Also no wonder why most programmers hate their jobs and want to die after getting one.

    • @MrHaggyy
      @MrHaggyy 3 роки тому +82

      @@swarnavasamanta2628 mhm i think the horizon of programmer/developer/engineer in this field got much broader. Yes, there are many abstraction layers we have invented and standardized over the years. I have a mechatronics degree with a microsystem-technology specialization. Most of my field works on improving the hardware for existing assembly code. But we also introduce new things in hardware which we map to assembly or C/C++ code. On that layer, you have the guys who are building assemblers, linkers, and compilers. These are the programs you need to actually execute code on a machine. On top of that, you have the Microsoft, Android, Apple, Linux, etc guys who write an operating system that provides useability with that stuff. And on that foundation, you can start building languages, IDEs, or any program you can open on your computer. And if we finally have these higher-level languages and programs we can start building frameworks or things like python. That field can write very powerful applications that millions of people can use, or that run on many machines at the same time, or all the things these cloud-native guys are doing. The interest in these fields is widely different. I personally love hardware, and the guys I work with love building hardware or building systems with hardware. Systems can be the new Intel i3-i7, over to raspberry pi or smartphone processor, to small controllers like an STM32 which are used in smartwatches, cars, microwaves, freezers down to something like an Arduino which is easy to learn.
      There are a lot of people working on those layers. Many of them being the stereotype white europeon/north-american older man. But this field is one of the most global out there. With Korea, Taiwan, Japan and China being the "most" impactful.
      The amount of things you could learn about computer and software layers is way beyond one's reach. 99.99% of all programmers don't have a clue how transistors are formed into bite logic, scaled to 16-32-64-86-128bit wide memory, how this memory became a register with a specific purpose and how you address this register so you can call it. But you don't need to know it in order to write a program. :-) we have you covered in that one :-)
      So even assembly can teach you a lot about how a computer works, you don't need to write it. In fact you shouldn't write it for any used code. Use a compiler and write it in a higher-level language. All the smart people from the compiler department will cover you there. And so on and so on. Until the hip young facebook star engineer can write his php or python code for his next new feature. And if we do something amazing down the layers he will get a new version that will make his software even better than before. And the only thing he needs to do is trust the work of other people.
      The unpleasant truth about why so many programmers want to die or really do it is a mismatch between management, expectations and skills pared with bad working environments. Coding and engineering computers is a mentally very hard and demanding task. You have to know your tools, get to know the problem, which I like to call a puzzle, identify the pieces of your puzzle, sometimes create a new piece that fits, and solve the puzzle. This takes time. A good time is anything from 2 to 4 hours. Less is only sufficient for really easy tasks, longer is better but you need to train for it and you need to go to the toilet, move, eat, sleep etc. In most companies, this deep focus session gets corrupted by meetings, telephone, angry managers, or people that think they are important to the problem. These corruptions drain a lot of willpower and unless you are an (senior) engineer and prepared for this kind of stuff it will depress you. You need to get your routines in place in order to sustain. The other part is once you solved the puzzle your company needs to give you a reward for this. If your management doesn't like your result and lets you feel their miss liking, you need someone holding you on the bridge. That's why many companies in this field like Facebook and Intel don't have 9 to 5 jobs. You get paid to work for them. There are recommendations on how you should set up your routines and there are people helping you. But you can come and go as you like. But you get certain tasks and a timeframe. Once the timeframe is over people all over the world are counting on you getting the job done in time.
      So very wide, very different, and very interesting domain. And it's very rewarding if you know that you did something that all of mankind will use and benefit from in a view month after you finished your work.

  • @zrebbesh
    @zrebbesh 3 роки тому +656

    "HCF" -- Halt and Catch Fire.
    On a lot of early CPUs (1970s/1980s, yes damnit I am old) the manual gave the bit pattern for each instruction - and the the rest of the bit patterns did undocumented things. Some were just a different way to spell NOP, some did deeply bizarre unintended things that happened because the bits randomly activated chunks of the CPU circuitry that mixed and matched chunks that were used in different combinations for other commands, and some did things that were only ever intended to be done in the factory, during QA testing.
    We used to hunt through these "undocumented instructions" looking for anything interesting or cool that we could then figure out uses for. But this was a bit risky. A fair number of CPUs had at least one undocumented instruction that would immediately cause the machine to lock up and, a few seconds later, destroy the CPU. Sometimes they caught fire, sometimes they melted through the PCB. Sometimes they desoldered themselves from the board and fell out. Whenever we found it we called it a "Halt And Catch Fire" instruction and patched the name 'HCF' into our macro assembler for that bit pattern, in order to avoid accidentally finding it again.
    Naturally when I saw the title of this video I figured HCF would be at the top of the list.
    Finding an HCF usually meant a new version of the chip as soon as the company could mask it off. We thought of ourselves as contributing to their QA efforts, although very few of them thanked us for it.

    • @ducksonplays4190
      @ducksonplays4190 3 роки тому +69

      That is ridiculous, thank you for this comment.

    • @rty1955
      @rty1955 3 роки тому +54

      Write while rewind
      Eject disc
      Read & write while ripping tape
      Disable console
      active emergency power off
      Electrocute operator
      Sense card deck on printer and open cover
      Write past EOT
      Read and scramble data
      I have a huge list of them along with my green cards

    • @Safyire_
      @Safyire_ 3 роки тому +24

      Can you give some examples of interesting undocumented instructions you came across with?

    • @zrebbesh
      @zrebbesh 3 роки тому +146

      @@Safyire_ We found things like 'compare while swapping' that swapped the values in two registers while writing 1 to the comparison bit if the first was higher than the second. That was actually a little bit useful. We found a lot of things that tried to do two or three things at once but did them in a random-ish order because of race conditions. One of those was useful because it consistently did xor before swap if the CPU was hot and swap before xor if the CPU was cold, so we could write code that monitored the CPU and shut things down if it got too hot. We found instructions that connected multiple registers to the bus for output, meaning the result of the instruction would be written to four different registers at once. We also found instructions that connected multiple registers to the bus for input, which was useless and sometimes damaged the CPU. It was a real crapshoot. Also a very expensive hobby if you damaged the machine and your professor wasn't ready to write it off to "research." CPUs were not cheap.

    • @morgwai667
      @morgwai667 3 роки тому +12

      ​@@rty1955 ​ @Zrebbesh you crazy old hackers! ;-) you are legends! :)

  • @1111757
    @1111757 3 роки тому +699

    I can't get over this presentation. That's the kind of nerdy content you expect to find in a recording of a 10 year old talk that was given to 50 people in a tent :D

  • @KazeN64
    @KazeN64 Рік тому +10

    I've used MIPS excessively and never looked at X86 much. This feels like when you were playing yugioh in 1999 and you were summoning and setting 1 card every turn and then you get teleported to 2023 where people play their entire deck in one turn and have cards with effects that are 7 paragraphs

    • @jhgvvetyjj6589
      @jhgvvetyjj6589 11 місяців тому

      Even when cutting off all SSE and up instructions (making it useful for legacy x86 device targetting) there is still a lot of complexity, including very precise x87 floating point and MMX vectorization. What makes it especially fascinating is how compatible it has become; a 640×480 60fps renderer on a very old x86 processor with MMX might very well be the exact same program that does 3840×2160 60fps on a modern PC.

    • @splits8999
      @splits8999 3 місяці тому

      huh.... what the fuck

  • @zactron1997
    @zactron1997 3 роки тому +769

    Good lord that poor silicon. I can't even begin to imagine how you'd design chips to implement some of these instructions. I'd love to see a followup video showing some examples of using these instructions, and if they're superceded, what should be used instead!

    • @Noctew
      @Noctew 3 роки тому +78

      They committed the cardinal sin in the 1970s with REP MOVx and it went downhill from there.

    • @fake12396
      @fake12396 3 роки тому +125

      microcode, lots of microcode

    • @shinyhappyrem8728
      @shinyhappyrem8728 3 роки тому +28

      I'd think that there are massive groups of "one circuit per operation", and they all work in parallel. From all the results only the specified one is selected.

    • @Lukas-er4nd
      @Lukas-er4nd 3 роки тому +11

      Microcode. Lots and lots of microcode.

    • @polypolyman
      @polypolyman 3 роки тому +35

      A long time ago, they actually gave up on x86, and have been making much simpler chips that convert x86 to that simpler system using "microcode"

  • @Andrath
    @Andrath 3 роки тому +967

    You'd almost think silicon makers like to mess with compiler writers.

    • @kestasjk
      @kestasjk 3 роки тому +142

      I doubt these instructions were aimed at people writing compilers, they'd be aimed at people doing things with encryption, low-level synchronization, multimedia.. I think these days people would first try and come up with a GPU based way to tackle these large data-processing problems, but before GPUs were general purpose parallel computers you had to do these single instruction multiple data things on the CPU

    • @toboterxp8155
      @toboterxp8155 3 роки тому +67

      @@kestasjk Also, doing stuff with a good CPU instruction is generally more efficient than doing it on the GPU, simply because you have to send across the data and get the result back on a GPU.

    • @kestasjk
      @kestasjk 3 роки тому +44

      @@toboterxp8155 Sort of.. The thing is if you’ve got enough data the GPU is so much faster it’s worth the overhead (and the memory space is getting more integrated / unified all the time), and if you’ve not got enough data to make sending to the GPU worthwhile the speed up for processing a small amount of data on the CPU more efficiently probably isn’t worth it. Perhaps for certain encryption or compression tasks where it can’t be parallelised very well on the GPU but it still needs lots of processing power they may still be useful, but I doubt these sorts of instructions are used in modern software very often

    • @toboterxp8155
      @toboterxp8155 3 роки тому +21

      @@kestasjk Your generally correct, but those instructions are a standard way of making programs faster, used to this day. If your task isn't easily converted to the GPU, you don't want the extra work, or you don't want the program to require a GPU, using some complex instructions is an easy, fast and simple way to optimize for some extra speed when needed.

    • @kestasjk
      @kestasjk 3 роки тому +17

      @@toboterxp8155 True.. but I think you can probably attribute ARM/NVIDIA’s ability to keep improving by leaps and bounds while Intel is reaching a plateau to its need to maintain a library of instructions that aren’t really necessary in modern software. If it gets rid of them old software breaks, if it keeps them any improvement it wants to make to the architecture needs to work with all these. Intel went for making the fastest possible CPU, but we now know a single thread can only go so fast (and the tricks like branch prediction have exposed gaping security holes in CPUs, forcing users to choose a pretence of security or turning branch prediction off and getting a huge performance hit). So parallelism is the future: In the 00s this meant multi-core CPUs, today this means offloading massive jobs to the GPU, but the breakthrough will come with CPUs and GPUs merging into one. Not to an SoC, like we already have, but with GPU-like programmable shaders as a part of the CPU instruction set and compiler chain, so that talking about CPU/GPU will be like talking about CPU/ALU. You’ll be able to do the operations like these instructions do in a single cycle, but by setting up a “CUDA-core” with general purpose instructions that can access the same memory.

  • @flowerpt
    @flowerpt 3 роки тому +1259

    Intel: One cycle
    Bioinformaticists: lemme reimplement that in Python and take 300,000 cycles to compute the same thing.

    • @kestasjk
      @kestasjk 3 роки тому +134

      Don't worry; as long as computer time remains far more valuable than developer time, and no alternative graphics-based technology appears for custom parallel processing operations, Intel will be just fine

    • @SimonBuchanNz
      @SimonBuchanNz 3 роки тому +34

      @@kestasjk eh, emulation of x86 on ARM on both Windows and Mac is apparently good enough now that I'd be seriously worried if I was Intel. AMD at least have their GPUs...

    • @JayOhm
      @JayOhm 3 роки тому +21

      @@SimonBuchanNz I think AMD wouldn't mind going ARM too much, if they have to. Maybe even will design dual-instruction-set chips for the transition period. Good thing that China won't let Nvidia buy ARM.
      In general, nowadays there is a tendency towards "crossplatform" software design practices, so the question of "Can it run widespread software fast?" would soon become irrelevant. For example, Adobe Lightroom already works on ARM on Windows and their other products will follow soon. Itanium might not have flopped if it happened a few years from now, at least not for the reason it did, which was poor x86 emulation performance.

    • @codycast
      @codycast 3 роки тому +24

      @@JayOhm how exactly can China stop a US company from buying a UK company?
      Should we find out what Italy and Argentina think too?

    • @JayOhm
      @JayOhm 3 роки тому +10

      @@codycast The short answer is Qualcomm. They are banned by US so if ARM becomes US-owned, Qualcomm will no longer be able to legally produce ARM chips. Possible political implications of that are just too painful to risk so regulators almost certainly won't allow it.

  • @icarvs_vivit
    @icarvs_vivit 3 роки тому +83

    #1 is the definition of insane and incredibly useful.
    Thank you for translating the Enginese into English.
    Now I can delete my string comparison macros forever.

  • @redsmith9953
    @redsmith9953 3 роки тому +289

    I just remember, porting the torque game engine to PSP, and from all the work, the CMPXCHG instruction for the mutex, i implemented some native PSP intrinsic to do that, good memories, the best optimization trick also, the game was doing 10 fps at the best, the problem was matrix transposition, between the engine and PSP "opengl", so i made a transposition on the fly changing the order of reading and writing of the registers in the VFPU instructions, kicking the Sony engineers 'axe' ; ), and getting 30 fps, enough to pass their performance standards.

    • @KangJangkrik
      @KangJangkrik 3 роки тому +9

      Wow you made PSP games?

    • @redsmith9953
      @redsmith9953 3 роки тому +33

      @@KangJangkrik , i made the Torque game engine port, and on top of that another team was developing games using it.

    • @DiThi
      @DiThi 3 роки тому +1

      Nice, but wouldn't it have been better to change which indices of matrices are used in vector and matrix functions? E.g. using m[4] instead of m[1] and vice versa.

    • @kyrylmelekhin2667
      @kyrylmelekhin2667 3 роки тому +1

      Marix transpose is the dumbest operation ever, you shouldn't be doing that, ever.

    • @redsmith9953
      @redsmith9953 3 роки тому +14

      @@DiThi that implementation costs 20 fps in that platform, you need to swap the entire matrix operations for every calculation, sounds trivial but was not for a 333 Mhz processor with slow RAM.
      before was:
      matrix.transpose(); // bloated operation
      vector.mul(matrix);
      after optimization was:
      vector.mul(matrix); // due to the trick no transpose needed

  • @DukePaprikar
    @DukePaprikar 3 роки тому +41

    Yeah, watch-mojo really dropped the ball by not covering this one.

  • @FinaISpartan
    @FinaISpartan 3 роки тому +337

    Can't wait till you remake this vid in 10 years with all the custom RISC-V extension instructions. Gonna be pretty wild to see what people come up with.

    • @ritteradam
      @ritteradam 3 роки тому +15

      The big mistake Intel made is to create fixed width vector instructions. The V in RISC-V points to the importance of the variable width vector instructions where the assembly code doesn’t need to know the vector register size (V extension), and a similar matrix extension is coming for machine learning I think (though V is already a great improvement)

    • @canaDavid1
      @canaDavid1 3 роки тому +35

      @@ritteradam The V in risc-v is a roman numeral standing for 5, as it is the 5th iteration of risc from Berkeley (i think).

    • @ritteradam
      @ritteradam 3 роки тому +16

      @@canaDavid1 Officially yes, but you can find videos of the people who developed RISC- on UA-cam, and they mentioned that they originally developed it because they wanted to get the vector extension right, and that's why they called it RISC-V at the start.

    • @bFix
      @bFix 3 роки тому +8

      Also it's a reduced instruction set (risc) and not a complex instruction set (cisc) like x86
      So why should risc-v even get some of these?
      just do them in software and let the compiler do it's magic.

    • @TheMixedupstuff
      @TheMixedupstuff 3 роки тому +18

      The point of risc-v is to have a common set of instructions understood by many cpus and to be extended with application specific extensions where needed. So you can be 100% sure there will be many wild instruction extensions.

  • @soranuareane
    @soranuareane 3 роки тому +37

    CMPXCHG is how mutual-exclusion, locks, and semaphores are implemented in systems like QEMU. I remember having to fix a bug with a race condition in the QEMU Sparc interpreter by adding judicious use of CMPXCHG locking. It's an amazing instruction and, with its guaranteed atomic behavior, can be used to trivialize mutexes.

  • @ZILtoid1991
    @ZILtoid1991 3 роки тому +148

    PMADDWD is quite useful for fast affine transformation functions. On SSE2, I can even calculate two pixels at once

  • @Kyrelel
    @Kyrelel 3 роки тому +66

    Bear in mind that some instructions were not designed, they are a by-product of the design process.
    In essence, take any bit-pattern that is not assigned to an instruction and look at what the processor will do.
    Most often it will do nothing (which his why there are so many NOP's in instruction sets) or it may crash, but sometimes it will do something weird and wonderful and be included as an "official" instruction while the designers pretend it was intentional.

    • @Rudxain
      @Rudxain 2 роки тому +12

      That's like exploiting hardware-level undefined-behavior

    • @lPlanetarizado
      @lPlanetarizado Рік тому +12

      there is a comment that mentions HCF -Halt and Catch Fire- , "undocumented instruction" that sometimes could catch fire...damn, thats amazing lol

    • @appelnonsurtaxe
      @appelnonsurtaxe Рік тому +7

      ​​@@lPlanetarizadohat wouldn't happen today on your PC's x86. Or this would be a terrible security issue. On modern systems userspace processes should be able to (try to) run any instruction they want without the CPU melting down.

    • @NormanVN
      @NormanVN Рік тому +12

      All of the instructions in this video were quite intentional, but niche. Well, only some are niche. cmpxchg is a _foundational_ instruction whose importance cannot be understated, while pshufb is going to be in pretty much every vector codebase. dpps is pretty well known, parallel dot product. not a fan of dpps tbh.

  • @quadroninja2708
    @quadroninja2708 Рік тому +9

    This video has such an unique editing. The topic isn't any less obscure, and it's really cool to hear the author being so enthusiastic about those instructions. It's a really interesting experience

  • @bobbymorelli9763
    @bobbymorelli9763 3 роки тому +104

    alright guys lets brainstorm what kind of algorithm could benefit from all 10...maybe search for a specific font in an image by comparing each glyphs bitmap to the image using MPSADBW and search for words within identified glyphs using the last instruction?

    • @AlexanderBukh
      @AlexanderBukh 3 роки тому +40

      careful, or you might ending up creating another awfully named megainstruction

    • @bakedbeings
      @bakedbeings 3 роки тому +9

      @@AlexanderBukh ALRTGYSBSTRM

    • @nyanpasu64
      @nyanpasu64 3 роки тому +1

      Needs moar threads.

    • @abebuckingham8198
      @abebuckingham8198 3 роки тому +2

      MPSADBW can be used for all sorts of optimization problems as the sum of absolute differences is a metric. It's often faster than using the Euclidean metric which requires a square root and you can substitute one for the other in many situations.

    • @gazehound
      @gazehound 10 місяців тому

      you could feasibly use a good chunk of these by implementing a fancy video encoding

  • @dkosmari
    @dkosmari 3 роки тому +25

    The carryless multiplication is polynomial multiplication modulo 2. It's used to implement things like CRC computation, and Reed-Solomon error correction codes.

    • @jgunther3398
      @jgunther3398 Рік тому

      i was disturbed to find any mul instruction. i loved my homemade multiplication and division routines

    • @gazehound
      @gazehound 10 місяців тому +1

      Yes, it's useful for all kinds of codes. It's a direct implementation of a field theory concept

  • @rockercas
    @rockercas 3 роки тому +468

    wow, that were 1010 assembly language instructions, not a mere 10!

    • @i_am_aladeen
      @i_am_aladeen 3 роки тому +22

      I actually crunched these numbers in my head before I realized what you did. I feel ashamed. +1

    • @bbq1423
      @bbq1423 3 роки тому +62

      There are 10 kinds of people in this world. Those who know binary, and those who do not.

    • @threepointonefour607
      @threepointonefour607 3 роки тому +61

      @@bbq1423 there are 10 kinds of people in the world: those who understand hexadecimal and F the rest

    • @skilz8098
      @skilz8098 3 роки тому +3

      @@threepointonefour607 0000 0000b - 1111 1111b == 0x00 - 0xFF since log2(x) is a factor of log16(x)! If you are doing simple programming, then 90% of the time you'll only need hexadecimal. If you are actually building and designing hardware and implementing it's data paths, control lines and control bits... You are not going to get very far without binary and Boolean Algebra! If you get into Cryptography, or Signal Analysis you might want to know binary as you'll end up performing a lot of bit manipulation!

    • @alg3n320
      @alg3n320 3 роки тому +15

      @@bbq1423 and those who didn't expect a trinary joke

  • @lonsbury
    @lonsbury 3 роки тому +282

    I feel bad for the CPU engineers who will need to add compatibility for this stuff in 20 years
    Edit: finished watching the video. This was pretty fascinating, and the 3D text made it very nice to watch. I hope you gain more subscribers!

    • @vylbird8014
      @vylbird8014 3 роки тому +15

      They'll do it in microcode, I imagine. Apart from the RNG, they can all be done purely in heaps of microcode if you don't care about performance, no dedicated hardware needed.

    • @gorilladisco9108
      @gorilladisco9108 3 роки тому +13

      If you ever learn about microprocessors, it's all about microcode. Every assembly instruction are function call to microcode. The design will basically the same, with microcode printed in ROM inside the chip. You just have to be creative using that microcode to come up with a new instruction.

    • @johnbrown9181
      @johnbrown9181 3 роки тому +16

      @@gorilladisco9108 There's definitely a lot more to it than just microcode. Things that are both easy and compact in hardware - such as a linear-list search or swizzling - and microcode won't get you there.
      Also I'm not aware of any major RISC implementations that use a significant amount of microcode, very much unlike x86.

    • @gorilladisco9108
      @gorilladisco9108 3 роки тому +7

      @@johnbrown9181 And that's why you won't see any instruction like the ones listed on this video on any RISC microprocessors. The thing about x86 and other CISC microprocessors is they use microcode liberally.
      Microcode is how a microprocessor work. All you have to do is to have imagination.

    • @Waccoon
      @Waccoon 3 роки тому +4

      Depends on how fast it needs to be. Optimizing complex instructions to use all of a core's hardware is difficult, but just getting older instructions to work for the sake of compatibility isn't that hard. Hence, x86 code from a couple decades ago will work fine on a modern x64 chip, while ARM, PowerPC, and other RISC designs have suffered mountains of compatibility issues over time.

  • @GaryBickford
    @GaryBickford 3 роки тому +52

    Don't forget the Motorola 6800 "Halt and catch fire" instruction. It was an unpublished byte code that caused a branch to itself until the chip overheated.

    • @BrianG61UK
      @BrianG61UK 3 роки тому +7

      No. en.wikipedia.org/wiki/Halt_and_Catch_Fire_(computing)

    • @GaryBickford
      @GaryBickford 3 роки тому +6

      @@BrianG61UK Long ago a computer center I worked in had a list created by IBMers in the 1960s of amusing opcodes, including HCF. But I didn't want to complicate the text, and the MC6800 item is there in the Wikipedia description, though I did have the details incorrect😊.

    • @tomysshadow
      @tomysshadow 3 роки тому +2

      This video is about x86 though. Given, it does have the HLT instruction, and if you use it in your user mode application it will catch fire (if by catching fire you mean cause a privileged instruction exception) :0)

    • @rty1955
      @rty1955 3 роки тому

      HCF was around in the 60s way before the 6800

    • @GaryBickford
      @GaryBickford 3 роки тому +3

      @@rty1955 yes, I recall on the wall of a data center I worked at, a paper list of spoof IBM machine instructions that included this HCF instruction. Iirc there was also BAH, Branch And Hang😂. The only CPU that actually did this that I'm aware of was the early 6800, but it's possible there were others. The 6800 was an "unimplemented" instruction bit pattern that unbeknownst to Motorola effectively branched to itself immediately and repeatedly until the heat built up enough to burn the logic.
      I also personally knew experienced the result of two amusing (to me) episodes - at a college I was attending, a kid running a canned BASIC business program that managed somehow to overwrite the entire disk map, effectively erasing everything, and a kid looking for a job used social engineering to get the guy running jobs to dive and hit the Big Red Halt button. Each of those events caused the Computer Center to be offline for more than a week. And an entire computer center at a company where I worked got completely fried including three mainframes due to a lightning strike right at the pole outside the Center. The senior manager had resisted spending the $5 million required for a motor generator to isolate the computers from the world. We had 400 engineers twiddling thumbs for two weeks. He got a new job.

  • @ishdx9374
    @ishdx9374 3 роки тому +27

    the last one seems so damn complex it's unbelievable it takes 3-4 cycles

  • @ukyoize
    @ukyoize 3 роки тому +55

    The string instructions seem like half of grep implemintation.

  • @elietheprof5678
    @elietheprof5678 3 роки тому +10

    Excellent visualizations btw. Way more straightforward than instruction manuals that try to explain everything with just words.

  • @quickstartprojects2162
    @quickstartprojects2162 3 роки тому +40

    Finally SSE 4.2 string compare is understandable. I wish we had the Australian version, Creel version, of the intel instruction set manuals.

    • @deppy2165
      @deppy2165 3 роки тому +11

      if you're struggling with the intel manuals I personally find the amd manuals more comprehensible

  • @CrittingOut
    @CrittingOut Рік тому +1

    one of the assembly instruction video's of all time.

  • @glikar1
    @glikar1 3 роки тому +32

    Exciting! Love your enthusiasm. Almost makes c redundant. There is something about machine code that feels right.

    • @bootmii98
      @bootmii98 3 роки тому +3

      did you know that ++ and -- were VAX intrinsics?

    • @seneca983
      @seneca983 3 роки тому +7

      There is something about machine code that feels right.
      I dunno. I've not done any actual assembly programming so maybe my opinion doesn't matter but x86 just seems so bloated and inelegant.

    • @swarnavasamanta2628
      @swarnavasamanta2628 3 роки тому

      @@seneca983 you would be partially right. Bloated or not depends on the way of implementation, if these instructions were to be implemented by microcode, yes absolutely, better let the programmer handle them. But if they are direct on chip Hardware implementation of these instructions then it's a different story, it takes the opposite route of bloat. Takes 1 instruction instead of writing a 100 line function in C and hoping compiler would get the translation right. Also x86 being firmly established the engineers have to make sure they are compatible all the way. Support for languages will drop eventually, while x86 is going to stay.

    • @seneca983
      @seneca983 3 роки тому +1

      @@swarnavasamanta2628 One advantage of a simpler and smaller instruction set is that microcoding might not then be necessary and the chip could be simpler.
      Indeed x86 would be rather difficult to supplant. However, it seems possible that ARM could do it though it's uncertain and would probably take a long time if it happened.

    • @swarnavasamanta2628
      @swarnavasamanta2628 3 роки тому

      @@seneca983 ARM is definitely a beast, and their methodology is completely different from other CISC approaches. It began first as a project to see if a computer really needs large complex instructions, they thought they would come at a halt problem but nothing really came up and they could make everything work with 1 cycle simple instructions (although with a bit of microcode). At this point hard to tell what the future holds, maybe there will be standardization when one architecture has so many advantages that renders other architectures almost useless or unworthy of learning curve. Who knows what the future holds but up until that the architecture land of computers is like wild wild west and i kind of love it that way.

  • @superblaubeere27
    @superblaubeere27 3 роки тому +18

    7:30 Btw, the carryless multiply is extremely useful when making parsers

    • @mohammedjawahri5726
      @mohammedjawahri5726 3 роки тому

      :o, can u elaborate pls xD

    • @superblaubeere27
      @superblaubeere27 3 роки тому +1

      @@mohammedjawahri5726 here is a video about it, you will need the context: ua-cam.com/video/wlvKAT7SZIQ/v-deo.html

    • @mohammedjawahri5726
      @mohammedjawahri5726 3 роки тому

      @@superblaubeere27 thanks!

    • @0MoTheG
      @0MoTheG 3 роки тому

      @@superblaubeere27 You mean at 35:00 ?

    • @superblaubeere27
      @superblaubeere27 3 роки тому

      @@0MoTheG exactly.

  • @DjVortex-w
    @DjVortex-w 3 роки тому +29

    You know that an instruction is complex if implementing it in a higher-level programming language would take literally hundreds of lines of code.

  • @kippers12isOG
    @kippers12isOG 3 роки тому +8

    I love your vids mate. You’re such a god dam likeable character

  • @galier2
    @galier2 3 роки тому +44

    TMS-9900 also has a very unique instruction: X Rn . Execute the instruction in register n. It's the only CPU I know of that has the equivalent of an eval() function (as the registers are stored in external RAM, it's clear that it's not difficult to implement in that case).

    • @Rudxain
      @Rudxain 3 роки тому +6

      It has SEVERE security issues. But hey, at least it can be used for self-modifying programs

    • @galier2
      @galier2 3 роки тому +9

      @@Rudxain for a CPU that doesn't have priviledge levels or memory protection, I don't think that security is an issue with the X instruction.

    • @peterfireflylund
      @peterfireflylund Рік тому +2

      S/360 had the EX instruction for that. The instruction wasn’t in a register but in memory (S/360 was variable length, 2/4/6 bytes). This kind of instruction was fairly common in the 50’s and 60’s.

    • @galier2
      @galier2 Рік тому

      @@peterfireflylund interesting. Btw in the TMS-9900 the instruction is also in memory because the register window is in memory.

  • @first-thoughtgiver-of-will2456
    @first-thoughtgiver-of-will2456 2 роки тому

    This and 2 minute papers are the most important channels on my UA-cam thank you for your service.

  • @ProjectPhysX
    @ProjectPhysX 3 роки тому +7

    Fantastic video! Such exotic instructions can insanely speed up / shorten certain algorithms. Back when I did MPASM (has only 35ish instructions), there are some rarely used ones that magically do exactly what you can also emulate in 10 more common instructions.
    From the instructions in the video I so far only used cmpxchg to emulate floating-point atomic addition in OpenCL.

  • @T33K3SS3LCH3N
    @T33K3SS3LCH3N 3 роки тому +2

    My little brother is doing a similar major as I did and will have a course with some practical work in assembly next year. Your video just gave me the inspiration to help him find some more "creative" solution to those assignments.

  • @sasas845
    @sasas845 3 роки тому +8

    I've worked with or in close proximity of most of these. If you do high performance number crunching or data crunching, the value logistics (i.e. which value needs to be in what operand in which SIMD position) very quickly becomes a major issue and for that all these shuffle/rotate/select/ are a godsend, especially since they tend to be just rewiring of existing ALU functionality so AFAIK should be easy to implement in silicon. Number 1 on the list is the only instruction family I'd put into "space magic" territory, but I might just not have seen its use case yet.

  • @MjuMeli
    @MjuMeli 3 роки тому +14

    This getting recommended to people is almost as oddly specific as the sound of sorting algorithms

  • @SaHaRaSquad
    @SaHaRaSquad 3 роки тому +49

    Not gonna lie, string comparison on the instruction set level actually sounds pretty useful. Not a fan of the absolutely insane arguments though.

    • @WhatsACreel
      @WhatsACreel  3 роки тому +3

      Yes, they are magnificent instructions!! Assembly can be super fiddly to code, but very powerful if you have the time to make sure it is correct.

    • @Gulleization
      @Gulleization 3 роки тому

      Yeah, as an accountant by profession I still wonder how mathematical reconciliation of bank statements and checking accounts can be so complicated to program and usually buggy.
      I guess that last instruction combined with machine learning techniques really could speed up the process.

    • @SaHaRaSquad
      @SaHaRaSquad 3 роки тому +10

      @@Gulleization You absolutely don't want machine learning near anything that requires accurate numbers. ML has its place but it isn't nearly as useful or reliable as the hype often makes it appear.

    • @somdudewillson
      @somdudewillson Рік тому

      @@SaHaRaSquad It depends on they type of ML. Neural networks are generally fuzzy, but there are lots and lots of other kinds of machine learning implementations, and some of them work very well for accurate numbers.

    • @jgunther3398
      @jgunther3398 Рік тому

      it would only be four or five instructions in a loop. but if it was four or five times faster and all you did was compare strings, very valuable!

  • @gabrote42
    @gabrote42 3 роки тому +2

    I haven't watched a video like this ever. Saving it for arguments. Thanks!

  • @Chrisuan
    @Chrisuan 3 роки тому +4

    Found this randomly in my suggestions. Insane content, great stuff. As a C++ programmer this assembly stuff scares me lol

    • @GogiRegion
      @GogiRegion 3 роки тому +2

      I’ve never done programming in assembly on any newer hardware, so to be I always thought of assembly operations as stuff like move this to there, add, subtract, compare two registers, so even as someone who’s used assembly this is absurd to me.

  • @MoosesValley
    @MoosesValley Рік тому

    Appreciate the tour. Did quite a lot of Assembly coding in my earlier years, and quickly grew to love it - it's a lot of fun when you get up and running, but you need to keep so much more information in your brain / at your finger tips compared to higher level languages.

  • @educate9946
    @educate9946 3 роки тому +4

    I love this presentation, it fits the weirdness of the ops! Great job!

  • @bbq1423
    @bbq1423 3 роки тому +322

    Wouldn’t it be better to call them functions instead of instructions at this point?

    • @jjoonathan7178
      @jjoonathan7178 3 роки тому +167

      Needs a RUNDOOM instruction.

    • @allmycircuits8850
      @allmycircuits8850 3 роки тому +37

      @@jjoonathan7178 At least IDDQD seems plausible, integer divide quads by double, store results as double :)

    • @oldxuyoutube1
      @oldxuyoutube1 3 роки тому +54

      They have their own implementation circuitry therefore they should be called instruction, and this is also one of the most important feature of x86 ISA, we make complex operation into an instruction to shorten the execution time and make program smaller.

    • @yadt
      @yadt 3 роки тому +27

      @@oldxuyoutube1 well, there is microcode...

    • @microcolonel
      @microcolonel 3 роки тому +1

      No, because they are not functions; maybe you could call them routines but not functions.

  • @salainen6850
    @salainen6850 3 роки тому +13

    PEXT is so useful! I can finally get the correct bits from a 4X 1R 1G 1B 1I 8-bit color buffer to the "layers" in mode 12h easily!

    • @WhatsACreel
      @WhatsACreel  3 роки тому +4

      Mode 12h? Are you coding EGA? That's awesome!

    • @salainen6850
      @salainen6850 3 роки тому +1

      @@WhatsACreel Yup! I think I should also do something on UEFI though, as it gives higher resolutions.

    • @ivanbrezina7632
      @ivanbrezina7632 3 роки тому +1

      Also DES, RC4 and other cyphers based on Feistel's schema would ridiculously slow without this.

  • @Ale-bj7nd
    @Ale-bj7nd Рік тому +1

    I always forget how beauty assembly is.

  • @monad_tcp
    @monad_tcp 3 роки тому +91

    This made me realize that X86 is more abstract than the C language, each of those instructions are like 4 or 5 lines of C.

    • @andersjjensen
      @andersjjensen 3 роки тому +60

      Now imagine having to teach a compiler to take your 5 lines of C code.... and figuring out which of the five thousand different x86 instructions is the perfect fit :P

    • @codahighland
      @codahighland 3 роки тому +21

      That's the opposite of more abstract. Being more abstract means you have tools that are more general-purpose in order to handle a variety of different uses. These instructions are not abstract; they are intended for specific purposes and aren't especially useful at all otherwise.
      Consider that these instructions are actually implemented as microcode inside the CPU -- miniature programs built out of primitive building blocks.

    • @sunnohh
      @sunnohh 3 роки тому +9

      @@codahighland i guess what he is really trying to say is that x86 is so bloated you can implement the same thing a billion different ways

    • @codahighland
      @codahighland 3 роки тому +3

      @@davestephens3246 Was the ad hominem even necessary? I wasn't judging. I was just giving information.

    • @monad_tcp
      @monad_tcp 3 роки тому +1

      @@codahighland "they are more general-purpose in order to handle a variety of different uses"
      that's why I said what I said. "X86 is more abstract than C"
      x86 has lots and lots of complexity, the instruction set has lots of arguments and things that happen in some state and not in others, the instruction is variable length.
      So, the instructions can be used for lots of different purposes, with different modes, different registers, and so on, and so forth.
      The instructions are actually implemented as microcode should be more than enough evidence that assembly is more abstract than the machine itself.
      Assembly is much more complex than the abstract machine that defines C and which you program to.
      C is basically a macro-assembler for the PDP11, X86 is a monster near it, it can do a lot, much more things, you can fine control memory load/store ordering, lots of abstract things that you can't even do in C, like barriers, for example.
      One practical example, there are SIMD instructions that a single instruction will to an entire for loop with sum and comparative to a variable, but in a register, like 4 or 5 lines of C is just a single asm in x86, and the compilers know how to translate that, because you can't even declare data-paralelism in C, the compilers have to pretty much guess so otherwise the CPU would be idling because C programs are sequential, but what we care about is how data relates to itself, not the control-flow of the program, the CPU couldn't care less about it (speculative execution for the win!), all because the C has less abstraction power than the machine itself.
      C is really, really outdated.

  • @HowieDue416
    @HowieDue416 Рік тому

    This video makes no sense to me, but my uncles used to code in assembly language. It just truly gives me awe and appreciation for the pioneers who used this language (WITHOUT DEBUGGING) and makes me see them in a new light as men of math.
    Thanks for humbling me and thanking god that there are higher level languages

  • @jimviau327
    @jimviau327 3 роки тому +82

    I'm no programmer but it appears to me that programming these instructions into a CPU is just about as complicated and fascinating as quantum physics.

    • @BrightBlueJim
      @BrightBlueJim 3 роки тому +12

      Then you don't know much about quantum physics. The point is that these instructions were added because doing these operations (which are needed in very specific cases) in software is otherwise very inefficient. In fact, in a microcoded CPU, they aren't that difficult to implement. If you really had to do these things in "hardware" (i.e., dedicated logic gates), that would be a whole lot of square microns.

    • @mage3690
      @mage3690 2 роки тому +1

      @@BrightBlueJim man, what a day and age we live in, to have real estate measured in microns! I'm only 20 years old, and I'm already living in the future. Imagine what the _actual_ future holds!

    • @Roxor128
      @Roxor128 Рік тому +1

      @@mage3690 Microns? If you want an actual comparison to real estate in terms of cost for high-end parts, you're going to want something a little bigger. Your unit will be the nanohectare (10mm^2). Your typical big complicated chip will therefore be around 20-40 nanohectares in size and will have cost Intel or AMD the equivalent of buying 20-40 hectares of actual land to develop.

    • @jgunther3398
      @jgunther3398 Рік тому +1

      half of the fun was all the bizarre "words" that mystified everybody else. it made you feel special. it's not as complicated as it looks. abstracting the problem into code is harder

    • @jimviau327
      @jimviau327 Рік тому

      @mage3690 here is a hint. In the future you will be a borg. With NeuralLink, all will be connected to the WEB and our reality will be online. Disconnecting from it would represent another phase of consciousness. Then, you will be able to experiment with 5 phases of consciousness , sleep, awake, dream, WEB and illumination. The latter being the most fantastic of all.

  • @NogCube
    @NogCube 3 роки тому +1

    I love your style bro! This is a great one. 👌
    Back to 2000.

  • @realhet
    @realhet 3 роки тому +75

    PUNPCKLDQD is sad and disappointed not being able to get on the list ;D

    • @juliankandlhofer7553
      @juliankandlhofer7553 3 роки тому +28

      Gesundheit.

    • @WhatsACreel
      @WhatsACreel  3 роки тому +21

      I am sorry, PUNPCKLQDQ... :( if we do a follow-up video, I will be sure to include the unpacking instructions in that :)

    • @realhet
      @realhet 3 роки тому +18

      @@WhatsACreel I remember doing a 8x8 16bit matrix transpose for a jpeg decoder with only 8 sse regs and 2 memory temp 'regs' with these crazy-named instructions. It was so satisfying when it finally started working correctly. :D

    • @WhatsACreel
      @WhatsACreel  3 роки тому +11

      @@realhet Wow!! Things were certainly tough when we only had 8 regs :)

    • @SuperSmashDolls
      @SuperSmashDolls 3 роки тому +13

      At some point that stopped being an x86 instruction and started being a DooM cheatcode.

  • @louistournas120
    @louistournas120 Рік тому +1

    It is great having a visual of these operations.
    Intel had once made an app that showed how each SSE instruction worked. I used that to learn and to write assembly code.

  • @lx2222x
    @lx2222x 3 роки тому +4

    Very cool video with very good animations, pls continue making this videos 👍, I just love ur channel

  • @furyzenblade3558
    @furyzenblade3558 3 роки тому +1

    Woa, high quality video, I love it! And the 3d visuals really help to represent the instructions

  • @OzoneGrif
    @OzoneGrif 3 роки тому +48

    I wonder which language compilers are able to detect these patterns and use the ASM operand instead of doing the slow imperative way.

    • @WhatsACreel
      @WhatsACreel  3 роки тому +37

      I love Clang! It does a lot of optimisations. You might have to use intrinsics, but these things are available in C++. Best way to know if the compiler is using decent instructions is to disassemble and check what it's doing. Or use the ‘Godbolt Compiler Explorer’ website.
      I don't think there's any compilers that are better at applying these instructions than humans. The gap is narrowing, and maybe one day, we'll get AI compilers that can do these things better.

    • @OzoneGrif
      @OzoneGrif 3 роки тому +6

      @@WhatsACreel Right, I guess the best bet would be to use/create libraries providing these functions as interfaced tooling; the librairies making use of ASM internally if possible (since it depends on the CPU type)

    • @Winnetou17
      @Winnetou17 3 роки тому +6

      @@WhatsACreel AI compilers that can do things better than humans! NEVER! Maybe just faster... (insecure human signing off)

    • @mthf5839
      @mthf5839 3 роки тому +2

      @@Winnetou17 I might be wooshing rn, but there are quite a few examples of AI doing better than humans. Google has some wild stuff for recognizing numbers from blured photos for its street view stuff.

    • @swarnavasamanta2628
      @swarnavasamanta2628 3 роки тому +5

      @@OzoneGrif please no more abstraction by library interfaces at low level. It is a nightmare, i say let good prpgrammers handle this.

  • @intvnut
    @intvnut 3 роки тому +3

    Carryless multiplication also comes up in error correcting codes and checksums. And, of course, it can implement INTERCAL's unary bitwise XOR if you multiply by 3.

    • @intvnut
      @intvnut 3 роки тому +1

      Hmm... my other comment about PEXT got deleted, probably because I included a link. PEXT implements INTERCAL's _select_ operator. And I believe PDEP can implement INTERCAL's _mingle_ operator. It's good to see Intel catching up with the amazing INTERCAL language!

  • @adamengelhart5159
    @adamengelhart5159 3 роки тому +42

    The other day I learned about the POLY instruction on the VAX. That's POLY as in polynomial, so when I heard of it I thought "well, I guess there could be a use for it in numerical apps, maybe? It's not like it's going to be more than a few coefficients. Maybe a cubic; that's only four."
    I was only off by twenty-eight! That's right--the VAX can, with a single terrible opcode, compute the value of up to a thirty-first degree polynomial, to either float or double precision.

    • @TotalImmort7l
      @TotalImmort7l 3 роки тому +3

      Isn't assembly strangely awesome?

    • @romannasuti25
      @romannasuti25 3 роки тому +2

      ...wouldn't a 31 degree polynomial just smash the value to negative infinity, positive infinity, or zero? What the hell is even the use of that lol

    • @juanthehorse420
      @juanthehorse420 3 роки тому +7

      @@romannasuti25 nope, if you need to do some crazy ass Taylor series or something and just look at a certain portion

    • @meneldal
      @meneldal 3 роки тому

      @@juanthehorse420 Outside of bragging about computing Pi faster, is there any use for 10+ long Taylor series in practice?

    • @MsHumanOfTheDecade
      @MsHumanOfTheDecade 3 роки тому +1

      @@meneldal approximating any function with nicer ones and then being able to calculate that fast on the fly can be useful, though most of those often-used functions have fast instructions themselves at this point.

  • @FlorianEagox
    @FlorianEagox 3 роки тому

    I love that I can tell how much fun you were having with this!

  • @SvetlinAnkov
    @SvetlinAnkov 3 роки тому +8

    @Creel, I love how you slipped in DNA nucleotide bases in the string match example 😃

    • @allmycircuits8850
      @allmycircuits8850 3 роки тому +6

      As soon as genetic scientists move from Excel to ASM, we are DOOMED!

  • @davannaleah
    @davannaleah 3 роки тому +4

    I remember the old Intel 8085 had some hidden instructions we used in our projects we knew they would not be changed because the instruction were used in some of the development tools for the MDS (Microprocessor Development System). There were instructions like LDHLISP with an 8 offset parameter. Basically it was "Load the HL register Indirectly with the Stack Pointer with the offset added" it was essential for writing re-entrant code (in 8085 assembler!). BTW this was way back in 1980!

  • @desmond-hawkins
    @desmond-hawkins Рік тому +4

    About *CMPXCHG* being "absolutely bizarre" (6:22), this is not only used for mutexes and semaphores as explained, but is also the most common primitive used for "lock-free" concurrent data structures (see for example Doug Lea's amazing ConcurrentSkipListMap implementation). It is so useful that many languages export it in some core library, like in C++ or java.util.concurrent in Java. Most programs you use every day likely rely on it or its equivalent in another architecture, unlike some of the other weird instructions listed in this video.

    • @michaelcederberg7937
      @michaelcederberg7937 Рік тому +1

      And it is not very useful as presented where all operands were registers. You want to executed this on a piece of memory.

  • @adamwieckowski6082
    @adamwieckowski6082 3 роки тому +2

    pmaddwd is my all time favorite instruction. Totally priceless for video coding!

  • @colinstu
    @colinstu 3 роки тому +5

    that glow around the bright text on dark background is driving my eyeballs crazy.

    • @WhatsACreel
      @WhatsACreel  3 роки тому +1

      Noted! Thanks for letting me know and cheers for watching :)

    • @colinstu
      @colinstu 3 роки тому +2

      @@WhatsACreel interesting vid / instructions nonetheless. but yeah, the glow reminds me of when my eyes are wet from crying, I kept having to pause and rub my eyes to "dry" them only to see it's still foggy looking lol.

    • @WhatsACreel
      @WhatsACreel  3 роки тому +3

      @@colinstu Ha! I felt the same way while making it! I toned down the glow from 6 to 2.5. It was still hard to look at, but I’d already rendered half the animations, so had to settle. I’m hoping to use animations resembling construction paper in the future. They are very easy to look at, but more time consuming to create. We will have to see how we go.

    • @snoozy355
      @snoozy355 3 роки тому

      @@WhatsACreel what software did you create your animations in?

  • @gazehound
    @gazehound 10 місяців тому

    Wow that carryless multiplication instruction took me straight back to my Information & Coding Theory class.

  • @kingtutthefirst
    @kingtutthefirst 3 роки тому +3

    I've always loved the absurdity of the PA-RISC2 instructions SET/RESET/MOVE TO SYSTEM MASK and the PSW E-bit. By changing it, you change the endianness of the entire CPU... And, because of pipelining, the instruction has to be followed by 7 palindromic NOP instructions. That's just always cracked me up.

  • @WhaleMilk
    @WhaleMilk 3 роки тому +1

    gonna admit, I don't know a lick of Assembly, but I enjoy trying to decode what anything here means while also listening to this dude's voice. Very entertaining

  • @alienrenders
    @alienrenders 3 роки тому +21

    Is it bad that I've used most of these and consider them perfectly normal? Glad you didn't get into OS level instructions that set up descriptors and gates. Now those are weird.

    • @keokawasaki7833
      @keokawasaki7833 3 роки тому +8

      bruh that shit fucks with my head, i tried getting into it but then the whole GDT, protected mode, gates and shit just knocked the air out of me by punching my brain in the balls (figuratively)

    • @ethanpayne4116
      @ethanpayne4116 3 роки тому

      considering these instructions normal is like knowing the difference between the ruddy northeastern gray-banded ant and the ruddy northeastern gray-striped ant. The world of CISC is truly a jungle

  • @patrickpinholt
    @patrickpinholt Рік тому +1

    Fun to hear about the rarely seen instructions 🎉🎉🎉

  • @soonts
    @soonts 3 роки тому +5

    addsubps was probably made for complex numbers packed into these vectors.
    mpsadbw and similar psadbw indeed were made for video codecs, to estimate errors. You should avoid mpsadbw because too slow, but psadbw is good.
    I think the craziest of them are for cryptography, like aeskeygenassist or sha1rnds4. Good luck explaining what they do.
    Another notable mentions are insertps (SSE 4.1; inserts a lane into vector + selectively zeroes out lanes; I used for lots of things), pmulhrsw (SSSE3; hard to explain what it does but I used it to apply volume to 16-bit PCM audio), and all of the from FMA3 set (easy to explain what they do, that’s ±(a*b)±c in one instruction for float numbers, but the throughput is so good).

    • @WhatsACreel
      @WhatsACreel  3 роки тому +2

      Great points mate! Cheers for watching :)

  • @Erizo_
    @Erizo_ Рік тому +1

    I never knew i needed this, until now.

  • @xymaryai8283
    @xymaryai8283 3 роки тому +17

    god, not even cryptographers would bother figuring these instructions out nowadays. no wonder RISC instruction sets are so much faster for the same electrons, they don't need to snake around the dark winding alleys of the ALU

    • @mduckernz
      @mduckernz 3 роки тому +3

      They absolutely do, though. Crypto nearly exclusively is written in assembler, and prioritises code that always takes the same amount of time to execute (to prevent timing attacks), and code that also otherwise doesn't leak state (the amount of time something takes to execute is a leak, but if it's always the same you can't extract any data from it)

  • @SERVOPUNK
    @SERVOPUNK 3 роки тому +1

    Ah, now I have a solution for the task of making any x86 compiler author cry in 15 minutes.

  • @islandfireballkill
    @islandfireballkill 3 роки тому +180

    I wonder how complicated it would be to try to formulate compiler autorecognition for instruction selection for these. That last one is easily a couple hundred lines of C code.

    • @FinaISpartan
      @FinaISpartan 3 роки тому +90

      Very complicated. Most of these optimizations are often missed by c compilers and have to be manually implemented in assembly. In some cases (video de/encoding) up to 50% of the codebase has to be rewritten in asm for these reasons.

    • @Abu_Shawarib
      @Abu_Shawarib 3 роки тому +56

      Your only hope is to use a library that already has fast paths coded in assembly to do this for you.

    • @jfwfreo
      @jfwfreo 3 роки тому +30

      The best way to do this would be to implement these as compiler intrinsics that would then be substituted with the correct ASM instructions.

    • @bootmii98
      @bootmii98 3 роки тому +1

      @@jfwfreo what if some other arch doesn't have them? most compiler suites support at least one other architecture.

    • @jfwfreo
      @jfwfreo 3 роки тому +10

      @@bootmii98 Most compilers for x86/x64 (including GCC and Microsoft) already support a boatload of compiler intrinsics for SSE and all sorts of things.

  • @spacewolfjr
    @spacewolfjr 3 роки тому +3

    Creel, you are most excellent!

  • @Clairvoire
    @Clairvoire Рік тому

    I didn't know PEXT existed until now... it's exactly what I need for fixed point multiplication, thanks!

  • @JohnCLiberte
    @JohnCLiberte 3 роки тому +6

    Just imagine pitch meetings to decide which instructions should go in the set :D. I'm surprised they don't have a 'calculate your taxes and clean the house' instruction

    • @boptillyouflop
      @boptillyouflop 3 роки тому +3

      These instructions do have a couple really solid selling points: (1) they don't write to multiple registers (2) they don't do special memory accesses (3) they don't cause any weird special interrupts.

  • @nicholash8021
    @nicholash8021 3 роки тому +2

    CMPXCHG -- Probably the most important instruction of them all.

    • @carstenschultz5
      @carstenschultz5 3 роки тому +1

      Yes, nothing exotic about this. It’s also in LLVM IR, for example.

    • @nicholash8021
      @nicholash8021 3 роки тому

      @@carstenschultz5 Not exotic but critical for establishing synchronization contexts in multi threaded systems.

    • @carstenschultz5
      @carstenschultz5 3 роки тому +1

      @@nicholash8021 , I was agreeing with you. It just does not belong in a list of crazy instructions.

  • @overcritical304
    @overcritical304 3 роки тому +3

    Honestly, 2 days ago I was trying to figure out what the hell does MPSADBW do!. Love you Creel, I hope you will make videos on in-depth explanation of these instruction.

    • @WhatsACreel
      @WhatsACreel  3 роки тому +1

      Hahaha, that's awesome! Thank you for watching :)

  • @Molotom
    @Molotom Рік тому

    You have great energy and enthusiasm in this video! Keep it up :)

  • @SimGunther
    @SimGunther 3 роки тому +49

    EIEIO
    I know it's a PPC instruction, but still...
    Seriously, the craziest ASM instructions are the ones not documented in any of the instruction manuals, but are only found by the sandsifter program (written by xoreaxeaxeax)

    • @sebastiaanpeters2971
      @sebastiaanpeters2971 3 роки тому

      Any proof for your second claim?

    • @danyildiabin4953
      @danyildiabin4953 3 роки тому +12

      @@sebastiaanpeters2971
      ua-cam.com/video/_eSAF_qT_FY/v-deo.html
      ua-cam.com/video/ajccZ7LdvoQ/v-deo.html
      This guy had a few talks about undocumented instructions or whole undocumented cpu hardware blocks

    • @SimGunther
      @SimGunther 3 роки тому +7

      @@sebastiaanpeters2971 Any of Chris Domas' talks around unlocking God Mode or breaking x86 should suffice

    • @StannyObelisk
      @StannyObelisk 3 роки тому +12

      Old McDonald had an assembler, EIEIO.

  • @notamouse5630
    @notamouse5630 Рік тому +1

    CMPXCHG16B is used for atomic operations required by lock free and block free queues.

  • @reirei_tk
    @reirei_tk 3 роки тому +4

    Honestly it's amazing how much work PCMPxSTRx can do in 3 or 4 clock cycles.

  • @UnkownUnkown01
    @UnkownUnkown01 Рік тому

    As weird as this video is, I never enjoyed a video so much, I think it's just the enthusiasm this guy has... damn, I wish everyone who made videos like there would have that same enthusiasm, but if you're reading this, thanks, I can't remember the last time I liked a video this much

  • @jerradn
    @jerradn 3 роки тому +4

    I felt like I had to clean my glasses several times during this video, haha.

  • @swoopskee
    @swoopskee 3 роки тому +1

    whoah, this is some premium content right here, thank you! Subbed and notifications on

  •  3 роки тому +4

    It's like watching golden globes for nerds

  • @Lux_Lost
    @Lux_Lost 3 роки тому +1

    I don't know where I went wrong in life to end up here, but I'm enjoying it, so it's chill

  • @DogsRNice
    @DogsRNice 3 роки тому +5

    Of all the thousands of videos I’ve watched this is the one that went farthest over my head

  • @distrologic2925
    @distrologic2925 3 роки тому

    Love how excited he is constantly

  • @haraldsbaumanis
    @haraldsbaumanis 3 роки тому +7

    It would be very interesting to talk to the people who designed these chips

    • @GogiRegion
      @GogiRegion 3 роки тому +3

      I’m just imagining that the entire design team for #1 probably go into extreme PTSD flashbacks any time they see the letters PCMP anywhere near STR. I just can’t imagine what the proposal Idea was like that led to the instruction being considered.

  • @douggale5962
    @douggale5962 3 роки тому

    To be honest, I 3/4 expected this to be a dumb list, but I was pleasantly surprised that you actually know some stuff!

  • @iandrsaurri625
    @iandrsaurri625 3 роки тому +11

    Imagine how fast programs would be if our compilers could instantly see when these obscure commands would be useful and then put them into place. I dont even understand how these instructions take so few clockcycles

    • @adrien.cordonnier
      @adrien.cordonnier 3 роки тому +3

      Imagine how fast programs would be if developers could see when these obscure commands would be useful and then put them into place.

    • @joekerr5418
      @joekerr5418 3 роки тому +1

      If only the compilers had a mind of its own. Well the developers do, but nah

  • @sgthometar
    @sgthometar 3 роки тому

    Imma be honest: I didn't understand most of this. But your enthusiasm is contagious.

  • @mojeimja
    @mojeimja 3 роки тому +3

    I can not imagine a compiler that utilizes these fully! Use asm, optimize by hand!

    • @soonts
      @soonts 3 роки тому

      I agree it’s borderline impossible for compilers to emit them automatically. I saw clang’s auto-vectorizer emitting vpshufb but that was very simple code.
      I disagree about ASM. All these instructions can be used in C or C++ as compiler intrinsics, way more practical.

    • @mojeimja
      @mojeimja 3 роки тому

      @@soonts yes, but if one can understand and use intrinsic properly, then heshe can just write entire function in ASM too (right there inside C code), so it not about how exactly to use it, its about to use it efficiently at all.

    • @soonts
      @soonts 3 роки тому

      @@mojeimja The code I write often has both SIMD and scalar parts, interleaved tightly.
      Modern compilers are quite good at scalar stuff, they abuse LEA instruction for integer math because it’s faster, and do many more non-obvious things. Just because they suck at automatic vectorization doesn’t mean they suck generally.
      For SIMD code, manually allocating registers, and conforming to the ABI (i.e. which registers to backup/restore when doing function calls) is not fun.
      With intrinsics, the compiler takes care about these boring pieces.

  • @saudude2174
    @saudude2174 Рік тому

    PEXT made me laugh for some reason. Don't know if it's the particular tone you explained it in or the absolute (seemingly for my stupid brain) randomness and bizarreness of this operation, but I love it.

  • @elietheprof5678
    @elietheprof5678 3 роки тому +5

    I have to admit, when CPUs changed from 32 bit to 64 bit, I was skeptical. Like how often do you really need to count beyond 2 billion anyway? But now I see why 64-bit instruction sets can be useful as fuck, and faster for the same clock speed.

  • @lt3880
    @lt3880 Рік тому

    This was in my recommendations dozens of times in the last year. I finally watched, and I dont know what to do with this information