having only ever worked with RISC assembly like MIPS in school, seeing the extremes of what you poor poor x86 driver authors have to deal with is entertaining and enlightening.
Sojit , in this case It doesn't appear that this video content will ever be of service to the quality of life you are seeking. Did I just wrote that ? I'm not even sure I understand myself. :)
3 роки тому+3
@@TheActualDP It has 2#10_1111_0010_1010_0010# views (I love ADA's based integers :D)
RdSeed - It's not always slow. There's a FIFO on the output of the RNG. RdSeed pulls from that FIFO. If you haven't just pulled a bunch of values from the FIFO, the value will be available immediately because the FIFO is not empty. If you try to continuously pull from Rdseed and measure the average time per instruction, it will appear slower because you are limited to the physical rate of generation of full entropy numbers from the RNG, which requires a whole lot of computation - Generate 512 bits from the entropy source, AES-CBC-MAC them together to get 128 bits (that's two RdRand result's worth) XOR it with and output from the DRBG (another 3 AES operations, just like SP800-90C describes) stuff the two 64 bit numbers from the 128 bit result into the output FIFO. How do I know all that? I designed it.
@@LKRaider It was around 2009 I started. It ended up first in the Ivy Bridge processors with the RdRand instruction. I had been working on writing cryptographic protocols in standard committees (802.11i, 802.16 etc) and they all needed cryptographically secure random numbers and when I looked at the SP800-90 specification back then, it was not sufficient. It described DRBGs (aka PRNGs) but not entropy extraction or physical entropy sources. A small team of 4 people was assembled, myself, a mathematician, an analog designer and a corporate cat herder. the math guy came up with some of the mathematical principles and identified the best papers describing how to quantify the entropy, the analog guy did the physical entropy source, the cat herder got it into silicon and I designed the digital logic that takes the partially random bits, turns them into full random bits with an entropy extractor and seeds a PRNG/DRBG with that full entropy data to make the resulting stream of random numbers fast enough. Since then the other three left (2 retired and one died) and I've been the main owner of the RNGs since. RdSeed which gives full entropy output as per SP800-90C and X9.82 was added with Broadwell. This was so you could make arbitrarily large keys from it. Faster and slower versions were created (fast for servers, slower for energy efficient chips) also I've designed a few other types of RNG for specific needs, like super small ones, non uniform ones and floating point ones. I contributed to the development of SP80090B and SP800-90C and the revision of SP800-90A which now cover most of what you need in a secure RNG. A couple of years ago I finished a book on random numbers which was published (Random Number Generators, Principles and Practices). So getting involved to solve my problem of where do I get random numbers has turned into the defining part of my career. The standard are still changing. Certification requirements are still evolving and the need for new RNGs that fit in different contexts keeps up apace, so it has become a full time job for myself and a small number of colleagues.
@@DOSeater I wish I knew this 2 months ago. I got that job anyway, but it took a little more interview iterations. Now I'm a happy developer of a delivery robot :)
Are you kidding me? I hate his voice with every fibre of my being. I've subbed only because he has subtitles and the other videos look interesting. That first 30 seconds was excruciating, I may need a lie down in a dark room
So the most amazing thing about these instructions to me is the fact so many of them run in single digit cycles. You have to marvel at the engineering effort that has gone into it. Also, a compiler has to basically be sentient to know when and how to use some of these.
@@MrHaggyy And billions of hours for a javascript hello world. i think capable computer engineers brought this upon their selves by providing layers and layers of abstraction and burying need for internal necessary concepts to get something done. No wonder the developers now are too shallow in their concepts, probably not their fault if they get hired only after 6 months of python for data structures (they have no incentive to learn the deeper internals if they get paid shitload for sitting in a desk). Hell i would say most people choose programming or development for making bucks, learning and interest comes later. There only a few people now who are truly interested and curious in the core of things and it might just be that after 10 years understanding these would just be luxury and not necessity. Also no wonder why most programmers hate their jobs and want to die after getting one.
@@swarnavasamanta2628 mhm i think the horizon of programmer/developer/engineer in this field got much broader. Yes, there are many abstraction layers we have invented and standardized over the years. I have a mechatronics degree with a microsystem-technology specialization. Most of my field works on improving the hardware for existing assembly code. But we also introduce new things in hardware which we map to assembly or C/C++ code. On that layer, you have the guys who are building assemblers, linkers, and compilers. These are the programs you need to actually execute code on a machine. On top of that, you have the Microsoft, Android, Apple, Linux, etc guys who write an operating system that provides useability with that stuff. And on that foundation, you can start building languages, IDEs, or any program you can open on your computer. And if we finally have these higher-level languages and programs we can start building frameworks or things like python. That field can write very powerful applications that millions of people can use, or that run on many machines at the same time, or all the things these cloud-native guys are doing. The interest in these fields is widely different. I personally love hardware, and the guys I work with love building hardware or building systems with hardware. Systems can be the new Intel i3-i7, over to raspberry pi or smartphone processor, to small controllers like an STM32 which are used in smartwatches, cars, microwaves, freezers down to something like an Arduino which is easy to learn. There are a lot of people working on those layers. Many of them being the stereotype white europeon/north-american older man. But this field is one of the most global out there. With Korea, Taiwan, Japan and China being the "most" impactful. The amount of things you could learn about computer and software layers is way beyond one's reach. 99.99% of all programmers don't have a clue how transistors are formed into bite logic, scaled to 16-32-64-86-128bit wide memory, how this memory became a register with a specific purpose and how you address this register so you can call it. But you don't need to know it in order to write a program. :-) we have you covered in that one :-) So even assembly can teach you a lot about how a computer works, you don't need to write it. In fact you shouldn't write it for any used code. Use a compiler and write it in a higher-level language. All the smart people from the compiler department will cover you there. And so on and so on. Until the hip young facebook star engineer can write his php or python code for his next new feature. And if we do something amazing down the layers he will get a new version that will make his software even better than before. And the only thing he needs to do is trust the work of other people. The unpleasant truth about why so many programmers want to die or really do it is a mismatch between management, expectations and skills pared with bad working environments. Coding and engineering computers is a mentally very hard and demanding task. You have to know your tools, get to know the problem, which I like to call a puzzle, identify the pieces of your puzzle, sometimes create a new piece that fits, and solve the puzzle. This takes time. A good time is anything from 2 to 4 hours. Less is only sufficient for really easy tasks, longer is better but you need to train for it and you need to go to the toilet, move, eat, sleep etc. In most companies, this deep focus session gets corrupted by meetings, telephone, angry managers, or people that think they are important to the problem. These corruptions drain a lot of willpower and unless you are an (senior) engineer and prepared for this kind of stuff it will depress you. You need to get your routines in place in order to sustain. The other part is once you solved the puzzle your company needs to give you a reward for this. If your management doesn't like your result and lets you feel their miss liking, you need someone holding you on the bridge. That's why many companies in this field like Facebook and Intel don't have 9 to 5 jobs. You get paid to work for them. There are recommendations on how you should set up your routines and there are people helping you. But you can come and go as you like. But you get certain tasks and a timeframe. Once the timeframe is over people all over the world are counting on you getting the job done in time. So very wide, very different, and very interesting domain. And it's very rewarding if you know that you did something that all of mankind will use and benefit from in a view month after you finished your work.
"HCF" -- Halt and Catch Fire. On a lot of early CPUs (1970s/1980s, yes damnit I am old) the manual gave the bit pattern for each instruction - and the the rest of the bit patterns did undocumented things. Some were just a different way to spell NOP, some did deeply bizarre unintended things that happened because the bits randomly activated chunks of the CPU circuitry that mixed and matched chunks that were used in different combinations for other commands, and some did things that were only ever intended to be done in the factory, during QA testing. We used to hunt through these "undocumented instructions" looking for anything interesting or cool that we could then figure out uses for. But this was a bit risky. A fair number of CPUs had at least one undocumented instruction that would immediately cause the machine to lock up and, a few seconds later, destroy the CPU. Sometimes they caught fire, sometimes they melted through the PCB. Sometimes they desoldered themselves from the board and fell out. Whenever we found it we called it a "Halt And Catch Fire" instruction and patched the name 'HCF' into our macro assembler for that bit pattern, in order to avoid accidentally finding it again. Naturally when I saw the title of this video I figured HCF would be at the top of the list. Finding an HCF usually meant a new version of the chip as soon as the company could mask it off. We thought of ourselves as contributing to their QA efforts, although very few of them thanked us for it.
Write while rewind Eject disc Read & write while ripping tape Disable console active emergency power off Electrocute operator Sense card deck on printer and open cover Write past EOT Read and scramble data I have a huge list of them along with my green cards
@@Safyire_ We found things like 'compare while swapping' that swapped the values in two registers while writing 1 to the comparison bit if the first was higher than the second. That was actually a little bit useful. We found a lot of things that tried to do two or three things at once but did them in a random-ish order because of race conditions. One of those was useful because it consistently did xor before swap if the CPU was hot and swap before xor if the CPU was cold, so we could write code that monitored the CPU and shut things down if it got too hot. We found instructions that connected multiple registers to the bus for output, meaning the result of the instruction would be written to four different registers at once. We also found instructions that connected multiple registers to the bus for input, which was useless and sometimes damaged the CPU. It was a real crapshoot. Also a very expensive hobby if you damaged the machine and your professor wasn't ready to write it off to "research." CPUs were not cheap.
I can't get over this presentation. That's the kind of nerdy content you expect to find in a recording of a 10 year old talk that was given to 50 people in a tent :D
I've used MIPS excessively and never looked at X86 much. This feels like when you were playing yugioh in 1999 and you were summoning and setting 1 card every turn and then you get teleported to 2023 where people play their entire deck in one turn and have cards with effects that are 7 paragraphs
Even when cutting off all SSE and up instructions (making it useful for legacy x86 device targetting) there is still a lot of complexity, including very precise x87 floating point and MMX vectorization. What makes it especially fascinating is how compatible it has become; a 640×480 60fps renderer on a very old x86 processor with MMX might very well be the exact same program that does 3840×2160 60fps on a modern PC.
Good lord that poor silicon. I can't even begin to imagine how you'd design chips to implement some of these instructions. I'd love to see a followup video showing some examples of using these instructions, and if they're superceded, what should be used instead!
I'd think that there are massive groups of "one circuit per operation", and they all work in parallel. From all the results only the specified one is selected.
I doubt these instructions were aimed at people writing compilers, they'd be aimed at people doing things with encryption, low-level synchronization, multimedia.. I think these days people would first try and come up with a GPU based way to tackle these large data-processing problems, but before GPUs were general purpose parallel computers you had to do these single instruction multiple data things on the CPU
@@kestasjk Also, doing stuff with a good CPU instruction is generally more efficient than doing it on the GPU, simply because you have to send across the data and get the result back on a GPU.
@@toboterxp8155 Sort of.. The thing is if you’ve got enough data the GPU is so much faster it’s worth the overhead (and the memory space is getting more integrated / unified all the time), and if you’ve not got enough data to make sending to the GPU worthwhile the speed up for processing a small amount of data on the CPU more efficiently probably isn’t worth it. Perhaps for certain encryption or compression tasks where it can’t be parallelised very well on the GPU but it still needs lots of processing power they may still be useful, but I doubt these sorts of instructions are used in modern software very often
@@kestasjk Your generally correct, but those instructions are a standard way of making programs faster, used to this day. If your task isn't easily converted to the GPU, you don't want the extra work, or you don't want the program to require a GPU, using some complex instructions is an easy, fast and simple way to optimize for some extra speed when needed.
@@toboterxp8155 True.. but I think you can probably attribute ARM/NVIDIA’s ability to keep improving by leaps and bounds while Intel is reaching a plateau to its need to maintain a library of instructions that aren’t really necessary in modern software. If it gets rid of them old software breaks, if it keeps them any improvement it wants to make to the architecture needs to work with all these. Intel went for making the fastest possible CPU, but we now know a single thread can only go so fast (and the tricks like branch prediction have exposed gaping security holes in CPUs, forcing users to choose a pretence of security or turning branch prediction off and getting a huge performance hit). So parallelism is the future: In the 00s this meant multi-core CPUs, today this means offloading massive jobs to the GPU, but the breakthrough will come with CPUs and GPUs merging into one. Not to an SoC, like we already have, but with GPU-like programmable shaders as a part of the CPU instruction set and compiler chain, so that talking about CPU/GPU will be like talking about CPU/ALU. You’ll be able to do the operations like these instructions do in a single cycle, but by setting up a “CUDA-core” with general purpose instructions that can access the same memory.
Don't worry; as long as computer time remains far more valuable than developer time, and no alternative graphics-based technology appears for custom parallel processing operations, Intel will be just fine
@@kestasjk eh, emulation of x86 on ARM on both Windows and Mac is apparently good enough now that I'd be seriously worried if I was Intel. AMD at least have their GPUs...
@@SimonBuchanNz I think AMD wouldn't mind going ARM too much, if they have to. Maybe even will design dual-instruction-set chips for the transition period. Good thing that China won't let Nvidia buy ARM. In general, nowadays there is a tendency towards "crossplatform" software design practices, so the question of "Can it run widespread software fast?" would soon become irrelevant. For example, Adobe Lightroom already works on ARM on Windows and their other products will follow soon. Itanium might not have flopped if it happened a few years from now, at least not for the reason it did, which was poor x86 emulation performance.
@@codycast The short answer is Qualcomm. They are banned by US so if ARM becomes US-owned, Qualcomm will no longer be able to legally produce ARM chips. Possible political implications of that are just too painful to risk so regulators almost certainly won't allow it.
#1 is the definition of insane and incredibly useful. Thank you for translating the Enginese into English. Now I can delete my string comparison macros forever.
I just remember, porting the torque game engine to PSP, and from all the work, the CMPXCHG instruction for the mutex, i implemented some native PSP intrinsic to do that, good memories, the best optimization trick also, the game was doing 10 fps at the best, the problem was matrix transposition, between the engine and PSP "opengl", so i made a transposition on the fly changing the order of reading and writing of the registers in the VFPU instructions, kicking the Sony engineers 'axe' ; ), and getting 30 fps, enough to pass their performance standards.
Nice, but wouldn't it have been better to change which indices of matrices are used in vector and matrix functions? E.g. using m[4] instead of m[1] and vice versa.
@@DiThi that implementation costs 20 fps in that platform, you need to swap the entire matrix operations for every calculation, sounds trivial but was not for a 333 Mhz processor with slow RAM. before was: matrix.transpose(); // bloated operation vector.mul(matrix); after optimization was: vector.mul(matrix); // due to the trick no transpose needed
Can't wait till you remake this vid in 10 years with all the custom RISC-V extension instructions. Gonna be pretty wild to see what people come up with.
The big mistake Intel made is to create fixed width vector instructions. The V in RISC-V points to the importance of the variable width vector instructions where the assembly code doesn’t need to know the vector register size (V extension), and a similar matrix extension is coming for machine learning I think (though V is already a great improvement)
@@canaDavid1 Officially yes, but you can find videos of the people who developed RISC- on UA-cam, and they mentioned that they originally developed it because they wanted to get the vector extension right, and that's why they called it RISC-V at the start.
Also it's a reduced instruction set (risc) and not a complex instruction set (cisc) like x86 So why should risc-v even get some of these? just do them in software and let the compiler do it's magic.
The point of risc-v is to have a common set of instructions understood by many cpus and to be extended with application specific extensions where needed. So you can be 100% sure there will be many wild instruction extensions.
CMPXCHG is how mutual-exclusion, locks, and semaphores are implemented in systems like QEMU. I remember having to fix a bug with a race condition in the QEMU Sparc interpreter by adding judicious use of CMPXCHG locking. It's an amazing instruction and, with its guaranteed atomic behavior, can be used to trivialize mutexes.
Bear in mind that some instructions were not designed, they are a by-product of the design process. In essence, take any bit-pattern that is not assigned to an instruction and look at what the processor will do. Most often it will do nothing (which his why there are so many NOP's in instruction sets) or it may crash, but sometimes it will do something weird and wonderful and be included as an "official" instruction while the designers pretend it was intentional.
@@lPlanetarizadohat wouldn't happen today on your PC's x86. Or this would be a terrible security issue. On modern systems userspace processes should be able to (try to) run any instruction they want without the CPU melting down.
All of the instructions in this video were quite intentional, but niche. Well, only some are niche. cmpxchg is a _foundational_ instruction whose importance cannot be understated, while pshufb is going to be in pretty much every vector codebase. dpps is pretty well known, parallel dot product. not a fan of dpps tbh.
This video has such an unique editing. The topic isn't any less obscure, and it's really cool to hear the author being so enthusiastic about those instructions. It's a really interesting experience
alright guys lets brainstorm what kind of algorithm could benefit from all 10...maybe search for a specific font in an image by comparing each glyphs bitmap to the image using MPSADBW and search for words within identified glyphs using the last instruction?
MPSADBW can be used for all sorts of optimization problems as the sum of absolute differences is a metric. It's often faster than using the Euclidean metric which requires a square root and you can substitute one for the other in many situations.
The carryless multiplication is polynomial multiplication modulo 2. It's used to implement things like CRC computation, and Reed-Solomon error correction codes.
@@threepointonefour607 0000 0000b - 1111 1111b == 0x00 - 0xFF since log2(x) is a factor of log16(x)! If you are doing simple programming, then 90% of the time you'll only need hexadecimal. If you are actually building and designing hardware and implementing it's data paths, control lines and control bits... You are not going to get very far without binary and Boolean Algebra! If you get into Cryptography, or Signal Analysis you might want to know binary as you'll end up performing a lot of bit manipulation!
I feel bad for the CPU engineers who will need to add compatibility for this stuff in 20 years Edit: finished watching the video. This was pretty fascinating, and the 3D text made it very nice to watch. I hope you gain more subscribers!
They'll do it in microcode, I imagine. Apart from the RNG, they can all be done purely in heaps of microcode if you don't care about performance, no dedicated hardware needed.
If you ever learn about microprocessors, it's all about microcode. Every assembly instruction are function call to microcode. The design will basically the same, with microcode printed in ROM inside the chip. You just have to be creative using that microcode to come up with a new instruction.
@@gorilladisco9108 There's definitely a lot more to it than just microcode. Things that are both easy and compact in hardware - such as a linear-list search or swizzling - and microcode won't get you there. Also I'm not aware of any major RISC implementations that use a significant amount of microcode, very much unlike x86.
@@johnbrown9181 And that's why you won't see any instruction like the ones listed on this video on any RISC microprocessors. The thing about x86 and other CISC microprocessors is they use microcode liberally. Microcode is how a microprocessor work. All you have to do is to have imagination.
Depends on how fast it needs to be. Optimizing complex instructions to use all of a core's hardware is difficult, but just getting older instructions to work for the sake of compatibility isn't that hard. Hence, x86 code from a couple decades ago will work fine on a modern x64 chip, while ARM, PowerPC, and other RISC designs have suffered mountains of compatibility issues over time.
Don't forget the Motorola 6800 "Halt and catch fire" instruction. It was an unpublished byte code that caused a branch to itself until the chip overheated.
@@BrianG61UK Long ago a computer center I worked in had a list created by IBMers in the 1960s of amusing opcodes, including HCF. But I didn't want to complicate the text, and the MC6800 item is there in the Wikipedia description, though I did have the details incorrect😊.
This video is about x86 though. Given, it does have the HLT instruction, and if you use it in your user mode application it will catch fire (if by catching fire you mean cause a privileged instruction exception) :0)
@@rty1955 yes, I recall on the wall of a data center I worked at, a paper list of spoof IBM machine instructions that included this HCF instruction. Iirc there was also BAH, Branch And Hang😂. The only CPU that actually did this that I'm aware of was the early 6800, but it's possible there were others. The 6800 was an "unimplemented" instruction bit pattern that unbeknownst to Motorola effectively branched to itself immediately and repeatedly until the heat built up enough to burn the logic. I also personally knew experienced the result of two amusing (to me) episodes - at a college I was attending, a kid running a canned BASIC business program that managed somehow to overwrite the entire disk map, effectively erasing everything, and a kid looking for a job used social engineering to get the guy running jobs to dive and hit the Big Red Halt button. Each of those events caused the Computer Center to be offline for more than a week. And an entire computer center at a company where I worked got completely fried including three mainframes due to a lightning strike right at the pole outside the Center. The senior manager had resisted spending the $5 million required for a motor generator to isolate the computers from the world. We had 400 engineers twiddling thumbs for two weeks. He got a new job.
There is something about machine code that feels right. I dunno. I've not done any actual assembly programming so maybe my opinion doesn't matter but x86 just seems so bloated and inelegant.
@@seneca983 you would be partially right. Bloated or not depends on the way of implementation, if these instructions were to be implemented by microcode, yes absolutely, better let the programmer handle them. But if they are direct on chip Hardware implementation of these instructions then it's a different story, it takes the opposite route of bloat. Takes 1 instruction instead of writing a 100 line function in C and hoping compiler would get the translation right. Also x86 being firmly established the engineers have to make sure they are compatible all the way. Support for languages will drop eventually, while x86 is going to stay.
@@swarnavasamanta2628 One advantage of a simpler and smaller instruction set is that microcoding might not then be necessary and the chip could be simpler. Indeed x86 would be rather difficult to supplant. However, it seems possible that ARM could do it though it's uncertain and would probably take a long time if it happened.
@@seneca983 ARM is definitely a beast, and their methodology is completely different from other CISC approaches. It began first as a project to see if a computer really needs large complex instructions, they thought they would come at a halt problem but nothing really came up and they could make everything work with 1 cycle simple instructions (although with a bit of microcode). At this point hard to tell what the future holds, maybe there will be standardization when one architecture has so many advantages that renders other architectures almost useless or unworthy of learning curve. Who knows what the future holds but up until that the architecture land of computers is like wild wild west and i kind of love it that way.
TMS-9900 also has a very unique instruction: X Rn . Execute the instruction in register n. It's the only CPU I know of that has the equivalent of an eval() function (as the registers are stored in external RAM, it's clear that it's not difficult to implement in that case).
S/360 had the EX instruction for that. The instruction wasn’t in a register but in memory (S/360 was variable length, 2/4/6 bytes). This kind of instruction was fairly common in the 50’s and 60’s.
Fantastic video! Such exotic instructions can insanely speed up / shorten certain algorithms. Back when I did MPASM (has only 35ish instructions), there are some rarely used ones that magically do exactly what you can also emulate in 10 more common instructions. From the instructions in the video I so far only used cmpxchg to emulate floating-point atomic addition in OpenCL.
My little brother is doing a similar major as I did and will have a course with some practical work in assembly next year. Your video just gave me the inspiration to help him find some more "creative" solution to those assignments.
I've worked with or in close proximity of most of these. If you do high performance number crunching or data crunching, the value logistics (i.e. which value needs to be in what operand in which SIMD position) very quickly becomes a major issue and for that all these shuffle/rotate/select/ are a godsend, especially since they tend to be just rewiring of existing ALU functionality so AFAIK should be easy to implement in silicon. Number 1 on the list is the only instruction family I'd put into "space magic" territory, but I might just not have seen its use case yet.
Yeah, as an accountant by profession I still wonder how mathematical reconciliation of bank statements and checking accounts can be so complicated to program and usually buggy. I guess that last instruction combined with machine learning techniques really could speed up the process.
@@Gulleization You absolutely don't want machine learning near anything that requires accurate numbers. ML has its place but it isn't nearly as useful or reliable as the hype often makes it appear.
@@SaHaRaSquad It depends on they type of ML. Neural networks are generally fuzzy, but there are lots and lots of other kinds of machine learning implementations, and some of them work very well for accurate numbers.
I’ve never done programming in assembly on any newer hardware, so to be I always thought of assembly operations as stuff like move this to there, add, subtract, compare two registers, so even as someone who’s used assembly this is absurd to me.
Appreciate the tour. Did quite a lot of Assembly coding in my earlier years, and quickly grew to love it - it's a lot of fun when you get up and running, but you need to keep so much more information in your brain / at your finger tips compared to higher level languages.
They have their own implementation circuitry therefore they should be called instruction, and this is also one of the most important feature of x86 ISA, we make complex operation into an instruction to shorten the execution time and make program smaller.
Now imagine having to teach a compiler to take your 5 lines of C code.... and figuring out which of the five thousand different x86 instructions is the perfect fit :P
That's the opposite of more abstract. Being more abstract means you have tools that are more general-purpose in order to handle a variety of different uses. These instructions are not abstract; they are intended for specific purposes and aren't especially useful at all otherwise. Consider that these instructions are actually implemented as microcode inside the CPU -- miniature programs built out of primitive building blocks.
@@codahighland "they are more general-purpose in order to handle a variety of different uses" that's why I said what I said. "X86 is more abstract than C" x86 has lots and lots of complexity, the instruction set has lots of arguments and things that happen in some state and not in others, the instruction is variable length. So, the instructions can be used for lots of different purposes, with different modes, different registers, and so on, and so forth. The instructions are actually implemented as microcode should be more than enough evidence that assembly is more abstract than the machine itself. Assembly is much more complex than the abstract machine that defines C and which you program to. C is basically a macro-assembler for the PDP11, X86 is a monster near it, it can do a lot, much more things, you can fine control memory load/store ordering, lots of abstract things that you can't even do in C, like barriers, for example. One practical example, there are SIMD instructions that a single instruction will to an entire for loop with sum and comparative to a variable, but in a register, like 4 or 5 lines of C is just a single asm in x86, and the compilers know how to translate that, because you can't even declare data-paralelism in C, the compilers have to pretty much guess so otherwise the CPU would be idling because C programs are sequential, but what we care about is how data relates to itself, not the control-flow of the program, the CPU couldn't care less about it (speculative execution for the win!), all because the C has less abstraction power than the machine itself. C is really, really outdated.
This video makes no sense to me, but my uncles used to code in assembly language. It just truly gives me awe and appreciation for the pioneers who used this language (WITHOUT DEBUGGING) and makes me see them in a new light as men of math. Thanks for humbling me and thanking god that there are higher level languages
Then you don't know much about quantum physics. The point is that these instructions were added because doing these operations (which are needed in very specific cases) in software is otherwise very inefficient. In fact, in a microcoded CPU, they aren't that difficult to implement. If you really had to do these things in "hardware" (i.e., dedicated logic gates), that would be a whole lot of square microns.
@@BrightBlueJim man, what a day and age we live in, to have real estate measured in microns! I'm only 20 years old, and I'm already living in the future. Imagine what the _actual_ future holds!
@@mage3690 Microns? If you want an actual comparison to real estate in terms of cost for high-end parts, you're going to want something a little bigger. Your unit will be the nanohectare (10mm^2). Your typical big complicated chip will therefore be around 20-40 nanohectares in size and will have cost Intel or AMD the equivalent of buying 20-40 hectares of actual land to develop.
half of the fun was all the bizarre "words" that mystified everybody else. it made you feel special. it's not as complicated as it looks. abstracting the problem into code is harder
@mage3690 here is a hint. In the future you will be a borg. With NeuralLink, all will be connected to the WEB and our reality will be online. Disconnecting from it would represent another phase of consciousness. Then, you will be able to experiment with 5 phases of consciousness , sleep, awake, dream, WEB and illumination. The latter being the most fantastic of all.
@@WhatsACreel I remember doing a 8x8 16bit matrix transpose for a jpeg decoder with only 8 sse regs and 2 memory temp 'regs' with these crazy-named instructions. It was so satisfying when it finally started working correctly. :D
It is great having a visual of these operations. Intel had once made an app that showed how each SSE instruction worked. I used that to learn and to write assembly code.
I love Clang! It does a lot of optimisations. You might have to use intrinsics, but these things are available in C++. Best way to know if the compiler is using decent instructions is to disassemble and check what it's doing. Or use the ‘Godbolt Compiler Explorer’ website. I don't think there's any compilers that are better at applying these instructions than humans. The gap is narrowing, and maybe one day, we'll get AI compilers that can do these things better.
@@WhatsACreel Right, I guess the best bet would be to use/create libraries providing these functions as interfaced tooling; the librairies making use of ASM internally if possible (since it depends on the CPU type)
@@Winnetou17 I might be wooshing rn, but there are quite a few examples of AI doing better than humans. Google has some wild stuff for recognizing numbers from blured photos for its street view stuff.
Carryless multiplication also comes up in error correcting codes and checksums. And, of course, it can implement INTERCAL's unary bitwise XOR if you multiply by 3.
Hmm... my other comment about PEXT got deleted, probably because I included a link. PEXT implements INTERCAL's _select_ operator. And I believe PDEP can implement INTERCAL's _mingle_ operator. It's good to see Intel catching up with the amazing INTERCAL language!
The other day I learned about the POLY instruction on the VAX. That's POLY as in polynomial, so when I heard of it I thought "well, I guess there could be a use for it in numerical apps, maybe? It's not like it's going to be more than a few coefficients. Maybe a cubic; that's only four." I was only off by twenty-eight! That's right--the VAX can, with a single terrible opcode, compute the value of up to a thirty-first degree polynomial, to either float or double precision.
@@meneldal approximating any function with nicer ones and then being able to calculate that fast on the fly can be useful, though most of those often-used functions have fast instructions themselves at this point.
I remember the old Intel 8085 had some hidden instructions we used in our projects we knew they would not be changed because the instruction were used in some of the development tools for the MDS (Microprocessor Development System). There were instructions like LDHLISP with an 8 offset parameter. Basically it was "Load the HL register Indirectly with the Stack Pointer with the offset added" it was essential for writing re-entrant code (in 8085 assembler!). BTW this was way back in 1980!
About *CMPXCHG* being "absolutely bizarre" (6:22), this is not only used for mutexes and semaphores as explained, but is also the most common primitive used for "lock-free" concurrent data structures (see for example Doug Lea's amazing ConcurrentSkipListMap implementation). It is so useful that many languages export it in some core library, like in C++ or java.util.concurrent in Java. Most programs you use every day likely rely on it or its equivalent in another architecture, unlike some of the other weird instructions listed in this video.
@@WhatsACreel interesting vid / instructions nonetheless. but yeah, the glow reminds me of when my eyes are wet from crying, I kept having to pause and rub my eyes to "dry" them only to see it's still foggy looking lol.
@@colinstu Ha! I felt the same way while making it! I toned down the glow from 6 to 2.5. It was still hard to look at, but I’d already rendered half the animations, so had to settle. I’m hoping to use animations resembling construction paper in the future. They are very easy to look at, but more time consuming to create. We will have to see how we go.
I've always loved the absurdity of the PA-RISC2 instructions SET/RESET/MOVE TO SYSTEM MASK and the PSW E-bit. By changing it, you change the endianness of the entire CPU... And, because of pipelining, the instruction has to be followed by 7 palindromic NOP instructions. That's just always cracked me up.
gonna admit, I don't know a lick of Assembly, but I enjoy trying to decode what anything here means while also listening to this dude's voice. Very entertaining
Is it bad that I've used most of these and consider them perfectly normal? Glad you didn't get into OS level instructions that set up descriptors and gates. Now those are weird.
bruh that shit fucks with my head, i tried getting into it but then the whole GDT, protected mode, gates and shit just knocked the air out of me by punching my brain in the balls (figuratively)
considering these instructions normal is like knowing the difference between the ruddy northeastern gray-banded ant and the ruddy northeastern gray-striped ant. The world of CISC is truly a jungle
addsubps was probably made for complex numbers packed into these vectors. mpsadbw and similar psadbw indeed were made for video codecs, to estimate errors. You should avoid mpsadbw because too slow, but psadbw is good. I think the craziest of them are for cryptography, like aeskeygenassist or sha1rnds4. Good luck explaining what they do. Another notable mentions are insertps (SSE 4.1; inserts a lane into vector + selectively zeroes out lanes; I used for lots of things), pmulhrsw (SSSE3; hard to explain what it does but I used it to apply volume to 16-bit PCM audio), and all of the from FMA3 set (easy to explain what they do, that’s ±(a*b)±c in one instruction for float numbers, but the throughput is so good).
god, not even cryptographers would bother figuring these instructions out nowadays. no wonder RISC instruction sets are so much faster for the same electrons, they don't need to snake around the dark winding alleys of the ALU
They absolutely do, though. Crypto nearly exclusively is written in assembler, and prioritises code that always takes the same amount of time to execute (to prevent timing attacks), and code that also otherwise doesn't leak state (the amount of time something takes to execute is a leak, but if it's always the same you can't extract any data from it)
I wonder how complicated it would be to try to formulate compiler autorecognition for instruction selection for these. That last one is easily a couple hundred lines of C code.
Very complicated. Most of these optimizations are often missed by c compilers and have to be manually implemented in assembly. In some cases (video de/encoding) up to 50% of the codebase has to be rewritten in asm for these reasons.
@@bootmii98 Most compilers for x86/x64 (including GCC and Microsoft) already support a boatload of compiler intrinsics for SSE and all sorts of things.
Just imagine pitch meetings to decide which instructions should go in the set :D. I'm surprised they don't have a 'calculate your taxes and clean the house' instruction
These instructions do have a couple really solid selling points: (1) they don't write to multiple registers (2) they don't do special memory accesses (3) they don't cause any weird special interrupts.
Honestly, 2 days ago I was trying to figure out what the hell does MPSADBW do!. Love you Creel, I hope you will make videos on in-depth explanation of these instruction.
EIEIO I know it's a PPC instruction, but still... Seriously, the craziest ASM instructions are the ones not documented in any of the instruction manuals, but are only found by the sandsifter program (written by xoreaxeaxeax)
@@sebastiaanpeters2971 ua-cam.com/video/_eSAF_qT_FY/v-deo.html ua-cam.com/video/ajccZ7LdvoQ/v-deo.html This guy had a few talks about undocumented instructions or whole undocumented cpu hardware blocks
As weird as this video is, I never enjoyed a video so much, I think it's just the enthusiasm this guy has... damn, I wish everyone who made videos like there would have that same enthusiasm, but if you're reading this, thanks, I can't remember the last time I liked a video this much
I’m just imagining that the entire design team for #1 probably go into extreme PTSD flashbacks any time they see the letters PCMP anywhere near STR. I just can’t imagine what the proposal Idea was like that led to the instruction being considered.
Imagine how fast programs would be if our compilers could instantly see when these obscure commands would be useful and then put them into place. I dont even understand how these instructions take so few clockcycles
I agree it’s borderline impossible for compilers to emit them automatically. I saw clang’s auto-vectorizer emitting vpshufb but that was very simple code. I disagree about ASM. All these instructions can be used in C or C++ as compiler intrinsics, way more practical.
@@soonts yes, but if one can understand and use intrinsic properly, then heshe can just write entire function in ASM too (right there inside C code), so it not about how exactly to use it, its about to use it efficiently at all.
@@mojeimja The code I write often has both SIMD and scalar parts, interleaved tightly. Modern compilers are quite good at scalar stuff, they abuse LEA instruction for integer math because it’s faster, and do many more non-obvious things. Just because they suck at automatic vectorization doesn’t mean they suck generally. For SIMD code, manually allocating registers, and conforming to the ABI (i.e. which registers to backup/restore when doing function calls) is not fun. With intrinsics, the compiler takes care about these boring pieces.
PEXT made me laugh for some reason. Don't know if it's the particular tone you explained it in or the absolute (seemingly for my stupid brain) randomness and bizarreness of this operation, but I love it.
I have to admit, when CPUs changed from 32 bit to 64 bit, I was skeptical. Like how often do you really need to count beyond 2 billion anyway? But now I see why 64-bit instruction sets can be useful as fuck, and faster for the same clock speed.
TIL I learned there's an audience for top 10 videos about assembly instructions. Cool.
I'm surprised our community is so large
I'm surprised this has > 10^5 views.
having only ever worked with RISC assembly like MIPS in school, seeing the extremes of what you poor poor x86 driver authors have to deal with is entertaining and enlightening.
Sojit , in this case It doesn't appear that this video content will ever be of service to the quality of life you are seeking. Did I just wrote that ? I'm not even sure I understand myself. :)
@@TheActualDP It has 2#10_1111_0010_1010_0010# views (I love ADA's based integers :D)
RdSeed - It's not always slow. There's a FIFO on the output of the RNG. RdSeed pulls from that FIFO. If you haven't just pulled a bunch of values from the FIFO, the value will be available immediately because the FIFO is not empty. If you try to continuously pull from Rdseed and measure the average time per instruction, it will appear slower because you are limited to the physical rate of generation of full entropy numbers from the RNG, which requires a whole lot of computation - Generate 512 bits from the entropy source, AES-CBC-MAC them together to get 128 bits (that's two RdRand result's worth) XOR it with and output from the DRBG (another 3 AES operations, just like SP800-90C describes) stuff the two 64 bit numbers from the 128 bit result into the output FIFO. How do I know all that? I designed it.
The true gold is down in the comments
Oh cool. When did you design it? Care to share some history?
@@LKRaider It was around 2009 I started. It ended up first in the Ivy Bridge processors with the RdRand instruction. I had been working on writing cryptographic protocols in standard committees (802.11i, 802.16 etc) and they all needed cryptographically secure random numbers and when I looked at the SP800-90 specification back then, it was not sufficient. It described DRBGs (aka PRNGs) but not entropy extraction or physical entropy sources. A small team of 4 people was assembled, myself, a mathematician, an analog designer and a corporate cat herder. the math guy came up with some of the mathematical principles and identified the best papers describing how to quantify the entropy, the analog guy did the physical entropy source, the cat herder got it into silicon and I designed the digital logic that takes the partially random bits, turns them into full random bits with an entropy extractor and seeds a PRNG/DRBG with that full entropy data to make the resulting stream of random numbers fast enough. Since then the other three left (2 retired and one died) and I've been the main owner of the RNGs since. RdSeed which gives full entropy output as per SP800-90C and X9.82 was added with Broadwell. This was so you could make arbitrarily large keys from it. Faster and slower versions were created (fast for servers, slower for energy efficient chips) also I've designed a few other types of RNG for specific needs, like super small ones, non uniform ones and floating point ones. I contributed to the development of SP80090B and SP800-90C and the revision of SP800-90A which now cover most of what you need in a secure RNG. A couple of years ago I finished a book on random numbers which was published (Random Number Generators, Principles and Practices). So getting involved to solve my problem of where do I get random numbers has turned into the defining part of my career. The standard are still changing. Certification requirements are still evolving and the need for new RNGs that fit in different contexts keeps up apace, so it has become a full time job for myself and a small number of colleagues.
Wow that's amazing
@@davidjohnston4240 What a lovely comment chain to stumble upon, great read!
May your random continue to prosper!
Dot product of packed singles in your area
I would like it in my boot sector
The probability of finding a project worth uploading commits of my sus code is very low.
@@TheLightningStalker but never zero
dot product of deez nuts packed on your chin
I think 🤔 it's must be cross product
Wow, so the task I was given in a job interview was actually an assambler one-liner. Good to know.
If you'd said that in the job interview you'd get instantly hired
@@DOSeater I wish I knew this 2 months ago. I got that job anyway, but it took a little more interview iterations. Now I'm a happy developer of a delivery robot :)
@@luck3949 Nice! I'm happy it worked out for you
"Oh, that's easy, you can do it in one cycle using the PSCMPXCHGFMADDRABCXYZUW instruction"
Which one was it?
I love how hyped this guy is about CPU instructions. Really fun to listen to.
This dude could describe paint drying on a wall and I'd be entertained. I've never seen an assembly instruction before this video lol
i don't know why but for me it's quite annoying.
I'm also hyped when I learn something truly revolutionary
I am surprised he wasn't more excited.
Are you kidding me? I hate his voice with every fibre of my being. I've subbed only because he has subtitles and the other videos look interesting. That first 30 seconds was excruciating, I may need a lie down in a dark room
Lol, this guy has that kind of voice that makes it sound like he's constantly on the brink of laughter
The way you write sounds very British 😂😂
He sounds like BuzzFeeds IT guy
I have the same feeling with Tim from the Unmade Podcast. Maybe it's the Australian accent haha
@@julian-xy7gh Australian here: it's not universal for Aussies, he's just a gem 💎
More like madness.
Assembly language has that effect...
.
So the most amazing thing about these instructions to me is the fact so many of them run in single digit cycles. You have to marvel at the engineering effort that has gone into it. Also, a compiler has to basically be sentient to know when and how to use some of these.
Yes there went millions of hours of engineering into getting to the point where you could write Hallo World in Python etc.
No. If the compiler was sentient, it would kill itself.
@@altaroffire56 LOL
@@MrHaggyy And billions of hours for a javascript hello world. i think capable computer engineers brought this upon their selves by providing layers and layers of abstraction and burying need for internal necessary concepts to get something done. No wonder the developers now are too shallow in their concepts, probably not their fault if they get hired only after 6 months of python for data structures (they have no incentive to learn the deeper internals if they get paid shitload for sitting in a desk). Hell i would say most people choose programming or development for making bucks, learning and interest comes later. There only a few people now who are truly interested and curious in the core of things and it might just be that after 10 years understanding these would just be luxury and not necessity. Also no wonder why most programmers hate their jobs and want to die after getting one.
@@swarnavasamanta2628 mhm i think the horizon of programmer/developer/engineer in this field got much broader. Yes, there are many abstraction layers we have invented and standardized over the years. I have a mechatronics degree with a microsystem-technology specialization. Most of my field works on improving the hardware for existing assembly code. But we also introduce new things in hardware which we map to assembly or C/C++ code. On that layer, you have the guys who are building assemblers, linkers, and compilers. These are the programs you need to actually execute code on a machine. On top of that, you have the Microsoft, Android, Apple, Linux, etc guys who write an operating system that provides useability with that stuff. And on that foundation, you can start building languages, IDEs, or any program you can open on your computer. And if we finally have these higher-level languages and programs we can start building frameworks or things like python. That field can write very powerful applications that millions of people can use, or that run on many machines at the same time, or all the things these cloud-native guys are doing. The interest in these fields is widely different. I personally love hardware, and the guys I work with love building hardware or building systems with hardware. Systems can be the new Intel i3-i7, over to raspberry pi or smartphone processor, to small controllers like an STM32 which are used in smartwatches, cars, microwaves, freezers down to something like an Arduino which is easy to learn.
There are a lot of people working on those layers. Many of them being the stereotype white europeon/north-american older man. But this field is one of the most global out there. With Korea, Taiwan, Japan and China being the "most" impactful.
The amount of things you could learn about computer and software layers is way beyond one's reach. 99.99% of all programmers don't have a clue how transistors are formed into bite logic, scaled to 16-32-64-86-128bit wide memory, how this memory became a register with a specific purpose and how you address this register so you can call it. But you don't need to know it in order to write a program. :-) we have you covered in that one :-)
So even assembly can teach you a lot about how a computer works, you don't need to write it. In fact you shouldn't write it for any used code. Use a compiler and write it in a higher-level language. All the smart people from the compiler department will cover you there. And so on and so on. Until the hip young facebook star engineer can write his php or python code for his next new feature. And if we do something amazing down the layers he will get a new version that will make his software even better than before. And the only thing he needs to do is trust the work of other people.
The unpleasant truth about why so many programmers want to die or really do it is a mismatch between management, expectations and skills pared with bad working environments. Coding and engineering computers is a mentally very hard and demanding task. You have to know your tools, get to know the problem, which I like to call a puzzle, identify the pieces of your puzzle, sometimes create a new piece that fits, and solve the puzzle. This takes time. A good time is anything from 2 to 4 hours. Less is only sufficient for really easy tasks, longer is better but you need to train for it and you need to go to the toilet, move, eat, sleep etc. In most companies, this deep focus session gets corrupted by meetings, telephone, angry managers, or people that think they are important to the problem. These corruptions drain a lot of willpower and unless you are an (senior) engineer and prepared for this kind of stuff it will depress you. You need to get your routines in place in order to sustain. The other part is once you solved the puzzle your company needs to give you a reward for this. If your management doesn't like your result and lets you feel their miss liking, you need someone holding you on the bridge. That's why many companies in this field like Facebook and Intel don't have 9 to 5 jobs. You get paid to work for them. There are recommendations on how you should set up your routines and there are people helping you. But you can come and go as you like. But you get certain tasks and a timeframe. Once the timeframe is over people all over the world are counting on you getting the job done in time.
So very wide, very different, and very interesting domain. And it's very rewarding if you know that you did something that all of mankind will use and benefit from in a view month after you finished your work.
"HCF" -- Halt and Catch Fire.
On a lot of early CPUs (1970s/1980s, yes damnit I am old) the manual gave the bit pattern for each instruction - and the the rest of the bit patterns did undocumented things. Some were just a different way to spell NOP, some did deeply bizarre unintended things that happened because the bits randomly activated chunks of the CPU circuitry that mixed and matched chunks that were used in different combinations for other commands, and some did things that were only ever intended to be done in the factory, during QA testing.
We used to hunt through these "undocumented instructions" looking for anything interesting or cool that we could then figure out uses for. But this was a bit risky. A fair number of CPUs had at least one undocumented instruction that would immediately cause the machine to lock up and, a few seconds later, destroy the CPU. Sometimes they caught fire, sometimes they melted through the PCB. Sometimes they desoldered themselves from the board and fell out. Whenever we found it we called it a "Halt And Catch Fire" instruction and patched the name 'HCF' into our macro assembler for that bit pattern, in order to avoid accidentally finding it again.
Naturally when I saw the title of this video I figured HCF would be at the top of the list.
Finding an HCF usually meant a new version of the chip as soon as the company could mask it off. We thought of ourselves as contributing to their QA efforts, although very few of them thanked us for it.
That is ridiculous, thank you for this comment.
Write while rewind
Eject disc
Read & write while ripping tape
Disable console
active emergency power off
Electrocute operator
Sense card deck on printer and open cover
Write past EOT
Read and scramble data
I have a huge list of them along with my green cards
Can you give some examples of interesting undocumented instructions you came across with?
@@Safyire_ We found things like 'compare while swapping' that swapped the values in two registers while writing 1 to the comparison bit if the first was higher than the second. That was actually a little bit useful. We found a lot of things that tried to do two or three things at once but did them in a random-ish order because of race conditions. One of those was useful because it consistently did xor before swap if the CPU was hot and swap before xor if the CPU was cold, so we could write code that monitored the CPU and shut things down if it got too hot. We found instructions that connected multiple registers to the bus for output, meaning the result of the instruction would be written to four different registers at once. We also found instructions that connected multiple registers to the bus for input, which was useless and sometimes damaged the CPU. It was a real crapshoot. Also a very expensive hobby if you damaged the machine and your professor wasn't ready to write it off to "research." CPUs were not cheap.
@@rty1955 @Zrebbesh you crazy old hackers! ;-) you are legends! :)
I can't get over this presentation. That's the kind of nerdy content you expect to find in a recording of a 10 year old talk that was given to 50 people in a tent :D
make that a 20 year old talk
what were you expecting with this title??
@@ethanpayne4116 make that 40, i was there :)
I've used MIPS excessively and never looked at X86 much. This feels like when you were playing yugioh in 1999 and you were summoning and setting 1 card every turn and then you get teleported to 2023 where people play their entire deck in one turn and have cards with effects that are 7 paragraphs
Even when cutting off all SSE and up instructions (making it useful for legacy x86 device targetting) there is still a lot of complexity, including very precise x87 floating point and MMX vectorization. What makes it especially fascinating is how compatible it has become; a 640×480 60fps renderer on a very old x86 processor with MMX might very well be the exact same program that does 3840×2160 60fps on a modern PC.
huh.... what the fuck
Good lord that poor silicon. I can't even begin to imagine how you'd design chips to implement some of these instructions. I'd love to see a followup video showing some examples of using these instructions, and if they're superceded, what should be used instead!
They committed the cardinal sin in the 1970s with REP MOVx and it went downhill from there.
microcode, lots of microcode
I'd think that there are massive groups of "one circuit per operation", and they all work in parallel. From all the results only the specified one is selected.
Microcode. Lots and lots of microcode.
A long time ago, they actually gave up on x86, and have been making much simpler chips that convert x86 to that simpler system using "microcode"
You'd almost think silicon makers like to mess with compiler writers.
I doubt these instructions were aimed at people writing compilers, they'd be aimed at people doing things with encryption, low-level synchronization, multimedia.. I think these days people would first try and come up with a GPU based way to tackle these large data-processing problems, but before GPUs were general purpose parallel computers you had to do these single instruction multiple data things on the CPU
@@kestasjk Also, doing stuff with a good CPU instruction is generally more efficient than doing it on the GPU, simply because you have to send across the data and get the result back on a GPU.
@@toboterxp8155 Sort of.. The thing is if you’ve got enough data the GPU is so much faster it’s worth the overhead (and the memory space is getting more integrated / unified all the time), and if you’ve not got enough data to make sending to the GPU worthwhile the speed up for processing a small amount of data on the CPU more efficiently probably isn’t worth it. Perhaps for certain encryption or compression tasks where it can’t be parallelised very well on the GPU but it still needs lots of processing power they may still be useful, but I doubt these sorts of instructions are used in modern software very often
@@kestasjk Your generally correct, but those instructions are a standard way of making programs faster, used to this day. If your task isn't easily converted to the GPU, you don't want the extra work, or you don't want the program to require a GPU, using some complex instructions is an easy, fast and simple way to optimize for some extra speed when needed.
@@toboterxp8155 True.. but I think you can probably attribute ARM/NVIDIA’s ability to keep improving by leaps and bounds while Intel is reaching a plateau to its need to maintain a library of instructions that aren’t really necessary in modern software. If it gets rid of them old software breaks, if it keeps them any improvement it wants to make to the architecture needs to work with all these. Intel went for making the fastest possible CPU, but we now know a single thread can only go so fast (and the tricks like branch prediction have exposed gaping security holes in CPUs, forcing users to choose a pretence of security or turning branch prediction off and getting a huge performance hit). So parallelism is the future: In the 00s this meant multi-core CPUs, today this means offloading massive jobs to the GPU, but the breakthrough will come with CPUs and GPUs merging into one. Not to an SoC, like we already have, but with GPU-like programmable shaders as a part of the CPU instruction set and compiler chain, so that talking about CPU/GPU will be like talking about CPU/ALU. You’ll be able to do the operations like these instructions do in a single cycle, but by setting up a “CUDA-core” with general purpose instructions that can access the same memory.
Intel: One cycle
Bioinformaticists: lemme reimplement that in Python and take 300,000 cycles to compute the same thing.
Don't worry; as long as computer time remains far more valuable than developer time, and no alternative graphics-based technology appears for custom parallel processing operations, Intel will be just fine
@@kestasjk eh, emulation of x86 on ARM on both Windows and Mac is apparently good enough now that I'd be seriously worried if I was Intel. AMD at least have their GPUs...
@@SimonBuchanNz I think AMD wouldn't mind going ARM too much, if they have to. Maybe even will design dual-instruction-set chips for the transition period. Good thing that China won't let Nvidia buy ARM.
In general, nowadays there is a tendency towards "crossplatform" software design practices, so the question of "Can it run widespread software fast?" would soon become irrelevant. For example, Adobe Lightroom already works on ARM on Windows and their other products will follow soon. Itanium might not have flopped if it happened a few years from now, at least not for the reason it did, which was poor x86 emulation performance.
@@JayOhm how exactly can China stop a US company from buying a UK company?
Should we find out what Italy and Argentina think too?
@@codycast The short answer is Qualcomm. They are banned by US so if ARM becomes US-owned, Qualcomm will no longer be able to legally produce ARM chips. Possible political implications of that are just too painful to risk so regulators almost certainly won't allow it.
#1 is the definition of insane and incredibly useful.
Thank you for translating the Enginese into English.
Now I can delete my string comparison macros forever.
I just remember, porting the torque game engine to PSP, and from all the work, the CMPXCHG instruction for the mutex, i implemented some native PSP intrinsic to do that, good memories, the best optimization trick also, the game was doing 10 fps at the best, the problem was matrix transposition, between the engine and PSP "opengl", so i made a transposition on the fly changing the order of reading and writing of the registers in the VFPU instructions, kicking the Sony engineers 'axe' ; ), and getting 30 fps, enough to pass their performance standards.
Wow you made PSP games?
@@KangJangkrik , i made the Torque game engine port, and on top of that another team was developing games using it.
Nice, but wouldn't it have been better to change which indices of matrices are used in vector and matrix functions? E.g. using m[4] instead of m[1] and vice versa.
Marix transpose is the dumbest operation ever, you shouldn't be doing that, ever.
@@DiThi that implementation costs 20 fps in that platform, you need to swap the entire matrix operations for every calculation, sounds trivial but was not for a 333 Mhz processor with slow RAM.
before was:
matrix.transpose(); // bloated operation
vector.mul(matrix);
after optimization was:
vector.mul(matrix); // due to the trick no transpose needed
Yeah, watch-mojo really dropped the ball by not covering this one.
Can't wait till you remake this vid in 10 years with all the custom RISC-V extension instructions. Gonna be pretty wild to see what people come up with.
The big mistake Intel made is to create fixed width vector instructions. The V in RISC-V points to the importance of the variable width vector instructions where the assembly code doesn’t need to know the vector register size (V extension), and a similar matrix extension is coming for machine learning I think (though V is already a great improvement)
@@ritteradam The V in risc-v is a roman numeral standing for 5, as it is the 5th iteration of risc from Berkeley (i think).
@@canaDavid1 Officially yes, but you can find videos of the people who developed RISC- on UA-cam, and they mentioned that they originally developed it because they wanted to get the vector extension right, and that's why they called it RISC-V at the start.
Also it's a reduced instruction set (risc) and not a complex instruction set (cisc) like x86
So why should risc-v even get some of these?
just do them in software and let the compiler do it's magic.
The point of risc-v is to have a common set of instructions understood by many cpus and to be extended with application specific extensions where needed. So you can be 100% sure there will be many wild instruction extensions.
CMPXCHG is how mutual-exclusion, locks, and semaphores are implemented in systems like QEMU. I remember having to fix a bug with a race condition in the QEMU Sparc interpreter by adding judicious use of CMPXCHG locking. It's an amazing instruction and, with its guaranteed atomic behavior, can be used to trivialize mutexes.
PMADDWD is quite useful for fast affine transformation functions. On SSE2, I can even calculate two pixels at once
Bear in mind that some instructions were not designed, they are a by-product of the design process.
In essence, take any bit-pattern that is not assigned to an instruction and look at what the processor will do.
Most often it will do nothing (which his why there are so many NOP's in instruction sets) or it may crash, but sometimes it will do something weird and wonderful and be included as an "official" instruction while the designers pretend it was intentional.
That's like exploiting hardware-level undefined-behavior
there is a comment that mentions HCF -Halt and Catch Fire- , "undocumented instruction" that sometimes could catch fire...damn, thats amazing lol
@@lPlanetarizadohat wouldn't happen today on your PC's x86. Or this would be a terrible security issue. On modern systems userspace processes should be able to (try to) run any instruction they want without the CPU melting down.
All of the instructions in this video were quite intentional, but niche. Well, only some are niche. cmpxchg is a _foundational_ instruction whose importance cannot be understated, while pshufb is going to be in pretty much every vector codebase. dpps is pretty well known, parallel dot product. not a fan of dpps tbh.
This video has such an unique editing. The topic isn't any less obscure, and it's really cool to hear the author being so enthusiastic about those instructions. It's a really interesting experience
alright guys lets brainstorm what kind of algorithm could benefit from all 10...maybe search for a specific font in an image by comparing each glyphs bitmap to the image using MPSADBW and search for words within identified glyphs using the last instruction?
careful, or you might ending up creating another awfully named megainstruction
@@AlexanderBukh ALRTGYSBSTRM
Needs moar threads.
MPSADBW can be used for all sorts of optimization problems as the sum of absolute differences is a metric. It's often faster than using the Euclidean metric which requires a square root and you can substitute one for the other in many situations.
you could feasibly use a good chunk of these by implementing a fancy video encoding
The carryless multiplication is polynomial multiplication modulo 2. It's used to implement things like CRC computation, and Reed-Solomon error correction codes.
i was disturbed to find any mul instruction. i loved my homemade multiplication and division routines
Yes, it's useful for all kinds of codes. It's a direct implementation of a field theory concept
wow, that were 1010 assembly language instructions, not a mere 10!
I actually crunched these numbers in my head before I realized what you did. I feel ashamed. +1
There are 10 kinds of people in this world. Those who know binary, and those who do not.
@@bbq1423 there are 10 kinds of people in the world: those who understand hexadecimal and F the rest
@@threepointonefour607 0000 0000b - 1111 1111b == 0x00 - 0xFF since log2(x) is a factor of log16(x)! If you are doing simple programming, then 90% of the time you'll only need hexadecimal. If you are actually building and designing hardware and implementing it's data paths, control lines and control bits... You are not going to get very far without binary and Boolean Algebra! If you get into Cryptography, or Signal Analysis you might want to know binary as you'll end up performing a lot of bit manipulation!
@@bbq1423 and those who didn't expect a trinary joke
I feel bad for the CPU engineers who will need to add compatibility for this stuff in 20 years
Edit: finished watching the video. This was pretty fascinating, and the 3D text made it very nice to watch. I hope you gain more subscribers!
They'll do it in microcode, I imagine. Apart from the RNG, they can all be done purely in heaps of microcode if you don't care about performance, no dedicated hardware needed.
If you ever learn about microprocessors, it's all about microcode. Every assembly instruction are function call to microcode. The design will basically the same, with microcode printed in ROM inside the chip. You just have to be creative using that microcode to come up with a new instruction.
@@gorilladisco9108 There's definitely a lot more to it than just microcode. Things that are both easy and compact in hardware - such as a linear-list search or swizzling - and microcode won't get you there.
Also I'm not aware of any major RISC implementations that use a significant amount of microcode, very much unlike x86.
@@johnbrown9181 And that's why you won't see any instruction like the ones listed on this video on any RISC microprocessors. The thing about x86 and other CISC microprocessors is they use microcode liberally.
Microcode is how a microprocessor work. All you have to do is to have imagination.
Depends on how fast it needs to be. Optimizing complex instructions to use all of a core's hardware is difficult, but just getting older instructions to work for the sake of compatibility isn't that hard. Hence, x86 code from a couple decades ago will work fine on a modern x64 chip, while ARM, PowerPC, and other RISC designs have suffered mountains of compatibility issues over time.
Don't forget the Motorola 6800 "Halt and catch fire" instruction. It was an unpublished byte code that caused a branch to itself until the chip overheated.
No. en.wikipedia.org/wiki/Halt_and_Catch_Fire_(computing)
@@BrianG61UK Long ago a computer center I worked in had a list created by IBMers in the 1960s of amusing opcodes, including HCF. But I didn't want to complicate the text, and the MC6800 item is there in the Wikipedia description, though I did have the details incorrect😊.
This video is about x86 though. Given, it does have the HLT instruction, and if you use it in your user mode application it will catch fire (if by catching fire you mean cause a privileged instruction exception) :0)
HCF was around in the 60s way before the 6800
@@rty1955 yes, I recall on the wall of a data center I worked at, a paper list of spoof IBM machine instructions that included this HCF instruction. Iirc there was also BAH, Branch And Hang😂. The only CPU that actually did this that I'm aware of was the early 6800, but it's possible there were others. The 6800 was an "unimplemented" instruction bit pattern that unbeknownst to Motorola effectively branched to itself immediately and repeatedly until the heat built up enough to burn the logic.
I also personally knew experienced the result of two amusing (to me) episodes - at a college I was attending, a kid running a canned BASIC business program that managed somehow to overwrite the entire disk map, effectively erasing everything, and a kid looking for a job used social engineering to get the guy running jobs to dive and hit the Big Red Halt button. Each of those events caused the Computer Center to be offline for more than a week. And an entire computer center at a company where I worked got completely fried including three mainframes due to a lightning strike right at the pole outside the Center. The senior manager had resisted spending the $5 million required for a motor generator to isolate the computers from the world. We had 400 engineers twiddling thumbs for two weeks. He got a new job.
the last one seems so damn complex it's unbelievable it takes 3-4 cycles
The string instructions seem like half of grep implemintation.
Excellent visualizations btw. Way more straightforward than instruction manuals that try to explain everything with just words.
Finally SSE 4.2 string compare is understandable. I wish we had the Australian version, Creel version, of the intel instruction set manuals.
if you're struggling with the intel manuals I personally find the amd manuals more comprehensible
one of the assembly instruction video's of all time.
Exciting! Love your enthusiasm. Almost makes c redundant. There is something about machine code that feels right.
did you know that ++ and -- were VAX intrinsics?
There is something about machine code that feels right.
I dunno. I've not done any actual assembly programming so maybe my opinion doesn't matter but x86 just seems so bloated and inelegant.
@@seneca983 you would be partially right. Bloated or not depends on the way of implementation, if these instructions were to be implemented by microcode, yes absolutely, better let the programmer handle them. But if they are direct on chip Hardware implementation of these instructions then it's a different story, it takes the opposite route of bloat. Takes 1 instruction instead of writing a 100 line function in C and hoping compiler would get the translation right. Also x86 being firmly established the engineers have to make sure they are compatible all the way. Support for languages will drop eventually, while x86 is going to stay.
@@swarnavasamanta2628 One advantage of a simpler and smaller instruction set is that microcoding might not then be necessary and the chip could be simpler.
Indeed x86 would be rather difficult to supplant. However, it seems possible that ARM could do it though it's uncertain and would probably take a long time if it happened.
@@seneca983 ARM is definitely a beast, and their methodology is completely different from other CISC approaches. It began first as a project to see if a computer really needs large complex instructions, they thought they would come at a halt problem but nothing really came up and they could make everything work with 1 cycle simple instructions (although with a bit of microcode). At this point hard to tell what the future holds, maybe there will be standardization when one architecture has so many advantages that renders other architectures almost useless or unworthy of learning curve. Who knows what the future holds but up until that the architecture land of computers is like wild wild west and i kind of love it that way.
7:30 Btw, the carryless multiply is extremely useful when making parsers
:o, can u elaborate pls xD
@@mohammedjawahri5726 here is a video about it, you will need the context: ua-cam.com/video/wlvKAT7SZIQ/v-deo.html
@@superblaubeere27 thanks!
@@superblaubeere27 You mean at 35:00 ?
@@0MoTheG exactly.
You know that an instruction is complex if implementing it in a higher-level programming language would take literally hundreds of lines of code.
I love your vids mate. You’re such a god dam likeable character
TMS-9900 also has a very unique instruction: X Rn . Execute the instruction in register n. It's the only CPU I know of that has the equivalent of an eval() function (as the registers are stored in external RAM, it's clear that it's not difficult to implement in that case).
It has SEVERE security issues. But hey, at least it can be used for self-modifying programs
@@Rudxain for a CPU that doesn't have priviledge levels or memory protection, I don't think that security is an issue with the X instruction.
S/360 had the EX instruction for that. The instruction wasn’t in a register but in memory (S/360 was variable length, 2/4/6 bytes). This kind of instruction was fairly common in the 50’s and 60’s.
@@peterfireflylund interesting. Btw in the TMS-9900 the instruction is also in memory because the register window is in memory.
This and 2 minute papers are the most important channels on my UA-cam thank you for your service.
Fantastic video! Such exotic instructions can insanely speed up / shorten certain algorithms. Back when I did MPASM (has only 35ish instructions), there are some rarely used ones that magically do exactly what you can also emulate in 10 more common instructions.
From the instructions in the video I so far only used cmpxchg to emulate floating-point atomic addition in OpenCL.
My little brother is doing a similar major as I did and will have a course with some practical work in assembly next year. Your video just gave me the inspiration to help him find some more "creative" solution to those assignments.
I've worked with or in close proximity of most of these. If you do high performance number crunching or data crunching, the value logistics (i.e. which value needs to be in what operand in which SIMD position) very quickly becomes a major issue and for that all these shuffle/rotate/select/ are a godsend, especially since they tend to be just rewiring of existing ALU functionality so AFAIK should be easy to implement in silicon. Number 1 on the list is the only instruction family I'd put into "space magic" territory, but I might just not have seen its use case yet.
This getting recommended to people is almost as oddly specific as the sound of sorting algorithms
Not gonna lie, string comparison on the instruction set level actually sounds pretty useful. Not a fan of the absolutely insane arguments though.
Yes, they are magnificent instructions!! Assembly can be super fiddly to code, but very powerful if you have the time to make sure it is correct.
Yeah, as an accountant by profession I still wonder how mathematical reconciliation of bank statements and checking accounts can be so complicated to program and usually buggy.
I guess that last instruction combined with machine learning techniques really could speed up the process.
@@Gulleization You absolutely don't want machine learning near anything that requires accurate numbers. ML has its place but it isn't nearly as useful or reliable as the hype often makes it appear.
@@SaHaRaSquad It depends on they type of ML. Neural networks are generally fuzzy, but there are lots and lots of other kinds of machine learning implementations, and some of them work very well for accurate numbers.
it would only be four or five instructions in a loop. but if it was four or five times faster and all you did was compare strings, very valuable!
I haven't watched a video like this ever. Saving it for arguments. Thanks!
Found this randomly in my suggestions. Insane content, great stuff. As a C++ programmer this assembly stuff scares me lol
I’ve never done programming in assembly on any newer hardware, so to be I always thought of assembly operations as stuff like move this to there, add, subtract, compare two registers, so even as someone who’s used assembly this is absurd to me.
Appreciate the tour. Did quite a lot of Assembly coding in my earlier years, and quickly grew to love it - it's a lot of fun when you get up and running, but you need to keep so much more information in your brain / at your finger tips compared to higher level languages.
I love this presentation, it fits the weirdness of the ops! Great job!
Wouldn’t it be better to call them functions instead of instructions at this point?
Needs a RUNDOOM instruction.
@@jjoonathan7178 At least IDDQD seems plausible, integer divide quads by double, store results as double :)
They have their own implementation circuitry therefore they should be called instruction, and this is also one of the most important feature of x86 ISA, we make complex operation into an instruction to shorten the execution time and make program smaller.
@@oldxuyoutube1 well, there is microcode...
No, because they are not functions; maybe you could call them routines but not functions.
PEXT is so useful! I can finally get the correct bits from a 4X 1R 1G 1B 1I 8-bit color buffer to the "layers" in mode 12h easily!
Mode 12h? Are you coding EGA? That's awesome!
@@WhatsACreel Yup! I think I should also do something on UEFI though, as it gives higher resolutions.
Also DES, RC4 and other cyphers based on Feistel's schema would ridiculously slow without this.
I always forget how beauty assembly is.
This made me realize that X86 is more abstract than the C language, each of those instructions are like 4 or 5 lines of C.
Now imagine having to teach a compiler to take your 5 lines of C code.... and figuring out which of the five thousand different x86 instructions is the perfect fit :P
That's the opposite of more abstract. Being more abstract means you have tools that are more general-purpose in order to handle a variety of different uses. These instructions are not abstract; they are intended for specific purposes and aren't especially useful at all otherwise.
Consider that these instructions are actually implemented as microcode inside the CPU -- miniature programs built out of primitive building blocks.
@@codahighland i guess what he is really trying to say is that x86 is so bloated you can implement the same thing a billion different ways
@@davestephens3246 Was the ad hominem even necessary? I wasn't judging. I was just giving information.
@@codahighland "they are more general-purpose in order to handle a variety of different uses"
that's why I said what I said. "X86 is more abstract than C"
x86 has lots and lots of complexity, the instruction set has lots of arguments and things that happen in some state and not in others, the instruction is variable length.
So, the instructions can be used for lots of different purposes, with different modes, different registers, and so on, and so forth.
The instructions are actually implemented as microcode should be more than enough evidence that assembly is more abstract than the machine itself.
Assembly is much more complex than the abstract machine that defines C and which you program to.
C is basically a macro-assembler for the PDP11, X86 is a monster near it, it can do a lot, much more things, you can fine control memory load/store ordering, lots of abstract things that you can't even do in C, like barriers, for example.
One practical example, there are SIMD instructions that a single instruction will to an entire for loop with sum and comparative to a variable, but in a register, like 4 or 5 lines of C is just a single asm in x86, and the compilers know how to translate that, because you can't even declare data-paralelism in C, the compilers have to pretty much guess so otherwise the CPU would be idling because C programs are sequential, but what we care about is how data relates to itself, not the control-flow of the program, the CPU couldn't care less about it (speculative execution for the win!), all because the C has less abstraction power than the machine itself.
C is really, really outdated.
This video makes no sense to me, but my uncles used to code in assembly language. It just truly gives me awe and appreciation for the pioneers who used this language (WITHOUT DEBUGGING) and makes me see them in a new light as men of math.
Thanks for humbling me and thanking god that there are higher level languages
I'm no programmer but it appears to me that programming these instructions into a CPU is just about as complicated and fascinating as quantum physics.
Then you don't know much about quantum physics. The point is that these instructions were added because doing these operations (which are needed in very specific cases) in software is otherwise very inefficient. In fact, in a microcoded CPU, they aren't that difficult to implement. If you really had to do these things in "hardware" (i.e., dedicated logic gates), that would be a whole lot of square microns.
@@BrightBlueJim man, what a day and age we live in, to have real estate measured in microns! I'm only 20 years old, and I'm already living in the future. Imagine what the _actual_ future holds!
@@mage3690 Microns? If you want an actual comparison to real estate in terms of cost for high-end parts, you're going to want something a little bigger. Your unit will be the nanohectare (10mm^2). Your typical big complicated chip will therefore be around 20-40 nanohectares in size and will have cost Intel or AMD the equivalent of buying 20-40 hectares of actual land to develop.
half of the fun was all the bizarre "words" that mystified everybody else. it made you feel special. it's not as complicated as it looks. abstracting the problem into code is harder
@mage3690 here is a hint. In the future you will be a borg. With NeuralLink, all will be connected to the WEB and our reality will be online. Disconnecting from it would represent another phase of consciousness. Then, you will be able to experiment with 5 phases of consciousness , sleep, awake, dream, WEB and illumination. The latter being the most fantastic of all.
I love your style bro! This is a great one. 👌
Back to 2000.
PUNPCKLDQD is sad and disappointed not being able to get on the list ;D
Gesundheit.
I am sorry, PUNPCKLQDQ... :( if we do a follow-up video, I will be sure to include the unpacking instructions in that :)
@@WhatsACreel I remember doing a 8x8 16bit matrix transpose for a jpeg decoder with only 8 sse regs and 2 memory temp 'regs' with these crazy-named instructions. It was so satisfying when it finally started working correctly. :D
@@realhet Wow!! Things were certainly tough when we only had 8 regs :)
At some point that stopped being an x86 instruction and started being a DooM cheatcode.
It is great having a visual of these operations.
Intel had once made an app that showed how each SSE instruction worked. I used that to learn and to write assembly code.
Very cool video with very good animations, pls continue making this videos 👍, I just love ur channel
Woa, high quality video, I love it! And the 3d visuals really help to represent the instructions
I wonder which language compilers are able to detect these patterns and use the ASM operand instead of doing the slow imperative way.
I love Clang! It does a lot of optimisations. You might have to use intrinsics, but these things are available in C++. Best way to know if the compiler is using decent instructions is to disassemble and check what it's doing. Or use the ‘Godbolt Compiler Explorer’ website.
I don't think there's any compilers that are better at applying these instructions than humans. The gap is narrowing, and maybe one day, we'll get AI compilers that can do these things better.
@@WhatsACreel Right, I guess the best bet would be to use/create libraries providing these functions as interfaced tooling; the librairies making use of ASM internally if possible (since it depends on the CPU type)
@@WhatsACreel AI compilers that can do things better than humans! NEVER! Maybe just faster... (insecure human signing off)
@@Winnetou17 I might be wooshing rn, but there are quite a few examples of AI doing better than humans. Google has some wild stuff for recognizing numbers from blured photos for its street view stuff.
@@OzoneGrif please no more abstraction by library interfaces at low level. It is a nightmare, i say let good prpgrammers handle this.
Carryless multiplication also comes up in error correcting codes and checksums. And, of course, it can implement INTERCAL's unary bitwise XOR if you multiply by 3.
Hmm... my other comment about PEXT got deleted, probably because I included a link. PEXT implements INTERCAL's _select_ operator. And I believe PDEP can implement INTERCAL's _mingle_ operator. It's good to see Intel catching up with the amazing INTERCAL language!
The other day I learned about the POLY instruction on the VAX. That's POLY as in polynomial, so when I heard of it I thought "well, I guess there could be a use for it in numerical apps, maybe? It's not like it's going to be more than a few coefficients. Maybe a cubic; that's only four."
I was only off by twenty-eight! That's right--the VAX can, with a single terrible opcode, compute the value of up to a thirty-first degree polynomial, to either float or double precision.
Isn't assembly strangely awesome?
...wouldn't a 31 degree polynomial just smash the value to negative infinity, positive infinity, or zero? What the hell is even the use of that lol
@@romannasuti25 nope, if you need to do some crazy ass Taylor series or something and just look at a certain portion
@@juanthehorse420 Outside of bragging about computing Pi faster, is there any use for 10+ long Taylor series in practice?
@@meneldal approximating any function with nicer ones and then being able to calculate that fast on the fly can be useful, though most of those often-used functions have fast instructions themselves at this point.
I love that I can tell how much fun you were having with this!
@Creel, I love how you slipped in DNA nucleotide bases in the string match example 😃
As soon as genetic scientists move from Excel to ASM, we are DOOMED!
I remember the old Intel 8085 had some hidden instructions we used in our projects we knew they would not be changed because the instruction were used in some of the development tools for the MDS (Microprocessor Development System). There were instructions like LDHLISP with an 8 offset parameter. Basically it was "Load the HL register Indirectly with the Stack Pointer with the offset added" it was essential for writing re-entrant code (in 8085 assembler!). BTW this was way back in 1980!
About *CMPXCHG* being "absolutely bizarre" (6:22), this is not only used for mutexes and semaphores as explained, but is also the most common primitive used for "lock-free" concurrent data structures (see for example Doug Lea's amazing ConcurrentSkipListMap implementation). It is so useful that many languages export it in some core library, like in C++ or java.util.concurrent in Java. Most programs you use every day likely rely on it or its equivalent in another architecture, unlike some of the other weird instructions listed in this video.
And it is not very useful as presented where all operands were registers. You want to executed this on a piece of memory.
pmaddwd is my all time favorite instruction. Totally priceless for video coding!
that glow around the bright text on dark background is driving my eyeballs crazy.
Noted! Thanks for letting me know and cheers for watching :)
@@WhatsACreel interesting vid / instructions nonetheless. but yeah, the glow reminds me of when my eyes are wet from crying, I kept having to pause and rub my eyes to "dry" them only to see it's still foggy looking lol.
@@colinstu Ha! I felt the same way while making it! I toned down the glow from 6 to 2.5. It was still hard to look at, but I’d already rendered half the animations, so had to settle. I’m hoping to use animations resembling construction paper in the future. They are very easy to look at, but more time consuming to create. We will have to see how we go.
@@WhatsACreel what software did you create your animations in?
Wow that carryless multiplication instruction took me straight back to my Information & Coding Theory class.
I've always loved the absurdity of the PA-RISC2 instructions SET/RESET/MOVE TO SYSTEM MASK and the PSW E-bit. By changing it, you change the endianness of the entire CPU... And, because of pipelining, the instruction has to be followed by 7 palindromic NOP instructions. That's just always cracked me up.
gonna admit, I don't know a lick of Assembly, but I enjoy trying to decode what anything here means while also listening to this dude's voice. Very entertaining
Is it bad that I've used most of these and consider them perfectly normal? Glad you didn't get into OS level instructions that set up descriptors and gates. Now those are weird.
bruh that shit fucks with my head, i tried getting into it but then the whole GDT, protected mode, gates and shit just knocked the air out of me by punching my brain in the balls (figuratively)
considering these instructions normal is like knowing the difference between the ruddy northeastern gray-banded ant and the ruddy northeastern gray-striped ant. The world of CISC is truly a jungle
Fun to hear about the rarely seen instructions 🎉🎉🎉
addsubps was probably made for complex numbers packed into these vectors.
mpsadbw and similar psadbw indeed were made for video codecs, to estimate errors. You should avoid mpsadbw because too slow, but psadbw is good.
I think the craziest of them are for cryptography, like aeskeygenassist or sha1rnds4. Good luck explaining what they do.
Another notable mentions are insertps (SSE 4.1; inserts a lane into vector + selectively zeroes out lanes; I used for lots of things), pmulhrsw (SSSE3; hard to explain what it does but I used it to apply volume to 16-bit PCM audio), and all of the from FMA3 set (easy to explain what they do, that’s ±(a*b)±c in one instruction for float numbers, but the throughput is so good).
Great points mate! Cheers for watching :)
I never knew i needed this, until now.
god, not even cryptographers would bother figuring these instructions out nowadays. no wonder RISC instruction sets are so much faster for the same electrons, they don't need to snake around the dark winding alleys of the ALU
They absolutely do, though. Crypto nearly exclusively is written in assembler, and prioritises code that always takes the same amount of time to execute (to prevent timing attacks), and code that also otherwise doesn't leak state (the amount of time something takes to execute is a leak, but if it's always the same you can't extract any data from it)
Ah, now I have a solution for the task of making any x86 compiler author cry in 15 minutes.
I wonder how complicated it would be to try to formulate compiler autorecognition for instruction selection for these. That last one is easily a couple hundred lines of C code.
Very complicated. Most of these optimizations are often missed by c compilers and have to be manually implemented in assembly. In some cases (video de/encoding) up to 50% of the codebase has to be rewritten in asm for these reasons.
Your only hope is to use a library that already has fast paths coded in assembly to do this for you.
The best way to do this would be to implement these as compiler intrinsics that would then be substituted with the correct ASM instructions.
@@jfwfreo what if some other arch doesn't have them? most compiler suites support at least one other architecture.
@@bootmii98 Most compilers for x86/x64 (including GCC and Microsoft) already support a boatload of compiler intrinsics for SSE and all sorts of things.
Creel, you are most excellent!
I didn't know PEXT existed until now... it's exactly what I need for fixed point multiplication, thanks!
Just imagine pitch meetings to decide which instructions should go in the set :D. I'm surprised they don't have a 'calculate your taxes and clean the house' instruction
These instructions do have a couple really solid selling points: (1) they don't write to multiple registers (2) they don't do special memory accesses (3) they don't cause any weird special interrupts.
CMPXCHG -- Probably the most important instruction of them all.
Yes, nothing exotic about this. It’s also in LLVM IR, for example.
@@carstenschultz5 Not exotic but critical for establishing synchronization contexts in multi threaded systems.
@@nicholash8021 , I was agreeing with you. It just does not belong in a list of crazy instructions.
Honestly, 2 days ago I was trying to figure out what the hell does MPSADBW do!. Love you Creel, I hope you will make videos on in-depth explanation of these instruction.
Hahaha, that's awesome! Thank you for watching :)
You have great energy and enthusiasm in this video! Keep it up :)
EIEIO
I know it's a PPC instruction, but still...
Seriously, the craziest ASM instructions are the ones not documented in any of the instruction manuals, but are only found by the sandsifter program (written by xoreaxeaxeax)
Any proof for your second claim?
@@sebastiaanpeters2971
ua-cam.com/video/_eSAF_qT_FY/v-deo.html
ua-cam.com/video/ajccZ7LdvoQ/v-deo.html
This guy had a few talks about undocumented instructions or whole undocumented cpu hardware blocks
@@sebastiaanpeters2971 Any of Chris Domas' talks around unlocking God Mode or breaking x86 should suffice
Old McDonald had an assembler, EIEIO.
CMPXCHG16B is used for atomic operations required by lock free and block free queues.
Honestly it's amazing how much work PCMPxSTRx can do in 3 or 4 clock cycles.
As weird as this video is, I never enjoyed a video so much, I think it's just the enthusiasm this guy has... damn, I wish everyone who made videos like there would have that same enthusiasm, but if you're reading this, thanks, I can't remember the last time I liked a video this much
I felt like I had to clean my glasses several times during this video, haha.
whoah, this is some premium content right here, thank you! Subbed and notifications on
It's like watching golden globes for nerds
I don't know where I went wrong in life to end up here, but I'm enjoying it, so it's chill
Of all the thousands of videos I’ve watched this is the one that went farthest over my head
Furry cringe
@@GeneralKenobi69420you have 69420 in your username
Love how excited he is constantly
It would be very interesting to talk to the people who designed these chips
I’m just imagining that the entire design team for #1 probably go into extreme PTSD flashbacks any time they see the letters PCMP anywhere near STR. I just can’t imagine what the proposal Idea was like that led to the instruction being considered.
To be honest, I 3/4 expected this to be a dumb list, but I was pleasantly surprised that you actually know some stuff!
Imagine how fast programs would be if our compilers could instantly see when these obscure commands would be useful and then put them into place. I dont even understand how these instructions take so few clockcycles
Imagine how fast programs would be if developers could see when these obscure commands would be useful and then put them into place.
If only the compilers had a mind of its own. Well the developers do, but nah
Imma be honest: I didn't understand most of this. But your enthusiasm is contagious.
I can not imagine a compiler that utilizes these fully! Use asm, optimize by hand!
I agree it’s borderline impossible for compilers to emit them automatically. I saw clang’s auto-vectorizer emitting vpshufb but that was very simple code.
I disagree about ASM. All these instructions can be used in C or C++ as compiler intrinsics, way more practical.
@@soonts yes, but if one can understand and use intrinsic properly, then heshe can just write entire function in ASM too (right there inside C code), so it not about how exactly to use it, its about to use it efficiently at all.
@@mojeimja The code I write often has both SIMD and scalar parts, interleaved tightly.
Modern compilers are quite good at scalar stuff, they abuse LEA instruction for integer math because it’s faster, and do many more non-obvious things. Just because they suck at automatic vectorization doesn’t mean they suck generally.
For SIMD code, manually allocating registers, and conforming to the ABI (i.e. which registers to backup/restore when doing function calls) is not fun.
With intrinsics, the compiler takes care about these boring pieces.
PEXT made me laugh for some reason. Don't know if it's the particular tone you explained it in or the absolute (seemingly for my stupid brain) randomness and bizarreness of this operation, but I love it.
I have to admit, when CPUs changed from 32 bit to 64 bit, I was skeptical. Like how often do you really need to count beyond 2 billion anyway? But now I see why 64-bit instruction sets can be useful as fuck, and faster for the same clock speed.
This was in my recommendations dozens of times in the last year. I finally watched, and I dont know what to do with this information