Wanna learn MORE awesome stuff while helping out the channel? Go get a FREE month of Skillshare using my link: join.skillshare.com/learn-1/?coupon=AFF30D23
"I didn't want to use an incredibly low level library, because it felt like cheating" Been there before, I know where this road leads. I wish you luck on your future breadboard CPU project!
Seriously, there is a reason high level languages exist. If you can let the compiler and the language take care of your branch tables for you, you can save a lot of development time. These days, the only reasons to work with assembly are for SIMD optimization, extremely optimized library functions (often involving SIMD), and occasionally interrupt routines. There are occasional times with small micro-controllers like this where you need to use assembly, but code the compiler creates is typically roughly equivalent to what you'd create, and often better. Don't underestimate the power of auto-loop unrolling and function inlining. The linker can take all sorts of shortcuts that you wouldn't think to.
@@Xevos701 The Firestorm implementation of integer division is surprisingly efficient, but it's nonetheless 2.5-4 times worse in performance than the corresponding multiplication.
I remember building a 8 bit cpu, the division circuit was by far the longest to figure out. I started with obviously +-, *... then tried for months to figure out how to do division (without obviously looking for the answer online) once I realized how you could do division with substractions and additions it took me just a couple of weeks to deal with IO and W/R to ram.
I dabble with Z80 assembly. Multiplication is a charm. e.g. multiply HL register by 10 push de ld d, h ld e, l add hl, hl add hl, hl add hl, de add hl, hl pop de I've written something like that so often by now its just muscle memory :D
Motorola 6809 (1978) has a multiply command, takes register A and multiplies that by register B then combines the A & B registers to make a register D (which is 8+8 bit so 16 bit) and outputs the multiplication result.
Ah, tables of squares and difference-of-squares multiplication methods. And then there are Newton-Raphson approximations for computing reciprocals so you can pipeline divisions using multiplication. All good fun...
I think the real reason we do this isn't because we don't like libraries, it's because we want to understand. There's so many times that I've written something myself, fully understood the entire problem, gotten the bugs worked out, and as soon I was entirely satisfied with my understanding, I simply immediately replaced my solution with a library, because I don't want to maintain any of that nonsense!
@@JonJenkins1982 not only that, but even for higher level programmers, every time you deal with bit twiddling and such, you learn to think that way even better. Case in point, I had mostly figured out the solution for this only a few moments after he pointed out that ARM M0 cores don't have division capabilities, or at least close to it. I was thinking the solution was overflowing the irrelevant bits out of the multiplication register, but this is effectively the same thing, just putting them into a word that we simply ignore. It's not that I, or anyone else, have some sort of unique insight into how it works, I never would have thought of that earlier in my career. It's just that being exposed to these sorts of cool hacks over time gives you the tools you need to come up with new bit hacks when you encounter a problem.
The thing people often don't understand is that every operation eventually leads to an addition or a subtraction; multiplying is just adding the same number multiple times, and dividing is just looping through a bunch of potential numbers to see which one is closest.
Multiplication is not repeated addition - just multiply 3.2 X 4.7 - cannot be done by repeated addition. Multiplication is a scaling operation (I have 3.2 lots of 4.7). We. muck up children's primary education by lying to them that multiplication is repeated addition.
When you multiply A by B I promise you, the CPU doesn't add A to itself B times. There's adding involved but there's much smarter ways to achieve multiplication. Division isn't guess work, but it is much slower. It basically does binary long division which does require recursive synchronous computation and usually takes more than 1 clock cycle and uses a heap of silicon .
@@daniel.watching Of course, it does not add B times, but it adds (log B) times and performs a shift of A after every addition. Yet, it adds A several times.
5 / 9 is roughly 569 / 1024. Dividing by multiples of 2 is an Arithmetic Shift Right. Multiply by 569, shift the result Right by ten bits. Multiply by 36409, shift 16 bits... and 2330169 with a shift of 32 bits merges nicely with your code.
Sir, this is what I call an elegant solution. Easy to understand. Easy to code. Easy to process. And give the most precise results available for this architeture (I think).
@@VictorCampos87 This is exactly why they don't have these operations on the processor. Because skilled embedded developers don't need it and it would make it too expensive or large for the applications it was built for. People that do this every day generally don't need it.
The good old freedom units that you got forced on by the uk cause nothing stands mor for freedom than the British empire. Did you learn about picas and points btw? And can you tell me how many miles 1 pica + 3 points are?
For Non-Americans an easy way to understand Fahrenheit is to think of them as a percentage anything below 40% is cold 40% to around 75% are nice medium temperatures and anything above that is hot as hell!
@Devin Nall Yeah like most metric stuff is just better full stop, but Celcius is not more beneficial for a human. Fs range of 0-100 being "about dangerously cold but manageable with layers" to "about dangerously hot but manageable with hydration and limiting sun exposure" is much easier to just understand what a temp actually means to me planning my day. Celcius is based about water state changes but I'm not water i don't care when it changes state for weather forecasts
Even as late as the the Nintendo DS ARM 9, we needed to use fixed point mathematics. The hardware registers themselves accepted only fixed point values. Same with trancendental trig functions. We had to use lookup tables in BRADIAN(Binary Radian) units interpolated on the fly. Fun times.
Fixed point or fractional integers can in fact do adequate calculations for most practical things and in signal processing, it just gets increasingly burdening on the programmer to decide where the decimal point is put best for each number represented in the program to not lose precision or keep the S/N. Consequently, it is easier for the programmer to have this done at runtime with floating points, but also more wasteful and also possibly numerically more unstable/indeterministic when dealing with feed back or iteration over data sets. Scientific calculations or a FFT (and other transformations) are examples where floating point calculations may be needed. In a FFT for example all data points in the time domain may contribute to only one data point in the frequency domain and vice versa, floating point may be the better choice to deal with the bigger dynamic of the transformed compared to the source data, compared to just increasing integer bit precision overall.
Yep one of the hardest problems in computer design has always been finding a good way to implement division. There have been some creative solutions, like in the cray-1 supercomputer which would instead compute a reciprocal that you could then multiply to do your division. But often, designers just omit division all together from their CPUs.
As someone who does microcontroller stuff, I often do that sort of thing. There is even a method that allows you to effectively divide by a variable amount. It is based on two observations: 1) A/B = (N*A)/(N*B) so you can pick an N to make the denominator something easier to deal with 2) For X very small 1/(1+X) = 1-X this allows you to deal with any number near a power of two without a divide.
@@crashhappy7296 The 2nd thing is really more like an observation Taylor started with. The two parts together effectively do a divide but it is not Taylor.
At the machine level, computers don't subtract, multiply, or divide. They add, compliment and shift. So, in binary, shift the dividend left past the radix as many times as you want significant digits. Two's compliment the divisor and add it to the dividend until the quotient sign bit is negative. Shift the quotient right the same amount you shifted the dividend left. Viola! The decimal is not called decimal in any other number base but 10. It's the radix in any number base. The video was great! Good job digging into the software that deep!!!!
@@__christopher__ Yep. And even the logic can be expressed in discrete components. And those in gathered natural materials. Sand, dirt, burnt things and wizardry ;-D
bro's just casually programming some ARM assembly. love it man stuff like this makes me motivated to learn low level stuff even though it looks kinda scary to me
I think it's a waste of time beyond knowing "what is assembly and how it works + basics" (Note: I, myself, know the "basics" I mentioned above. It's logical that "knowing" is always superior comparing to "not knowing"...)
@@Yas-gs8cm Once you learn assembly your thought process never becomes the same again. Even if you work with higher level languages you still constantly think about: "What happens once this gets converted to assembly". It helps you develop new ways to solve problems that you wouldn't have if you never coded in assembly.
For those of you who, like me, had some difficulty understanding his explanation. His final calculation is something like: C = (F - 32) * 5 * Const / (2^32); where Const = ceil((2^32) / 9); He divides by 2^32 by grabbing the highest 32 bits of a little-endian architecture number, because it would be equivalent to bit shifting the 64-bit positive number 32 times to the right.
For something that's only C to F or reverse, just calculate it ahead of time on excel and put a lookup table inside the code. It is embedded system and it will be faster and smaller than pulling in a math library.
A couple of extra notes: - This division method (multiply+shift) is attempted by compilers whenever the divisor is known at compile time. It isn't used to divide by an unknown number as the constants can't be predetermined. - Another alternative to implementing these things in hardware is to translate the instruction into micro ops on the cpu. (I can't say for certain which instructions do this but I believe all exponential/trig functions usually do.) I believe the reciprocal is usually calculated in hardware so a division might be converted to this plus a multiplication And yes, regular divisions can take 100s of cycles
There are a lot of iterative algorithms for computing divisions. My favourite is successive approximation: x = a/b, we rearrange to x.b = a. We then *guess* the value of x 1 bit at a time, by setting each bit to '1' from MSB to LSB and checking if x_guess.b is larger than a. If it is smaller (or equal), then we leave that bit set. If it is larger, then we reset that bit to '0'. This algorithm works nicely for a lot of monotonic functions with simple inverses (e.g. division --> multiplication, square root --> square, and there is even a clever trick for calculating logarithms this way).
Of course, that algorithm is slow, but I think it is elegant. In digital hardware, it needs N clock cycles to calculate an N-bit result (but uses an incredibly small amount of silicon). In a CPU, much longer.
@@absalomdraconis There are plenty of ways of doing it faster in hardware, it just costs more transistors. Depending on the clock frequency, it may even be possible in 1 clock cycle. The best solution depends on the use case.
I used this same technique years ago. Another way of thinking about it is that you’re converting your fraction to a close fraction which has a power of 2 as the denominator since that will just be a bit shift operation. It was an old intel 8088 microprocessor. It had a divide opcode, but execution took over 160 clock cycles. The real-time code couldn’t finish processing one block of data before the next arrived. Converting all the multiply & divides into multiply & shifts saved the day.
Yes... but also no... Like right now I could cheat and just take some random Water Shader someone made for Godot, copy that code and put in my game... BUT... I actually wanna know how water shaders work because it's fundamental to recreating a submarine simulator that's actually good and doesn't feel like an asset flip.
@@JohnnyThund3r If you utilize a library which is well-designed, the combined manpower of its developers shall render it better than anything that a single developer like yourself or I could create. Plaigarism doesn't inherently mean inferiority.
The generalization of this concept is called "division by multiplicative Inverse" or less commonly "multiplication by the reciprocal" and is a relatively common practice when you have a division where the divisor is a constant. It's talked about a lot in low-level books, like "Hacker's Delight." Basically, you precompute the division by encoding it as a reciprocal. It's fatal flaw is you have to know what you are dividing with from the start. To go farther, you have to use something like the Newton-Raphson method for division. This is the same as the above, in that it finds the reciprocation quotient. The first major difference is that it is a run time operation. That is, it FINDS the needed reciprocal during execution, no precomputation. The second major difference is that it uses an approximation to division _as a function_ to get close to the correct result, then uses Newton-Raphson method to refine that guess to as many digits as you desire. For 32bits, I find that 3 iterations is plenty. Incidentally, all of this is possible because floating point/fixed point number systems encode multiplication and division _into_ the number itself, in the same sense that having a sign bit encodes addition and subtraction into a number. Floating point, specifically, also encodes exponentiation into the number, which is why it is the usual goto number system. ... The comment claiming this to be called "Division by Invariant Multiplication" is the first time I have ever heard it called specifically by that name, and has just one research paper with that exact title... Wikipedia calls it "multiplication by [the] reciprocal" which someone responding to said comment linked to as well. multiplication by the reciprocal is a basic math operation, and gets you a lot more hits than "Division by Invariant Multiplication." It's also certainly not complicated to understand, as this ~5 min video has clearly demonstrated.
@@rickroller1566 en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division ...specifically, you want "Xi =(2-DXi) ... which can be calculated from Xi using only multiplication and subtraction, or using two fused multiply-adds. "
Loving this type of video format where you just tell a short story about something you have been working on yourself and learning something new! Would love to see more of this format!
I love how this highlights how division is fundamentally different from the other basic operations. It opens up a whole new domain of numbers and we have to contend with that.
The mathematical proofs for multiplication and division is what made me hate math. I am an chem engineer by education and environmental by trade. I love equations, but screw any sort of proof. Hated every second of every math class I took in college for that reason. I would have loved to complete matrix theory, (still have nightmares there), but proofs made me quit. I think I made it less than a month. Kicker is, the class even said it WASN'T going to have proofs.
@@goldenhate6649 Pretty much, I do recreational math all day, but forget deep dives into multivar calculus or other long routes by hand just to demonstrate a simple conjecture
Learnt this early on. Computers divide by left shifting/right shifting the number in a register. It is multiplication by the decimal representation of the divisor.
@@mrbutish 3 is exactly representable as float, as are 1 and 2, also 3.00...0004 is clearly not an integer, meaning its not a possible output from a shift. Possibly you mean .1 + .2 != .3? which is plausible given .1, .2 and .3 aren't exactly representable.
In the not-so-good old days, on an IBM mainframe and before IEEE math, we've got a lot of fun adding 100 events each one with a weight of 1/100... just to discover after many hours reviewing the code that 1/100 is a periodic fraction in binary! So we changed and weighted 128 events.
My favorite trick is to take some huge power of 2, let’s say 1024. Turn 5/9 into the closest fraction with 1024 as the denominator. In this case you would get 570/1024. Now all you have to do is multiply your binary number by 570, and then move the decimal 10 places to the left. I’m sure you guys know way better ways to do this, but this was my trick when I had to write code in assembly for my computer science class.
@@gapplegames1604 Depending on the architecture and the number of bits you have there are tradeoffs with precision/max value you can accept. Since 32 bit multiplies tend to be faster (on cheap hardware), something with a lower bit shift like your example (but most likely 16 to use tricks with using just high/low word) has its uses in some situations. If you're stuck with an 8-bit AVR, you're definitely not doing 64 bits multiplies (and probably not 32 either).
Except multiplying by 590 reduces the domain and causes overflow in cases the better method doesn't. You usually like to write crappy buggy code, without consideration of any problems or limitations or domain restrictions. Yes we know. Hopefully nobody is paying you for such garbage tho
@@gregorymorse8423 jesus christ bro chill, he's explaining how he did it "for my computer science class", he's just working with the closest possible precision he can given the limitations of practical time constraints and diminishing returns. For the purposes of understanding how assembly works, the goal of most courses that are going to require a basic understanding of this kind of math, this is a perfectly sufficient method.
Once you described how that weird arbitrarily large number was found/represented, it reminded me of the Fast Inverse Square Root function in Quakes engine that tells the computer to cast a float as an int with no conversion
The real kicker is the matrix that allows you to do square roots without doing a square root and in only a few FLOPS. Its the foundation of all modern video games that use triangular modeling.
The first computer which I programmed in Assembly was the IBM 704, in 1960. There were two floating point division instructions, Divide AND Stop & Divide OR Stop. This had to do with the requirement that the result had to be less than one. So the inputs had to be scaled to make sure.
In the “old” days of fixed point arithmetic, machines came in fractional or integer models. So, we kept track of powers of 2 in a note by each statement so as to loose track, x(2^n). The “n” was physically typed in the comment beside each statement. As far as timing, a multiply took 4 cycles while a divide took 13 computer cycles. This was for a 16-bit process while double, triple, quad etc precision timing grew to 2n cycles, 3 n etc. So we always tried to find ways of using multiples, especially for critical time simulations.
I've been developing on a Motorola/freescale/NXP DSC family chip for a few years now and I'm always having to do multiply shift approximations. A simple divide operation takes a few hundred microseconds, and when I'm running a one millisecond loop that can blow things out quick. This is just one of those cases where close enough really is close enough.
I've got some bad news... looks like the Cortex M0, M0+ & M1 don't support the SMULL instruction (32-bit multiplication with 64-bit result). You only get the lower 32 bits. If you set the compiler flag `-march=armv6-m` in Godbolt, you'll see it calls a library function for division (`__aeabi_idiv`) instead.
Yup, the same thing tripped me up when I was fooling around with some playful attempts to do baremetal stuff on a Raspberry Pi 3B, which has a Cortex A53 I believe. I didn't bother to include or link any libraries, because I was only doing very basic arithmetic and pointer shit anyway, outputting over UART. I was quite surprised that I was getting an error referencing this function call, in a place where I fully expected a plain integer division to be.
I strongly doubt if he actually needs to convert millions of Fahrenheit to Celsius. For a reduced range a 32-bit multiplication with a 16-bit constant and a 16-bit shift is enough for correctness.
@@Carewolf "Dont code for armv6... That stuff is ancient." ah yes - the "ancient" 9 years ago... cause nobody would ever use something that was first released that long ago............................................................ yeah no.
@@ABaumstumpf You dont because arm before v7 is not a single platform, it makes no sense to target armv6 because there is no such single ISA, it is a huge mess. if you target arm32 you target armv7 or a VERY specific arm cpu.. And besides armv6 is from 2002, so 20+ years old, not 9
If you had divided by a variable instead of 9 in the C source, the true face of division would show up. The code would then have to call a runtime division algorithm instead of multiplying by a constant. The ARM assembler code for efficient division is worth a look.
I'm old enough to remember the days when all calculators were mechanical (sometimes electrically powered, but not always). Unless they were the more expensive variety, they did not support division. I specifically remember my first hand-held 4-function electronic calculator (circa 1960) and being truly amazed that it could divide! There was actually a short pause (maybe 1/8 second) before it displayed the answer to a division problem.
Before the prevalence of div instructions and FPUs this was standard practice in embedded systems and decades of products used fixed point math to do incredible things, including keeping your car on the road in the winter!
As soon as you aren't working with whole numbers, most humans start to break at division as well. 90/9 is something most people can do, 90/8 becomes a bit tricky to some, but 90/7.6 will require most people to pull out a writing sheet to 'cache partial results', and even makes us struggle past this point. And this is while we are the creatures that learn how to divide things evenly with others even before we learn math.
the big difference here is that 8 and 9 have 1 digit while 7.6 has two not the fact its not a whole number if you say divide by .9 now its a cakewalk for most people again
@@ss3nm0dn4r8 I wouldn't say most would find it easy again (althought definitely easier then 90/7.6), but that's also because the solution is a whole, natural number again.
Old timer here... visiting you from the days before MUL and DIV instructions in microprocessors. I should be well versed in all manner of integer math operations, but thankfully I've been able to forget (or at least not recall and use) all that for years now. Heck, for some of the industrial machines (Z80 based) I programmed in the '80s we splurged and added a math co-processor card! Probably the main takeaway in your journey shown here is that ARM is older than we tend to think. Born in the '80s itself, ARM was radically new (RISC) but still influenced by its times. Also, adding hardware DIV might have been considered antithetical to the new RISC concept.
And nowadays RISC and CISC aren't really meaningful anymore. The most widely used RISC designs have tons of instructions and extensions and the most widely used CISC designs translate CISC instructions into RISC-like µOps
The ARM1 processor didn't have a multiply instruction either. Multiply was added for the ARM2 not really for maths but because of how often the C compiler had to calculate addresses (eg array members) and multiply is very useful for that.
Really cool video!! It feels kind of great when I get to see someone else passionate about assembly language. This video reminded me about the time I was figuring out on how I should multiply and divide a number in the assembly language of ATMega8A microcontroller (and later with an 8085 processor): I had to use repeated addition and subtraction to get the job done. Later on, when I worked with 8086 microprocessor I got to know that division is an expensive operation (just like LOOP and REPNE instruction in x86 processor. Even in x86-64 family of processors, they are just artificates which no one has ever optimized, so everyone favours JMP instead of LOOP). Again, this was a great video! 👍👍
I strongly recommend the book "Hacker's Delight" by Henry S. Warren, Jr. It's particularly valuable for anyone doing assembly language programming. I wish it had been around in the 80s when I started out.
It is nice to see young kids discovering the wonders of the world! :) What if I told you that the famous 6502 CPU which powered such machines like the C64 (and many-many others) had not only no division but no multiplication too. And it was still able to rule the 80s.
When I was using Apple II, I wrote a macro library for 16- and 32-bit int operations like add/subtract/multiply/divide. But when I got to wanting to do floating point from assembler, I cheated and just bought a floating-point processor board.
@@w花bwell, by deciding to work with floating points, you kinda accepted the fact that you will sacrifice memory and performance. After some time studying hardwares, specially old ones, i got why its that so fundamental and common.
Traditionally, we used log/exp approach on 8bit/16bit. Hitachi and Mips RISC cpu work around the problem. On mips, multiplication and division are carried out in parallel and store the result in specific registers. Hitachi "sh" cpus provides a division by part, we can set the required precision: one instruction to initialize, another to perform a step.
@@w花bYes, of course, it was Turing complete. I just said that it was even more primitive, but still a CPU. Getting surprised that some CPU has no division shows that you're way too young. :)
Nice solution! I remember reading a book back in the 90's called "Game Programmer's Black Book" that detailed workarounds for faster division performance on the 486's and Pentiums of the day.
Those of us who work with smaller micros do this on an almost daily basis. Create a multiplier/divisor pair with the divisor a power of 2. 2^8, 2^16 and 2^32 are especially nice because they just involve discarding low bytes. The only caution is to make sure your intermediate multiplication cannot overflow. For "normal" temperatures, a 2^8 divisor is all you need. 5/9 = 142.22, so the multiplier can be 142 or 143. Now you work back to determine the maximum input temperature that will avoid overflow, so 65535/142 = 461 and change. The conversion will work with inputs up to 461 degrees, with an error of about 0.16%. You can also add a rounding factor of 256/2 to the multiply before the 2^8 "division".
When we manually tried different division approaches on a computer engineering courses I was fascinated by how hard division acually is. It makes sense though - division is also harder to do on paper. P.S. there is also a square root
Division in binary is actually not that difficult, since the (appropriately shifted) divisor either does or doesn't go into the dividend (determined using subtraction). There are even shortcuts you can do to make this fairly efficient. However, it still takes a number of cycles equal to the quotient length.
@@Mueller3D yeah, but it's much less intuitive than multiplication. It's also longer algorithms with more steps and branching. In multiplication you just go with a naive approach. In division I don't think any approach can be considered naive
That's actually a pretty cool solution. I thought you were going to talk about floating point division and how inaccurate it is, but this was also interesting to watch
This is the bit i love. Go deep, go to circuit level... half bit and full adders, shift registers... We have paper, pencils, and extra ways of accessing memory. A puter is stick with whatever figures are in its amu. It "sees" one digit at a time. It cant see the whole number, it cant go back and "carry the one"... Subtraction by addition and complements has always fascinated me...
Taking me back to applying newton's method for calculating square root in MIPS assembly as a programming assignment in an introductory computer architecture/CS class. I think more than anything else it taught me to respect all of the crazy shit that's happening at the lowest levels (although of course it goes even lower).
I usually pre-scale a fraction a with an acceptable binary power, eg 2^8 giving (5/9)*256 = 142, and multiply with that then right shifting the result by 8 to fast divide.
Welcome to 6502 assembler with more bits! 😱 Seriously, at the assembly level (especially with RISC type instruction sets) you end up really appreciating how much heavy lifting your compiler does for you. I've never written anything in ARM assembler, but this video takes me back. Thanks for sharing! 👍🍺
One of my favorite optimizations had to do with converting timestamps from 32-bit number of seconds to separate year/month/day/weekday/hour/minute/seconds fields on a 6502. The typical approach involves multiple 32-bit divisions. I realized that most of those quotient bits were always zero. So I essentially unrolled those divisions and eliminated all of the trial subtractions that were guaranteed to result in a zero. It ended up needing a total of something like 34 subtractions, resulting in a speedup of 7X.
On systems with fixed instruction length even stuff like loading immediates (that is, constants) to registers is quite funky, even on symbolic assembler. On x86 with variable instruction length but no widely available barrel shifter logic available to instructions these are quite literal "move immediate" instructions, but on ARM, complicated constants either take several instructions to load (if not using constant islands), or generate constants in interesting ways...
JRISC in 1993 ( Atari Jaguar ) and SH2 in 1994 on Sega 32x and Saturn offered 32-bit division. It has some latency, but did not block the processor. Similar how quake did division on CISC Pentium in 1996 .
I started this on the wrong foot, I had "just use a high level language" yelling in my head until I realized that looking at these from the low level is the point of the video and indeed the channel
I remember when I first started learning to use assembly, I started coding for Game Boy. I decided I'll write a few simple functions to get a feel of assembly, so I wrote a function for Fibonacci sequence then I attempted to write a function for “n!” !!! Imagine my surprise when I found out that GBZ80(CPU of Game Boy) doesn't have any instruction for multiplication. I had to write a multiplication function using bit shifting. I have used a lot of different CPU's assembly and most of the very old ones (6502, 65C816, …) doesn't have a mul\div instruction; and old ones (8086, 8186) only have integer mul\div; and newer ones (from pentium until now) are very hard to program (if you want to make them fast, that is) because of SIMDs.
This was a triumph! FWIW most x86_64 CPUs have floating point division hardware but integer division is implemented using micro ops for newton-raphson method. This bugs me to no end because you normally want high throughput for floating division, so you use n/r for that, but for ints it usually is a single division on the critical path. The difference is huge - unpipelined trial subtraction hardware has an 8 cycle typical latency, but n/r is 26 cycles.
@@soonts ooh, interesting! DIV and IDIV timings going back to Bulldozer appear to be for a slim hardware division unit with up to 18 cycles of latency. Intel are still in the high 20s though. Thanks, I've been team red for over a decade but somehow never noticed this.
“Division is slow, we’re just gonna ditch that” - I love the approach. Back in university assembly course, instead of dividing by 1000 I just bit shifted by 10 because it’s faster and good enough. My professor wondered about the loss of precision and made me fix it, though. 😂
Nice video, thanks! It is worth noting that, even when division is supported by the hardware, it can be quite slow. I once timed the basic floating point operations on a Cortex-A5, which has an FPU implementing the VFPv4-D16 instruction set. Multiplications turned out to be fast (one/two cycles in single/double precision respectively), but divisions took more than seven times longer than multiplications.
@@zlcoolboy That's still faster than doing it in software. This method only works because it's a known constant that can be approximated. Once you divide by a variable, you've got to use much more sophisticated approximations, which will likely take multiple times as long as a hardware divide unit.
The RISC philosophy, love it or hate it 😉 Yes, I saw the interesting multiply-to-divide trick in printf routines to display numbers, where it has to divide by 10 (or powers of 10). I was impressed, it was packed with such little techniques to save cycles (on x86_64).
@@linuxization4205 It depends which categories you are willing to consider. There are microcontrollers and CPUs derived for ex. from the 6809, 8051 or 6502 and so on which are borderline or plain CISC. Though many designs, including the x86, only have CISC instructions but a RISC architecture, so it's not as straightforward as when the term was initially coined. 😀 I believe Intel started translating the instructions into micro-instructions in their Pentium series, otherwise they wouldn't have achieved the same level of performance. It was bold and brilliant. Nexgen, which was bought by AMD to save their failing K5 architecture, did the same back then.
@@phenanrithe It was the Pentium Pro, (P6 or i686), which all modern x86 chips still identify as. So yeah, the big x86 CISC architecture uses RISC-like tricks under the hood and only displays itself as CISC on the outside for compatibility's sake. And that whole thing was also supposed to be replaced. After IA-16 and IA-32 should be IA-64, but that went to nowhere just like i960, i860 or i432. And obviously Itanium which, as P7 was somewhat supposed to be the successor in 98
@@HappyBeezerStudios Yeah... what happened was that IA-64/Itanium was a disaster, it tried to optimize all the wrong things and ended up being somehow SLOWER than x86 if you can believe such a thing.
When I was first taught about logic gates and how to make a full adder curcuit, I went on an absolute rampage trying to figure out how pure logic can be used to perform mathematical functions. Subtraction is easy once you know how to add, and multiplication is just adding with extra steps, so I figured them out quite quickly, but I remember spending weeks on division. I ended up figuring out how the division process worked, but was left with the problem that: a) like half the results of dividing in binary end with recurring decimals b) Three decimal places would only give my answer a resolution of 1/8th This resulted in some divisions appearing to give the same answer, because the difference between them was too small to see. So I thought "I know, I'll just represent the remainder as a fraction, that should be easier." The problem with that is that nobody wants to be told that: 27÷18=27/18 So I would have to find some way to simplify the fraction. Which is not easier at all. As it turns out, the only way to find the prime factors of a number is to divide it by every prime number smaller than half of the original, smallest to largest, and see if the result is a whole number. Which means for each digit you add to the input the list of prime numbers needed to check grows exponentially. And then you have the added problem that once you find that a number is divisible by one of the primes, you then have to check whether you can divide it by that prime again, and keep checking indefinitely until you get a remainder, and while that's easy enough to show in a flow chart, making a circuit to do that is a whole different matter. The "solution" I decided on was to make a curcuit that only devides three digit binary numbers, and have the remainder be displayed up to three decimal (binamal?) places as long as there is no "loss" (I.e. There is no data carried off the end of the divider circuit). If there is loss, the result is displayed as a fraction which can then be simplified. The reason this is easier to simplify is that now the only fraction that will ever need simplification are 3/6 and 2/6 This means that my simplification circuit only needs to check if the numerator and denominator are both even, and if so divide them by two, and if not, check if they're divisible by 3, and if so divide them. If they are neither divisible by 2 or 3 they are displayed as they are, since I know from my meta-knowledge of all possible divisions of three digit binary numbers that any other fraction I could encounter would either be already in its simplest form, or be possible to display to three decimal places without loss. This was really a solution, but I'm still pretty proud that I made it at least look like it can divide.
Amazing content. It would also be interesting videos about division of non-constant numbers, the implementation of a floating point unit (showing how to add and multiply floats) and also the implementation of algorithms like sine and square root. All of that comparing C and assembly.
I encountered such a thing on a school project with FPGA's. With an FPGA you allocate logic units (AND, OR etc.) to build your own logic circuit. Using a single division operation used like a thousand logic units, which is a lot. Instead we used an division algorithm which got it down to under 200. It probably used multiplication and bit shifting, but I'm not sure anymore.
Most Americans don't even know that the Fahrenheit scale is 0-96 and falsely assume it's 0-100 because they underestimate what a dumb scale it is. He literally used the coldest temperature in his home town as 0 and the average human body temperature as 96, which as you probably know, he got wrong.
I don’t think I’ll go into learning assembly at any point. But I started programming in my 3rd year of college. I first switched over from music education when I learned I didn’t want to be a teacher. I first learned some basics in game maker, then the teacher taught us some simple Visual Basic (it was outdated frankly even back in 2019, but it was a good intro language imo, statically typed yet simple to grasp the syntax). From there the next classes taught C#, sql server, then some basic web dev with html, css, and php. I didn’t finish there, but went on to teach myself python, JavaScript, Ruby, and I’m even trying to pick up some C++ and Go. I transferred my credits to WGU and am trying to finish up my degree and get out there as a developer, and I love that you, primeagen, melkey, and others put great content out there and it helps me to keep myself motivated to finish.
As an old assembly hound, it's fun to see something like ARM. I don't do a lot of RISC stuff in ASM because of the old adage "RISC is for people who write compilers. CISC is for people who write programs" On the 8086 and 8088 we too couldn't rely on there even being a FPU, and our MUL and DIV statements were painfully slow -- in excess of 150 clocks EACH. An old trick for doing percentages in that case was to do a multiply very like this as a fraction of 65536. What made that "fun" was you ended up with the integer result in DX, and the numerator (65536 as the denominator) in AX. We were further hobbled by there being no MUL reg, immed ; assumes temperature in AX sub ax, 0x20 mov bx, 0x8E38 ; roughly 5/9ths 0x10000 mul bx ; DX = integer result, AX = numerator over 0x10000 Becuase it uses only the one multiply, it takes half the time that MUL 5 and DIV 9 would. Which is good when we're talking turning 300+ clocks into 150.
What are 8086 and 68000 even doing in all those cycles. The factors are 16 bit as is the ALU. This only makes sense to me if MUL and DIV were just an afterthought which had to be cheap. Indeed I have seen a large stretches of code without any of these instructions. This typical IBM-PC business code does not MUL or DIV. Maybe once to calculate taxes.
@@ArneChristianRosenfeldt A hardware multiply requires a lot... no, A LOT ... of silicon. We're talking as much as 20 times the silicon as an addition. To that end a lot of early processors left it out entirely, or implemented it in "microcode" In fact that's exactly what the 8086 / 8088 does, is use microcode. A small bit of code on the processor that pretends to be an opcode, that runs on the other instructions of the processor. Even with a hardware multiply the operation is intensive enough to require a lot of clock cycles as multiply and divide -- even just as integer -- are expensive operations. A mul reg16 taking anywhere from 25 to 40 clocks depending on CPU. (laugh is the 486 is slower than the 286 or 386), and thus it wasn't uncommon for stuff like vga video routines to still use shifts instead. ; ACCEPTS ; AX = Y ; CORRUPTS ; AX ; RETURNS ; DI = Y * 320 xchg ah, al ; == y * 256 mov di, ax shl ax, 2 ; == y * 64 add di, ax ; di == y * 320 Runs in about half the clock cycles of a hardware multiply right up through the first pentium. Though still not as fast as a XLAT
@@jasonknight1085 still does not explain why anybody would need more than 20 cycles in microcode. With a barrel shifter and some look ahead for zeros all those shift special cases would go away. Zero flag running from msb to lsb for early out (like in 386). Silicon usage goes up with the square of bits. Hence it makes sense to implement 8x8 as MUL ( costs like a 64 bit add ), and then run the microcode over this. 4 cycles for 68k like 16x16 -> 32. ARM7 in the GBA has hardware 32x8 and needs 4 cycles for 32x32.
@@ArneChristianRosenfeldt Do you realize just how much silicon an ARMv4 has that an 8086 does not? We're talking about a ~29,000 transistor and ~5000 gate processor here. That ARM7TDMI (the 7 is misleading, it's a v4) is what, 280,000 transistors and 90k or so gates? What you describe in silicon alone would be a fifth the die space the 8086 used. JUST for the multiple logic alone. Remember, this is an age where a DRAM memory access was 4 clocks per bus width (so a word was 8 clocks on the 8088), and even a simple byte shift by fixed one place took two clocks. Memory shift was 15 + and effective address calculation, and shift reg16 by CL was 8+4n. But yeah, once transistor counts grew these problems went away. By the time of the 80186 you're looking at 26 to 33 clocks, and the 286 it's 13 to 26 clocks. The 186 was 55,000 transistors, the 286 was 134,000. Also worthy of mention is the NEC V20 and V30, drop-in replacements for the 8088 and 8086 respectively, with timings closer to and in some places even faster than the 286. See its flat 21 clocks for all 8 bit multiplies and 26 clocks for 16 bit. It achieved this by having over 60,000 transistors, more than double the stock intel part. The type of infrastructure needed for the microcode you describe by the time you do both 8 bit and 16 bit mul and div is more silicon than many processors even had when the 8086 started being designed in 1976. That's no joke either, look at the 8080 with its whopping 4500 transistors-- and it was considered a chonky boy on release. The Z80 was around 8500, 6502 was 4500. And IF contemporary processors had a multiple opcode (many did not) they were inherently 8 bit with a 16 bit result (in two registers). There's a reason the Motorola 68000 with its amazing fast hardware multiply via its 16 bit ALU needed 68000 transistors (hence its name), an absurd number for the time that took five years to even come close to looking viable outside of big iron use... because manufacturing processes for that many transistors reduced yields resulting in elevated prices.
I remember when I first hit by this wall, but mine was bigger, just imagine ( 8-bit processor, and no multiply instructions) It was a PIC mcu, and it almost boiled me to write a floating point 64-bit based division and multiplication macros, it was enjoyable and hard challenge to tackle.
A breakthrough moment for processor-based math for me was, way back in a college Computer Architecture class, realizing that, internally, computers do addition and multiplication exactly as we do on paper, they just do it in base 2. Changing the base simplifies the operation greatly. There is no deep magic.
Number 1 thing I learned in CPU design is early on CPUs at the logic level can do one form of math. Addition . It does everything with addition. It uses flags or registers to determine negative numbers. It also uses counters for multiplication and division.
I love understanding what happens in the the low-level architecture even though I prefer coding in C for an MCU. Much appreciated!! I've subscribed 🔥🔥🔥🔥🔥🔥♥♥
It's called RISC (Reduced Instruction Set Computing) for a reason, if you want hardware division operations use a CISC (Complex Inst..) processor or a mathematical co-processor. ;)
RISC does not mean that it won't have certain instructions, or that it lacks functionality, or that it is a _limited_ instruction set computer. It just means load/store arch, orthogonal reg file and instr set.
@@GeorgeTsiros Well it's not limited it's just reduced, but it still is general purpose. ;) The idea behind RISC was to improve the efficiency per transistor count and a division unit would mean more transistors with almost no benefit regarding the most programs.
Also important to know is that both integer divide and FPU divide are *very* slow even on architectures that supports it in hardware. An integer divide is up to 12 cycles on Cortex M4 and 10-50 cycles on modern X86-64 for example, Skylake needed like 40-100 cycles for integer div just a couple generations ago. FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4. Also noteworthy is that even architectures such as M0/M0+ can implement external division hardware, such as the case for the RP2040 using M0+ architecture which integer division hardware needs only 8 cycles and therefore is actually faster than standard Cortex M4 for integer division.
@@ewenchan1239 Can you restate your question, as it doesnt really make any sense. Are you asking why a div instruction isnt substituted with a combination of add/sub/mul instructions? They are, the compiler will try to replace a divide either by a shift, a combination of muls and shifts or adds and shifts, depending on if your target supports mul instructions. However this only works if the divisor is a compile time constant. For other cases the compiler most likely will use div instruction unless you tell it otherwise. For example you could use libdivide instead, which can be faster than hardware divide on a lot of targets. Or you could tell it to use its default inbuild algorithm, which most likely will be quite a bit slower than a hardware divider *if* the compiler cant optimize it down, which in a lot of cases it can.
@@_--_--_ "Can you restate your question, as it doesnt really make any sense." You state: "FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4." If your statement above is true, then why isn't it used in lieu of using the ACTUAL divide instruction, ALL the time? (You've answered it for the most part in the second half of your reply above.) But for said second half of your reply to be true, you would have to know: a) the divisor is a compile time constant (which means that if it isn't a compile time constant, it won't work) and b) you explicitly tell the compiler what to do/how to handle the DIV instructions, which typically means that you would have to know the specific purpose or application that you are programming for, AND that the difference in the quotient is not significant nor critical enough IF the approximation is sufficient. (i.e. you're programming for yourself rather than for someone else to use where the importance of the epsilon value is not known nor what someone else's tolerance limit for epsilon is)
@@ewenchan1239 "You state: "FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4."" I mentioned the integer divide instruction a bit before that, lets not mix up integer operations with floating point. As far as I am aware we are talking about integer arithmetics specifically. "a) the divisor is a compile time constant (which means that if it isn't a compile time constant, it won't work)" Wont work is not the right wording here. Yes if the divisor isnt a compile time constant than the compiler cant replace the div instruction with a faster combination of other instructions by default, however thats an optimization problem, it still "works". Also its often a good idea to rethink your code, often a lot of divisions are unnecessary (especially with floating point, but we are talking integer here), the compiler cant do this for you most of the time. This was mostly my intent of the original comment anyway, the compiler cant optimize all your mistakes away and using divides carelessly is often a mistake if performance is of any importance. "which typically means that you would have to know the specific purpose or application that you are programming for" As this discussion centers around low level programming, mostly embedded programming, we are talking about target specific code, where you know exactly what target you are progamming for and its specific compiler. Portable code is not relevant to this discussion. "AND that the difference in the quotient is not significant nor critical enough IF the approximation is sufficient." I am not talking about approximation, thats a good way of potentially optimizing code if its feasible, but thats not part of this discussion. All compiler optimizations or replacement software algorithms will give the exact same output as the hardware divider.
@@_--_--_ "I mentioned the integer divide instruction a bit before that, lets not mix up integer operations with floating point." But you also the "cost" (in terms of the number of operations) it takes for FP ADD/SUB and FP MUL on said Cortex M4 as well. "FPU divide on Cortex M4 is 14 cycles." "As far as I am aware we are talking about integer arithmetics specifically." My understanding is that we can be talking about both (given that you can represent floating point numbers as a sum-product of pairs of two integers, can you not? (i.e. the approximation for 1/9 ~= 1 * 2^-4 + 1 * 2^-5 + 1*2^-6 ~= 0.109375) And isn't the mul/shift - it's doing floating point math with integers, isn't it? (unless I am missing something here and/or misunderstanding what @Low Level Learning is talking about here, in this video) "Wont work is not the right wording here. Yes if the divisor isnt a compile time constant than the compiler cant replace the div instruction with a faster combination of other instructions by default, however thats an optimization problem, it still "works"." The "won't work" here refers to the substitution or the replacement of the slower div operation with the faster combination of other instructions. That's the "work" that you were talking about it doing. Therefore; pursuant to your statement, if you aren't dividing by a compile time constant, then said replacement/substitution won't take place - i.e. the "work" that I am referring to, based on what you wrote, won't take place. Ergo; "won't 'work'" (div operation won't get replaced by "combination of other instructions" (which should help to perform the division faster without using the div operation/instruction).) "often a lot of divisions are unnecessary" Varies, I would think. "This was mostly my intent of the original comment anyway, the compiler cant optimize all your mistakes away and using divides carelessly is often a mistake if performance is of any importance." Again, I think that it varies. (depends on what you are doing/why you are using division in the first place) re: mistakes I wouldn't necessarily call using divisions a mistake. I agree that this is where a skilled programmer comes in, but if you are learning or you are an unskilled programmer, then that's something that they will have to learn. Take for example, the inversion of 1/(1/9). If you use exact arithmetic, your E(X) is 9. If you approximate (1/9) ~= 1*2^-4 + 1*2^-5 + 1*2^-6 ~= 0.109375, then 1/0.109375 = 9.142857142857142.... Therefore; your epsilon would be 9.142857142857142... - 9 = 0.142857142857142.... or about 1.6% error. And therefore; the question that you would have to ask yourself is "is that 1.6% error significant or not?" In some cases, it might not be. In other cases, it can be VERY significant. It really depends on what you are doing. (Note that often times, if you are replacing a division operation, I have yet to see someone who advocates for the replacement or substitution of the div operations with a combination of other operations - they don't automatically talk about (nor calculate) their epsilon and let the user know what that epsilon is that arises from said replacement/substitution. I think that if you are going to implement such a kind of replacement (of the div operation) with a combination of other operations, where and whenever possible, you SHOULD be REQUIRED to tell the user what your epsilon is, when appropriate.) "As this discussion centers around low level programming, mostly embedded programming, we are talking about target specific code, where you know exactly what target you are progamming for and its specific compiler. Portable code is not relevant to this discussion." That's an assumption on your part though. There is nothing that explicitly states this. As I said, it is important for people NOT to have it as a key takeaway from watching this video that they are going to implement this everywhere they see because they see it as a speed/performance improvement whilst they fail to recognise that this speed/performance improvement comes at the expense of precision and accuracy. Even for assembly code (cf. GROMACS/Folding@Home), maintaining the precision and the accuracy can still be vital for the program to perform the calculations that is asked of it properly and correctly. "I am not talking about approximation, thats a good way of potentially optimizing code if its feasible, but thats not part of this discussion." The replacement of a division operation with a combination of other operations is almost always an approximation of the actual division operation. (cf. the example that @Low Level Learning shows above with respect to the approximation of 1/9 ~= 1 * 2^-4 + 1 * 2^-5 + 1 * 2^-6 ~= 0.109375) That IS an approximation. That IS what he did and that IS what is shown.
For the multiplication part you could optimize it even further by removing multiplication entirely and just storing f-32 in a register then bit shifting left by 2 followed by adding the f-32 which would be the same as multiplying by 5. The the divide by 9 is way tougher to optimize though because it seems like you can't to the same with right bit shifts unless I'm missing something :/ Also I guess this would only work if you wanted the whole integer part of the result, if you wanted some decimal places too I guess you would need to use fixed point
@@jhgvvetyjj6589 because division in this case is floor division which is not associative (x*y)/z != x*(y/z). For example (2*3)/4 = 6/4 = 1 != 0 = 2*0 = 2*(3/4). EDIT: Ah, I see that is not what you are asking now.
Wouldn't your register just end up trashing your bits if you go beyond scope? xxxxooooxxxxoooo >> 6 bits for example, would you not end up with 000000111100? then if you were shift left it'd come back as zeros?
Lookup tables are your friend on 6502 where practical. Check this video which has a link to page (as I recall) that shows how to do 8x8 bit multiply on 6502 with 4 table lookups, 2 8 bit add/subtracts, and a 16 but subtract in under 40 clock cycles, ua-cam.com/video/WEliEAc3ZyA/v-deo.html
If you need to do many chained calculations containing different operators, it can be faster to do them as rationals, and only do the final division at the end. Summation and division will end up to be three operations minimally, but it can pay off quite a bit for lower complexity with other operations. Though it can quickly run into rollover issues, so might need to keep an eye out for that, and apply a simplify step (gcd algorithm) where necessary.
Been there done that on 6502, z80 a whole lot. And fixed point is the easiest and fastest approach. Also many CPUs don’t have multiply, they are both mathematical monsters 🤪
One method I've seen on 6502 is to use a lookup table of squares; (x+y)^2 - (x-y)^2 == (x*y*4). Useful for 8-bit * 8-bit -> 16-bit multiplies. You can do the divide by four with shifts or just build it into the table.
One time on a high school math test I showed my work in complement division as a protest against showing my work. As it turns out I'm not great at math, but I like to flex when I can and I'm pretty sure the math teacher knew about it but had to look it up for a refresher to follow it. Great victory.
As a computer engineering student currently learning about CPU design, it’s also important to bring up that new CPU architectures (arm & RISC-V) are shifting towards reduced instruction sets to increase efficiency. This is why division isn’t prioritised, as it can be done in *very* roundabout ways that may still be faster than a physical hardware implementation. As long as you have the fundamental instructions, you can do essentially everything.
680x0 had those too in the FPU for '020 and '030, but they removed them again for the '040 because they're slow even in hardware and add so much complexity and use up so much silicon.
This is a very great video. Thanks for making it! I am an silicon engineer developed processors and I am surprised with ARM's brilliant solution! That make die smaller and power efficient!
If you're gonna use "SCARY FIXED POINT" to a division, you may as well fold the multiply by five in, and save a whole multiply instruction. So instead of setting up 1/9 as your magic constant, set up 5/9 (just a different constant). But, yeah, depending on your CPU, some things need to be coded using multiple instructions. Might be multiply, might be divide, might be floating point. I worked on a graphics system coded in 68000/C where all pixels were held in fixed point 24.8 (a whole number of 256ths of a pixel, if you like)
Yeah, fixed point is a very effective technique to get around divisions, unfortunately you can't always use it when high precision is required. Luckly the problem does not apply when you have an FPU. A very interesting alternative I suggest you to look into is CORDIC, it computes only 1 bit per iteration, but if your architecture is superscalar that is not a big issue. Furthermore I also read something about CORDIC and loop pipelining at a certain point, thus looking that up could help as well!
As a side-note: In the 'ARM system developers guide' by A.N. Sloss, published 2004 by Elsevier, a 40 opcode or so division routine is shown which divides two 32b numbers in under 32 cycles. This routine was the reason that up to 2004 there was no need for a hardware divider for the ARM, as till that time these hardware dividers needed 1 cycle/bit. The routine uses 2 Newton-Rhapson iterations and a correction process.
Yeah, pretty much your only alternative would be to build a nested-loop that adds the divisor to a total, counts the number of times the loop runs before the total reaches the dividend, then move the decimal point on the divisor and the loop-counter, and repeat the whole process until you run out of space to store decimal places.
Learned this at university while studying computer engineering. That is one of the advantages of a 4 year education over general programming bootcamps. You get into the specific details of why things happen at a low level.
in a vaccuum yea, who wouldn't want to? but realistically, you wasted a ton of your money and life for something a "general programming bootcamp" could have gotten u into your field a LOT faster for a LOT less money. Funny how that works. but in order to suck off the tit that is high lvl education grant money, one must pay the blood price of an undergrad + grad degree. so funny how that works. :")
@@housemana No chance anyone out of a coding boot camp could get my type of job. It requires quite a bit hardware knowledge as well as low level programing knowledge. Let alone the amount of research we have to do including advanced mathematics to invent tech that is not currently in use. This is just one of many upon many things covered in an extended program. In case you are not familiar with software development, there are many branches and specialties that required varying levels of knowledge. Usually the more difficult jobs tend to have higher requirements, but also have higher pay ceilings. Except maybe game development (High standards, shit pay and low quality of life). I'm not saying this to shit on people who go to boot camps. In fact I have a ton of friends who have advanced their careers and greatly benefited from these. But if your goal is to be a software dev, a 4 year university will undoubtedly provide you with a more thorough education and better job prospects as starting point. I just see a lot of videos essentially spreading your same message, and it is just not accurate.
Dividers don't need to be big or complicated; a basic shift-and-subtract divider is little more than a wide adder, it's just slow to do it that way - depending on the implementation, typically something like 1-2 cycles per quotient bit, so for a 32:32 bit divide, it would take something like 32-64 cycles. Pretty much the same hardware can also do multiplication, just as slowly. These are usually the first multiplication/division algorithms you learn about when you study HW for arithmetic circuits. This long runtime is awkward to deal with; having operations that long-running is a pain if you try to fit them in a simple microcontroller design that otherwise has nothing of the sort, and also makes it harder to guarantee timely servicing of interrupts etc. For this reason early RISC CPUs either didn't have any multiplication/division support at all or effectively treated it like a coprocessor which sometimes (e.g. when interrupts happened) could "lose" results and then you'd have to retry. With ARMv1 in particular, the instruction set was such that you could just implement the basic repeated-doubling-and-add multiplication and shift-and-subtract division operations in software. That would typically take something like 3-4 cycles per bit for the straightforward way, which is 3-4 times slower than dedicated hardware would be, but it avoids all the complexity and consequences of dealing with long-running arithmetic operations. ARMv2 added multiplication instructions, and did get a better multiplication algorithm than software shift-and-add - but it still only handled 2 bits of the product per cycle, so (since it was a 32-bit CPU) multiplications could take as much as 16 cycles! This used modified Booth recoding, which is a staple in multipliers to this day, but otherwise pretty much just ran on the regular ALU. Over time this got improved further. The problem with division is that it's not as gradual: the 1-bit-at-a-time algorithms are simple, there are 2-bit-at-a-time algorithms that are already a lot more complicated (and need a fair amount of tables), and then there's yet another giant complexity jump if you want faster than that. So historically, there was a reason to be weary about including either multipliers or dividers, and if you're gonna do one of the two, multipliers are used a lot more often and also have a lot more options for implementers. This is especially important because most divisions (and remainders) in real-world code are by constants and for those, we can use the multiply-with-reciprocal method which just requires a decently fast multiplier and not a divider. Real-world integer divisions by non-constants are not unheard of, but they're rare enough that to this day, if you're going to do a simple chip, it's not doubtful whether a dedicated hardware divider is worth it; software division routines are usually good enough for those cases, and especially when you have a fast multiplier available (which enables better algorithms). The really fun part is that for a long time, until fairly recently, Intel x86/x64 CPUs didn't have a real integer divider either. Sure they have an integer divide _instruction_, but pretty much everything Intel released between the original 1993 Pentium and the Icelake microarchitecture (released in 2019) does not have an integer divide _unit_. All these CPUs have a floating-point unit with a floating-point divider, and would use that floating-point divider for integers too. (Running a short, or in some cases not-so-short, microcode routine to do so.) The reasons for this are similar to why ARM Cortex cores don't have it: this isn't particularly fast, but integer divide instructions are rare in practice, and a dedicated integer divider just wasn't worth it. They did change their mind on that eventually as mentioned, but it took them over 25 years!
what is a pity though is that almost no one implements "fuzzy"(analog or asynchronous digital backed) divide instructions, so they could be used in DSP/AI/GFX etc. cases where error is not a problem. There are neat asynchronous methods, exploiting propagation/delay effects which take just a little of silicon and are fairly precise . OFC most ppl argue that divide can be emulated or real hardware unit used, but then it cannot do either deep pipelines, nor SIMD or other bulk matrix transformation. DSP and GFX suffer most as not only people need beefy chips to do fairly simple operations, but also power consumption hurts.
When I was in university, I had to write assembly code, and later had to take a Design of Compilers course, both of which were required for my major. Design of Compilers in particular was painful.
Wanna learn MORE awesome stuff while helping out the channel? Go get a FREE month of Skillshare using my link: join.skillshare.com/learn-1/?coupon=AFF30D23
The "Degrees" class unambiguously requires floating-point numbers.
What about the neon DSP code? That handles div. Wide accumulated Also.
"I didn't want to use an incredibly low level library, because it felt like cheating" Been there before, I know where this road leads. I wish you luck on your future breadboard CPU project!
Either breadboard or TempleOS, one of two roads
RNJesus speaks to me.
Seriously, there is a reason high level languages exist. If you can let the compiler and the language take care of your branch tables for you, you can save a lot of development time. These days, the only reasons to work with assembly are for SIMD optimization, extremely optimized library functions (often involving SIMD), and occasionally interrupt routines.
There are occasional times with small micro-controllers like this where you need to use assembly, but code the compiler creates is typically roughly equivalent to what you'd create, and often better. Don't underestimate the power of auto-loop unrolling and function inlining. The linker can take all sorts of shortcuts that you wouldn't think to.
@@EmptyZoo393 What's the fun in that?
@@mikicerise6250
Do you speak Aramaic? What has He been telling you?
Lemme guess. "Donald Trump has been sent to save you all" maybe?
I like how it changed to night with the easy bit at the start
FINALLY someone noticed that XD
@@LowLevelTV legend
The Firestorm core in Apple M1 has a dedicated Divider ALU.
@@Xevos701 The Firestorm implementation of integer division is surprisingly efficient, but it's nonetheless 2.5-4 times worse in performance than the corresponding multiplication.
Multiply by 555, then bit shift
This brings back painful memories of 8-bit computing. Consider yourself lucky to have a multiply instruction.
I remember building a 8 bit cpu, the division circuit was by far the longest to figure out.
I started with obviously +-, *... then tried for months to figure out how to do division (without obviously looking for the answer online) once I realized how you could do division with substractions and additions it took me just a couple of weeks to deal with IO and W/R to ram.
just add the number multiple times if you need multiplication.
I dabble with Z80 assembly. Multiplication is a charm.
e.g. multiply HL register by 10
push de
ld d, h
ld e, l
add hl, hl
add hl, hl
add hl, de
add hl, hl
pop de
I've written something like that so often by now its just muscle memory :D
Motorola 6809 (1978) has a multiply command, takes register A and multiplies that by register B then combines the A & B registers to make a register D (which is 8+8 bit so 16 bit) and outputs the multiplication result.
Ah, tables of squares and difference-of-squares multiplication methods. And then there are Newton-Raphson approximations for computing reciprocals so you can pipeline divisions using multiplication. All good fun...
I love how us, programmers, are spending hours and nights to not use an existing external lib just to say "I've made it myself" 🤣
nothing ever beat than making your own function all day instead of just copy pasting
I think the real reason we do this isn't because we don't like libraries, it's because we want to understand.
There's so many times that I've written something myself, fully understood the entire problem, gotten the bugs worked out, and as soon I was entirely satisfied with my understanding, I simply immediately replaced my solution with a library, because I don't want to maintain any of that nonsense!
If you do that, you acquire skills that easily translate to other realms of computational problem solving.
@@JonJenkins1982 not only that, but even for higher level programmers, every time you deal with bit twiddling and such, you learn to think that way even better.
Case in point, I had mostly figured out the solution for this only a few moments after he pointed out that ARM M0 cores don't have division capabilities, or at least close to it. I was thinking the solution was overflowing the irrelevant bits out of the multiplication register, but this is effectively the same thing, just putting them into a word that we simply ignore.
It's not that I, or anyone else, have some sort of unique insight into how it works, I never would have thought of that earlier in my career. It's just that being exposed to these sorts of cool hacks over time gives you the tools you need to come up with new bit hacks when you encounter a problem.
agree
The thing people often don't understand is that every operation eventually leads to an addition or a subtraction; multiplying is just adding the same number multiple times, and dividing is just looping through a bunch of potential numbers to see which one is closest.
Addition and Subtraction eventually end up in pure binary logic. It would be enough to have bit shift, AND, OR, and setting and testing bits.
Yes but actually no. Thatd be too slow. Its logical bitwise operations.
Multiplication is not repeated addition - just multiply 3.2 X 4.7 - cannot be done by repeated addition. Multiplication is a scaling operation (I have 3.2 lots of 4.7). We. muck up children's primary education by lying to them that multiplication is repeated addition.
When you multiply A by B I promise you, the CPU doesn't add A to itself B times. There's adding involved but there's much smarter ways to achieve multiplication.
Division isn't guess work, but it is much slower. It basically does binary long division which does require recursive synchronous computation and usually takes more than 1 clock cycle and uses a heap of silicon .
@@daniel.watching Of course, it does not add B times, but it adds (log B) times and performs a shift of A after every addition. Yet, it adds A several times.
5 / 9 is roughly 569 / 1024. Dividing by multiples of 2 is an Arithmetic Shift Right.
Multiply by 569, shift the result Right by ten bits. Multiply by 36409, shift 16 bits...
and 2330169 with a shift of 32 bits merges nicely with your code.
Sir, this is what I call an elegant solution. Easy to understand. Easy to code. Easy to process. And give the most precise results available for this architeture (I think).
@@VictorCampos87 This is exactly why they don't have these operations on the processor. Because skilled embedded developers don't need it and it would make it too expensive or large for the applications it was built for. People that do this every day generally don't need it.
@@Kyle-ke5fx What do you do when a variable divide comes up?
@@joshuahudson2170 do it the same way we do long division by hand in decimal
@@TrackedHiker That's a guess and check pattern.
Other solution: don't use Fahrenheits.
Imperial Unit Gang
The good old freedom units that you got forced on by the uk cause nothing stands mor for freedom than the British empire.
Did you learn about picas and points btw? And can you tell me how many miles 1 pica + 3 points are?
F that!
For Non-Americans an easy way to understand Fahrenheit is to think of them as a percentage anything below 40% is cold 40% to around 75% are nice medium temperatures and anything above that is hot as hell!
@Devin Nall Yeah like most metric stuff is just better full stop, but Celcius is not more beneficial for a human. Fs range of 0-100 being "about dangerously cold but manageable with layers" to "about dangerously hot but manageable with hydration and limiting sun exposure" is much easier to just understand what a temp actually means to me planning my day. Celcius is based about water state changes but I'm not water i don't care when it changes state for weather forecasts
Even as late as the the Nintendo DS ARM 9, we needed to use fixed point mathematics. The hardware registers themselves accepted only fixed point values.
Same with trancendental trig functions. We had to use lookup tables in BRADIAN(Binary Radian) units interpolated on the fly.
Fun times.
Did you program DS games?
Fixed point or fractional integers can in fact do adequate calculations for most practical things and in signal processing, it just gets increasingly burdening on the programmer to decide where the decimal point is put best for each number represented in the program to not lose precision or keep the S/N.
Consequently, it is easier for the programmer to have this done at runtime with floating points, but also more wasteful and also possibly numerically more unstable/indeterministic when dealing with feed back or iteration over data sets.
Scientific calculations or a FFT (and other transformations) are examples where floating point calculations may be needed.
In a FFT for example all data points in the time domain may contribute to only one data point in the frequency domain and vice versa, floating point may be the better choice to deal with the bigger dynamic of the transformed compared to the source data, compared to just increasing integer bit precision overall.
Yep one of the hardest problems in computer design has always been finding a good way to implement division.
There have been some creative solutions, like in the cray-1 supercomputer which would instead compute a reciprocal that you could then multiply to do your division. But often, designers just omit division all together from their CPUs.
Well this was just division... By constant....
The DS also has a divider unit, which makes up somewhat for the lack of support by the CPU :p
As someone who does microcontroller stuff, I often do that sort of thing. There is even a method that allows you to effectively divide by a variable amount. It is based on two observations:
1) A/B = (N*A)/(N*B) so you can pick an N to make the denominator something easier to deal with
2) For X very small 1/(1+X) = 1-X this allows you to deal with any number near a power of two without a divide.
That second approximation, and the others that result from that Taylor series expansion, are all over physics.
@@crashhappy7296 The 2nd thing is really more like an observation Taylor started with. The two parts together effectively do a divide but it is not Taylor.
At the machine level, computers don't subtract, multiply, or divide. They add, compliment and shift. So, in binary, shift the dividend left past the radix as many times as you want significant digits. Two's compliment the divisor and add it to the dividend until the quotient sign bit is negative. Shift the quotient right the same amount you shifted the dividend left. Viola! The decimal is not called decimal in any other number base but 10. It's the radix in any number base.
The video was great! Good job digging into the software that deep!!!!
At the machine level, they also don't add, either. Addition is build out of purely logical operations.
@@__christopher__ Yep. And even the logic can be expressed in discrete components. And those in gathered natural materials. Sand, dirt, burnt things and wizardry ;-D
bro's just casually programming some ARM assembly. love it man stuff like this makes me motivated to learn low level stuff even though it looks kinda scary to me
Do it!
I think it's a waste of time beyond knowing "what is assembly and how it works + basics"
(Note: I, myself, know the "basics" I mentioned above. It's logical that "knowing" is always superior comparing to "not knowing"...)
Learning it will improve your high-level coding. Worth it!
@@Yas-gs8cm Once you learn assembly your thought process never becomes the same again. Even if you work with higher level languages you still constantly think about: "What happens once this gets converted to assembly". It helps you develop new ways to solve problems that you wouldn't have if you never coded in assembly.
@@Yas-gs8cm this isn't even the basics though, this video is super simple stuff
For those of you who, like me, had some difficulty understanding his explanation.
His final calculation is something like: C = (F - 32) * 5 * Const / (2^32); where Const = ceil((2^32) / 9);
He divides by 2^32 by grabbing the highest 32 bits of a little-endian architecture number, because it would be equivalent to bit shifting the 64-bit positive number 32 times to the right.
Thx! That actually cleared things up nicely :D
Thanks, it cleared up my doubts
You are a wizard
For something that's only C to F or reverse, just calculate it ahead of time on excel and put a lookup table inside the code. It is embedded system and it will be faster and smaller than pulling in a math library.
A couple of extra notes:
- This division method (multiply+shift) is attempted by compilers whenever the divisor is known at compile time. It isn't used to divide by an unknown number as the constants can't be predetermined.
- Another alternative to implementing these things in hardware is to translate the instruction into micro ops on the cpu. (I can't say for certain which instructions do this but I believe all exponential/trig functions usually do.) I believe the reciprocal is usually calculated in hardware so a division might be converted to this plus a multiplication
And yes, regular divisions can take 100s of cycles
There are a lot of iterative algorithms for computing divisions. My favourite is successive approximation: x = a/b, we rearrange to x.b = a. We then *guess* the value of x 1 bit at a time, by setting each bit to '1' from MSB to LSB and checking if x_guess.b is larger than a. If it is smaller (or equal), then we leave that bit set. If it is larger, then we reset that bit to '0'. This algorithm works nicely for a lot of monotonic functions with simple inverses (e.g. division --> multiplication, square root --> square, and there is even a clever trick for calculating logarithms this way).
Of course, that algorithm is slow, but I think it is elegant. In digital hardware, it needs N clock cycles to calculate an N-bit result (but uses an incredibly small amount of silicon). In a CPU, much longer.
@@weetabixharry : I wonder if anyone's tried to shorten it with a "binary search" or priority encoder?
@@absalomdraconis There are plenty of ways of doing it faster in hardware, it just costs more transistors. Depending on the clock frequency, it may even be possible in 1 clock cycle. The best solution depends on the use case.
@@absalomdraconis And it already is a binary search (because we're setting 1 bit at a time, not incrementing the value +1 for each guess).
I used this same technique years ago. Another way of thinking about it is that you’re converting your fraction to a close fraction which has a power of 2 as the denominator since that will just be a bit shift operation. It was an old intel 8088 microprocessor. It had a divide opcode, but execution took over 160 clock cycles. The real-time code couldn’t finish processing one block of data before the next arrived. Converting all the multiply & divides into multiply & shifts saved the day.
This is why, whenever you're programming, and realize a solution that seems like cheating, do it. Cheat.
Real
He isn't trying to make something, hes doing it to learn
Yes... but also no...
Like right now I could cheat and just take some random Water Shader someone made for Godot, copy that code and put in my game... BUT... I actually wanna know how water shaders work because it's fundamental to recreating a submarine simulator that's actually good and doesn't feel like an asset flip.
@@JohnnyThund3r If you utilize a library which is well-designed, the combined manpower of its developers shall render it better than anything that a single developer like yourself or I could create. Plaigarism doesn't inherently mean inferiority.
@@RokeJulianLockhart.s13ouq they're not talking about using a library, they're talking about ripping other assets into their project
The generalization of this concept is called "division by multiplicative Inverse" or less commonly "multiplication by the reciprocal" and is a relatively common practice when you have a division where the divisor is a constant. It's talked about a lot in low-level books, like "Hacker's Delight." Basically, you precompute the division by encoding it as a reciprocal. It's fatal flaw is you have to know what you are dividing with from the start.
To go farther, you have to use something like the Newton-Raphson method for division. This is the same as the above, in that it finds the reciprocation quotient. The first major difference is that it is a run time operation. That is, it FINDS the needed reciprocal during execution, no precomputation. The second major difference is that it uses an approximation to division _as a function_ to get close to the correct result, then uses Newton-Raphson method to refine that guess to as many digits as you desire. For 32bits, I find that 3 iterations is plenty.
Incidentally, all of this is possible because floating point/fixed point number systems encode multiplication and division _into_ the number itself, in the same sense that having a sign bit encodes addition and subtraction into a number. Floating point, specifically, also encodes exponentiation into the number, which is why it is the usual goto number system.
... The comment claiming this to be called "Division by Invariant Multiplication" is the first time I have ever heard it called specifically by that name, and has just one research paper with that exact title... Wikipedia calls it "multiplication by [the] reciprocal" which someone responding to said comment linked to as well. multiplication by the reciprocal is a basic math operation, and gets you a lot more hits than "Division by Invariant Multiplication." It's also certainly not complicated to understand, as this ~5 min video has clearly demonstrated.
Based informative comment
Doesn't Newton's method use division? What am I missing?
@@rickroller1566 en.wikipedia.org/wiki/Division_algorithm#Newton%E2%80%93Raphson_division
...specifically, you want "Xi =(2-DXi) ... which can be calculated from Xi using only multiplication and subtraction, or using two fused multiply-adds. "
you rlly are an alchemist
Why dont you divide by 1.8
Loving this type of video format where you just tell a short story about something you have been working on yourself and learning something new! Would love to see more of this format!
Thank you! I really had fun with this one
And has the magic number (at least in the 32-bit version) 0x5F3759DF.
I love how this highlights how division is fundamentally different from the other basic operations. It opens up a whole new domain of numbers and we have to contend with that.
The mathematical proofs for multiplication and division is what made me hate math. I am an chem engineer by education and environmental by trade. I love equations, but screw any sort of proof. Hated every second of every math class I took in college for that reason. I would have loved to complete matrix theory, (still have nightmares there), but proofs made me quit. I think I made it less than a month. Kicker is, the class even said it WASN'T going to have proofs.
In a sense, subtraction also opens up a new domain as it gives negative numbers.
@@goldenhate6649 Pretty much, I do recreational math all day, but forget deep dives into multivar calculus or other long routes by hand just to demonstrate a simple conjecture
It simply isn't. Subtraction does the same thing.
@@weirdlyspecific302 but division open up more
Learnt this early on. Computers divide by left shifting/right shifting the number in a register. It is multiplication by the decimal representation of the divisor.
That is why python has the 1+2= 3.0000...0004
@@mrbutish 3 is exactly representable as float, as are 1 and 2, also 3.00...0004 is clearly not an integer, meaning its not a possible output from a shift. Possibly you mean .1 + .2 != .3? which is plausible given .1, .2 and .3 aren't exactly representable.
@@nathansnail ya you're right. its 0.1 + 0.2 not 1 + 2
In the not-so-good old days, on an IBM mainframe and before IEEE math, we've got a lot of fun adding 100 events each one with a weight of 1/100... just to discover after many hours reviewing the code that 1/100 is a periodic fraction in binary! So we changed and weighted 128 events.
@@EvilidelRio how does that work? could I get an elaboration on what that means?
@@senatuspopulusqueromanum if you added 1/100 100 times you didn't get back 1.0 but something like 0.999996798...
My favorite trick is to take some huge power of 2, let’s say 1024. Turn 5/9 into the closest fraction with 1024 as the denominator. In this case you would get 570/1024. Now all you have to do is multiply your binary number by 570, and then move the decimal 10 places to the left. I’m sure you guys know way better ways to do this, but this was my trick when I had to write code in assembly for my computer science class.
Just realized this is almost exactly what you did in your video!
@@gapplegames1604 This is exactly what he did in the video
@@gapplegames1604 Depending on the architecture and the number of bits you have there are tradeoffs with precision/max value you can accept. Since 32 bit multiplies tend to be faster (on cheap hardware), something with a lower bit shift like your example (but most likely 16 to use tricks with using just high/low word) has its uses in some situations. If you're stuck with an 8-bit AVR, you're definitely not doing 64 bits multiplies (and probably not 32 either).
Except multiplying by 590 reduces the domain and causes overflow in cases the better method doesn't. You usually like to write crappy buggy code, without consideration of any problems or limitations or domain restrictions. Yes we know. Hopefully nobody is paying you for such garbage tho
@@gregorymorse8423 jesus christ bro chill, he's explaining how he did it "for my computer science class", he's just working with the closest possible precision he can given the limitations of practical time constraints and diminishing returns. For the purposes of understanding how assembly works, the goal of most courses that are going to require a basic understanding of this kind of math, this is a perfectly sufficient method.
Once you described how that weird arbitrarily large number was found/represented, it reminded me of the Fast Inverse Square Root function in Quakes engine that tells the computer to cast a float as an int with no conversion
//Evil floating point bit level hacking.
@@endergamer7444 // what the fuck?
@@endergamer7444 // what the duck
//Newton Iteration
The real kicker is the matrix that allows you to do square roots without doing a square root and in only a few FLOPS. Its the foundation of all modern video games that use triangular modeling.
The first computer which I programmed in Assembly was the IBM 704, in 1960. There were two floating point division instructions, Divide AND Stop & Divide OR Stop. This had to do with the requirement that the result had to be less than one. So the inputs had to be scaled to make sure.
In the “old” days of fixed point arithmetic, machines came in fractional or integer models. So, we kept track of powers of 2 in a note by each statement so as to loose track, x(2^n). The “n” was physically typed in the comment beside each statement. As far as timing, a multiply took 4 cycles while a divide took 13 computer cycles. This was for a 16-bit process while double, triple, quad etc precision timing grew to 2n cycles, 3 n etc. So we always tried to find ways of using multiples, especially for critical time simulations.
-has library that handles said issue easily
-decides not to use it, and wastes a bunch of time reinventing the wheel.
I've been developing on a Motorola/freescale/NXP DSC family chip for a few years now and I'm always having to do multiply shift approximations. A simple divide operation takes a few hundred microseconds, and when I'm running a one millisecond loop that can blow things out quick.
This is just one of those cases where close enough really is close enough.
Yeah I’ll probably do another video for this but it’s crazy that even on modern CISC chips DIV instructions take sometimes up to 12-20 clock cycles
@@LowLevelTV But then the human brain typically struggles with division the most too. I still can't even get my head around long division lol.
@@LowLevelTV That is because Division is one of the hardest problems in Computer Arithmetic.
Maybe we should teach them short division method.
@@LowLevelTV If you want to realize you're still coding at a high level even in asm, then try to implement a divide yourself in FPGA (-:
I've got some bad news... looks like the Cortex M0, M0+ & M1 don't support the SMULL instruction (32-bit multiplication with 64-bit result). You only get the lower 32 bits. If you set the compiler flag `-march=armv6-m` in Godbolt, you'll see it calls a library function for division (`__aeabi_idiv`) instead.
Yup, the same thing tripped me up when I was fooling around with some playful attempts to do baremetal stuff on a Raspberry Pi 3B, which has a Cortex A53 I believe. I didn't bother to include or link any libraries, because I was only doing very basic arithmetic and pointer shit anyway, outputting over UART. I was quite surprised that I was getting an error referencing this function call, in a place where I fully expected a plain integer division to be.
I strongly doubt if he actually needs to convert millions of Fahrenheit to Celsius. For a reduced range a 32-bit multiplication with a 16-bit constant and a 16-bit shift is enough for correctness.
Dont code for armv6... That stuff is ancient.
@@Carewolf "Dont code for armv6... That stuff is ancient."
ah yes - the "ancient" 9 years ago... cause nobody would ever use something that was first released that long ago............................................................ yeah no.
@@ABaumstumpf You dont because arm before v7 is not a single platform, it makes no sense to target armv6 because there is no such single ISA, it is a huge mess. if you target arm32 you target armv7 or a VERY specific arm cpu.. And besides armv6 is from 2002, so 20+ years old, not 9
If you had divided by a variable instead of 9 in the C source, the true face of division would show up. The code would then have to call a runtime division algorithm instead of multiplying by a constant. The ARM assembler code for efficient division is worth a look.
I'm old enough to remember the days when all calculators were mechanical (sometimes electrically powered, but not always). Unless they were the more expensive variety, they did not support division. I specifically remember my first hand-held 4-function electronic calculator (circa 1960) and being truly amazed that it could divide! There was actually a short pause (maybe 1/8 second) before it displayed the answer to a division problem.
0:37 "I felt I was on my way to a quick LOW LEVEL victory" nice one.
Before the prevalence of div instructions and FPUs this was standard practice in embedded systems and decades of products used fixed point math to do incredible things, including keeping your car on the road in the winter!
As soon as you aren't working with whole numbers, most humans start to break at division as well.
90/9 is something most people can do, 90/8 becomes a bit tricky to some, but 90/7.6 will require most people to pull out a writing sheet to 'cache partial results', and even makes us struggle past this point.
And this is while we are the creatures that learn how to divide things evenly with others even before we learn math.
I'd say that most people (including myself) would actually no even try to do 90/7.6 by themselves at all, sheet or not.
@@farfa2937 My solution would be: "90/7.6=Fuckyou"
90/7.6=900/76
the big difference here is that 8 and 9 have 1 digit while 7.6 has two not the fact its not a whole number if you say divide by .9 now its a cakewalk for most people again
@@ss3nm0dn4r8 I wouldn't say most would find it easy again (althought definitely easier then 90/7.6), but that's also because the solution is a whole, natural number again.
Old timer here... visiting you from the days before MUL and DIV instructions in microprocessors. I should be well versed in all manner of integer math operations, but thankfully I've been able to forget (or at least not recall and use) all that for years now. Heck, for some of the industrial machines (Z80 based) I programmed in the '80s we splurged and added a math co-processor card!
Probably the main takeaway in your journey shown here is that ARM is older than we tend to think. Born in the '80s itself, ARM was radically new (RISC) but still influenced by its times. Also, adding hardware DIV might have been considered antithetical to the new RISC concept.
And nowadays RISC and CISC aren't really meaningful anymore. The most widely used RISC designs have tons of instructions and extensions and the most widely used CISC designs translate CISC instructions into RISC-like µOps
The ARM1 processor didn't have a multiply instruction either. Multiply was added for the ARM2 not really for maths but because of how often the C compiler had to calculate addresses (eg array members) and multiply is very useful for that.
dividing using assembly was one of my least favorite assignments when i took a class with assembly
Really cool video!! It feels kind of great when I get to see someone else passionate about assembly language. This video reminded me about the time I was figuring out on how I should multiply and divide a number in the assembly language of ATMega8A microcontroller (and later with an 8085 processor): I had to use repeated addition and subtraction to get the job done. Later on, when I worked with 8086 microprocessor I got to know that division is an expensive operation (just like LOOP and REPNE instruction in x86 processor. Even in x86-64 family of processors, they are just artificates which no one has ever optimized, so everyone favours JMP instead of LOOP).
Again, this was a great video! 👍👍
Thanks for watching! :D
I strongly recommend the book "Hacker's Delight" by Henry S. Warren, Jr. It's particularly valuable for anyone doing assembly language programming. I wish it had been around in the 80s when I started out.
I also thought of that immediately, because I remember that in Hacker’s Delight there is this exact algorithm. Such a good book.
It is nice to see young kids discovering the wonders of the world! :) What if I told you that the famous 6502 CPU which powered such machines like the C64 (and many-many others) had not only no division but no multiplication too. And it was still able to rule the 80s.
When I was using Apple II, I wrote a macro library for 16- and 32-bit int operations like add/subtract/multiply/divide. But when I got to wanting to do floating point from assembler, I cheated and just bought a floating-point processor board.
You simply have to implement that in software as far as I know. It will just be slower but that's It.
@@w花bwell, by deciding to work with floating points, you kinda accepted the fact that you will sacrifice memory and performance.
After some time studying hardwares, specially old ones, i got why its that so fundamental and common.
Traditionally, we used log/exp approach on 8bit/16bit. Hitachi and Mips RISC cpu work around the problem. On mips, multiplication and division are carried out in parallel and store the result in specific registers. Hitachi "sh" cpus provides a division by part, we can set the required precision: one instruction to initialize, another to perform a step.
@@w花bYes, of course, it was Turing complete. I just said that it was even more primitive, but still a CPU. Getting surprised that some CPU has no division shows that you're way too young. :)
Nice solution! I remember reading a book back in the 90's called "Game Programmer's Black Book" that detailed workarounds for faster division performance on the 486's and Pentiums of the day.
Being in management today, I watch videos like this to get back to my roots and just feel good. Thanks for posting this! 🙂
Those of us who work with smaller micros do this on an almost daily basis. Create a multiplier/divisor pair with the divisor a power of 2. 2^8, 2^16 and 2^32 are especially nice because they just involve discarding low bytes. The only caution is to make sure your intermediate multiplication cannot overflow. For "normal" temperatures, a 2^8 divisor is all you need. 5/9 = 142.22, so the multiplier can be 142 or 143. Now you work back to determine the maximum input temperature that will avoid overflow, so 65535/142 = 461 and change. The conversion will work with inputs up to 461 degrees, with an error of about 0.16%. You can also add a rounding factor of 256/2 to the multiply before the 2^8 "division".
When we manually tried different division approaches on a computer engineering courses I was fascinated by how hard division acually is. It makes sense though - division is also harder to do on paper.
P.S. there is also a square root
That's why they call it a TRIAL divisor, if it's too big you get to start over again.
Division in binary is actually not that difficult, since the (appropriately shifted) divisor either does or doesn't go into the dividend (determined using subtraction). There are even shortcuts you can do to make this fairly efficient. However, it still takes a number of cycles equal to the quotient length.
@@Mueller3D yeah, but it's much less intuitive than multiplication. It's also longer algorithms with more steps and branching. In multiplication you just go with a naive approach. In division I don't think any approach can be considered naive
That's actually a pretty cool solution. I thought you were going to talk about floating point division and how inaccurate it is, but this was also interesting to watch
the depth you go into each topic is remarkable, truly enlightening!
This is the bit i love.
Go deep, go to circuit level... half bit and full adders, shift registers...
We have paper, pencils, and extra ways of accessing memory.
A puter is stick with whatever figures are in its amu. It "sees" one digit at a time. It cant see the whole number, it cant go back and "carry the one"...
Subtraction by addition and complements has always fascinated me...
Taking me back to applying newton's method for calculating square root in MIPS assembly as a programming assignment in an introductory computer architecture/CS class. I think more than anything else it taught me to respect all of the crazy shit that's happening at the lowest levels (although of course it goes even lower).
MIPS is dead! The company responsible for MIPS now is RISC-V.
I usually pre-scale a fraction a with an acceptable binary power, eg 2^8 giving (5/9)*256 = 142, and multiply with that then right shifting the result by 8 to fast divide.
Welcome to 6502 assembler with more bits! 😱
Seriously, at the assembly level (especially with RISC type instruction sets) you end up really appreciating how much heavy lifting your compiler does for you. I've never written anything in ARM assembler, but this video takes me back. Thanks for sharing! 👍🍺
One of my favorite optimizations had to do with converting timestamps from 32-bit number of seconds to separate year/month/day/weekday/hour/minute/seconds fields on a 6502. The typical approach involves multiple 32-bit divisions. I realized that most of those quotient bits were always zero. So I essentially unrolled those divisions and eliminated all of the trial subtractions that were guaranteed to result in a zero. It ended up needing a total of something like 34 subtractions, resulting in a speedup of 7X.
On systems with fixed instruction length even stuff like loading immediates (that is, constants) to registers is quite funky, even on symbolic assembler. On x86 with variable instruction length but no widely available barrel shifter logic available to instructions these are quite literal "move immediate" instructions, but on ARM, complicated constants either take several instructions to load (if not using constant islands), or generate constants in interesting ways...
JRISC in 1993 ( Atari Jaguar ) and SH2 in 1994 on Sega 32x and Saturn offered 32-bit division. It has some latency, but did not block the processor. Similar how quake did division on CISC Pentium in 1996 .
I started this on the wrong foot, I had "just use a high level language" yelling in my head until I realized that looking at these from the low level is the point of the video and indeed the channel
When confronted with similar issues in CS, I adopted the more straightforward solution of giving up coding forever.
That's a more common solution than many folk realise.
I remember when I first started learning to use assembly, I started coding for Game Boy. I decided I'll write a few simple functions to get a feel of assembly, so I wrote a function for Fibonacci sequence then I attempted to write a function for “n!” !!!
Imagine my surprise when I found out that GBZ80(CPU of Game Boy) doesn't have any instruction for multiplication. I had to write a multiplication function using bit shifting.
I have used a lot of different CPU's assembly and most of the very old ones (6502, 65C816, …) doesn't have a mul\div instruction; and old ones (8086, 8186) only have integer mul\div; and newer ones (from pentium until now) are very hard to program (if you want to make them fast, that is) because of SIMDs.
This was a triumph!
FWIW most x86_64 CPUs have floating point division hardware but integer division is implemented using micro ops for newton-raphson method. This bugs me to no end because you normally want high throughput for floating division, so you use n/r for that, but for ints it usually is a single division on the critical path. The difference is huge - unpipelined trial subtraction hardware has an 8 cycle typical latency, but n/r is 26 cycles.
On modern AMD CPUs, integer division only takes 2 micro-ops. Still slow though, for 64-bit integers the latency is up to 19 cycles.
@@soonts ooh, interesting! DIV and IDIV timings going back to Bulldozer appear to be for a slim hardware division unit with up to 18 cycles of latency. Intel are still in the high 20s though. Thanks, I've been team red for over a decade but somehow never noticed this.
@@soonts It takes a varied amount of time depending on the value you divide with. Anywhere from 2 to 30 cycles.
Not true. Hardware division makes use of SRT, not Newton's method.
@@capability-snob divq on ice lake is fast as hell. That is 128 bit division.
Next time I get lost in a lucid dream, I'll try to divide with an old chipset. If division is supported I'm dreaming
Divception!
Thank you for not making this video unnecessarily long like so many UA-camrs do (8+ minute videos with boring history and whatnot).
“Division is slow, we’re just gonna ditch that” - I love the approach. Back in university assembly course, instead of dividing by 1000 I just bit shifted by 10 because it’s faster and good enough. My professor wondered about the loss of precision and made me fix it, though. 😂
Nice video, thanks!
It is worth noting that, even when division is supported by the hardware, it can be quite slow. I once timed the basic floating point operations on a Cortex-A5, which has an FPU implementing the VFPv4-D16 instruction set. Multiplications turned out to be fast (one/two cycles in single/double precision respectively), but divisions took more than seven times longer than multiplications.
Shows that ARM was right to forego a division module.
@@zlcoolboy That's still faster than doing it in software. This method only works because it's a known constant that can be approximated. Once you divide by a variable, you've got to use much more sophisticated approximations, which will likely take multiple times as long as a hardware divide unit.
The RISC philosophy, love it or hate it 😉 Yes, I saw the interesting multiply-to-divide trick in printf routines to display numbers, where it has to divide by 10 (or powers of 10). I was impressed, it was packed with such little techniques to save cycles (on x86_64).
Literally the only CISC cpu that is halfway alive and not x86 is the Motorola 680x0 architecture.
@@linuxization4205 It depends which categories you are willing to consider. There are microcontrollers and CPUs derived for ex. from the 6809, 8051 or 6502 and so on which are borderline or plain CISC. Though many designs, including the x86, only have CISC instructions but a RISC architecture, so it's not as straightforward as when the term was initially coined. 😀
I believe Intel started translating the instructions into micro-instructions in their Pentium series, otherwise they wouldn't have achieved the same level of performance. It was bold and brilliant. Nexgen, which was bought by AMD to save their failing K5 architecture, did the same back then.
@@phenanrithe It was the Pentium Pro, (P6 or i686), which all modern x86 chips still identify as.
So yeah, the big x86 CISC architecture uses RISC-like tricks under the hood and only displays itself as CISC on the outside for compatibility's sake.
And that whole thing was also supposed to be replaced. After IA-16 and IA-32 should be IA-64, but that went to nowhere just like i960, i860 or i432. And obviously Itanium which, as P7 was somewhat supposed to be the successor in 98
@@HappyBeezerStudios Yeah... what happened was that IA-64/Itanium was a disaster, it tried to optimize all the wrong things and ended up being somehow SLOWER than x86 if you can believe such a thing.
True story. Encounter this frequently when working with code generators! Good job!
:D
When I was first taught about logic gates and how to make a full adder curcuit, I went on an absolute rampage trying to figure out how pure logic can be used to perform mathematical functions.
Subtraction is easy once you know how to add, and multiplication is just adding with extra steps, so I figured them out quite quickly, but I remember spending weeks on division.
I ended up figuring out how the division process worked, but was left with the problem that:
a) like half the results of dividing in binary end with recurring decimals
b) Three decimal places would only give my answer a resolution of 1/8th
This resulted in some divisions appearing to give the same answer, because the difference between them was too small to see.
So I thought "I know, I'll just represent the remainder as a fraction, that should be easier."
The problem with that is that nobody wants to be told that:
27÷18=27/18
So I would have to find some way to simplify the fraction. Which is not easier at all.
As it turns out, the only way to find the prime factors of a number is to divide it by every prime number smaller than half of the original, smallest to largest, and see if the result is a whole number. Which means for each digit you add to the input the list of prime numbers needed to check grows exponentially.
And then you have the added problem that once you find that a number is divisible by one of the primes, you then have to check whether you can divide it by that prime again, and keep checking indefinitely until you get a remainder, and while that's easy enough to show in a flow chart, making a circuit to do that is a whole different matter.
The "solution" I decided on was to make a curcuit that only devides three digit binary numbers, and have the remainder be displayed up to three decimal (binamal?) places as long as there is no "loss" (I.e. There is no data carried off the end of the divider circuit).
If there is loss, the result is displayed as a fraction which can then be simplified.
The reason this is easier to simplify is that now the only fraction that will ever need simplification are 3/6 and 2/6
This means that my simplification circuit only needs to check if the numerator and denominator are both even, and if so divide them by two, and if not, check if they're divisible by 3, and if so divide them.
If they are neither divisible by 2 or 3 they are displayed as they are, since I know from my meta-knowledge of all possible divisions of three digit binary numbers that any other fraction I could encounter would either be already in its simplest form, or be possible to display to three decimal places without loss.
This was really a solution, but I'm still pretty proud that I made it at least look like it can divide.
I can't get out of bed and brush my teeth and this guy is doing this with his time
Amazing content. It would also be interesting videos about division of non-constant numbers, the implementation of a floating point unit (showing how to add and multiply floats) and also the implementation of algorithms like sine and square root. All of that comparing C and assembly.
It looks like the library function __aeabi_idiv/__aeabi_uidiv is generally used on ARM to handle division by a non-constant denominator.
Binary long division isn't as complicated as you might think, though it's fiddly
That's why I stick to multiplication by inverses
So much easier
@@LowLevelTV Obviously I posted a witless comment as quickly as I could to get attention. Thanks for the content!
Anytime man
I encountered such a thing on a school project with FPGA's. With an FPGA you allocate logic units (AND, OR etc.) to build your own logic circuit.
Using a single division operation used like a thousand logic units, which is a lot. Instead we used an division algorithm which got it down to under 200. It probably used multiplication and bit shifting, but I'm not sure anymore.
Most Americans don't even know that the Fahrenheit scale is 0-96 and falsely assume it's 0-100 because they underestimate what a dumb scale it is.
He literally used the coldest temperature in his home town as 0 and the average human body temperature as 96, which as you probably know, he got wrong.
I don’t think I’ll go into learning assembly at any point. But I started programming in my 3rd year of college. I first switched over from music education when I learned I didn’t want to be a teacher. I first learned some basics in game maker, then the teacher taught us some simple Visual Basic (it was outdated frankly even back in 2019, but it was a good intro language imo, statically typed yet simple to grasp the syntax). From there the next classes taught C#, sql server, then some basic web dev with html, css, and php. I didn’t finish there, but went on to teach myself python, JavaScript, Ruby, and I’m even trying to pick up some C++ and Go. I transferred my credits to WGU and am trying to finish up my degree and get out there as a developer, and I love that you, primeagen, melkey, and others put great content out there and it helps me to keep myself motivated to finish.
As an old assembly hound, it's fun to see something like ARM. I don't do a lot of RISC stuff in ASM because of the old adage "RISC is for people who write compilers. CISC is for people who write programs"
On the 8086 and 8088 we too couldn't rely on there even being a FPU, and our MUL and DIV statements were painfully slow -- in excess of 150 clocks EACH.
An old trick for doing percentages in that case was to do a multiply very like this as a fraction of 65536. What made that "fun" was you ended up with the integer result in DX, and the numerator (65536 as the denominator) in AX. We were further hobbled by there being no MUL reg, immed
; assumes temperature in AX
sub ax, 0x20
mov bx, 0x8E38 ; roughly 5/9ths 0x10000
mul bx
; DX = integer result, AX = numerator over 0x10000
Becuase it uses only the one multiply, it takes half the time that MUL 5 and DIV 9 would. Which is good when we're talking turning 300+ clocks into 150.
So true, typical MUL reg
What are 8086 and 68000 even doing in all those cycles. The factors are 16 bit as is the ALU. This only makes sense to me if MUL and DIV were just an afterthought which had to be cheap. Indeed I have seen a large stretches of code without any of these instructions. This typical IBM-PC business code does not MUL or DIV. Maybe once to calculate taxes.
@@ArneChristianRosenfeldt A hardware multiply requires a lot... no, A LOT ... of silicon. We're talking as much as 20 times the silicon as an addition. To that end a lot of early processors left it out entirely, or implemented it in "microcode"
In fact that's exactly what the 8086 / 8088 does, is use microcode. A small bit of code on the processor that pretends to be an opcode, that runs on the other instructions of the processor.
Even with a hardware multiply the operation is intensive enough to require a lot of clock cycles as multiply and divide -- even just as integer -- are expensive operations. A mul reg16 taking anywhere from 25 to 40 clocks depending on CPU. (laugh is the 486 is slower than the 286 or 386), and thus it wasn't uncommon for stuff like vga video routines to still use shifts instead.
; ACCEPTS
; AX = Y
; CORRUPTS
; AX
; RETURNS
; DI = Y * 320
xchg ah, al ; == y * 256
mov di, ax
shl ax, 2 ; == y * 64
add di, ax ; di == y * 320
Runs in about half the clock cycles of a hardware multiply right up through the first pentium. Though still not as fast as a XLAT
@@jasonknight1085 still does not explain why anybody would need more than 20 cycles in microcode. With a barrel shifter and some look ahead for zeros all those shift special cases would go away. Zero flag running from msb to lsb for early out (like in 386). Silicon usage goes up with the square of bits. Hence it makes sense to implement 8x8 as MUL ( costs like a 64 bit add ), and then run the microcode over this. 4 cycles for 68k like 16x16 -> 32. ARM7 in the GBA has hardware 32x8 and needs 4 cycles for 32x32.
@@ArneChristianRosenfeldt Do you realize just how much silicon an ARMv4 has that an 8086 does not? We're talking about a ~29,000 transistor and ~5000 gate processor here. That ARM7TDMI (the 7 is misleading, it's a v4) is what, 280,000 transistors and 90k or so gates?
What you describe in silicon alone would be a fifth the die space the 8086 used. JUST for the multiple logic alone.
Remember, this is an age where a DRAM memory access was 4 clocks per bus width (so a word was 8 clocks on the 8088), and even a simple byte shift by fixed one place took two clocks. Memory shift was 15 + and effective address calculation, and shift reg16 by CL was 8+4n.
But yeah, once transistor counts grew these problems went away. By the time of the 80186 you're looking at 26 to 33 clocks, and the 286 it's 13 to 26 clocks.
The 186 was 55,000 transistors, the 286 was 134,000.
Also worthy of mention is the NEC V20 and V30, drop-in replacements for the 8088 and 8086 respectively, with timings closer to and in some places even faster than the 286. See its flat 21 clocks for all 8 bit multiplies and 26 clocks for 16 bit.
It achieved this by having over 60,000 transistors, more than double the stock intel part.
The type of infrastructure needed for the microcode you describe by the time you do both 8 bit and 16 bit mul and div is more silicon than many processors even had when the 8086 started being designed in 1976. That's no joke either, look at the 8080 with its whopping 4500 transistors-- and it was considered a chonky boy on release. The Z80 was around 8500, 6502 was 4500. And IF contemporary processors had a multiple opcode (many did not) they were inherently 8 bit with a 16 bit result (in two registers).
There's a reason the Motorola 68000 with its amazing fast hardware multiply via its 16 bit ALU needed 68000 transistors (hence its name), an absurd number for the time that took five years to even come close to looking viable outside of big iron use... because manufacturing processes for that many transistors reduced yields resulting in elevated prices.
I remember when I first hit by this wall, but mine was bigger, just imagine ( 8-bit processor, and no multiply instructions)
It was a PIC mcu, and it almost boiled me to write a floating point 64-bit based division and multiplication macros, it was enjoyable and hard challenge to tackle.
PIC is trash, move to ARM master race.
@@cpK054L design one and then call it trash, for studying fundamentals pic mcus are the best , ARM is more advanced and complex.
@@abdox86 I used PICs for undergraduate programs and for the price to performance ARM gave me a better result as a general purpose MCU
A breakthrough moment for processor-based math for me was, way back in a college Computer Architecture class, realizing that, internally, computers do addition and multiplication exactly as we do on paper, they just do it in base 2. Changing the base simplifies the operation greatly. There is no deep magic.
Number 1 thing I learned in CPU design is early on CPUs at the logic level can do one form of math. Addition . It does everything with addition. It uses flags or registers to determine negative numbers. It also uses counters for multiplication and division.
I love understanding what happens in the the low-level architecture even though I prefer coding in C for an MCU. Much appreciated!! I've subscribed 🔥🔥🔥🔥🔥🔥♥♥
It's called RISC (Reduced Instruction Set Computing) for a reason, if you want hardware division operations use a CISC (Complex Inst..) processor or a mathematical co-processor. ;)
RISC does not mean that it won't have certain instructions, or that it lacks functionality, or that it is a _limited_ instruction set computer. It just means load/store arch, orthogonal reg file and instr set.
@@GeorgeTsiros Well it's not limited it's just reduced, but it still is general purpose. ;)
The idea behind RISC was to improve the efficiency per transistor count and a division unit would mean more transistors with almost no benefit regarding the most programs.
Also important to know is that both integer divide and FPU divide are *very* slow even on architectures that supports it in hardware.
An integer divide is up to 12 cycles on Cortex M4 and 10-50 cycles on modern X86-64 for example, Skylake needed like 40-100 cycles for integer div just a couple generations ago.
FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4.
Also noteworthy is that even architectures such as M0/M0+ can implement external division hardware, such as the case for the RP2040 using M0+ architecture which integer division hardware needs only 8 cycles and therefore is actually faster than standard Cortex M4 for integer division.
If integer ADD/SUB/MUL is so fast, then why isn't it used everywhere?
@@ewenchan1239 Can you restate your question, as it doesnt really make any sense. Are you asking why a div instruction isnt substituted with a combination of add/sub/mul instructions? They are, the compiler will try to replace a divide either by a shift, a combination of muls and shifts or adds and shifts, depending on if your target supports mul instructions.
However this only works if the divisor is a compile time constant. For other cases the compiler most likely will use div instruction unless you tell it otherwise. For example you could use libdivide instead, which can be faster than hardware divide on a lot of targets. Or you could tell it to use its default inbuild algorithm, which most likely will be quite a bit slower than a hardware divider *if* the compiler cant optimize it down, which in a lot of cases it can.
@@_--_--_
"Can you restate your question, as it doesnt really make any sense."
You state:
"FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4."
If your statement above is true, then why isn't it used in lieu of using the ACTUAL divide instruction, ALL the time?
(You've answered it for the most part in the second half of your reply above.)
But for said second half of your reply to be true, you would have to know:
a) the divisor is a compile time constant (which means that if it isn't a compile time constant, it won't work)
and b) you explicitly tell the compiler what to do/how to handle the DIV instructions, which typically means that you would have to know the specific purpose or application that you are programming for, AND that the difference in the quotient is not significant nor critical enough IF the approximation is sufficient.
(i.e. you're programming for yourself rather than for someone else to use where the importance of the epsilon value is not known nor what someone else's tolerance limit for epsilon is)
@@ewenchan1239
"You state:
"FPU divide on Cortex M4 is 14 cycles. Compare that to 1 cycle for integer add/sub/mul/shift and 1 cycle for FPU add/sub and 3 cycles for FPU mul on M4.""
I mentioned the integer divide instruction a bit before that, lets not mix up integer operations with floating point. As far as I am aware we are talking about integer arithmetics specifically.
"a) the divisor is a compile time constant (which means that if it isn't a compile time constant, it won't work)"
Wont work is not the right wording here. Yes if the divisor isnt a compile time constant than the compiler cant replace the div instruction with a faster combination of other instructions by default, however thats an optimization problem, it still "works".
Also its often a good idea to rethink your code, often a lot of divisions are unnecessary (especially with floating point, but we are talking integer here), the compiler cant do this for you most of the time. This was mostly my intent of the original comment anyway, the compiler cant optimize all your mistakes away and using divides carelessly is often a mistake if performance is of any importance.
"which typically means that you would have to know the specific purpose or application that you are programming for"
As this discussion centers around low level programming, mostly embedded programming, we are talking about target specific code, where you know exactly what target you are progamming for and its specific compiler. Portable code is not relevant to this discussion.
"AND that the difference in the quotient is not significant nor critical enough IF the approximation is sufficient."
I am not talking about approximation, thats a good way of potentially optimizing code if its feasible, but thats not part of this discussion.
All compiler optimizations or replacement software algorithms will give the exact same output as the hardware divider.
@@_--_--_
"I mentioned the integer divide instruction a bit before that, lets not mix up integer operations with floating point."
But you also the "cost" (in terms of the number of operations) it takes for FP ADD/SUB and FP MUL on said Cortex M4 as well.
"FPU divide on Cortex M4 is 14 cycles."
"As far as I am aware we are talking about integer arithmetics specifically."
My understanding is that we can be talking about both (given that you can represent floating point numbers as a sum-product of pairs of two integers, can you not? (i.e. the approximation for 1/9 ~= 1 * 2^-4 + 1 * 2^-5 + 1*2^-6 ~= 0.109375)
And isn't the mul/shift - it's doing floating point math with integers, isn't it? (unless I am missing something here and/or misunderstanding what @Low Level Learning is talking about here, in this video)
"Wont work is not the right wording here. Yes if the divisor isnt a compile time constant than the compiler cant replace the div instruction with a faster combination of other instructions by default, however thats an optimization problem, it still "works"."
The "won't work" here refers to the substitution or the replacement of the slower div operation with the faster combination of other instructions. That's the "work" that you were talking about it doing.
Therefore; pursuant to your statement, if you aren't dividing by a compile time constant, then said replacement/substitution won't take place - i.e. the "work" that I am referring to, based on what you wrote, won't take place. Ergo; "won't 'work'" (div operation won't get replaced by "combination of other instructions" (which should help to perform the division faster without using the div operation/instruction).)
"often a lot of divisions are unnecessary"
Varies, I would think.
"This was mostly my intent of the original comment anyway, the compiler cant optimize all your mistakes away and using divides carelessly is often a mistake if performance is of any importance."
Again, I think that it varies. (depends on what you are doing/why you are using division in the first place)
re: mistakes
I wouldn't necessarily call using divisions a mistake.
I agree that this is where a skilled programmer comes in, but if you are learning or you are an unskilled programmer, then that's something that they will have to learn.
Take for example, the inversion of 1/(1/9).
If you use exact arithmetic, your E(X) is 9.
If you approximate (1/9) ~= 1*2^-4 + 1*2^-5 + 1*2^-6 ~= 0.109375,
then 1/0.109375 = 9.142857142857142....
Therefore; your epsilon would be 9.142857142857142... - 9 = 0.142857142857142....
or about 1.6% error.
And therefore; the question that you would have to ask yourself is "is that 1.6% error significant or not?" In some cases, it might not be. In other cases, it can be VERY significant.
It really depends on what you are doing.
(Note that often times, if you are replacing a division operation, I have yet to see someone who advocates for the replacement or substitution of the div operations with a combination of other operations - they don't automatically talk about (nor calculate) their epsilon and let the user know what that epsilon is that arises from said replacement/substitution. I think that if you are going to implement such a kind of replacement (of the div operation) with a combination of other operations, where and whenever possible, you SHOULD be REQUIRED to tell the user what your epsilon is, when appropriate.)
"As this discussion centers around low level programming, mostly embedded programming, we are talking about target specific code, where you know exactly what target you are progamming for and its specific compiler. Portable code is not relevant to this discussion."
That's an assumption on your part though.
There is nothing that explicitly states this.
As I said, it is important for people NOT to have it as a key takeaway from watching this video that they are going to implement this everywhere they see because they see it as a speed/performance improvement whilst they fail to recognise that this speed/performance improvement comes at the expense of precision and accuracy.
Even for assembly code (cf. GROMACS/Folding@Home), maintaining the precision and the accuracy can still be vital for the program to perform the calculations that is asked of it properly and correctly.
"I am not talking about approximation, thats a good way of potentially optimizing code if its feasible, but thats not part of this discussion."
The replacement of a division operation with a combination of other operations is almost always an approximation of the actual division operation.
(cf. the example that @Low Level Learning shows above with respect to the approximation of 1/9 ~= 1 * 2^-4 + 1 * 2^-5 + 1 * 2^-6 ~= 0.109375)
That IS an approximation.
That IS what he did and that IS what is shown.
For the multiplication part you could optimize it even further by removing multiplication entirely and just storing f-32 in a register then bit shifting left by 2 followed by adding the f-32 which would be the same as multiplying by 5. The the divide by 9 is way tougher to optimize though because it seems like you can't to the same with right bit shifts unless I'm missing something :/ Also I guess this would only work if you wanted the whole integer part of the result, if you wanted some decimal places too I guess you would need to use fixed point
Why not combine multiplication by 5 and multiplication by 1÷9 (477218589÷4294967296) to do multiplication by 5÷9 (2386092943÷4294967296)
@@jhgvvetyjj6589 because division in this case is floor division which is not associative (x*y)/z != x*(y/z). For example (2*3)/4 = 6/4 = 1 != 0 = 2*0 = 2*(3/4).
EDIT: Ah, I see that is not what you are asking now.
Damn you guys are so nerd
this is how binary multiplication already works!
Wouldn't your register just end up trashing your bits if you go beyond scope?
xxxxooooxxxxoooo >> 6 bits for example, would you not end up with 000000111100?
then if you were shift left it'd come back as zeros?
I learned about this in a pure math course but this gives another perspective. Well done!
"Felt like cheating" is a dangerous phrase that has been known to cost programmers years of their lives.
YES - more of this please! I’m learning 6502 (cc65 but no stdlib lib emulation), and this kind of trickery totally relates. 😊
Lookup tables are your friend on 6502 where practical. Check this video which has a link to page (as I recall) that shows how to do 8x8 bit multiply on 6502 with 4 table lookups, 2 8 bit add/subtracts, and a 16 but subtract in under 40 clock cycles, ua-cam.com/video/WEliEAc3ZyA/v-deo.html
If you need to do many chained calculations containing different operators, it can be faster to do them as rationals, and only do the final division at the end. Summation and division will end up to be three operations minimally, but it can pay off quite a bit for lower complexity with other operations. Though it can quickly run into rollover issues, so might need to keep an eye out for that, and apply a simplify step (gcd algorithm) where necessary.
Been there done that on 6502, z80 a whole lot. And fixed point is the easiest and fastest approach.
Also many CPUs don’t have multiply, they are both mathematical monsters 🤪
One method I've seen on 6502 is to use a lookup table of squares; (x+y)^2 - (x-y)^2 == (x*y*4). Useful for 8-bit * 8-bit -> 16-bit multiplies. You can do the divide by four with shifts or just build it into the table.
One time on a high school math test I showed my work in complement division as a protest against showing my work.
As it turns out I'm not great at math, but I like to flex when I can and I'm pretty sure the math teacher knew about it but had to look it up for a refresher to follow it. Great victory.
Man you know I never thought dividing would be one of this frustrating problems that seem to just take forever
As a computer engineering student currently learning about CPU design, it’s also important to bring up that new CPU architectures (arm & RISC-V) are shifting towards reduced instruction sets to increase efficiency. This is why division isn’t prioritised, as it can be done in *very* roundabout ways that may still be faster than a physical hardware implementation. As long as you have the fundamental instructions, you can do essentially everything.
Meanwhile x86 has native sine cosine and tangent and can convert between extended precision binary floating point and 18 digit decimal integers
680x0 had those too in the FPU for '020 and '030, but they removed them again for the '040 because they're slow even in hardware and add so much complexity and use up so much silicon.
@@andrewdunbar828 680x0 is dead.
0:21 Or you could put C= (F-32) * 0.5555556 because I’m pretty sure that 5/9 will never change
Print (Var=C)
Var=C
Lol
But not all ARM processors have an FPU
This is a very great video. Thanks for making it! I am an silicon engineer developed processors and I am surprised with ARM's brilliant solution! That make die smaller and power efficient!
Will Smith: "You can't do division"
Robot: "Can you?"
At 3:10, is that IEEE 754?
Learned more here than in most of my CS courses
Hell yeah
If you're gonna use "SCARY FIXED POINT" to a division, you may as well fold the multiply by five in, and save a whole multiply instruction.
So instead of setting up 1/9 as your magic constant, set up 5/9 (just a different constant).
But, yeah, depending on your CPU, some things need to be coded using multiple instructions. Might be multiply, might be divide, might be floating point.
I worked on a graphics system coded in 68000/C where all pixels were held in fixed point 24.8 (a whole number of 256ths of a pixel, if you like)
Yeah, fixed point is a very effective technique to get around divisions, unfortunately you can't always use it when high precision is required. Luckly the problem does not apply when you have an FPU. A very interesting alternative I suggest you to look into is CORDIC, it computes only 1 bit per iteration, but if your architecture is superscalar that is not a big issue. Furthermore I also read something about CORDIC and loop pipelining at a certain point, thus looking that up could help as well!
As a side-note: In the 'ARM system developers guide' by A.N. Sloss, published 2004 by Elsevier, a 40 opcode or so division routine is shown which divides two 32b numbers in under 32 cycles. This routine was the reason that up to 2004 there was no need for a hardware divider for the ARM, as till that time these hardware dividers needed 1 cycle/bit. The routine uses 2 Newton-Rhapson iterations and a correction process.
Yeah, pretty much your only alternative would be to build a nested-loop that adds the divisor to a total, counts the number of times the loop runs before the total reaches the dividend, then move the decimal point on the divisor and the loop-counter, and repeat the whole process until you run out of space to store decimal places.
binary search is also an alternative
Learned this at university while studying computer engineering. That is one of the advantages of a 4 year education over general programming bootcamps. You get into the specific details of why things happen at a low level.
Im learning this rn in my computer systems class and I still can’t wrap my head around it
in a vaccuum yea, who wouldn't want to? but realistically, you wasted a ton of your money and life for something a "general programming bootcamp" could have gotten u into your field a LOT faster for a LOT less money. Funny how that works. but in order to suck off the tit that is high lvl education grant money, one must pay the blood price of an undergrad + grad degree. so funny how that works. :")
@@housemana No chance anyone out of a coding boot camp could get my type of job. It requires quite a bit hardware knowledge as well as low level programing knowledge. Let alone the amount of research we have to do including advanced mathematics to invent tech that is not currently in use. This is just one of many upon many things covered in an extended program. In case you are not familiar with software development, there are many branches and specialties that required varying levels of knowledge. Usually the more difficult jobs tend to have higher requirements, but also have higher pay ceilings. Except maybe game development (High standards, shit pay and low quality of life).
I'm not saying this to shit on people who go to boot camps. In fact I have a ton of friends who have advanced their careers and greatly benefited from these. But if your goal is to be a software dev, a 4 year university will undoubtedly provide you with a more thorough education and better job prospects as starting point. I just see a lot of videos essentially spreading your same message, and it is just not accurate.
@@kool2btrue why can't you just read the textbooks at home? I am a self taught software engineer and I just read books on this stuff
@@kool2btrue I bet I could get your job
Sooo... you need to divide, to... divide????
Amazing how you give explanation as to why it don't have a divide instruction,
I knew the "Multiply and Shift" technique, but I hadn't thought to use the Overflow Register deliberately to do an implicit shift.
im a computer in this case
if you use the aproximation of 5/9 to solve the problem, it would get another problem in the future, or could be a easier solution too?
Dividers don't need to be big or complicated; a basic shift-and-subtract divider is little more than a wide adder, it's just slow to do it that way - depending on the implementation, typically something like 1-2 cycles per quotient bit, so for a 32:32 bit divide, it would take something like 32-64 cycles. Pretty much the same hardware can also do multiplication, just as slowly. These are usually the first multiplication/division algorithms you learn about when you study HW for arithmetic circuits.
This long runtime is awkward to deal with; having operations that long-running is a pain if you try to fit them in a simple microcontroller design that otherwise has nothing of the sort, and also makes it harder to guarantee timely servicing of interrupts etc. For this reason early RISC CPUs either didn't have any multiplication/division support at all or effectively treated it like a coprocessor which sometimes (e.g. when interrupts happened) could "lose" results and then you'd have to retry.
With ARMv1 in particular, the instruction set was such that you could just implement the basic repeated-doubling-and-add multiplication and shift-and-subtract division operations in software. That would typically take something like 3-4 cycles per bit for the straightforward way, which is 3-4 times slower than dedicated hardware would be, but it avoids all the complexity and consequences of dealing with long-running arithmetic operations.
ARMv2 added multiplication instructions, and did get a better multiplication algorithm than software shift-and-add - but it still only handled 2 bits of the product per cycle, so (since it was a 32-bit CPU) multiplications could take as much as 16 cycles! This used modified Booth recoding, which is a staple in multipliers to this day, but otherwise pretty much just ran on the regular ALU. Over time this got improved further. The problem with division is that it's not as gradual: the 1-bit-at-a-time algorithms are simple, there are 2-bit-at-a-time algorithms that are already a lot more complicated (and need a fair amount of tables), and then there's yet another giant complexity jump if you want faster than that.
So historically, there was a reason to be weary about including either multipliers or dividers, and if you're gonna do one of the two, multipliers are used a lot more often and also have a lot more options for implementers. This is especially important because most divisions (and remainders) in real-world code are by constants and for those, we can use the multiply-with-reciprocal method which just requires a decently fast multiplier and not a divider.
Real-world integer divisions by non-constants are not unheard of, but they're rare enough that to this day, if you're going to do a simple chip, it's not doubtful whether a dedicated hardware divider is worth it; software division routines are usually good enough for those cases, and especially when you have a fast multiplier available (which enables better algorithms).
The really fun part is that for a long time, until fairly recently, Intel x86/x64 CPUs didn't have a real integer divider either. Sure they have an integer divide _instruction_, but pretty much everything Intel released between the original 1993 Pentium and the Icelake microarchitecture (released in 2019) does not have an integer divide _unit_. All these CPUs have a floating-point unit with a floating-point divider, and would use that floating-point divider for integers too. (Running a short, or in some cases not-so-short, microcode routine to do so.) The reasons for this are similar to why ARM Cortex cores don't have it: this isn't particularly fast, but integer divide instructions are rare in practice, and a dedicated integer divider just wasn't worth it. They did change their mind on that eventually as mentioned, but it took them over 25 years!
what is a pity though is that almost no one implements "fuzzy"(analog or asynchronous digital backed) divide instructions, so they could be used in DSP/AI/GFX etc. cases where error is not a problem.
There are neat asynchronous methods, exploiting propagation/delay effects which take just a little of silicon and are fairly precise .
OFC most ppl argue that divide can be emulated or real hardware unit used, but then it cannot do either deep pipelines, nor SIMD or other bulk matrix transformation.
DSP and GFX suffer most as not only people need beefy chips to do fairly simple operations, but also power consumption hurts.
When I was in university, I had to write assembly code, and later had to take a Design of Compilers course, both of which were required for my major. Design of Compilers in particular was painful.