Floating Point Bit Hacks Every Programmer Should Know (Including Fast Inverse Square Root - Quake)

Creel

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 17 лис 2024

КОМЕНТАРІ • 108

@S3Kglitches 3 роки тому ⁺²⁰¹
meanwhile I have seen the lastest famous video for Quake 3 fast inverse square root, you have done it 8 years ago, bravo
@bubeshp 3 роки тому ⁺⁹
Nemean's channel?
@lynx5327 3 роки тому ⁺¹
@@bubeshp probably
@AmrReflection 3 роки тому ⁺²
Seems that's why the UA-cam Algorithm is now showing me this. Hehehe :D
@abhi36292 3 роки тому
@@lynx5327 yes
@ahmadalastal5303 3 роки тому
Dave's garage puts a video on Quake 3 fast inverse square root, Link: ua-cam.com/video/uCv5VRf8op0/v-deo.html
@johnp893 3 роки тому ⁺²⁹
For your last function, you calculate inverse-sqrt(x) then take reciprocal of result to calculate sqrt(x). Instead, you should calculate inverse-sqrt(x) and multiply by x, this will also give you sqrt(x) but uses multiply instead of reciprocal which should be faster (it works because x/sqrt(x) = sqrt(x)).
@micp5740 3 роки тому ⁺³²
I was searching for a faster way to produce square roots, I timed the approach shown here and compared that to the built-in sqrt implementation.
On an Intel i7 CPU using the visual studio compiler, the plain version (no newton) of the simplified Sqrt routine is really much faster than the original (roughly 8x).
But if you're looking for accuracy: the error is up to 6%, averaging on 2.5% (tested numbers from 1 to 200,000,000). To me that is a bit too far off.
A single round of Newton-Raphson improves accuracy significantly (max err 0.2%, avg 0.04%) but is far more CPU hungry than the actual thing. This slows down fast Sqrt almost to the level of real sqrt (roughly 30% faster still).
So additional Newton rounds will render it useless.
@meisterlumpi7822 3 роки тому ⁺¹
tried the sqrt with 500000 random values on a core i7 with gcc and clang. w/o newton you can cut the runtime to like half.. adding on round of newton eats up any performance benefit on gcc, on clang it's even slower than std::sqrt.
I doubt that any of these are useful with modern CPUs and compilers.
@pwnmeisterage 3 роки тому ⁺¹⁹
15:11
0x5f3759df is just thrown in there.
Meanwhile, the constant "threehalfs" is explicitly defined.
@austinhastings8793 3 роки тому ⁺⁵
Guess which one the coder actually understood?
@abhi36292 3 роки тому ⁺¹
check out nemean's video he explained it
,it basically is a approximation value to refine the solution
@teemuleppa3347 3 роки тому ⁺²
@@abhi36292 also dave from daves garage made a video about it just this week i think
@TheWeepingCorpse 11 років тому ⁺⁶
Even though I've been programming for over 20 years I still learn from your videos. Keep up the amazing work.
@tetraquark2402 3 роки тому
Wish I had kept to C and C++
@Kattemageren 3 роки тому ⁺²
These kinds of videos are delightfully nerdy. Thanks!
@mikwes8913 3 роки тому ⁺²
Actually I think understanding the Fast Inverse Square Root is easier if you first try to do an approximation of the reciprocal function y=1/x.
Lets first concentrate just on positive normal floating point numbers which have all their trailing mantissa bits set to zero.
These are, of course, the powers of two: 1.0, 2.0, 4.0, and also 0.5, 0.25 etc in the opposite direction.
for x=1.0 the reciprocal is 1.0
for x=2.0 the reciprocal is 0.5
for x=4.0 the reciprocal is 0.25
and
for x=0.5 the reciprocal is 2.0
for x=0.25 the reciprocal is 4.0
So we see relative to the identity point at 1.0, that for each +1 in the input exponent we need to do a -1 in the output exponent.
This kind of directly inverse relationship between input and output leads us directly to an approximation that works by subtracting the exponent from some well chosen constant.
Since for now we consider the trailing mantissa bits to be zero/irrelevant, this leads to the surprisingly simple approximation for y=1/x as follows:
bits(y) = MAGIC - bits(x);
Since we know that for x=1.0f y=1.0f should be the result, we can easily find a fitting magic constant:
MAGIC = 2 * bits(1.0f);
While this approximation indeed yields the correct results for the powers of two, the interesting question now is: What happens between these points, if we allow the trailing mantissa bits to be nonzero? The answer is surprisingly simple.
When going step by step from 1.0 up to but not including 2.0 in the floating point encodings space we do not cross any exponent boundary. Our steps map to equi-distant steps in the floating point number space, i.e. it is a perfectly linear mapping.
While we do our steps from 1.0 to 2.0 in the input encoding our approximation of course does the likewise steps on the output encoding, just in the downwards direction instead of upwards, which is equivalent to traversing the numbers 1.0 down to 0.5, also in a linear fashion.
The result is, when we plot our approximation function for y=1/x, the intervals between the powers of two will be just straight lines connecting the base points. So this is actually a piecewise linear approximation for y=1/x, with the base points being at the powers of two, and the line segments connecting the base points.
This approximation, except for maybe the extremes like subnormals on the lower end or very big values approaching +Inf at the upper end, is also is an upper boundary for y=1/x, since it is always above the curve. It also handles positive and negative numbers.
One might be tempted to distribute the relative error more evenly be lowering the magic number a bit, so that the error is spread to both sides of the actual curve, and this actually works. But before doing so lets take another look at the piecewise approximation we have done so far.
If we look at the line segments in the plot from (0.5, 2.0) to (1.0, 1.0) and then to (2.0, 0.5), we see that the slope of the first line segment is -2.0, while it is -0.5 on the second line segment.
This means that on each base point, the slope of our approximation decreases by a factor of 4, which might be surprising at first, since due to the binary nature of the floating point encoding we might be lured into thinking it should be a factor of 2. But it is easily understandable where this factor of 4 comes from when noticing that on each base point we actually cross two exponent boundaries simultaneously: one in the x-axis (input), and one in the y-axis (output).
So if we now lower the magic number also our line segments will be moved down in the graph accordingly. The line segments will now also cross the x- and y- exponent boundaries in different places instead of both at the same time, with each crossing now being a separate base point where the slope decreases only by a factor of 2. Since we now have double the base points we can fit the approximation to the actual curve even better!
Finding out the best magic constant can then be done either in an analytical way or experimentally. Besides the outer boundaries (subnormals and upper boundary at +Inf), which have to be analyzed separately, it is sufficient to analyze one segment (say 1.0 to 2.0) since all other segments behave the same.
The Fast Inverse Square Root works pretty much the same, but instead of stepping down the exponent by one for every step up in the input exponent, you step it down only by the half rate, i.e -1 for every +2 in the input exponent - that's what the shift does. Analyzing this approximation is of course more involved, since one needs to think about the case distinction between odd and even exponents, as the lowest exponent bit is shifted down into the trailing mantissa bits.
@stevemann3090 11 років тому ⁺¹¹
Your last trick, computing the sqrt(x) = 1 / (1/sqrt(x)), can be simplified. Since sqrt(x) = x * (1/sqrt(x)), you can replace the reciprocal by a multiplication which is faster and can also be done in parallel. If you like these sort computational speed-ups, especially if you want high-precision, try Richard E. Crandall's book "Projects in Scientific Computation" (1994).
@WhatsACreel 11 років тому ⁺⁴
Well done, this is a great trick! Thanks for watching and commenting!
@calvinfernandes1054 10 років тому
thanks for that book recommendation!
@stevemann3090 10 років тому ⁺²
Cal Ferns
I should also mention Crandall's later book, "Topics in Advanced Scientific Computation" (1996) which continues ideas from his first book.
@calvinfernandes1054 10 років тому
Thanks again Steve, I bought hackers delight and I'm going through it to begin with. Will definitely take a look at your recommendations right after
@1skeeta4u 4 роки тому
@@stevemann3090 Any fast track book recommendations?
@happygimp0 3 роки тому ⁺¹³
The compiler is smart enough to find the fastest way to negate a float. A -x will be enough, no need for the function negateFloat().
When you want integers with a certain number of bit, use uint8_t, uint16_t, uint32_t and uint64_t. int can have any number of bit bigger or equal 16. int can have padding, uintN_t can't have padding and never have trap values.
@WhatsACreel 3 роки тому ⁺⁵
We can neg 8 floats in a third of a cycle. We’d need to be low level, but it is 33% faster than sub from 0 on Intel, and 2x on AMD. A compiler might vectorise and apply this technique, yes. But often it will not.
It is good to understand and share, even if it is just to check the compiler is doing what we expect. If I code this with neg in scalar C++, and the compiler spits out x87 instructions, im gonna take the reins, the compiler is drunk... Haha! Silly example, but you get the point, no?
You are right, it won’t work with 386’s, Atmels or PICs. We need to be careful and understand our architectures and compilers before applying bit hacks.
Cheers for watching, have a good one :)
@happygimp0 3 роки тому ⁺⁵
@@WhatsACreel I tested it on my PC, with gcc -O2 -x is faster than the ^0x8000000 trick, with -O3 both are as fast as -x before.
@WhatsACreel 3 роки тому ⁺³
@@happygimp0 Great investigating mate!! I wonder why? Do you know what CPU you are using? Maybe a followup video is due. Thanks for sharing :)
@happygimp0 3 роки тому ⁺²
@@WhatsACreel Intel(R) Pentium(R) CPU B950
gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
On Debian AMD64.
@happygimp0 3 роки тому ⁺²
@@WhatsACreel Testcode:
float neg(float c)
{
return -c;
}
static inline uint32_t asInt(float i)
{
return *(uint32_t*)&i;
}
static inline float asFloat(uint32_t i)
{
return *(float*)&i;
}
float neg2(float c)
{
return asFloat(asInt(c)^0x80000000);
}
I compiled it to a shared library and tested it by calling it. Yes this is may not accurate since without the library it would be inlined.
I think it is slower because it has to move the float to a normal register, do the XOR and move it back.
CPUs are pretty advanced, the know the XOR trick too and probably have it embedded in hardware directly.
@WhatsACreel 11 років тому ⁺²⁴
Cheers, you made my day!
@happygimp0 3 роки тому ⁺¹⁵
My fast sqrt()-function: float fastSqrt(float x) { (void)x; return 1;}
It always returns 1, which means it is not accurate for any input which is not 1, but it is really really fast.
@АндрейОнищенко-з8х 3 роки тому
there can be cases when you operate with floats that are expected to be only powers of two, so his example may be relevant if you find where to apply it. I don't think your joke is really appropriate
@edgarbonet1 3 роки тому
@@АндрейОнищенко-з8х If you know you have a power of two, just extract the exponent with frexp(). No bit hacking needed.
@АндрейОнищенко-з8х 3 роки тому
@@edgarbonet1 that's not what I have commented. you are right but now you are talking about different thing
@happygimp0 3 роки тому
@@АндрейОнищенко-з8х If you want to take the root of a power of 2 of a float, you can just get the exponent, divide it by 2, if it was even you create the float again with the new exponent if it was uneven you create the float with the round down exponent and put the a sqrt(2) constant as mantissa.
Because sqrt(n)==2**(log2(n)/2).
And this is exact as long as your input is a power of 2.
@АндрейОнищенко-з8х 3 роки тому
@@happygimp0 yeah, I have no argues with that, that's right
@edgarbonet1 3 роки тому ⁺²
Re trick #2: that trick does not work in decimal. The trick in binary is log₂(x)≈x−1 for x∈[1,2]. It works with IEE-754 because the leading digit of the mantissa is not stored, thus the mantissa field already contains x−1. The “equivalent” trick you are showing in decimal is log₁₀(x)≈x÷10, which is not quite the same: it has a division by 10 (a digit shift) instead of a subtraction.
@octaviooctavio1548 3 роки тому
I didn't know how get raw bits from float numbers, and backwards. Great video. Thanks.
@KasranFox 9 років тому ⁺¹
This is a neat discussion of the fast inverse square root! I thought the origin of the magic numbers used in the calculations was even more fascinating, but I'm perhaps a little weird. The other tricks in this are handy as well.
@EvolutionCode 12 років тому ⁺¹
Excellent! I love the way you include some history, very exciting stuff!
Would realy enjoy more of these kind of movies so i can learn how to optimize my software and make it faster. Thanks A LOT!
@greg77389 3 роки тому ⁺⁵
This is the kind of stuff you'd only be able to figure out if you know assembly and binary well, which is why all modern coders I think should know this stuff. If all you know is high level programming languages, you're not going to be very innovative when it comes to making optimizations like this.
@АндрейОнищенко-з8х 3 роки тому ⁺³
but with modern computational power you don't even need such optimization. I worked on multiple large scale projects and everywhere the main rule is always "readability > optimization"
@runneypo 3 роки тому ⁺⁴
@@АндрейОнищенко-з8х Thinking like this is why modern applications don't feel very fast even though hardware is a lot faster than it used to be
@АндрейОнищенко-з8х 3 роки тому
@@runneypo which applications do you mean for example?
@runneypo 3 роки тому ⁺³
@@АндрейОнищенко-з8х things like web browsers /electron apps and office applications, there's a popular talk by mike acton which talks about this
@runneypo 3 роки тому ⁺¹
@Abitamim Bharmal You misunderstood. I wasn't saying electron was faster than native C/C++ at all. Learn to read before commenting maybe?
@Skandalos 5 років тому ⁺¹⁸
Your link to Lomont's page doesnt work anymore. Here's the wikipedia entry to the fast inverse square root thing where you can also find a link to Lomont's paper.
en.wikipedia.org/wiki/Fast_inverse_square_root
@vharse1 3 роки тому ⁺⁴
Which is: www.lomont.org/papers/2003/InvSqrt.pdf
@fllthdcrb 3 роки тому ⁺⁴
6:22 Considering the log of a negative number has a nonzero imaginary part, yeah, anything you calculate for that with this stuff is guaranteed to be wrong.
@GegoXaren 9 років тому ⁺¹
Here are a few macros that make life a lil' bit easier if you are doing work using many different compilers.
#if defined(__GNUC__)
#define ALIGNED_8 ___attribute___((aligned (8)))
#define ALIGNED_16 ___attribute___((aligned (16)))
#elif __MSC_VER
#define ALIGNED_8 __declspec(align(8))
#define ALIGNED_16 __declspec(align(16))
#else
#define ALIGNED_8
#define ALIGNED_16
#endif
@magras8220 3 роки тому ⁺¹
There is no need for this negate function because compiler will write exactly the same code for unary minus operator (-x). Modern compilers have no problems with -1*x either. More than that, they will use SIMD instruction for it.
@alexloktionoff6833 Рік тому
Is there a video about fast floating point reciprocal i.e 1/x?
@nickhuynh6321 3 роки тому ⁺⁶
I wonder how much of these tricks gets used when you have a really good compiler optimization...
@billowen3285 3 роки тому
Bro
@JamesUKE92 3 роки тому ⁺²
I don’t think the optimiser will choose to reduce accuracy on your behalf, but I’d have thought things like negating and multiplying constants rather than dividing would be done
@47Mortuus 3 роки тому ⁺²
@@JamesUKE92 Multiplying by the reciprocal will never be done unless you pass "unsafe math" compiler flags - they don't produce the _EXACT_ _BITWISE_ _RESULT_ (even when you divide by a constant!). FMA instructions (fused multiply add hardware instructions, including all of ((-/+a) * b) +/- c) take 4 clock cycles on new AMD processors vs 3 or 4 for both add and multiply each _AND_ _ARE_ _MORE_ _ACCURATE_ , since rounding only happens once, and they reduce code size by 50%, but the compiler will not generate them unless you allow it to. Again, it doesn't produce the exact same bitwise result, even though they are more accurate, even.
@davidwilliss5555 3 роки тому ⁺¹
Question: Can any of this be done using doubles instead of floats? Presumably you'd use unsigned long (or better yet a type explicitly defined as 64-bit. It's been a while since I did C/C++). And the magic numbers would be different.
@matthewexline6589 3 роки тому
So when it comes to multiplication being faster than division, what happens when we multiply by a float such as .5f? Does this turn out to be just as bad as dividing by 2? Likewise, and more importantly for me right now because I'm actually trying to develop my own little game, I have an algorithm that does some basic multiplication but it's always multiplying floats which have decimals, is this basically just as slow as division then?
@rollmeister 3 роки тому
Grigori Perelman should be tasked to find more programming maths hacks. He is the greatest mathematician of all time. A super genius.
@bezbos 3 роки тому
Could you elaborate on this? I couldn't find any work by Grigori Perelman related to programming.
@Rallion1 3 роки тому
Those links the video sayswill be in the description are not in the description
@bruh-moment-21 3 роки тому
Big fan! you’re amazing!
@DiegoGomezG 3 роки тому ⁺¹
OMG thanks! this is great to know D: !
@CharlieOspina 5 років тому ⁺²
Hi there! Can anybody tell me what a magic constant exactly is? What is its definition? I'm also trying to do an implementation of 1/x on Matlab but i can't find any documentation on internet about how to do it and this video is the closest thing that i've found.
Thanks for the help !
@WhatsACreel 5 років тому ⁺¹
It's just a number that solves a problem in a seemingly magical way. In order to figure out a magic constant for 1/x you'd need to study the floating point standard, mathematics, and probably experiment and code for many hours, days or weeks. Magic numbers are extremely arcane and difficult to understand. I'd have thought MatLab wouldn't have any trouble inversing a number? You could do the quake trick and square it, no? Or just compute the inverse without any magic at all... Good luck with your code mate, and cheers for watching :)
@CharlieOspina 5 років тому ⁺²
@@WhatsACreel Thanks! This helps me a lot. Greetings from Colombia :)
@WhatsACreel 5 років тому ⁺¹
@@CharlieOspina Awesome brus! Hola from Oz :)
@АндрейОнищенко-з8х 3 роки тому ⁺⁴
Just to be fair I wouldn't say it's something "every programmer should know" like description says. It's needed only if you are doing low level optimizations and not many real projects ever get to need of this
@m.sierra5258 3 роки тому ⁺⁸
"Floating Point Bit Hacks that a select group of programmers may or may not benefit from" just doesn't have the same ring to it
@aaronsmith6632 3 роки тому
Regarding the rsqrt magic number, it looks like 0x3f800000 (1.0 in floating point) * 1.5 = 0x5f400000, which is very close to the magic number. Mystery solved!
@panjak323 3 роки тому
Do bithacks work only with float, or can you use them with double too ?
@marcs9451 Рік тому
double is 64bit while float is 32bit. You can do similar tricks but you'll have to adapt the functions a bt
@panjak323 Рік тому
@@marcs9451 Thx, I did found out myself eventually.
@EnricoPolanski 7 років тому ⁺²
What is SSE?
@creelsmusic5814 7 років тому ⁺⁶
Its a bunch of instructions and registers added to x86 for performance programming. There's lots of cool instructions sets, SSE is an old one, it's still useful tho. Hope that helps and thanks for watching!
@fllthdcrb 3 роки тому ⁺¹
Streaming SIMD Extensions. It allows you to (mainly) do the same operations on multiple operands (that are packed into single registers) in parallel. To take a simple example, if you want to add the elements of two arrays (A[i] + B[i] for all i), you could use ADD instructions to do it one pair at a time. With SSE (and enhancements), you can load them in packed form into some of the dedicated registers and use the ADDPS instruction to handle several at a time, depending on the element size (well, the original SSE only worked on a four-32-bit-floats type, but SSE2 expanded the types to two 64-bit doubles, as well as various integer types, from two 64-bit integers down to sixteen 8-bit integers; more recent extensions expanded them even more, as well as the register size).
@MrEliseoD 3 роки тому
But we still have no way of representing 16,777,217 on 32-bit FP...
@seditt5146 3 роки тому
Wouldn't it be better to use Templates for the AsInt and AsFloat functions these days?
@АндрейОнищенко-з8х 3 роки тому ⁺¹
it's only about float, double and extended are much different. and int32 or uint32 are only int types that match float by number of bytes. you will need to define specific functions anyway, templates doesn't help here
and why "these days"? nothing really changed on that topic since video was published, templates existed long before
@seditt5146 3 роки тому
@@АндрейОнищенко-з8х ""and why "these days"? nothing really changed on that topic since video was published, templates existed long before"
Exactly.
As far as the other point, at least the template is preprocessor and does not have to force the optimizer to do work.
@АндрейОнищенко-з8х 3 роки тому ⁺¹
@@seditt5146 afaik in terms of build time template generation is generally more expansive than compile time optimization, but for small cases like you suggest maybe not actually. though there still will be functions and optimizer still will check if it's beneficial to treat them as inline. I still think precise definitions without templates are better here then
@seditt5146 3 роки тому
@@АндрейОнищенко-з8х Perhaps. Far as code generated likely is the same if I had to guess. Templates are not as detrimental as people seem to think though. Honestly, for whatever reason that does not quite make sense to me... I have seen more issues with compile times from Macros than properly used templates. I think Templates largely get a bad rap in that department because the have the potential to increase compile times but that is only if you are doing some really crazy shit. Matter fact, I have seen some really crazy shit( check out Video Compile time reflection without macros, its cool). What he does is insane and yet still stated he seen no real detriment in compile times.
I think the issue with macros largely lay in antiquated technology in the source code that has not gotten proper overhauls it desperately needs.
@zorbix3652 3 роки тому
Who is this man? What even is a creel? :P
@austinhastings8793 3 роки тому
A creel is a basket for holding live fish after catching them.
@stinchjack 3 роки тому
the audio sounds like I recorded it lol
@NoNameAtAll2 3 роки тому
Hello
@toddstewart4404 3 роки тому
ridiculous, false, click-bait title.

Наступне

Автоматичне відтворення