Off the top of my head, this reminds me of something the prof in my assembly language courses at university said: Dealing with whole registers is faster for the CPU than partial registers so use whole unless you really need the space. So if your cache is big enough that 64 bit values aren't causing misses, the extra work of dealing with smaller values isn't worth it.
That was my guess. Given that the memory/cache used is small in both case. The masking and bit shifting, needed to handle integer smaller than the architecture register length, are probably the culprit. But why on a compiler and not the other? And why on a specific optimization level? Also, obviously, in large memory/cache usage the cache miss would have a far greater cost than those masking and bit shifting operations. So this is probably specific to this use case and not something true in general.
@@ajjr1228 so, I did some testing with Assembly: whole.s: .global _start .text _start: mov $0, %rax loop: add $1, %rax cmp $1000000000, %rax jb loop mov $60, %rax syscall small.s: .global _start .text _start: mov $0, %eax loop: add $1, %eax cmp $1000000000, %eax jb loop mov $60, %rax syscall As you can see, the only difference is that the loop uses eax in small.s and rax in whole.s I created both programs (small and whole respectively) with: gcc -c small.s && ld -o small small.o I then executed both multiple times like this: time for i in 1..10000; do ./small; done I needed to do it in a (bash) loop to actually get an observable number. While I am not sure how representative it is because of that, I do get regularly (meaning, executed often in fast succession) half a second for everything for small and a quarter of a second for whole. So, I do think that it makes a difference. But on the other hand, I needed to actually do an extreme example here to actually be able to show it to a point where I think it will rarely make a difference. Nonetheless good to know in case it does.
@@ajjr1228 I am speculating absolutely wildly here; but CPU's has had instruction level parallellism for decades (ILP). Using the full register and being able to discard its contents, abstractly speaking here, could theoretically means that it uses less bandwidth. This is pure speculation. Haven't even watched the video yet.
One issue I see is the use of indices array. For each board value, it stores a Point type of its location. Assuming 32 bit-indexing, this means that the indices array takes up 8 times as much data as the game board itself. Since this is absolutely trivial to recompute, it is better to do so to reduce memory bandwidth and cache footprint. Replacing std::transform with two nested for loops gives me a 2x speedup with both compilers.
The nested loops let the compiler hoist half the modulus operations from the inner loop to the outer loop. Which even after the optimizations from the divisor being known at compile time is a very expensive operation compared to everything else in the loop.
It would have been nice to compare the LLVM bitcode that Clang generated for two different versions. Just to see if there is something weird going on between Clang and LLVM If I had to guess, I would say that Clang is emitting some alignment attribute for smaller field sizes that makes LLVM go crazy and and in turn generate unnecessary memory access ops.
Maybe some alignment problem, if you have a 64bits register processor and your struct uses 2 16 bits values, then it needs to do some mask and shift to match 32 64 bits
Hi Jason, first of all, thank you for the intriguing problem and for all the knowledge you share for free! I've run both versions under VTune and toplev, and they are similar. The only thing that changes is the number of instructions retired, and for some reason, toplev in some runs also reported high fetching latency due to branch resteer but with a low branch mispredict, which is weird. Getting to the bottom of why the fast version had fewer instructions wasn't easy, and in the end, the key was Clang's -Rpass-missed='.*'. Running the build command with that on my machine and with Clang version 18.1.8, the slow version has more "vectorization was impossible" messages. The slow version doesn't get vectorized in two points where the fast one does: slow.cpp:79:43: remark: Cannot SLP vectorize list: vectorization was impossible with available vectorization factors [-Rpass-missed=slp-vectorizer] 79 | return ((dividend % divisor) + divisor) % divisor; slow.cpp:131:63: remark: Cannot SLP vectorize list: vectorization was impossible with available vectorization factors [-Rpass-missed=slp-vectorizer] 131 | return static_cast(floor_modulo(p.y, height) * width + Clang also points to the exact operations that fail to be vectorized: the second modulo in the first message and the multiplication in the second message. Now that I had a good starting point, I went on Compiler Explorer and looked for the difference in the generated assembly for the % operator, and as expected, the fast version looks cleaner. The second modulo gets somehow "embedded" in the general computation, whereas the slow version generates five more instructions, one of which is even a conditional move. Taking into consideration that integer modulo operations are kind of hellish for CPUs, I can easily imagine that they truly are the bottleneck. The last question that remains unanswered is why Clang can perform vectorization on the larger integer but not on the smaller one. But to be honest, I'm not an LLVM engineer, and this question is way over my head.
Could you elaborate? I encountered this exact issue once on an older gcc. uint16_t -> uint32_t alone made a 2x performance improvement. I was shocked as Jason since smaller integers should improve caching. I just assumed that integer promotion rules somehow prevented some optimization but didn't investigate the mess and moved on after the code ran fast enough.
@@Y2B123 The native word size on most CPUs is 32bit or 64bit. Meaning that they can perform basic arithmetic operations only on registers of this size. So doing arithmetics on an uint16_t would in the most naive implementation require a mask read of the value in another register, performing the desired operation and then mark copy the result back. Even though the compiler will usually do optimisations, the resulting assembly will never be optimal.
@@cristian-si1gb Thanks for the reply. I am not familiar with computer architecture and forgot about memory addressability. Still, it is unbelievable that one or two additional instructions per item can sometimes dwarf memory loading delays. Perhaps there wasn't much locality to begin with.
When you call `floor_modulo`, `dividend` is a signed type (e.g. std::int8_t or std::int64_t) and `divisor` has type std::size_t which means `dividend` will be converted to type std::size_t during the modulo operation. This will likely give the wrong result when you pass -1 as `dividend`.
It may have something to do with clang/llvm upcasting indices to match the size of the pointer when it needs to index into the array, while the loop variable calculations are still done using the loop variable type. This can add some overhead in tight loops and may sometimes affect loop unrolling if the loop is just around the unroll threshold.
Instead of int#_t, is it better to use int_fast#_t so that the compiler can decide to use a wider type? With gcc-14.2.1 and clang-19.1.5 on 64 bit x86_64, sizeof(int_fast8_t) is 1, but sizeof(int_fast16_t) and sizeof(int_fast32_t) are both 8.
Okay, I see a few problems with your code. One possible area of squiffiness is the point struct, two 8-bit values could cause some weird access code to be generated at specific optimization levels, depending on packing and how the compiler understands the padding, but I'd only look there second of all. The fact that the game boards are allowed to be sizes that aren't a power of two really screws with things because you're performing a lot of divisions with each and every index operation just for wraparound bounds checking. If you used a power of two then you could just do a `bitwise and` and bounds checking is essentially free then. However, the problem isn't just that it would be slow to do division at all, but that division on 8-bit values causes weird code to be generated. So I'd look there first. If you really insist on using modulo division to bounds check and allow oddball board sizes, then I would go with calculating the reciprocal of your divisor at the setup for each board and use multiplication instead.
x86_64 CPUs are optimized for 64 bit operations. Even though 64 bit operations use more bits, the CPU might have more transistors per bit dedicated to them. 16 bit and some 32 bit operations require an operand override size prefix. Using a partial register limits what the compiler can do with the remainder of the register, and storing multiple variables in a register can make it harder for the CPU to guess how to pipeline instructions, so using partial registers sometimes doesn't gain anything.
To me it looks like an alignment problem. I just checked the run_board() version, and it generates so much more instructions. At -O2 the min_int_t version is closer to the int64_t at -03. Aligning the members of Point to 8 byte improves a bit, but not completely. I have no experience with x86, but from my ARM CortexM7 days misalignment is really inefficient.
I had a similar experience recently with gcc. I changed a 16-bit integer type to 32-bit and gained a 2x performance improvement immediately. I didn't think too much of it because it was among a series of refactors. I assumed that it was due to some weird optimization issue with integer promotion or something.
I wonder if -march would have a different impact on ARM/RISC-V machines, since IMHO those tend to be a little bit more diverse with their features while x86 is very much a set arch with not that many different features among cpus (Also I might be wrong, I am just wondering).
On x86 -march can be used to automatically enable SSE, AVX, AVX512 etc. which is great for highly parallelizable stuff but not as useful for most sequential things
It all depends on which optional extensions could actually benefit the task verses generic one size fits all instructions (Maybe to a much lesser extent the relative quality of the HW implementations on a per module basis. ie If the scalar ALU units are amazing but the SIMD unit is a terrible kludge.) So there could be major benefit for one particular combination of program and hardware while another sees insignificant changes.
Working with bytes on X86 has issues. A byte register overlaps a 64-bit register, but modifying the byte does NOT modify the upper (64-8=) 56 bits. Causes stalls in the CPU pipeline, because the CPU must wait to the results of the byte to compute before it can to stuff with the full register. (this is called a partial register stall). 32-bit operations on the other hand zero out the top 64-bits. Meaning that this is just as fast as 64 bit operations. This matters if you do multiplications and such where a 8-bit x 8-bit gets multiplied and promoted to a 16-bit (another no-no). The compiler knows that and injects code to avoid having to do 8-bit operations, hence the lookup tables. If you get you indices to 32-bit you will get (slightly) faster code than the 64-bit. Obviously all the extra code trashes your cache as well.
You get the bulk of the performance back just by switching GameBoard::width and GameBoard::height to size_t, though why the presumed pessimisation occurs is beyond me. That takes perf on my machine from 40s>23s. Switching the storage type to int64_t takes it down to 15s, but absent gcc comparison I don't know how fair that is
Not sure if you've done a video on this subject yet, if not. Might be a good time to do one on -Rpass=.* -Rpass-missed=.* etc (for Clang) and -fopt-info etc for GCC. You can use these flags in godbolt and get underline help tips etc. 10000 for Width and Height fits into a int16 and is being vectorised.
I didn't see a commit with the type hardcoded as int8_t or int32_t. I am pretty sure that clang is just confused with the constexpr min() function and misses the point that the type is an integer, or something like that.
excellent question. I've reached that conclusion, but it's really based on outdated information in my case. GCC works so why change it... My immediate thought (now that I've actually watched to the end) is that the processor really prefers naturally aligned data: you pay a not-insignificant premium for non-aligned or byte-aligned data. That said, you did mention about keeping as much data in cache as possible (question: are you actually achieving the desired caching?), and I don't know that the unaligned access penalty is paid, or at least costs as much, with data already in cache. I'd like to know what happens with (aligned 32-bit) quantities since that's the natural integer size for contemporary Intel architectures. I would expect the same if not (slightly, maybe un-measureably, greater) perf gains.
Top of my initial-suspects list: clang decided/managed to prove there's no aliasing going on even though the index references are to some one-byte hence `char` type, and so they have to be presumed to alias everything.
Your code where you try to pick the best index size is an example of a coder trying to optimise code instead if leaving it to the compiler. Been decades since we had to do this. Trust your compiler and stop trying to do its work. As for why small index was slower. Its the "read claim" engines in the cache circuits. I dug deep into this subject twenty years ago for a console. The tl;dr is packing data can be very bad for cache.
That sounds like for some reason, they made some optimizations for 64 bit types assuming they would be used a lot and didn't do the same for smaller types. Could be related to using them as indexes
std::optional fn(int x, bool b) { std::optional res; if (b) { res.emplace(x); } return res; } Now compile w gcc and clang using -O3, and see the crap gcc generates.
Given that this could be a bug and you're using Clang 18.1.0, why don't you try with a newer version? Right now you can use 19.1.6 and Godbolt has 19.1.0 available.
if O3 bloats up binary size as conference talk you quoted claims then obviously it does not matter for tiny programs as yours, so this is not disproving anything.
Not at a real computer now so can’t look, but my thought is to keep turning off optimizations (because -O0 is not literally zero any more than -Wall is all) and see if one of those is the culprit
Ever heard of formatting in tables ? Why not use common base (seconds) up to 2 decimal points? The table is unreadable and anyone hardly knows what you are describing.
It's probably not that int64_t is particular fast. More likely that there is something about the implementation of that 8-bit type that is clogging the works. Especially considering that you were using a fairly new consumer-grade GPU for acceleration, most of which generally don't have native 64b implemented in hardware, so the emulated 64b is 1/10th as fast as 32b rather than ½ as fast. (To get GPU hardware double-precision, you either need an nvidia card with xx100 GPU chip, which are like $20k used. Or some of the ~10 year old AMD FirePro workstation cards, some of which had very janky compute drivers, esp GCN1 and 1.1, so good luck.)
This should unleash the Clang experts on the problem. Look forward to the eventual episode uncovering the truth.
Off the top of my head, this reminds me of something the prof in my assembly language courses at university said: Dealing with whole registers is faster for the CPU than partial registers so use whole unless you really need the space. So if your cache is big enough that 64 bit values aren't causing misses, the extra work of dealing with smaller values isn't worth it.
SURELY the cycle time of accessing rax vs. ax is *no different*, or even doing ops on them.
That was my guess. Given that the memory/cache used is small in both case. The masking and bit shifting, needed to handle integer smaller than the architecture register length, are probably the culprit. But why on a compiler and not the other? And why on a specific optimization level? Also, obviously, in large memory/cache usage the cache miss would have a far greater cost than those masking and bit shifting operations. So this is probably specific to this use case and not something true in general.
@@ajjr1228 so, I did some testing with Assembly:
whole.s:
.global _start
.text
_start:
mov $0, %rax
loop:
add $1, %rax
cmp $1000000000, %rax
jb loop
mov $60, %rax
syscall
small.s:
.global _start
.text
_start:
mov $0, %eax
loop:
add $1, %eax
cmp $1000000000, %eax
jb loop
mov $60, %rax
syscall
As you can see, the only difference is that the loop uses eax in small.s and rax in whole.s
I created both programs (small and whole respectively) with: gcc -c small.s && ld -o small small.o
I then executed both multiple times like this: time for i in 1..10000; do ./small; done
I needed to do it in a (bash) loop to actually get an observable number.
While I am not sure how representative it is because of that, I do get regularly (meaning, executed often in fast succession) half a second for everything for small and a quarter of a second for whole.
So, I do think that it makes a difference.
But on the other hand, I needed to actually do an extreme example here to actually be able to show it to a point where I think it will rarely make a difference. Nonetheless good to know in case it does.
@@ajjr1228 I am speculating absolutely wildly here; but CPU's has had instruction level parallellism for decades (ILP). Using the full register and being able to discard its contents, abstractly speaking here, could theoretically means that it uses less bandwidth.
This is pure speculation. Haven't even watched the video yet.
I kinda assumed the same yea. int32 and int64 have nice simd intrinsics, not so sure about support for int8.
One issue I see is the use of indices array. For each board value, it stores a Point type of its location. Assuming 32 bit-indexing, this means that the indices array takes up 8 times as much data as the game board itself. Since this is absolutely trivial to recompute, it is better to do so to reduce memory bandwidth and cache footprint. Replacing std::transform with two nested for loops gives me a 2x speedup with both compilers.
The nested loops let the compiler hoist half the modulus operations from the inner loop to the outer loop. Which even after the optimizations from the divisor being known at compile time is a very expensive operation compared to everything else in the loop.
It would have been nice to compare the LLVM bitcode that Clang generated for two different versions. Just to see if there is something weird going on between Clang and LLVM
If I had to guess, I would say that Clang is emitting some alignment attribute for smaller field sizes that makes LLVM go crazy and and in turn generate unnecessary memory access ops.
So is next video about int_fast? ;)
Maybe some alignment problem, if you have a 64bits register processor and your struct uses 2 16 bits values, then it needs to do some mask and shift to match 32 64 bits
Hi Jason, first of all, thank you for the intriguing problem and for all the knowledge you share for free!
I've run both versions under VTune and toplev, and they are similar. The only thing that changes is the number of instructions retired, and for some reason, toplev in some runs also reported high fetching latency due to branch resteer but with a low branch mispredict, which is weird.
Getting to the bottom of why the fast version had fewer instructions wasn't easy, and in the end, the key was Clang's -Rpass-missed='.*'.
Running the build command with that on my machine and with Clang version 18.1.8, the slow version has more "vectorization was impossible" messages. The slow version doesn't get vectorized in two points where the fast one does:
slow.cpp:79:43: remark: Cannot SLP vectorize list: vectorization was impossible with available vectorization factors [-Rpass-missed=slp-vectorizer]
79 | return ((dividend % divisor) + divisor) % divisor;
slow.cpp:131:63: remark: Cannot SLP vectorize list: vectorization was impossible with available vectorization factors [-Rpass-missed=slp-vectorizer]
131 | return static_cast(floor_modulo(p.y, height) * width +
Clang also points to the exact operations that fail to be vectorized: the second modulo in the first message and the multiplication in the second message.
Now that I had a good starting point, I went on Compiler Explorer and looked for the difference in the generated assembly for the % operator, and as expected, the fast version looks cleaner. The second modulo gets somehow "embedded" in the general computation, whereas the slow version generates five more instructions, one of which is even a conditional move.
Taking into consideration that integer modulo operations are kind of hellish for CPUs, I can easily imagine that they truly are the bottleneck. The last question that remains unanswered is why Clang can perform vectorization on the larger integer but not on the smaller one. But to be honest, I'm not an LLVM engineer, and this question is way over my head.
Weren't arithmetic types smaller than int (or std::int_fast32_t) known to be significantly slower and harder for the compiler to optimize around?
Could you elaborate? I encountered this exact issue once on an older gcc. uint16_t -> uint32_t alone made a 2x performance improvement. I was shocked as Jason since smaller integers should improve caching. I just assumed that integer promotion rules somehow prevented some optimization but didn't investigate the mess and moved on after the code ran fast enough.
@@Y2B123 The native word size on most CPUs is 32bit or 64bit. Meaning that they can perform basic arithmetic operations only on registers of this size. So doing arithmetics on an uint16_t would in the most naive implementation require a mask read of the value in another register, performing the desired operation and then mark copy the result back. Even though the compiler will usually do optimisations, the resulting assembly will never be optimal.
@@cristian-si1gb Thanks for the reply. I am not familiar with computer architecture and forgot about memory addressability. Still, it is unbelievable that one or two additional instructions per item can sometimes dwarf memory loading delays. Perhaps there wasn't much locality to begin with.
When you call `floor_modulo`, `dividend` is a signed type (e.g. std::int8_t or std::int64_t) and `divisor` has type std::size_t which means `dividend` will be converted to type std::size_t during the modulo operation. This will likely give the wrong result when you pass -1 as `dividend`.
It may have something to do with clang/llvm upcasting indices to match the size of the pointer when it needs to index into the array, while the loop variable calculations are still done using the loop variable type. This can add some overhead in tight loops and may sometimes affect loop unrolling if the loop is just around the unroll threshold.
Instead of int#_t, is it better to use int_fast#_t so that the compiler can decide to use a wider type?
With gcc-14.2.1 and clang-19.1.5 on 64 bit x86_64, sizeof(int_fast8_t) is 1, but sizeof(int_fast16_t) and sizeof(int_fast32_t) are both 8.
Okay, I see a few problems with your code. One possible area of squiffiness is the point struct, two 8-bit values could cause some weird access code to be generated at specific optimization levels, depending on packing and how the compiler understands the padding, but I'd only look there second of all. The fact that the game boards are allowed to be sizes that aren't a power of two really screws with things because you're performing a lot of divisions with each and every index operation just for wraparound bounds checking. If you used a power of two then you could just do a `bitwise and` and bounds checking is essentially free then. However, the problem isn't just that it would be slow to do division at all, but that division on 8-bit values causes weird code to be generated. So I'd look there first. If you really insist on using modulo division to bounds check and allow oddball board sizes, then I would go with calculating the reciprocal of your divisor at the setup for each board and use multiplication instead.
x86_64 CPUs are optimized for 64 bit operations. Even though 64 bit operations use more bits, the CPU might have more transistors per bit dedicated to them. 16 bit and some 32 bit operations require an operand override size prefix. Using a partial register limits what the compiler can do with the remainder of the register, and storing multiple variables in a register can make it harder for the CPU to guess how to pipeline instructions, so using partial registers sometimes doesn't gain anything.
To me it looks like an alignment problem. I just checked the run_board() version, and it generates so much more instructions. At -O2 the min_int_t version is closer to the int64_t at -03. Aligning the members of Point to 8 byte improves a bit, but not completely. I have no experience with x86, but from my ARM CortexM7 days misalignment is really inefficient.
I had a similar experience recently with gcc. I changed a 16-bit integer type to 32-bit and gained a 2x performance improvement immediately. I didn't think too much of it because it was among a series of refactors. I assumed that it was due to some weird optimization issue with integer promotion or something.
I wonder if -march would have a different impact on ARM/RISC-V machines, since IMHO those tend to be a little bit more diverse with their features while x86 is very much a set arch with not that many different features among cpus (Also I might be wrong, I am just wondering).
On x86 -march can be used to automatically enable SSE, AVX, AVX512 etc. which is great for highly parallelizable stuff but not as useful for most sequential things
It all depends on which optional extensions could actually benefit the task verses generic one size fits all instructions (Maybe to a much lesser extent the relative quality of the HW implementations on a per module basis. ie If the scalar ALU units are amazing but the SIMD unit is a terrible kludge.)
So there could be major benefit for one particular combination of program and hardware while another sees insignificant changes.
Working with bytes on X86 has issues. A byte register overlaps a 64-bit register, but modifying the byte does NOT modify the upper (64-8=) 56 bits. Causes stalls in the CPU pipeline, because the CPU must wait to the results of the byte to compute before it can to stuff with the full register. (this is called a partial register stall). 32-bit operations on the other hand zero out the top 64-bits. Meaning that this is just as fast as 64 bit operations. This matters if you do multiplications and such where a 8-bit x 8-bit gets multiplied and promoted to a 16-bit (another no-no). The compiler knows that and injects code to avoid having to do 8-bit operations, hence the lookup tables. If you get you indices to 32-bit you will get (slightly) faster code than the 64-bit. Obviously all the extra code trashes your cache as well.
You get the bulk of the performance back just by switching GameBoard::width and GameBoard::height to size_t, though why the presumed pessimisation occurs is beyond me. That takes perf on my machine from 40s>23s. Switching the storage type to int64_t takes it down to 15s, but absent gcc comparison I don't know how fair that is
Not sure if you've done a video on this subject yet, if not. Might be a good time to do one on -Rpass=.* -Rpass-missed=.* etc (for Clang) and -fopt-info etc for GCC. You can use these flags in godbolt and get underline help tips etc. 10000 for Width and Height fits into a int16 and is being vectorised.
I didn't see a commit with the type hardcoded as int8_t or int32_t. I am pretty sure that clang is just confused with the constexpr min() function and misses the point that the type is an integer, or something like that.
excellent question. I've reached that conclusion, but it's really based on outdated information in my case. GCC works so why change it...
My immediate thought (now that I've actually watched to the end) is that the processor really prefers naturally aligned data: you pay a not-insignificant premium for non-aligned or byte-aligned data. That said, you did mention about keeping as much data in cache as possible (question: are you actually achieving the desired caching?), and I don't know that the unaligned access penalty is paid, or at least costs as much, with data already in cache. I'd like to know what happens with (aligned 32-bit) quantities since that's the natural integer size for contemporary Intel architectures. I would expect the same if not (slightly, maybe un-measureably, greater) perf gains.
Top of my initial-suspects list: clang decided/managed to prove there's no aliasing going on even though the index references are to some one-byte hence `char` type, and so they have to be presumed to alias everything.
Your code where you try to pick the best index size is an example of a coder trying to optimise code instead if leaving it to the compiler. Been decades since we had to do this. Trust your compiler and stop trying to do its work. As for why small index was slower. Its the "read claim" engines in the cache circuits. I dug deep into this subject twenty years ago for a console. The tl;dr is packing data can be very bad for cache.
The problem is GCC compilation is like 3 times slower. Even if it's a lot faster than many other compilers like MSVC in term of performance.
That sounds like for some reason, they made some optimizations for 64 bit types assuming they would be used a lot and didn't do the same for smaller types. Could be related to using them as indexes
Whaaaa? I get homework as a Christmas present. Best. Christmas. Ever.
std::optional fn(int x, bool b)
{
std::optional res;
if (b) {
res.emplace(x);
}
return res;
}
Now compile w gcc and clang using -O3, and see the crap gcc generates.
This is how "zero cost abstractions" looks like in a real life, haha =))
Let's see:
Rax is 64
Eax is 32
Ax is 16
But: Ax is also Ah
Awesome video! Please make more videos about this topic!
Given that this could be a bug and you're using Clang 18.1.0, why don't you try with a newer version? Right now you can use 19.1.6 and Godbolt has 19.1.0 available.
Perhaps use a profiler to know where the hotspots are for further study and comparison.
I have noticed that clang-18 is faster than clang-19
if O3 bloats up binary size as conference talk you quoted claims then obviously it does not matter for tiny programs as yours, so this is not disproving anything.
Am I the only one who wants to see ICC in comparison? There's not much data about it.
custom title
when you wrote "GCC > clang" it seems like it was the other way around - bigger number for time
Just assume the ">" means "better" to solve the confusion.
I s this release mode? I assume so.
Not at a real computer now so can’t look, but my thought is to keep turning off optimizations (because -O0 is not literally zero any more than -Wall is all) and see if one of those is the culprit
Too bad every optimization nowadays requires an entirely new compiler or an entirely new language. It's all about driving engagement.
Compilers are magic lmao
It is absolutely not. GCC is far behind Clang/LLVM in key areas (sometimes it is even behind .NET's ILC/RyuJIT).
Your trying to be smart with type sizing.
because it doesn't ship with clang-format.
Ever heard of formatting in tables ? Why not use common base (seconds) up to 2 decimal points? The table is unreadable and anyone hardly knows what you are describing.
Mostly, only the headers were aligned wrong.
It's probably not that int64_t is particular fast. More likely that there is something about the implementation of that 8-bit type that is clogging the works.
Especially considering that you were using a fairly new consumer-grade GPU for acceleration, most of which generally don't have native 64b implemented in hardware, so the emulated 64b is 1/10th as fast as 32b rather than ½ as fast.
(To get GPU hardware double-precision, you either need an nvidia card with xx100 GPU chip, which are like $20k used. Or some of the ~10 year old AMD FirePro workstation cards, some of which had very janky compute drivers, esp GCN1 and 1.1, so good luck.)
No worries, AI will fix all bugs in the future, especially these kind of bugs.