I like how he expands most of these topics into complete talks for the next 10 years (this is 2008). I've seen probably most of them, but they deserve some time. And here he is, ready to do it all in one hour.
I tried his version of improved atoi. It indeed is faster than the standard library atoi. However, manually implemented atoi with "bad" dependency "result = result * 10 + (*b - '0')" showed the same performance as the "improved" version . IMHO it makes sense, since "result += ..." presents the same dependency on the previous result as "result = result * 10 + (*b - '0')". In clang the results are even more stunning. His "bad" version of atoi is in fact *2 times faster* than the trick with precalculated powers of 10, because when clang sees "result = result * 10 + (*b - '0')", it uses two LEA instruction instead of multiply: leaq (%rcx,%rcx,4), %rcx, leaq (%rsi,%rcx,2), %rcx. The first one performs multiplication by 5, the second combines multiply by 2 and addition.
He did mention that some of the advice might "not age very well with improvements in compiler and CPU technology" (3:23). Clang has done a lot in that space, but it's good to know which advice hasn't aged well. Thanks for that!
54:34, I recently discovered that unrolling a loop of 'sscanf's to (I don't remember well) 50 numbers made it more than 3x faster, for both Clang and GCC. I thought they do that by default in -O2/3.
I honestly didn't know how to implement strtr. So as a challenge when listening to this, I wrote it as an infinite loop. It's much easier to think about and I think I got a pretty optimized linear version
I like Andrei, but I have to disagree that optimal way to implement StrStr is by starting with an infinite loop. I just implemented the algorithm in C# and I used the loop condition regularly. I would really like to see Andrei's implementation on this, because it may be that he is doing extra comparisons in the loop body.
Another good presentation by Alexandrescu, but watch out!, "digits10" is actually a pessimization, as benchmarked in my x86-64 computer, here: github.com/thecppzoo/inprogress/blob/master/src/main.cpp The reason is that it has too many unpredictable microbranches for reasonable inputs, this is "sloware". Eventually I will publish a more detailed analysis, for the time being, you may use my implementation of digits10 which is almost twice as fast for sequential inputs and about five times faster for random inputs
Thanks for the bench. The whole hour, I've been thinking "why not just use `ceil(log10(abs(a)))`?". Sure, float ops up the wazoo, might not be worth it for small numbers, but no looping required at all. I'll add it to your benches when I find some time :)
On Visual Studio 2015 using Microsoft 7 Professional with 2 Cores, I ran a test version of the non-parallel, parallel computation and unrolling, but the non-parallel routine was always faster. I even tried putting the large array and string length computation in the routine but it was still slower than the non-parallel computation. Relying on parallelism was slightly slower than unrolling in the last 2 tests. I developed 2 programs. I included the first program below and the output from the 2nd program! Output from 1st program: TEST 1: Convert string to number: 1234. Not Use Parallelism Times: min = 0 max = 5307 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234 Use Parallelism Times: min = 0 max = 6310 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234 Use Unrolling Times: min = 0 max = 6621 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234 TEST 2: Convert string to number: 12345678. Not Use Parallelism Times: min = 1 max = 9573 Min Iter = 10000 Max Iter = 100000000 RESULT: 12345678 Use Parallelism Times: min = 1 max = 10227 Min Iter = 10000 Max Iter = 100000000 RESULT: 12345678 Use Unrolling Times: min = 1 max = 9799 Min Iter = 10000 Max Iter = 100000000 RESULT: 12345678 TEST 3: Convert string to number: 1234567890123456. Not Use Parallelism Times: min = 1 max = 12625 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234567890123456 Use Parallelism Times: min = 2 max = 17928 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234567890123456 Use Unrolling Times: min = 2 max = 17227 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234567890123456 ------------------------------------------------------------------------------------------------------------------------------ I quickly modified the first version of the below program. In this 2nd program for the unrolling test, I replaced the "for" loop with an "if" statement. The "if" statement checked if there were "4", "8", or "16" digits in the test string. Then for each active option, I unrolled 4 times, 8 times, or 16 times. The resulting output from the program was that the routine based on unparllelism was always fastest. The second fastest was always unrolling. The routine based on parallelism was always the slowest routine. ** Basically, all the interesting things Andrei said about parallelism (and unrolling) does not work with the program that I developed on Visual Studio 2015. I believe all those fine and eloquent things should work elsewhere though! ------------------------------------------------------------------------------------------------------------------------------ Output from 2nd program: TEST 1: Convert string to number: 1234. Not Use Parallelism Times: min = 0 max = 3520 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234 Use Parallelism Times: min = 0 max = 5156 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234 Use Unrolling Times: min = 0 max = 4933 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234 TEST 2: Convert string to number: 12345678. Not Use Parallelism Times: min = 0 max = 5161 Min Iter = 10000 Max Iter = 100000000 RESULT: 12345678 Use Parallelism Times: min = 1 max = 8520 Min Iter = 10000 Max Iter = 100000000 RESULT: 12345678 Use Unrolling Times: min = 1 max = 8295 Min Iter = 10000 Max Iter = 100000000 RESULT: 12345678 TEST 3: Convert string to number: 1234567890123456. Not Use Parallelism Times: min = 1 max = 9065 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234567890123456 Use Parallelism Times: min = 1 max = 15273 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234567890123456 Use Unrolling Times: min = 2 max = 15128 Min Iter = 10000 Max Iter = 100000000 RESULT: 1234567890123456 ------------------------------------------------------------------------------------------------------------------------------ #include "stdafx.h" //USED BY VISUAL STUDIO 2015 #include #include using namespace std; // 2 ROUTINES THAT CREATE AN INTEGER FROM A STRING REPRESENTATION OF THAT NUMBER. //THE SLOWEST IS SUPPOSE TO RELY ON NON-PARALLELISM. //ROUTINE FROM TALK: slide 59 unsigned long long atoui_notuse_parallelism(const char* b, const char* e) { unsigned long long result = 0; for (; *b != *e; ++b) { result = result * 10 + (*b - '0'); } return result; } static const long long pow10[21] = { 1UL, 10UL, 100UL, 1000UL, 10000UL, 100000UL, 1000000UL, 10000000UL, 100000000UL, 1000000000UL, 10000000000UL, 100000000000UL, 1000000000000UL, 10000000000000UL, 100000000000000UL, 1000000000000000UL, 10000000000000000UL, 100000000000000000UL, 1000000000000000000UL, 10000000000000000000UL }; //IDEA FOR ROUTINE CAME FROM :slide 63 unsigned long long atoui_use_parallelism(const char* b, const char* e, short i) { unsigned long long result = 0; for (; *b != *e; ++b) { result += pow10[i--] * (*b - '0'); } return result; } //IDEA FOR ROUTINE CAME FROM UNROLLING SLIDE: slide unsigned long long atoui_use_unrolling (const char* b, const char* e, short i) { unsigned long long result = 0; for (; *b != *e;) { result += pow10[i--] * (*b - '0'); b++; result += pow10[i--] * (*b - '0'); b++; result += pow10[i--] * (*b - '0'); b++; result += pow10[i--] * (*b - '0'); b++; } return result; } int main(int argc, int** argv[]) { int elapsed_Time; int max, min; int max_num_iterations, min_num_iterations; long long result; std::clock_t start; //Holds starting time __int64 test_Number; char test[30]; //3 TEST VALUES char test2[9] = "1234."; char test3[11] = "12345678."; char test4[18] = "1234567890123456."; const char* b; const char* e = "."; int strlength; for (auto j = 1; j
he didn't do _the hands_ on a more serious note, sometimes i watch andrei for serious... sometimes not so much (example, the allocators video, that one was a kneeslapper)
I like how he expands most of these topics into complete talks for the next 10 years (this is 2008). I've seen probably most of them, but they deserve some time. And here he is, ready to do it all in one hour.
The talk was not from 2008. It was from 2017.
I love how he lets the audience choose the topic of the talk on the fly :D Who else does that :D
I tried his version of improved atoi. It indeed is faster than the standard library atoi. However, manually implemented atoi with "bad" dependency "result = result * 10 + (*b - '0')" showed the same performance as the "improved" version . IMHO it makes sense, since "result += ..." presents the same dependency on the previous result as "result = result * 10 + (*b - '0')".
In clang the results are even more stunning. His "bad" version of atoi is in fact *2 times faster* than the trick with precalculated powers of 10, because when clang sees "result = result * 10 + (*b - '0')", it uses two LEA instruction instead of multiply: leaq (%rcx,%rcx,4), %rcx, leaq (%rsi,%rcx,2), %rcx. The first one performs multiplication by 5, the second combines multiply by 2 and addition.
He did mention that some of the advice might "not age very well with improvements in compiler and CPU technology" (3:23). Clang has done a lot in that space, but it's good to know which advice hasn't aged well. Thanks for that!
54:34, I recently discovered that unrolling a loop of 'sscanf's to (I don't remember well) 50 numbers made it more than 3x faster, for both Clang and GCC. I thought they do that by default in -O2/3.
I honestly didn't know how to implement strtr. So as a challenge when listening to this, I wrote it as an infinite loop. It's much easier to think about and I think I got a pretty optimized linear version
amazing talk!
50:12, I would expect isdigit to be faster, because it would only need to check some bits.
I like Andrei, but I have to disagree that optimal way to implement StrStr is by starting with an infinite loop. I just implemented the algorithm in C# and I used the loop condition regularly. I would really like to see Andrei's implementation on this, because it may be that he is doing extra comparisons in the loop body.
Now I have big doubts regarding my optimization skills :-)
Great Talk. I learned a lot from you.
why does he call integers "integrals"
Another good presentation by Alexandrescu, but watch out!, "digits10" is actually a pessimization, as benchmarked in my x86-64 computer, here:
github.com/thecppzoo/inprogress/blob/master/src/main.cpp
The reason is that it has too many unpredictable microbranches for reasonable inputs, this is "sloware". Eventually I will publish a more detailed analysis, for the time being, you may use my implementation of digits10 which is almost twice as fast for sequential inputs and about five times faster for random inputs
Thanks for the bench. The whole hour, I've been thinking "why not just use `ceil(log10(abs(a)))`?".
Sure, float ops up the wazoo, might not be worth it for small numbers, but no looping required at all. I'll add it to your benches when I find some time :)
Certain cicadas come out every 17 years. If you need to know if this year is a cicada year, return (year %17 + offset)
37:15 wtf we have them on our WRISTS these days!
So pipelined cpu's are secretly vector processors.
On Visual Studio 2015 using Microsoft 7 Professional with 2 Cores, I ran a test version of the non-parallel, parallel computation and unrolling, but the non-parallel routine was always faster. I even tried putting the large array and string length computation in the routine but it was still slower than the non-parallel computation. Relying on parallelism was slightly slower than unrolling in the last 2 tests.
I developed 2 programs. I included the first program below and the output from the 2nd program!
Output from 1st program:
TEST 1: Convert string to number: 1234.
Not Use Parallelism Times: min = 0 max = 5307 Min Iter = 10000 Max Iter = 100000000
RESULT: 1234
Use Parallelism Times: min = 0 max = 6310 Min Iter = 10000 Max Iter = 100000000
RESULT: 1234
Use Unrolling Times: min = 0 max = 6621 Min Iter = 10000 Max Iter = 100000000
RESULT: 1234
TEST 2: Convert string to number: 12345678.
Not Use Parallelism Times: min = 1 max = 9573 Min Iter = 10000 Max Iter = 100000000
RESULT: 12345678
Use Parallelism Times: min = 1 max = 10227 Min Iter = 10000 Max Iter = 100000000
RESULT: 12345678
Use Unrolling Times: min = 1 max = 9799 Min Iter = 10000 Max Iter = 100000000
RESULT: 12345678
TEST 3: Convert string to number: 1234567890123456.
Not Use Parallelism Times: min = 1 max = 12625 Min Iter = 10000 Max Iter = 100000000
RESULT: 1234567890123456
Use Parallelism Times: min = 2 max = 17928 Min Iter = 10000 Max Iter = 100000000
RESULT: 1234567890123456
Use Unrolling Times: min = 2 max = 17227 Min Iter = 10000 Max Iter = 100000000
RESULT: 1234567890123456
------------------------------------------------------------------------------------------------------------------------------
I quickly modified the first version of the below program. In this 2nd program for the unrolling test, I replaced the "for" loop with an "if" statement. The "if" statement checked if there were "4", "8", or "16" digits in the test string. Then for each active option, I unrolled 4 times, 8 times, or 16 times. The resulting output from the program was that the routine based on unparllelism was always fastest. The second fastest was always unrolling. The routine based on parallelism was always the slowest routine.
** Basically, all the interesting things Andrei said about parallelism (and unrolling) does not work with the program that I developed on Visual Studio 2015. I believe all those fine and eloquent things should work elsewhere though!
------------------------------------------------------------------------------------------------------------------------------
Output from 2nd program:
TEST 1: Convert string to number: 1234.
Not Use Parallelism Times: min = 0 max = 3520 Min Iter = 10000 Max Iter = 100000000
RESULT: 1234
Use Parallelism Times: min = 0 max = 5156 Min Iter = 10000 Max Iter = 100000000
RESULT: 1234
Use Unrolling Times: min = 0 max = 4933 Min Iter = 10000 Max Iter = 100000000
RESULT: 1234
TEST 2: Convert string to number: 12345678.
Not Use Parallelism Times: min = 0 max = 5161 Min Iter = 10000 Max Iter = 100000000
RESULT: 12345678
Use Parallelism Times: min = 1 max = 8520 Min Iter = 10000 Max Iter = 100000000
RESULT: 12345678
Use Unrolling Times: min = 1 max = 8295 Min Iter = 10000 Max Iter = 100000000
RESULT: 12345678
TEST 3: Convert string to number: 1234567890123456.
Not Use Parallelism Times: min = 1 max = 9065 Min Iter = 10000 Max Iter = 100000000
RESULT: 1234567890123456
Use Parallelism Times: min = 1 max = 15273 Min Iter = 10000 Max Iter = 100000000
RESULT: 1234567890123456
Use Unrolling Times: min = 2 max = 15128 Min Iter = 10000 Max Iter = 100000000
RESULT: 1234567890123456
------------------------------------------------------------------------------------------------------------------------------
#include "stdafx.h" //USED BY VISUAL STUDIO 2015
#include
#include
using namespace std;
// 2 ROUTINES THAT CREATE AN INTEGER FROM A STRING REPRESENTATION OF THAT NUMBER.
//THE SLOWEST IS SUPPOSE TO RELY ON NON-PARALLELISM.
//ROUTINE FROM TALK: slide 59
unsigned long long atoui_notuse_parallelism(const char* b, const char* e)
{
unsigned long long result = 0;
for (; *b != *e; ++b)
{
result = result * 10 + (*b - '0');
}
return result;
}
static const long long pow10[21] = {
1UL,
10UL,
100UL,
1000UL,
10000UL,
100000UL,
1000000UL,
10000000UL,
100000000UL,
1000000000UL,
10000000000UL,
100000000000UL,
1000000000000UL,
10000000000000UL,
100000000000000UL,
1000000000000000UL,
10000000000000000UL,
100000000000000000UL,
1000000000000000000UL,
10000000000000000000UL
};
//IDEA FOR ROUTINE CAME FROM :slide 63
unsigned long long atoui_use_parallelism(const char* b, const char* e, short i)
{
unsigned long long result = 0;
for (; *b != *e; ++b)
{
result += pow10[i--] * (*b - '0');
}
return result;
}
//IDEA FOR ROUTINE CAME FROM UNROLLING SLIDE: slide
unsigned long long atoui_use_unrolling (const char* b, const char* e, short i)
{
unsigned long long result = 0;
for (; *b != *e;)
{
result += pow10[i--] * (*b - '0'); b++;
result += pow10[i--] * (*b - '0'); b++;
result += pow10[i--] * (*b - '0'); b++;
result += pow10[i--] * (*b - '0'); b++;
}
return result;
}
int main(int argc, int** argv[])
{
int elapsed_Time;
int max, min;
int max_num_iterations, min_num_iterations;
long long result;
std::clock_t start; //Holds starting time
__int64 test_Number;
char test[30]; //3 TEST VALUES
char test2[9] = "1234.";
char test3[11] = "12345678.";
char test4[18] = "1234567890123456.";
const char* b;
const char* e = ".";
int strlength;
for (auto j = 1; j
For benchmarking, you should probably use Clang or GCC.
You use std::clock, but you need to know that it isn't representing real time, but cpu-usage time
Use high_resolution_clock or steady_clock
What a deadpan audience
I bet this abstract was writen by andrei
Did they find a bunch of zombies to attend this talk or are the English completely humor less?
Aren't those the same?
he didn't do _the hands_
on a more serious note, sometimes i watch andrei for serious... sometimes not so much (example, the allocators video, that one was a kneeslapper)
Interesting talk. But that audience....
42:44 FORTH! RPL! PS!