Fastware - Andrei Alexandrescu

Поділитися
Вставка
  • Опубліковано 26 сер 2024

КОМЕНТАРІ • 29

  • @dedvzer
    @dedvzer 2 роки тому +3

    I like how he expands most of these topics into complete talks for the next 10 years (this is 2008). I've seen probably most of them, but they deserve some time. And here he is, ready to do it all in one hour.

    • @simonfarre4907
      @simonfarre4907 11 місяців тому +3

      The talk was not from 2008. It was from 2017.

  • @TheOnlyAndreySotnikov
    @TheOnlyAndreySotnikov 3 роки тому +11

    I tried his version of improved atoi. It indeed is faster than the standard library atoi. However, manually implemented atoi with "bad" dependency "result = result * 10 + (*b - '0')" showed the same performance as the "improved" version . IMHO it makes sense, since "result += ..." presents the same dependency on the previous result as "result = result * 10 + (*b - '0')".
    In clang the results are even more stunning. His "bad" version of atoi is in fact *2 times faster* than the trick with precalculated powers of 10, because when clang sees "result = result * 10 + (*b - '0')", it uses two LEA instruction instead of multiply: leaq (%rcx,%rcx,4), %rcx, leaq (%rsi,%rcx,2), %rcx. The first one performs multiplication by 5, the second combines multiply by 2 and addition.

    • @tedrivera8051
      @tedrivera8051 2 роки тому +5

      He did mention that some of the advice might "not age very well with improvements in compiler and CPU technology" (3:23). Clang has done a lot in that space, but it's good to know which advice hasn't aged well. Thanks for that!

    • @MrAbrazildo
      @MrAbrazildo 8 місяців тому

      54:34, I recently discovered that unrolling a loop of 'sscanf's to (I don't remember well) 50 numbers made it more than 3x faster, for both Clang and GCC. I thought they do that by default in -O2/3.

  • @fritzschnitzmueller3768
    @fritzschnitzmueller3768 3 роки тому +2

    I love how he lets the audience choose the topic of the talk on the fly :D Who else does that :D

  • @Voltra_
    @Voltra_ 2 роки тому +1

    I honestly didn't know how to implement strtr. So as a challenge when listening to this, I wrote it as an infinite loop. It's much easier to think about and I think I got a pretty optimized linear version

  • @MrAbrazildo
    @MrAbrazildo 8 місяців тому

    50:12, I would expect isdigit to be faster, because it would only need to check some bits.

  • @victornoagbodji
    @victornoagbodji 7 років тому +2

    amazing talk!

  • @xvoidee
    @xvoidee 4 роки тому +2

    Now I have big doubts regarding my optimization skills :-)

  • @ace100hyper3
    @ace100hyper3 2 роки тому

    I like Andrei, but I have to disagree that optimal way to implement StrStr is by starting with an infinite loop. I just implemented the algorithm in C# and I used the loop condition regularly. I would really like to see Andrei's implementation on this, because it may be that he is doing extra comparisons in the loop body.

  • @chriswatch582
    @chriswatch582 6 років тому +5

    On Visual Studio 2015 using Microsoft 7 Professional with 2 Cores, I ran a test version of the non-parallel, parallel computation and unrolling, but the non-parallel routine was always faster. I even tried putting the large array and string length computation in the routine but it was still slower than the non-parallel computation. Relying on parallelism was slightly slower than unrolling in the last 2 tests.
    I developed 2 programs. I included the first program below and the output from the 2nd program!
    Output from 1st program:
    TEST 1: Convert string to number: 1234.
    Not Use Parallelism Times: min = 0 max = 5307 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234
    Use Parallelism Times: min = 0 max = 6310 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234
    Use Unrolling Times: min = 0 max = 6621 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234
    TEST 2: Convert string to number: 12345678.
    Not Use Parallelism Times: min = 1 max = 9573 Min Iter = 10000 Max Iter = 100000000
    RESULT: 12345678
    Use Parallelism Times: min = 1 max = 10227 Min Iter = 10000 Max Iter = 100000000
    RESULT: 12345678
    Use Unrolling Times: min = 1 max = 9799 Min Iter = 10000 Max Iter = 100000000
    RESULT: 12345678
    TEST 3: Convert string to number: 1234567890123456.
    Not Use Parallelism Times: min = 1 max = 12625 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234567890123456
    Use Parallelism Times: min = 2 max = 17928 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234567890123456
    Use Unrolling Times: min = 2 max = 17227 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234567890123456
    ------------------------------------------------------------------------------------------------------------------------------
    I quickly modified the first version of the below program. In this 2nd program for the unrolling test, I replaced the "for" loop with an "if" statement. The "if" statement checked if there were "4", "8", or "16" digits in the test string. Then for each active option, I unrolled 4 times, 8 times, or 16 times. The resulting output from the program was that the routine based on unparllelism was always fastest. The second fastest was always unrolling. The routine based on parallelism was always the slowest routine.
    ** Basically, all the interesting things Andrei said about parallelism (and unrolling) does not work with the program that I developed on Visual Studio 2015. I believe all those fine and eloquent things should work elsewhere though!
    ------------------------------------------------------------------------------------------------------------------------------
    Output from 2nd program:
    TEST 1: Convert string to number: 1234.
    Not Use Parallelism Times: min = 0 max = 3520 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234
    Use Parallelism Times: min = 0 max = 5156 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234
    Use Unrolling Times: min = 0 max = 4933 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234
    TEST 2: Convert string to number: 12345678.
    Not Use Parallelism Times: min = 0 max = 5161 Min Iter = 10000 Max Iter = 100000000
    RESULT: 12345678
    Use Parallelism Times: min = 1 max = 8520 Min Iter = 10000 Max Iter = 100000000
    RESULT: 12345678
    Use Unrolling Times: min = 1 max = 8295 Min Iter = 10000 Max Iter = 100000000
    RESULT: 12345678
    TEST 3: Convert string to number: 1234567890123456.
    Not Use Parallelism Times: min = 1 max = 9065 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234567890123456
    Use Parallelism Times: min = 1 max = 15273 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234567890123456
    Use Unrolling Times: min = 2 max = 15128 Min Iter = 10000 Max Iter = 100000000
    RESULT: 1234567890123456
    ------------------------------------------------------------------------------------------------------------------------------
    #include "stdafx.h" //USED BY VISUAL STUDIO 2015
    #include
    #include
    using namespace std;
    // 2 ROUTINES THAT CREATE AN INTEGER FROM A STRING REPRESENTATION OF THAT NUMBER.
    //THE SLOWEST IS SUPPOSE TO RELY ON NON-PARALLELISM.
    //ROUTINE FROM TALK: slide 59
    unsigned long long atoui_notuse_parallelism(const char* b, const char* e)
    {
    unsigned long long result = 0;
    for (; *b != *e; ++b)
    {
    result = result * 10 + (*b - '0');
    }
    return result;
    }
    static const long long pow10[21] = {
    1UL,
    10UL,
    100UL,
    1000UL,
    10000UL,
    100000UL,
    1000000UL,
    10000000UL,
    100000000UL,
    1000000000UL,
    10000000000UL,
    100000000000UL,
    1000000000000UL,
    10000000000000UL,
    100000000000000UL,
    1000000000000000UL,
    10000000000000000UL,
    100000000000000000UL,
    1000000000000000000UL,
    10000000000000000000UL
    };
    //IDEA FOR ROUTINE CAME FROM :slide 63
    unsigned long long atoui_use_parallelism(const char* b, const char* e, short i)
    {
    unsigned long long result = 0;
    for (; *b != *e; ++b)
    {
    result += pow10[i--] * (*b - '0');
    }
    return result;
    }
    //IDEA FOR ROUTINE CAME FROM UNROLLING SLIDE: slide
    unsigned long long atoui_use_unrolling (const char* b, const char* e, short i)
    {
    unsigned long long result = 0;
    for (; *b != *e;)
    {
    result += pow10[i--] * (*b - '0'); b++;
    result += pow10[i--] * (*b - '0'); b++;
    result += pow10[i--] * (*b - '0'); b++;
    result += pow10[i--] * (*b - '0'); b++;
    }
    return result;
    }
    int main(int argc, int** argv[])
    {
    int elapsed_Time;
    int max, min;
    int max_num_iterations, min_num_iterations;
    long long result;
    std::clock_t start; //Holds starting time
    __int64 test_Number;
    char test[30]; //3 TEST VALUES
    char test2[9] = "1234.";
    char test3[11] = "12345678.";
    char test4[18] = "1234567890123456.";
    const char* b;
    const char* e = ".";
    int strlength;
    for (auto j = 1; j

    • @nivo6379
      @nivo6379 3 роки тому +1

      For benchmarking, you should probably use Clang or GCC.

    • @NoNameAtAll2
      @NoNameAtAll2 3 роки тому

      You use std::clock, but you need to know that it isn't representing real time, but cpu-usage time
      Use high_resolution_clock or steady_clock

  • @abc3631
    @abc3631 4 роки тому +5

    What a deadpan audience

  • @yuehuajia3176
    @yuehuajia3176 6 років тому +2

    Great Talk. I learned a lot from you.

  • @leipzig3
    @leipzig3 3 роки тому

    Certain cicadas come out every 17 years. If you need to know if this year is a cicada year, return (year %17 + offset)

  • @eduardomadrid2380
    @eduardomadrid2380 7 років тому +5

    Another good presentation by Alexandrescu, but watch out!, "digits10" is actually a pessimization, as benchmarked in my x86-64 computer, here:
    github.com/thecppzoo/inprogress/blob/master/src/main.cpp
    The reason is that it has too many unpredictable microbranches for reasonable inputs, this is "sloware". Eventually I will publish a more detailed analysis, for the time being, you may use my implementation of digits10 which is almost twice as fast for sequential inputs and about five times faster for random inputs

    • @FoxDr
      @FoxDr 3 роки тому +1

      Thanks for the bench. The whole hour, I've been thinking "why not just use `ceil(log10(abs(a)))`?".
      Sure, float ops up the wazoo, might not be worth it for small numbers, but no looping required at all. I'll add it to your benches when I find some time :)

  • @bool2max
    @bool2max Рік тому

    why does he call integers "integrals"

  • @FEE1DEAD
    @FEE1DEAD 3 роки тому +1

    Interesting talk. But that audience....

  • @typon1
    @typon1 5 років тому +13

    Did they find a bunch of zombies to attend this talk or are the English completely humor less?

    • @Vitorian
      @Vitorian 4 роки тому +1

      Aren't those the same?

    • @GeorgeTsiros
      @GeorgeTsiros 3 роки тому +2

      he didn't do _the hands_
      on a more serious note, sometimes i watch andrei for serious... sometimes not so much (example, the allocators video, that one was a kneeslapper)

  • @GeorgeTsiros
    @GeorgeTsiros 5 років тому

    37:15 wtf we have them on our WRISTS these days!

  • @TheDvrx
    @TheDvrx 3 роки тому

    So pipelined cpu's are secretly vector processors.

  • @gbizzotto
    @gbizzotto 5 років тому

    I bet this abstract was writen by andrei

  • @GeorgeTsiros
    @GeorgeTsiros 5 років тому

    42:44 FORTH! RPL! PS!