Sorting Algorithms: Speed Is Found In The Minds of People - Andrei Alexandrescu - CppCon 2019

CppCon

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 26 сер 2024

КОМЕНТАРІ • 87

@oleksiimomot4040 4 роки тому ⁺⁸⁸
Amazing talk with important conclusions and funny jokes from Andrei. Thank a lot!
@fritsgerms3565 2 роки тому ⁺¹⁸
Andrei is one of the original pall-bearers of c++. Always enjoy his talks. He is creative & unique in his thinking. I wish cppcon would become smaller and increase its quality like it's golden age in +-2014.
@movax20h 4 роки тому ⁺²⁴
Excellent talk. There is so many unsolved problems in sorting (especially adaptive sorts, parallel and distributed sorting, in place merge-style sorting, etc). People often overlook actual complexity (both theoretical and practical) of sorting. Many years ago I had an interesting sorting requirements, and well, I found out that despite research for 60 years and a lot of progress there is no known optimal solution, nor a well working scheme that works good on average to this day.
@CedricMialaret 4 роки тому ⁺²⁶
Must-watch presentation, as always with A.A.
@MichaelPohoreski 4 роки тому ⁺²⁶
_Design by Introspection_ is a sub-set of **Data Orientated Design.**
There are 3 types of optimizations (from slowest to fastest):
1. Bit Twiddling
2. Algorithmic
3. Data orientated (cache usage and branch prediction)
The FIRST rule for optimization is:
**KNOW THY DATA**
@kuhluhOG 4 роки тому ⁺¹
@Ziggi Mon data-orientated design means that you write your algorithms for your data, and don't abstract your data to something general but stay at the specific
it ends at: use different algorithms for different data, don't generalize (at least how I understand it, I don't use it though)
the first time I saw something about this topic was about by a (professional) video game developer who also said: "If we would have the time, we would write everything in Assembler, but we don't."
@sethsora7963 4 роки тому
Glad to know that someone else recognized this too.
@LEGOOOOOOOOOOOS 2 роки тому
Is there a book or talk on data orientated design?
@andrewzmorris 4 роки тому ⁺¹⁶
Fantastic talk. Andrei is such a great speaker, and the content was very interesting. I'm not sure how to really improve our code in the ways Alexei highlights, but it certainly seems to be something that we should be talking about.
@boguscoder 4 роки тому ⁺⁹
*Andrei ;)
@evikone 27 днів тому
I have Andrei's Book on D language--an excellent and insightful work.
@absurdengineering 4 роки тому ⁺⁷¹
I have a problem with this talk: it presumes a lot.
It presumes branch prediction, caches of certain depth, speculative execution, and who knows what else. Sure: modern desktop CPUs have it, as do the large ARMs. Now take this code and start moving it backwards in history. Say, to Raspberry Pi A. To a P-II. To a 486. To a 386. To AVR. Heck, rewrite it in Turbo Pascal for CP/M. I’ve tried: it all falls apart as you pass ARMv7. And the simpler the architecture, the worse the regression.
The standard library is supposed to be as universal as possible. It may run on varied architectures. Yet here we micro-optimize in ways that make things much worse as soon as you’re outside the realm of the most modern CPU. These improvements make the sort worse than 2x as slow on a 486, for example. Sure, it’s not exactly a common target. But Arduino supports C++20, and if we try this trick there, we get same performance regressions. So, before anyone sticks this code into any standard libraries, they better examine the supported platforms and make it optional, and enabled only on modern architectures.
I also have a problem with sweeping statements like “they don’t teach it in books”. CS introductory texts all presume an uncached in-order architecture where every memory access and comparison has equal cost. We all know this, even if the authors fail miserably at putting this assumption in big bold letters at the start of every chapter. But the standard analysis of algorithms doesn’t somehow fail here: as soon as you make the costs take into account caching and branch prediction, you get somewhat more complex equations, but you recover the approximation of computed costs to costs of running on real hardware. So, given that we all understand what assumptions the classic CS texts make, it’s disingenuous to say “they don’t teach it in the books”. They don’t because even with the simplified logical architecture the texts presume, it’s still hard enough. And there are plenty of theoretical algorithm treatments that do in fact take architectural details into account.
Also, it’s not as if the CS classics are completely detached from modernity. There’s this thing that was all the rage back when cores were too small to fit all the business data. Remember tape sorting? It takes excellent advantage of memory caching and prefetch, and you start seeing these advantages as soon as you “upgrade” from 386 to 486.
Also, this general admonition against branches but in favor of data-integrated condition tests: all is fine and dandy if you don’t introduce data dependencies. Not a thing back on a 486, but these days short-acting data dependencies deparallelize the CPU, and the worse they get the more the billion transistor CPU starts to work like fast but “dumb” 486, with 1000x less transistors…
This talk is a wonderful show and it is full of technical insights, but this is high-grade engineering entertainment, not an engineering project report. Don’t try to apply these things without having a clear understanding of what’s involved but unsaid, and for crying out loud: repeat all the benchmarks in your target application, with your target data.
@Sogartar 3 роки тому ⁺⁴
I think his point was exactly that. If you can introspect much of the relevant things like the type of data, hardware, etc. you can compile an algorithm more adapted to the task.
@KafeinBE Рік тому
Of course the talk should probably have mentioned this caveat, but I don't agree the standard library should optimize for any CPUs other than the most modern. Because the most modern are going to be the fastest, and that's naturally where speed and efficiency matters the most.
Additionally, caches, pipelining and speculative execution have been a feature of architecture design for a while now, and a pretty pervasive. Chances are, if you're not coding baremetal then you have to take these things into account. Sure, a very small IoT device won't have the best sort, but chances are the programmer ought to roll their own sort anyway.
@JordiMontesSanabria 4 роки тому ⁺²¹
25:50 "I am almost sorry there is no rotation there"
LOL
@MalcolmParsons 4 роки тому ⁺¹
There's a rotate() in insertion_sort().
@dengan699 4 роки тому ⁺⁷
Best CPP speaker
And I am not even CPP developer!
@6754bettkitty 3 роки тому ⁺⁵
42:20 That goto comment is one of the funniest things I have ever seen in programming!
@6754bettkitty Рік тому
42:40
@greenfloatingtoad 4 роки тому ⁺⁴
selecting an algorithm though specifics about types I believe is compatible/part of what Stepanov envisioned. The difference between a Concept and an Algebraic Object is the operations of a concept come with a cost
@ruroruro 4 роки тому ⁺¹⁵
I think, it may be interesting to try use a different intermediate structure, than the heap. If I understood it correctly, the key reason, why constructing a heap improves the performance is that a heap is in some sense already partially sorted and so, the linear searches end up being shorter or more uniform, improving both locality and maybe even branch prediction.
I think, that there may exist some data structure, that would be simpler to construct than the heap (maybe in terms of big-O or maybe in terms of performance), but would have the same or similar effects for the following linear insertion sort.
After all, as Alexandrescu himself said "the heap is a delicate fine-tuned structure", so maybe constructing a heap is actually overkill. After all, we are not actually using all the invariants the heap provides. So maybe we can get away with using some structure with weaker invariants.
@michaelryabko885 4 роки тому ⁺²
It would be interesting to see performance with different types of heaps, even. Here A.A. only tried with the basic 2 child heap(which makes sense as it is a lot easier to reason with), but using something like a fibonnaci heap might have interesting results. Although this reasoning may go harder in the direction of strong invariants and may lead to worse performance overall.
@arirahikkala 4 роки тому ⁺³
@@michaelryabko885 Keep in mind that these heaps are still all under a size threshold that went up to, what, 255 elements at most. And that your heap has to live in the array that's getting sorted.
@yash1152 Рік тому
rrlated question at 1:25:53
@chidude16 Рік тому ⁺¹
that was fantastic! brilliant. add that with Carbon and we're rocking....
@oscarasterkrans9301 4 роки тому ⁺³
Fantastic talk as always from Andrei.
@03Prashanth 3 роки тому ⁺⁷
22:50 That comment on the undergrads in india was hilarious. Quite true but still... hahahaha
@srh80 Рік тому
Guy from Somebody feed Phil, moonlights as a C++ guru. Amazing talk.
@bluecircle56 3 роки тому ⁺³
Andrei Alexandrescu talk, wheee!...
@maximilian19931 Рік тому
ASM instructions have cycle costs, mostly defined in CPU Documentation booklet(mostly a table of how many cycles it takes), would be useful at compile time to produce code that uses the smallest instruction block to get the job done.
The Architecture definition for compile target should include the ASM instruction cycle cost and make a decision to choose the least costly route available.
@saaah707 2 місяці тому
56:00 IIRC, reverse-sorted array is worst case for insertion sort. Why are the performances better than the sorted case (55:40)?
@CT7ALW 3 роки тому ⁺⁶
All that "boolean arithmetic instead of ifs" goes a bit against the "you're not smarter than the compiler", doesn't it? It's almost like that division example where you try to be smart with bit shifts and the compiler puts the division back...
@Theonewhowantsaname 2 роки тому ⁺¹
That’s the first question he answers, at 1:17:25
@colinmaharaj 3 роки тому ⁺²
I need to start traveling to meet these giants
@sakuraikamoto3699 4 роки тому ⁺³
Thank you sir is the best talk ini learning education from engineer 👍👍
@OMGclueless 4 роки тому ⁺⁹
How on earth do you write unguarded_insertion_sort for a generic type? It's one thing for std::sort to return elements in an arbitrary order if a programmer overlooks that their less-than predicate doesn't define a total order. It's totally another for the insertion operation to buffer underflow and start writing to arbitrary locations.
Actually, I don't even know that you can do it for doubles. At least, his code has a bug: at 29:22, if the element at *begin is NaN, then make_heap will leave it there because greater() is false for all comparisons with that element. Then unguarded_insertion_sort will underflow below that element because all less() comparisons will be false as well. IEEE764 floating point doesn't define a total order so you can't use the default comparison operators in the way Andrei does.
@mxBug 3 роки тому
OK. so let the consumer of the sort function supply a comparison with a total order.
if you are concerned about how the compiler can ensure the correctness of the comparison, perhaps dependent typing is your thing?
@Kromaatikse 2 роки тому ⁺²
First you compare the element being inserted with the element at the very beginning of the array. If it goes before that, you insert it there and you're done. If it doesn't, then the unguarded insertion run is guaranteed to terminate at or before the first element, *regardless* of whether the elements are totally-ordered or not.
@Dante3085 4 роки тому ⁺⁶
1:16:58 Thank you
@peceed 2 роки тому
For every set of parametrical disorders there exist asymptotically optimal algorithm for all elements, unfortunately proof interleaves Turing machines so we can expect linear (according to set size) slowdown. What is surprising there exist algorithms that are optimal and practically quick for many different disorders.
@stephenhowe4107 4 роки тому ⁺⁶
Slide at 26:56 is wrong.
It should be unguarded_insertion_sort(begin +1, end);
Because the 2nd smallest element can be at either begin +1 or begin +2 and it is still a valid heap.
@MikhailMatrosov 4 роки тому ⁺²
Exactly, but he insists that should be +3. Maybe we are missing something?..
@jeffdovalovsky7259 4 роки тому ⁺⁶
The slides in the linked github acknowledge that begin+3 was wrong, and correctly use begin+2.
Stepanov's unguarded_insertion_sort(first,last) accepts a range of unsorted elements, which it will insert as needed into a preceding pre-sorted sub-sequence. begin and begin+1 are already sorted, begin+2 is not, so it can be unguarded_insertion_sort(begin+2,end)
4 роки тому ⁺²
@@MikhailMatrosov , Stephen Howe - in the comments to this talk at Reddit (link in the description of the video) Alexei says that it should be +2, not +3 (because the two first elements are already sorted). The +3 was mistake in slides, and he has +2 in the actual code.
@stephenhowe4107 4 роки тому ⁺¹
@ : No that is not a correct explanation. Andrei explained to me, Your right about the +2.
The first element is sorted.
The second and third elements can be out of order, because that is still a valid heap.
You can have 1, 7, 4 and that is a valid heap as the root is less than 2nd and 3rd element its children.
The fact that the 2nd and 3rd element are out of order is neither here and there, it is a valid heap.
But it does not matter because this is unguarded insertion sort not insertion sort.
So the 2nd and 3rd element get corrected. That is Andrei has +2 (works due to unguarded version) rather than +3 (does not work).
In correspondence with Andrei, I was insisting on +1 because of this, but he pointed out this is unguarded insertion sort not insertion sort and therefore +2 is the minimum value which it is guaranted to work.
@hanyanglee9018 2 роки тому
17:55 15 seconds ago I came up with the idea of branch prection. When he asked the question I answer to my screen "Catching!". I'm definitely in poverty.
@joseville 2 роки тому ⁺³
36:12 I went the other way. I thought you'd add an arbitrarily large element (larger than any element in the list) to the list before heapify and then pop after sort.
39:45 Infinite loops FTW
@tyronewilson6463 Рік тому
The arbitrary large element idea is probably hard with the use of function pointers to define the comparisons
@wessmall7957 4 роки тому
This is earth shattering
@antonpetrenko9176 4 роки тому ⁺³
fire man and great talk!
@Sebanisu 4 роки тому ⁺³
I wonder if GNU has fixed the issues he found.
@monad_tcp 3 роки тому
48:57 I guess there's a way to improve more by exploiting addressing modes
@KafeinBE Рік тому
What happens if you repeat the silly idea? Do make_heap twice, once on the entire array as shown, then once again with the array minus the three first elements?
@aretha6360 4 роки тому ⁺³
I thought infinite loops were a bad thing in C/C++ since the compiler has no easy way of unrolling the loop because it doesn't know when it stops and it would interfere with the compiler's vectorisation ability? Also, in my code if I want an infinite loop I tend to write while(true), but he uses for(;;), is there a difference?
@lewistherin4096 4 роки тому ⁺¹
I guess there is no difference in for and while, but perhaps check godbolt.org/ and test it yourself with the compiler you are using
@darkengine5931 3 роки тому ⁺²
I'm curious about the first question as well. If the loop unrolling is applied at the IR level, it may make no difference as far as how we write them. But I'm not sure how optimizers typically approach loop unrolling optimizations. For the second question, 'for(;;)' tends to be what's accepted as the standard way to write infinite loops in C++. If you use 'while (true)', some compilers may complain about branching with a compile-time constant for its expression.
@6754bettkitty 3 роки тому
Cpp insights might show a difference, if there is one...
@kylek.3689 2 роки тому
Vectorization is undesirable for sorting an arbitrarily sized list of elements, since you can't predict how many elements you will need to sort ahead of time.
@Mage_Chartreux Рік тому ⁺¹
Was this tested on many different computers? I just watched an excellent talk on how memory layout during runtime can account for plus or minus 40% of the variation in the time it takes to run something. I'm wondering if a lot of this isn't just because the data is getting restructured.
@yash1152 Рік тому
1:24:35 > _"different intel machines"_
@NXTangl Рік тому
Anything that lays out doubles differently is going to be pretty damn exotic.
@Attlanttizz 4 роки тому ⁺⁹
33:38 The underscore bonus... 🤣
@greenfloatingtoad 4 роки тому ⁺¹
Until we have introspection, I think we can fake it with type_traits style structs of constants
@rkalle66 2 роки тому
The performance order of doing silly things is O(n²). Am I right? It's like gambling with hindsight 20/20. It reminds me in Germany talking about soccer and somebody saying "The game has 90 minutes or till the referee whistles the end". Something you have to put a 5-Mark coin into the piggy bank.
@andralex 2 роки тому
It is indeed quadratic, but not a gamble. Note that the quadratic algorithm is always invoked on small to medium input sizes (for larger inputs you'd use quicksort).
@ThanhNguyen-rz4tf 3 роки тому ⁺³
Note for myself(You should ignore it):
15:22 -> 16:43Compare linear-optimistic insertion sort vs binary insertion sort. => Worse O(n) (linear) doesn't mean worse runtime.
22:00 Try to insert condition into artimetic
@yash1152 Рік тому
16:35 why the binary search insertion sort is bad
@Dante3085 4 роки тому ⁺¹
Yeah, this guy is fun :D
@JorgeLuis-ts6qp 2 роки тому ⁺¹
Il leave 1:03:00 for me in the future.
@HtS643KyS6555GxQ3edA 3 роки тому ⁺¹
I wonder what the benchmarks would be on Apple m1 chip?
@yash1152 Рік тому
> _"am wondering about benchmarks on apple m1 chip"_
did u check?
@yash1152 Рік тому
1:04:35 whats hot/cold code??
@saaah707 2 місяці тому
hot is code that is executed many times, cold is code that is not executed often
i believe the goal is to minimize thrashing of instruction and data cache
@akshitjangra2471 Рік тому
What sorting algorithm was the speaker trying to improve?
@NXTangl Рік тому
PDQsort would have handled random01.
@childhood1888 2 роки тому
46:11
@LaurentLaborde 2 роки тому ⁺¹
from 1:09:34 I keep shouting "A DATABASE !!" every 20s but he's not answering :(
@LaurentLaborde 2 роки тому
a database is doing A LOT of work it order to be as fast as possible when needed. Which is 50% of his talk
@qm3ster 2 роки тому ⁺¹
Nah, generic programming should support this.
Don't just statically introspect the size of the element.
Compile the moving and comparison code just to analyze it, then use the results of that in the body of the sort implementation :v
@tomcheng2437 4 роки тому
I thought quicksort is not stable...
@VioletGiraffe 4 роки тому
Good point, there's std::sort and std::stable_sort.
@greenfloatingtoad 4 роки тому ⁺⁷
Idempotency is weaker than stability. If you put in perfectly sorted data, an idempotent sort will leave it alone. But if it's even a little bit unsorted, it might change the order equivalent elements show up in (stability)
@max0x7ba 4 роки тому ⁺²
That statement that Radix Sort only works for integers is not accurate. See Malte Skarupke brilliant talk "Sorting in less than O(n log n): Generalizing and optimizing radix sort": ua-cam.com/video/zqs87a_7zxw/v-deo.html
@mishatsaritsyn4843 Місяць тому
So this is how generic programming dies... with thunderous applause
@monad_tcp 3 роки тому
1:20:27 concepts don't work, the only solution for that is tagging types, and dependently typing, you have to tag directly the AST and attribute types in it, and them forward them, you can't do that in C++, because you can't introspect non-const-expr code, so concepts are useless (to solve that problem), as is SIFINAE, and template metaprogramming. So in a way, Generics Programming (in C++) is why we can't have nice things, you could actually do this optimization in C# thou, because the JIT compiler has that information !
Or you could manually tag your UDTs as he said, but that's not-optimal, it does work, but we can do better, its possible, perhaps C++ is why we can't have nice things, but we can have better languages.

Наступне

Автоматичне відтворення

Applied WebAssembly: Compiling and Running C++ in Your Web Browser - Ben Smith - CppCon 2019