Why is Radix Sort so Fast? Part 3 Comparison and Code, Radix Sort vs QuickSort

Поділитися
Вставка
  • Опубліковано 1 чер 2024
  • Support What's a Creel? on Patreon: / whatsacreel
    Office merch store: whats-a-creel-3.creator-sprin...
    FaceBook: / whatsacreel
    In this 3 part series, we will explore sorting algorithms from the fundamentals all the up to implementations of both a comparison sort and a base 256 Radix Sort.
    In this 3rd and final part, we look at some code and compare the performance of RadixSort and QuickSort with lists of various sizes, consisting or randomly generated unsigned 32 bit integers.
    GeeksForGeeks Radix Sort: www.geeksforgeeks.org/radix-s...
    GeeksForGeeks QuickSort: www.geeksforgeeks.org/quick-s...
    Apologies for the code below, I have had to replace all greater or equal symbols with GE, and all less or equal with LE, greater is G and less is L. For the unedited version, please refer to the video!
    Code:
    // Radix sort based on Geeks for Geeks:
    // www.geeksforgeeks.org/radix-s...
    static void RadixSort256(unsigned int* arr, int n)
    {
    if (n LE 1) return; // Added base case
    unsigned int* output = new unsigned int[n]; // output array
    int* count = new int[256];
    unsigned int* originalArr = arr; // So we know which was input
    for (int shift = 0, s = 0; shift L 4; shift++, s += 8)
    {
    // Zero the counts
    for (int i = 0; i L 256; i++)
    count[i] = 0;
    // Store count of occurrences in count[]
    for (int i = 0; i L n; i++)
    count[(arr[i] GG s)&0xff]++;
    // Change count[i] so that count[i] now contains
    // actual position of this digit in output[]
    for (int i = 1; i L 256; i++)
    count[i] += count[i - 1];
    // Build the output array
    for (int i = n - 1; i GE 0; i--)
    {
    // precalculate the offset as it's a few instructions
    int idx = (arr[i] GG s) & 0xff;
    // Subtract from the count and store the value
    output[--count[idx]] = arr[i];
    }
    // Copy the output array to input[], so that input[]
    // is sorted according to current digit
    // We can just swap the pointers
    unsigned int* tmp = arr;
    arr = output;
    output = tmp;
    }
    // If we switched pointers an odd number of times,
    // make sure we copy before returning
    if (originalArr == output)
    {
    unsigned int* tmp = arr;
    arr = output;
    output = tmp;
    for (int i = 0; i L n; i++)
    arr[i] = output[i];
    }
    delete[] output;
    delete[] count;
    }
    Quicksort:
    int Partition(unsigned int* data, int lo, int hi)
    {
    unsigned int pivot = data[lo + (hi - lo) / 2];
    int i = lo - 1;
    int j = hi + 1;
    for (;;)
    {
    do {} while (data[++i] L pivot);
    do {} while (data[--j] G pivot);
    if (i GE j)
    return j;
    // Swap [i] and [j]
    unsigned int tmp = data[i];
    data[i] = data[j];
    data[j] = tmp;
    }
    }
    void QuickSort(unsigned int* data, int lo, int hi)
    {
    if (lo L hi)
    {
    int p = Partition(data, lo, hi);
    QuickSort(data, lo, p);
    QuickSort(data, p + 1, hi);
    }
    }
    void QuickSort(unsigned int* data, int count)
    {
    if (count LE 1) return; // Added base case
    QuickSort(data, 0, count - 1);
    }
    Software used to make this vid:
    Visual Studio 2019 Community: www.visualstudio.com/downloads/
    Blender: www.blender.org/
    OBS: obsproject.com/
    Davinci Resolve 16: www.blackmagicdesign.com/prod...
    OpenOffice: www.openoffice.org/
    Gimp: www.gimp.org/
    80's 3D neon effect in the thumbnail is from Ducky 3D's: • Blender - 80's Style A...
    Background HDRI from thumbnail and intro is from HDRI Haven: hdrihaven.com/

КОМЕНТАРІ • 292

  • @BradleyBroom
    @BradleyBroom 3 роки тому +302

    For when you can't use radix sort, there's a whole science to picking the pivot for quicksort. For large arrays picking pivots close to the median gets you closer to O(n log n) and helps to avoid worst case (n2). Picking a single random value from the array is somewhere between those (since you're unlikely to pick the median). To pick a good pivot , pick several random values from the array and use the median of those. The science is how many random values to use as a function of n.

    • @WhatsACreel
      @WhatsACreel  3 роки тому +65

      Really great stuff mate! I've usually gone with random in the past. Median sounds like a great choice! Cheers for sharing mate, and cheers for watching :)

    • @MrWorshipMe
      @MrWorshipMe 3 роки тому +2

      In his benchmark it scales pretty much as N log N a factor of 1000 for the inout equals a factor of 10000 for the run time is what you'd expect.

    • @freespam9236
      @freespam9236 3 роки тому +3

      @@WhatsACreel my pivot selection has been 1st, middle and last element, find the median and use that as pivot
      so extra comparisons but worse case hard to happen - in reality when somebody actually designed input to hurt the pivot picking algorithm

    • @sharpfang
      @sharpfang 3 роки тому +50

      Yeah, picking the pivot close to median is easy! Just sort the array and pick the middle element!

    • @freespam9236
      @freespam9236 3 роки тому

      @@sharpfang median of 3 elements is not that hard, just some if's

  • @captainbodyshot2839
    @captainbodyshot2839 3 роки тому +198

    std::chrono::high_resolution_clock is the way to go when you want to measure performance of anything in C++

    • @WhatsACreel
      @WhatsACreel  3 роки тому +40

      Yes, good suggestion! I certainly could have used something more accurate :)

    • @anthonynjoroge5780
      @anthonynjoroge5780 3 роки тому +38

      No I'd say std::chrono::steady_clock is the best since it is,well,steady.
      High_resolution_clock is unstable in some implementations and is not recommended for measuring time spans

    • @sharpfang
      @sharpfang 3 роки тому +4

      Or at least repeat the run multiple times and sum up the numbers. Instead of 10 0's get time from start to finish. (subtract n times of dry run, generating the array and copying it to output unsorted).

    • @philipbotha6718
      @philipbotha6718 3 роки тому +1

      @@anthonynjoroge5780 Sometimes it is the same clock. I've seen libs where high_res_clk is an alias of -steady_clk-system_clock. Edit: wrong clock.

    • @anthonynjoroge5780
      @anthonynjoroge5780 3 роки тому +1

      @@philipbotha6718 I doubt that.I only know that to be true only for system_clock and high_resolution_clock. However,please give me an example of such a library. I'd love to see it.

  • @TomStorey96
    @TomStorey96 3 роки тому +54

    As if I needed any more distractions at the moment (it's hard enough to finish existing projects), now I want to write some sorting algorithms!

  • @CompanionCube
    @CompanionCube 3 роки тому +33

    use this if you often need to toggle between 2 lines
    //* remove first / to toggle
    func1();
    /*/
    func2();
    //*/

    • @WhatsACreel
      @WhatsACreel  3 роки тому +6

      That's awesome! :)

    • @DVSProductions
      @DVSProductions 3 роки тому +8

      Or just use
      #if doFunc
      Func1();
      #else
      Func2() ;
      #endif

  • @yvrelna
    @yvrelna 3 роки тому +75

    The weakness of Quicksort is that it cannot take advantage of partially sorted data or data that's exactly or almost exactly reversed. In real life, you often have two or more sorted dataset that you want to combine into a single sorted whole, or dataset that's mostly sorted but with just a few items in incorrect places. Indeed the worst case scenarios for quicksort usually is with sorted or reversed dataset.
    Some hybrid sorting algorithms like timsort or introsort can take advantage of those existing structures in the data and perform a nearly O(n) sort, while avoiding worst case scenarios and maintaining O(n log n) for any input.
    The standard sorting algorithm in the standard libraries of most programming languages usually uses one of these hybrid sorts rather than a naive sorting algorithm like quicksort.

    • @WhatsACreel
      @WhatsACreel  3 роки тому +21

      That's very often the case, yes! Certainly best to use a sorting algorithm that takes advantage of any patterns in the data! Cheers for sharing :)

    • @movax20h
      @movax20h 3 роки тому +7

      radix sort doesn't take advantage of partially sorted data either. quicksort hybrids, like timsort and introsort, can be massages to do these tricks. radix sort kind of hard, without some extra passes on data. radix sort usually will always destroy partial sorting of data in the first step.

    • @mnslr
      @mnslr 3 роки тому

      @@movax20h if you knew apriori that some selection of the data was already sorted could you improve it by splitting off the initially sorted bit? Im thinking something like this: copy out the sorted bit and replace its values in the modified inputArr with zeros/minimal/maximal possible values, except for the sorted selection's first and last (minimum and maximum) values. Then do quicksort on the inputArr using the retained first and last values as pivots. I'm having trouble thinking of a way to reintegrate the stored separately part but it feels like maybe there's a way. Maybe some kind of binary search or something?

    • @benhetland576
      @benhetland576 3 роки тому +1

      @@mnslr You could do a merge sort as the final merging step, which I suppose should even be doable in-place if the two parts (pre-sorted + qsort output) still remain in that same array.

    • @SevenRiderAirForce
      @SevenRiderAirForce 3 роки тому +6

      "naive sorting algorithm like quicksort" I can just feel the freshmen CS majors sweating as they read that haha!

  • @ridlr9299
    @ridlr9299 2 роки тому +8

    Although it looked like radix performed better at all sizes, I think it’s important to clarify that if each of these sorts was done 1000 times for each given size, quick sort would indeed perform much better on the smaller sizes.

  • @sheridenboord7853
    @sheridenboord7853 3 роки тому +2

    Amazing stuff. Love your work. I started on 8 bit machines back in 1978. You guys trying to learn, subscribe to this guy and watch all his stuff. Now Mr Creel, you been doing assembler. So you could use the processors clock counter to measure how long things take in cpu clocks. Forget the libraries. The magic x86 instruction is RDTSC - Read Time-Stamp Counter into EDX:EAX, opcode 0F 31.

  • @xniyana9956
    @xniyana9956 2 роки тому +3

    Wow. I never knew there was a sort that was so much faster than Quick Sort. I learned something here today. Radix Sort is nothing short of amazing!

    • @plasmaastronaut
      @plasmaastronaut 2 роки тому +1

      quicksort is a misnomer. its not that quick. The 'quick' means if u don't want to spend time thinking about which sorting algorithm to use u can quickly use quicksort for now and worry about improving the quickness later. Should be called lazysort, but then no-one would use it.

  • @galier2
    @galier2 3 роки тому +34

    2 small optimisations available for your RadixSort. memset and memcpy are a tick faster than loops, but the more important optimisation is that you don't need to shift and mask the input values. You can use a byte * instead of the uint * and you increment it by 4 in the innerloop and by 1 in the outerloop (of course endianness becomes an issue but in practice not, everything is little endian nowadays).

    • @AAA-de6gt
      @AAA-de6gt 3 роки тому

      I don't think endianness matters. Big-endian just ends up with sorting bytes from the MSD.

    • @WhatsACreel
      @WhatsACreel  3 роки тому +4

      Great suggestions! I really like this *byte idea! :)

    • @galier2
      @galier2 3 роки тому +1

      @@WhatsACreel Theoretically it will not make a big difference though. It only pushes the explicit shifting/masking from the ALU to the load-store unit where it is done implicitely. It can also introduce a partial register write stall that plagued Intel CPUs for a long time (when writing a byte register AL, BL etc., accessing the rest of the register (EAX, EBX or RA, RBX) would result in penaltie). So careful testing is necessary.

    • @benjaminfacouchere2395
      @benjaminfacouchere2395 3 роки тому +11

      There's a great video about modern compiler optimization: ua-cam.com/video/w0sz5WbS5AM/v-deo.html
      The key takeaway is: compilers have become extremely good in unwrapping loops and optimizing bit manipulations, if you are able to recognize the optimization easily the compiler will too.

    • @Viewable11
      @Viewable11 2 роки тому +1

      @@benjaminfacouchere2395 Another optimiziation: use constants instead of variables wherever possible, as this effects branch prediction and preemptive execution of the CPU.

  • @movax20h
    @movax20h 3 роки тому +2

    Btw. for 64 bit integers the radix sort might be a bit worse, unless you have a really big number of numbers. But for a lot of 32 bit or 16 bit integers, it will works blazing fast. A tricks, is to decide number of buckets before start of the algorithm, based on the size of the key, and number of keys. I.e. if you have billion large numbers, it makes sense to probably use Radix-65536. Because few extra MB for counts, is nothing compared to the temporary array. Also radix sort is reasonably easy to parallelize: in each phase, each thread updates counts using atomic, or own local copy of counts, then they are aggregated (by one thread or by multiple threads, various methods are possible). Then prefix sum is calculated (by one thread, or in parallel). Then shuffling is done in parallel too, with some extra tricks to make it correct.

  • @NuriTANRIKUT
    @NuriTANRIKUT 3 роки тому +3

    great series! very informative.
    couple of things I might improve, in order of importance IMO:
    1) keep GenerateRandomData() call out of timing block. you don't need to measure random number generation
    2) use std::chrono::steady_clock for better timing (milliseconds is enough but you can go up to nanoseconds precision). important point here is to use double instead of default int64_t:
    using milliseconds = std::chrono::duration;
    using std::chrono::steady_clock;
    auto start = steady_clock::now();
    // stuff
    auto end = steady_clock::now();
    double elapsed_ms = duration_cast(end - start).count();
    3) use std::vector (when possible: std::array) instead of C style arrays for two main reasons: a) no need to new/delete, b) swapping input and output arrays becomes as simple as std::swap(input, output) and it is O(1)
    there are more small stuff but hey it's a rabbit hole :)

  • @GabrielPerez-ml2go
    @GabrielPerez-ml2go 3 роки тому +4

    I was always scared of radix sort, thanks to your three part series I feel much much better, THANK YOU

  • @mivoe99
    @mivoe99 3 роки тому

    Great video series. Really enjoyed it. Altough I already learned about sorting algorithms at university I haven't taken a look at RadixSort until this series.

  • @goranjosic
    @goranjosic 3 роки тому

    This is my first time watching some of your videos and this is great! Excellently explained.

  • @cptechno
    @cptechno 3 роки тому

    Thanks for bringing this topic up! I didn't study these sorts at university. I intend to use both the bucket-sort and the radix-sort to sort trees and graphs. These algorithm have wonderful extensions to more complex objects than numbers and the number of variations on these opens up a lot of possibilities!

  • @kleetus92
    @kleetus92 2 роки тому

    Fascinating series on sorting. it's been 3 decades since I messed with this stuff, and even then, we didn't get this deep. Thank you for taking the time to explain this.
    Side note, has anyone told you that you sound very much like Eric Idle from the Monty Python crew? I don't mean that as defamatory.

  • @ferinzz
    @ferinzz 3 роки тому

    Awesome stuff! Very informative series that demonstrates why the methods do become faster, and not just how to do the method itself. Thank you very much, I greatly appreciated this.

  • @cacheman
    @cacheman 3 роки тому +4

    I have an article on github that covers sorting other numeric types and some radix sort optimizations, e.g column skipping. It's called "Radix sorting from the ground up"

  • @movax20h
    @movax20h 3 роки тому +2

    When comparing to quicksort, you should really use the built in C++ std::sort. Granted it is not required to be quick sort, but it is usually a hybrid of quick sort, heap sort in place, and insertion sort for small arrays, sometimes with some additional tricks (like detection of partially sorted or reversed arrays, like timsort; or other good pivot selection methods, i.e. median of 3, or something like this). For totally random data most of these tricks are not supper important, but some minor details and microoptimisations in std::sort are still relevant, like optimising increment / decrement, index comparissons, and form of the loop, makes a difference.

  • @morpheus_uat
    @morpheus_uat 2 роки тому

    this was really an interesting adventure into the radix sorting alg, good job there mate

  • @shyrealist
    @shyrealist 3 роки тому +2

    I think you can do the radix sort in place by calculating the prefix sum as you go and performing swaps based on the intermediate values, it still requires the final pass per iteration in the reverse direction to put them in the correct order... actually on second thoughts the swaps would have to be inserts (shifting elements), to work around that you could treat the original array as two buffers and I think one of them would need to be circular... hmmm... I wish I had the time to explore this idea!

  • @roccov3614
    @roccov3614 3 роки тому +18

    Am I right in thinking that the number of digits in the numbers affects radix sort but not quicksort? If so, I would have liked to see a set of tests where the number of elements are constant but the number of digits of the elements increases, and seeing if what effect that would have.

    • @DerekRoss1958
      @DerekRoss1958 10 місяців тому

      That applies to sorting strings more than numbers. If you're just using the standard integer number types used in the video they tend to have a fixed number of digits. It's just that all the leading digits are zeroes. So you can just use the radix sort, implemented as shown in the video.
      The only place where numbers might cause a problem for radix sort is where you have a number type that's not implemented using the standard types. Even there (or for strings) you just need to re-inplement radix sort with a variable number of passes, depending upon the length of string or the number of base-256 digits in the number.
      And when sorting non-standard datatypes, quicksort can be affected too.

  • @mechwarrior83
    @mechwarrior83 10 місяців тому

    Thank you so much for putting up this masterclass.

  • @mikefochtman7164
    @mikefochtman7164 3 роки тому +1

    Another small optimization. Regarding the allocation of the output array and the extra logic to detect an odd number of pointer-swaps and copying the results.
    Instead, have the caller responsible for allocating the output array and passing a pointer for that in along with the input array. Then the RadixSort() routine can just pass back the 'output' pointer. The caller can use the returned pointer for the sorted array, and also test against what was originally allocated to deallocate whichever one is no longer needed. No need to copy results and onus is on caller to allocate all memory and clean up. (the 256 element 'count' array could be static as well, so very minimal stack usage).

  • @darske1
    @darske1 3 роки тому

    Awesome video as always, sir
    Thank you very much!

  • @clock4883
    @clock4883 3 роки тому

    Once again! Amazing video. Keep it up! :D

  • @blarghblargh
    @blarghblargh 3 роки тому

    I really like your explanations. For a programmer they are very easy to understand.
    My frustration with compsci books and articles is so often that they talk a little too abstractly, so for example wouldn't really worry about trying to address the "bucket" problem in radix sort.
    One odd hunch I had while you were describing that problem was: could you avoid allocating a second array if you used a radix 2 to somehow do clever in-place bucketing, or would that not work because you might get too many items in a row for one bucket and it wouldn't balance out? I plan to work through that soon myself to find out, but figured I'd mention it before then.

  • @pleilleur
    @pleilleur 2 роки тому

    Thank you ! Interesting and fun video

  • @mrmr4737
    @mrmr4737 Рік тому

    FANTASTIC Video! I appreciate it:) Cheers mate🐲

  • @Matt23488
    @Matt23488 3 роки тому +2

    Yes, with radix sort, the greater than/less than comparison is built into the fact that the order in which we store the counts is itself an ordered list. You're leveraging compile-time information to help with speed. Of course, that's directly why you're using more memory.

  • @mshonle
    @mshonle 2 роки тому

    One note on your quicksort here: in the worst case, it requires O(n) space (the space is the call stack). To fix this, you need two modifications: (1) on the recursive call, recurse on the smallest subarray first; and (2) change the second recursive call into a tail call [in C/C++, reassign the parameters and goto the function start; in Kotlin declare the function as tailrec, et cetera].

  • @chesshooligan1282
    @chesshooligan1282 3 роки тому

    After three 15-minute videos explaining in detail why the radix sort is so fast, ends the series with...
    "It's probably got something to do with it."
    Great work, mate. Thanks.

  • @neverremindyou6172
    @neverremindyou6172 3 роки тому +1

    Great vid again! This sure looks like if i dont know how much there will be to sort, it would be better to straight up use radix sort.

  • @johnvodopija1743
    @johnvodopija1743 3 роки тому +1

    This was a terrific and engaging mini series. It gave me flashbacks of computer science at uni. Thank you 👍😎🇦🇺

    • @sylvainforget2174
      @sylvainforget2174 3 роки тому +1

      Indeed,. Also, Radix Sort was never even mentioned. 🇨🇦

  • @blendpinexus1416
    @blendpinexus1416 3 роки тому

    i like the comparison done in this series

  • @crysicle
    @crysicle 3 роки тому +1

    @What's a Creel Note: it's possible to find a more precise speed of the algorithms for the 0 - 10000 array lengths by running each algorithm n times and then dividing the time taken to run them by n. This way you wouldn't have a 0 sort time for lower inputs as enough time would be given for the timer to reach values greater than 0. Thus, after division you'd have some mantissa values to show the algorithms speed.

    • @khatharrmalkavian3306
      @khatharrmalkavian3306 3 роки тому

      He's done that before (you can still see the loop in his test harness, but set to one iteration). He puts the data generation inside the test unit because he doesn't realize that a GB of RAM can store 268,435,456 integers, so it's okay to generate his test data outside the loop.

  • @elliott8175
    @elliott8175 3 роки тому +3

    Nice video. FYI:
    *do {} while (/*code*/);* is the same as *while(/*code*/);* in c++.
    *if (originalArr == output) /*...*/* This section of the function isn't needed because it'll always be false: The number of shifts is a constant of 4.
    For optimisations:
    - We could put *count* on the memory stack to avoid unnecessary system calls (using *new* ).
    - As others have pointed out, *memcpy* should double the speed of zeroing the *count* array.

    • @safulkin
      @safulkin 3 роки тому

      staticaly allocated count is not good idea. what will happend when two different threads execute this code on diffent cpu cores? [spoiler] data corruption
      int[256] is enough small to stack allocation.

    • @elliott8175
      @elliott8175 3 роки тому

      @@safulkin Sorry, I meant allocate it on the stack. (I was meaning that *count* shouldn't be _dynamically_ allocated, and so chose the antonym _static,_ but of course that has its own meaning.)

    • @R4ngeR4pidz
      @R4ngeR4pidz 3 роки тому +1

      if im not mistaken
      in the case where the statement isnt met, do while will always run the code atleast once, whereas while wont
      is this different in cpp o.o?

    • @elliott8175
      @elliott8175 3 роки тому +2

      @@R4ngeR4pidz A while loop will run the body of the code if the while condition is true - but to know if the condition is true it needs to run the code in *while(/*code*/)* . So it really is the same as *do/while* .

  • @calypso1985
    @calypso1985 3 роки тому +3

    A comment here Mr creel. I would like to see the ram use in every instance in that table.

  • @YerushalayimShelZahv
    @YerushalayimShelZahv 3 роки тому

    love your videos, if you get a chance would you mind doing a deep dive into hashmaps vs arrays for linear search?

  • @torak456
    @torak456 3 роки тому +3

    Was the pattern of Radix Sort using 2x the memory of Quick Sort something that held through larger arrays? Would have been cool to see that in the final chart at the end. Awesome video series, thank you! :)

    • @WhatsACreel
      @WhatsACreel  3 роки тому +3

      Sorry, I glossed over that! It does continue, yes. Radix Sort, as used in the video, doubles the array, so it takes about double the RAM of QuickSort. Well, cheers for watching mate, and have a good one :)

  • @pissfilth
    @pissfilth 3 роки тому

    Well explained Creel..
    Showing the actual data on slides, and the process steps, is quite helpful for any viewer.
    Cheers mate!
    Will there be an AVX512 example implementation of the Radix sort in the future? and some high precision timing measurement perhaps? (like others also suggested in previous comments).. preferably in clockcycles..
    It would make a nice tutorial video on how to tackle problems using SIMD. Count in registers, shuffle data elements (or indexes of elements) in registers.. and in the end you have a really usefull, very! fast sorting algorithm.
    ..or maybe make it a competition: Can you make it faster? // viewer_interaction++; (no, i did not yet make that code already myself :) )
    BTW: ***Your dad makes nice drawings***
    I myself had that hobby too: Painting and drawing,.. until i discovered computers, and asm.

  • @melickon
    @melickon 3 роки тому

    Greate stuff! P.S. The final table is missing required RAM column

  • @snnwstt
    @snnwstt 3 роки тому

    There is also a "database sort", that is, we know how many "pages" are required to hold the sorted result since each "pages" can hold a fix number of elements (that number is our choice). The memory allocation occurs thus just once. Then, one element at a time from the array to be sorted, we look into which page it should go (each page keeping the min and max value it holds in book-keeping while adding or transferring data to another page), incorporate it to the page it belongs if there is room in that page, else, move half of its data to an blank page (blank up to now, its memory had already been reserved), new page which will be properly linked. Since the amount of all required memory is known at the start ( TWICE the initial data), the memory allocation is done just once. All the other manipulations are book-keeping. Note that database "index" rank the data by keeping the order of the index, not of the value, being ordered, which works fine for string, or float, or even orderable structures of multiple "fields".
    I suspect it MAY be faster than a quick sort, given, that, in the first place, if you have to sort 1G elements, they don't happen 1G at a time, but one by one, making the database sort working from "day one" instead of having to wait the 1G-th element to be known.

  • @wedusk
    @wedusk 3 роки тому

    Thoroughly enjoyed. Reminds me of uni!

  • @stephenevans8355
    @stephenevans8355 3 роки тому

    Most of the data I deal with is already partially sorted with combinations of hill sorted, valley sorted and reverse sorted. This can be pathological input for Quicksort but makes another algorithm like CombSort11 shine.

  • @bartekltg
    @bartekltg 3 роки тому

    std::sort is using a hybrid sorting algorithm called introsotr, as you mentioned, it is qsort + heapsort (+ of course O(n^2) sort for short subtables, insertsort I think). The qsort uses "median-of-three" (*) pivot It works well for sorted, reverse sorted and random tables, a bit faster even than a random pivot. But there exist permutations that break it. This is where heapsort is used. As iterations of qqicksort progresses, the depth of iteration is checked, and if it is too large (greater than log[n]*const), the problematic region is sorted using heapsort (a bit slower in mean case, but the worst case is nlogn)
    *) it is the median of the first, middle, and last element, not of random ones. Whan interesting, gcc uses the second, middle and last elements, and the median of those is inserted in the first place. Not sure why, but I would guess it gives less jumps in code and is a bit faster.
    The main advantage of the median rule is we are guaranteed that in the range is at least one element not smaller, and at least one element not greater than the pivot. This allows us to get rid of some if-statements.

  • @DavidLindes
    @DavidLindes 2 роки тому

    Cool series; thanks! Not sure how I managed to go this long without having this algorithm on my radar. I suppose just because it's relatively scope-limited to fixed-length unsigned integers, so its utility is near-zero in other contexts?? Anyway, still cool to know about, and has its place. I imagine it could also be used to pre-sort other kinds of data, if one can easily compute a fixed-length unsigned int that represents the most significant sorting components (e.g. casting char[] arrays, to effectively sort by the first few characters)? Interesting stuff.

  • @DFPercush
    @DFPercush 3 роки тому

    Fascinating video series, thanks for educating us. :) I will say though, I don't often find applications where I only need a list of numbers sorted. The numbers are often attached or keyed to a data structure, and the whole structure needs to travel with the number, or at least be referenced somehow. Combined with something like a hash map to quickly... well, map, the numbers back to their containing structures, this could be a nice intermediate step. But if you've already constructed a hash map, why would you need a sort... *scratches head for a minute* ... Actually, if you can pack the number to sort, as well as an index to the original structure, into a single integral value, perhaps a int64 or just a byte array, and only do the radix sort on the least significant bytes, then the indeces or pointers would be part of the sorted data. Yeah... I'll have to try this some time. Comparison sorts are all I learned in school, it's very interesting to see other approaches.

    • @DFPercush
      @DFPercush 3 роки тому

      So, follow up... I wrote a radix sort template class, it can do integers, strings, and any structure really, as long as it can be represented as a sortable buffer of bytes. I haven't tried negative numbers yet though. The way I ended up doing it was to allocate two buffers for indeces into the original data, in and out, with a buffer swapping approach. A little heavy on the memory, but very flexible. That also gives the option of returning just the list of indeces without modifying the data (view()), or doing swaps on the original array (sort()). It relies on two functor classes to obtain the size of the inputs and evaluate the sortable bytes, but it was trivial to default those for ints and make a "prefab" for strings that you can pass. It also has a persistent buffer option and preAlloc() to avoid those malloc penalties for repeated calls.
      sorter rs; rs.sort(data,size);
      or
      sorter rs; rs.sort(data, size);
      Pretty handy! Thanks again for the inspiration!
      If anybody wants it, github.com/DFPercush/RadixSort
      I'll keep working on getting that negative number thing... sorted.

  • @NorbiPeti
    @NorbiPeti 3 роки тому

    Nice series

  • @tektronix475
    @tektronix475 3 роки тому +1

    image processing in assembly. any advice about books, tutorials, forums, info?. thanks!

  • @AutomaticHourglass
    @AutomaticHourglass 2 роки тому

    @creel Did you tried using SSE or AVX2? It might give you significant performance boost on radix sort since you can mask 16 8 bits(SSE) or 32 8 bits(AVX2) at the same time? I'm curious about the results..

  • @AC-bi3bz
    @AC-bi3bz 3 роки тому

    Thanks for this amazing series of videos ... explained well and entertaining. Chapeau!

  • @az-kalaak6215
    @az-kalaak6215 2 роки тому

    If you know the size of the array in advance, then radix sort have a huge advantage since you can just use template to allocate an array during compilation, saving time, you can even do the qame for the counter array, again saving time
    Unfortunately, initializing them to 0 will not reduce it
    It will even remove all allocation, leading to no needed deallocation so faster and more stable program

  • @najamhaq
    @najamhaq 3 роки тому

    Good job on not giving older part links so people who land here first will get harder timer finding where it all began.

  • @jugoslavilic9761
    @jugoslavilic9761 10 місяців тому

    Great videos.
    Is it possible to somehow check the start list of data, whether it is just integer numbers, negative, positive, floating, just letters, combination of all, …
    So spend some time checking data and get information how much memory we have and then decide which sort algorithm is the the best, actually the fastest for the given data and available resources.
    That would be very close to ultimate, universal, fastest algorithm for any type of data.
    Of course this is only if we do not know what type of data to expect.
    I know its not just type of data and memory we have, so maybe include more informations and checks.

  • @florianh.6256
    @florianh.6256 3 роки тому +3

    An interesting topic related to this and probably relevant for many cs students that are obviously watching this (judging comments from previous videos) would be the effects of multithreading these things. Spoiler: Quicksort is not stable enough even with good pivots to get as much out of it. Also to justify the overhead you need to have more than 1e7 elements to begin with. Thinking about it: Multithreading is a good example where the stability of an algorithm actually matters.
    If you want to spice it up: Sorting is one of those things that can make really good use of GPU acceleration. Maybe run a cpu radix sort vs a gpu bubble sort on the same data set. So many fun topics around this to explore.
    Another nice topic for an addendum video: The practical side of things. I the wild the choice of sorting algorithm is more or less academic. There is close to no cost/benefit to customfit some std_lib sorting to you particular dataset. Yes, there are cases where every ms matters, particular in lowest end embedded or HPC implementations, but for your average job you just use some standard sorting and be done with it. As with any algorithm you should be aware of its pros and cons, but do not obsess with reinventing the wheel. "You can always come back later to fix it, when it breaks everything"-Mantra is sadly reality. Or try to "outsmart" modern compilers with your new learned asm knowledge.

    • @WhatsACreel
      @WhatsACreel  3 роки тому +1

      Well, there is a GPU radix sort, yes. I remember watching a GTC talk on it a long time ago. Though I couldn't tell you which it was. I did actually make a multi-threaded version of the Radix sort for this video series too, but scrapped it. I just split the data into 8, and then used a merge step at the end. It was ok. Mostly bottle necked by the RAM reads I think.
      Certainly a lot of fun topics, yes!

    • @nikilragav
      @nikilragav 3 роки тому

      Hmm how do you even parallelize radix sort for gpu? I was thinking about this yesterday - it is still iterative along the number of digits. You have to do one digit first, reshuffle the numbers then do the next digit, so I'm not sure there's much of a speed up. Also you'll run into weird race conditions if try to change the same memory location in multiple gpu kernels/threads. Have to think more about how to avoid that

    • @florianh.6256
      @florianh.6256 3 роки тому +2

      @@nikilragav There is a ton of literature on the topic. I think one of the best illustration of the idea is in "HARRIS M. Parallel prefix sum (scan) with CUDA. GPU Gems 3". as it also adresses implementation "problems" you face on GPUs. There are quite a bit of optimizations on the idea and gpus have better implementations nowadays (and more cores, and shaders to play with). For example "Fast 4-way parallel radix sorting on GPUs"
      Another thing to keep in mind: GPUs do not use the same memory system as CPUs. It is quite different in its usage. There are also tons of absolutely unintuitive parallel algos that one just dont know about when starting with the topic.
      There is also other neat sorting stuff from the last years. worth looking through their cites too. doi: 10.1109/ACCESS.2019.2920917 or doi: 10.1155/2020/4793545

    • @nikilragav
      @nikilragav 3 роки тому

      @@florianh.6256 Hi! I *briefly* looked at that one after I wrote my comment. Yep I'm somewhat familiar with GPU algos and memory allocations

    • @Akronymus_
      @Akronymus_ 3 роки тому

      At work, I often found myself having the best performance with insertion sort. Eg.: Inserting events into a file from networked sources based on timestamp. The "better" sorting algorithms often are worse, depending on your dataset.

  • @hs3881
    @hs3881 3 роки тому +1

    It would be great to see the shootout on strings of different lengths, too.

  • @sebbes333
    @sebbes333 3 роки тому +1

    *For more sort comparisons:*
    Check out *w0rthy* : ua-cam.com/users/w0rthyAvideos
    For example: "Sorts - Classic" ua-cam.com/video/CnzYaK5aE9c/v-deo.html
    Check out *Timo Bingmann* videos: ua-cam.com/users/TimoBingmannvideos
    For example: "15 Sorting Algorithms in 6 Minutes" ua-cam.com/video/kPRA0W1kECg/v-deo.html

  • @unvergebeneid
    @unvergebeneid 3 роки тому

    Is there a reason those two loops at the beginning of the Quicksort code are do while loops instead of simple while loops? Since the body is empty, I can't see how they would behave differently.

  • @kashgarinn
    @kashgarinn 3 роки тому

    What would be the fastest sort for lexicographical sorting of double fp (x,y) datapoints?

  • @AndreSomers
    @AndreSomers 3 роки тому

    As quicksort (& friends) use a devide & conquer approach, they are relatively easy to parallelize, which does not help in the big-O speed of the algorithm but does help in the wall-clock speed. How well is radix sort scalable in this way? I don't see an obvious way do do that in the steps you described, except for chopping up the input array into parts, radix-sorting those, and then use a merge sort on the sorted parts perhaps?
    Also: how useful (if at all) is this technique if you dont't just want to sort a bunch of integers, but a collection of structs that you want to sort by a field that is an integer?

    • @CrystalLily1302
      @CrystalLily1302 2 роки тому

      Very late reply, but the way you could do it is by using the fields as keys and then reiterating over the list of integer elements to create a ssorted list of the structs, would be slower than a quicksort unless you have something in the 5k-10k range of structs/objects but theoretically useful if you have massive object systems that need to be sorted.

  • @bagok701
    @bagok701 3 роки тому

    I wonder if radix sorting by the most significant digit instead of the least would be faster, reading memory nearby another memory address could be binned? or would it be worse if memory modules allow reads from different parts of memory in parallel?

  • @murrayhorn8817
    @murrayhorn8817 3 роки тому

    It would be interesting to see this on a Risk processor like the ARM or Risk V with an index of 16 ompared with 256. This could result in the counts being stored in 16 of the 32 registers as opposed to memory. Would this speed things up enough to compensate for double the amount of passes compared to an index size of 256?

  • @fev4
    @fev4 3 роки тому

    This is fucking awesome. Subscribed. I just found this channel today, incredibly good. :o

  • @davidblake8612
    @davidblake8612 3 роки тому

    Cool series! Ta.

  • @raunaquepatra3966
    @raunaquepatra3966 2 роки тому +2

    I think the complexity of radix sort should be O(N log_b_(Max value)) where log_b is log with base = no of buckets used. For most computers max value = 2^32 so the complexity becomes O(32N) =O(N). But every it's tought as O(N) which is misleading according to me.

    • @ichigo_nyanko
      @ichigo_nyanko 2 роки тому

      k = log_b_(Max value) is always a constant so it will always end up as O(kN) = O(N)

  • @nebulagaming678
    @nebulagaming678 2 роки тому

    Hi guys! How would I extend this function to work with 64 bits and potentially with 128 bit numbers? Thanks!

  • @monkingflame9493
    @monkingflame9493 2 роки тому

    whats the scalability of the algorithm on multicore? like a quicksort will start with 1 and could be made to scale up to n^2 threads. how does radix sort scale? is there a limit on the threads it can use? like there is obviosly a bottle neck with it having to form the prefix sum.

    • @rpuxa163
      @rpuxa163 2 роки тому

      You can divide an array, sort each part in separate thread using radix and then merge like quick sort.

  • @davidjames1684
    @davidjames1684 3 роки тому

    What might happen if you wanted to use radix sort to sort 32 bit integers and you typecast them to four 8 bit extended ASCII characters, so even a very large decimal number like 4 billion would only require 4 digits (base 256)? Would it work properly? Would it be faster than using base 10 or base 16?

  • @j.p.hendrix4389
    @j.p.hendrix4389 3 роки тому

    I'm trying to figure if radix sort would turn into a special case if the bucket size was one bit in size. So instead of dividing 256, dividing by 2. From an assembly point of view this could be interesting in code size I guess?

  • @AndrewLapteff
    @AndrewLapteff Рік тому

    AMAZING!

  • @tonupif
    @tonupif 2 роки тому

    Tnx!!!

  • @sykeassai
    @sykeassai 2 роки тому

    I came into this in video 2, and the explanation of radix thought makes a lot of sense. but I'm left with a couple of questions.
    I did some basic programming on my degree about 12 years ago and I honestly can't say I remember all that much of it,
    but I am wondering 1) what language is that written in, (I'm guessing Visual Studio? in which case if I want to relearn programming at some point that is maybe where I should start as programing writing software?)
    2) what is the program used to run that code?

    • @pkmnfrk
      @pkmnfrk 2 роки тому

      The language is C++. The compiler/debugger is Visual Studio. If you are looking to get back into programming, I suggest Python instead of C++, it's much easier to use and with many fewer gotchas. The downside is that it's not nearly as fast as C++

  • @Vaaaaadim
    @Vaaaaadim 3 роки тому +1

    A nice short video series. The prefix sum idea was especially clever.
    Could you possibly do even better if you try to optimize the sorting algorithm to make better use of locality? By that I mean, this seems like it would be doing fairly random accesses to memory, whereas having accesses that are mostly linear would perform better. To my understanding, having fairly random accesses causes page faults.

    • @WhatsACreel
      @WhatsACreel  3 роки тому

      Hmmm... that's an interesting thought. The data is random, so there will have to randomness somewhere in the algorithm - in a QuickSort the randomness is in the elements that we swap. If we use a small radix, we get much better locality, but at the cost of more iterations. I can't think of a way out other than finding a nice compromise. That seems to be about base 256 radix sort. The counts are small enough that they fit into the L1 cache, but large enough that there's not too many iterations. Well, cheers for sharing this thought, it is very interesting! And cheers for watching :)

    • @Vaaaaadim
      @Vaaaaadim 3 роки тому

      @@WhatsACreel
      Perhaps one could make buffers for each possible radix (digit? symbol? value?) such that the L1(or possibly the larger caches) can fit everything. Once a buffer is totally filled, we flush its contents to where the data is supposed to go. Then the page faulting would only occur once during each time we need to flush a buffer.

  • @NoorquackerInd
    @NoorquackerInd 3 роки тому +1

    I'm _real_ happy I only sort by Unix timestamps

  • @HL65536
    @HL65536 3 роки тому

    would be interesting to see what the speeds are like with longer data types. 64 bit, 128bit,...? Is there a tipping point?

  • @horaciomlhh
    @horaciomlhh 3 роки тому +10

    would had been nice to see the memory ratio in the comparison table

    • @kazedcat
      @kazedcat 3 роки тому +2

      The memory ratio is 2 or it approaches 2 as the number of elements approaches infinity.

  • @davidjames1684
    @davidjames1684 3 роки тому

    I think a really good sort algorithm would be one that first scans the data to be sorted (size, type, randomness, min/max data values...), and based on that, selects the most appropriate sorting algorithm. For example, it could have 10 complete sorting algorithms available to itself, but decides which one to use based on the data. Some of those 10 could even be combination algorithms involing pieces of different subalgorithms.
    Also, I wanted to mention that heapsort has a simple tweak in that instead of just plucking the root node each time, you can pluck that and one of the children (if available). For example, if we had a maxheap with the first 3 elements being 10, 4, and 6 (in that order), instead of just plucking the 10 and then reheapifying, we can pluck the 10 and the 6. Someone actually benchmarked this for me and it was like 3% faster than 1 element popping heapsort. A mild improvement but hey if it works and it is faster, why not use it?

    • @JirayD
      @JirayD 2 роки тому

      Isn't that what introsort is all about?

  • @hyruleengineer7252
    @hyruleengineer7252 3 роки тому

    Great Video Series!
    I think that counting-based sorting algorithms also have an advantage over compare-based algorithms, when run on a modern computer, as pipelining can be utilised more effectively, because the code does not include any branches, right?

    • @mikefochtman7164
      @mikefochtman7164 3 роки тому

      Seems good, and the count array might be cached since it's not that large.

    • @CrystalLily1302
      @CrystalLily1302 2 роки тому

      The only downside that I can see is that masse parallelization is harder to do with counting based algorithms compared to the divide and conquer compare algorithms, so if you're running on a cloud server with tons of nodes or even on a smaller system with many cores then then the parallelization could increase speed a lot.

  • @NicosLeben
    @NicosLeben 3 роки тому

    The process memory is shown in real time. When you are looking at it the sorting is already done and all the memory was freed again. Look at @11:54 and @12:21 in the process memory graph. You can see it spiking at the beginning and then it's freed again.
    I am also curious why you are measuring the time of the random number genator too. Yes, it is just O(n) but for a comparison between both sorting algorithms you can not add a constant number to each time.

  • @vukpsodorov5446
    @vukpsodorov5446 3 роки тому +1

    i was wondering; returning the radix sort counts to 0 runs through 256 iterations, when in reality if we're dealing with integers you can only have 10 digits (0-9) as you showed in the previous video. i'm not sure if i just missed something, but i think there's a bit more performance to be had (not much though, considering it only does that once per main iteration, of which you only have a few). alternatively, if you do use a larger list of them, would you be able to dodge returning them to 0 altogether, somehow moving to the next 10 digits in the array of counters (so the first iteration uses 0-9, the second uses 10-19 and so on), you never return any to 0, and then you just delete that as you do when you're done? i haven't given it much thought or experimented with it, but i figure it's doable and would likely increase performance (although i guess that would only really happen when dealing with some seriously large numbers in the array, with lots of digits where every instruction in that part of the sort counts)

    • @WhatsACreel
      @WhatsACreel  3 роки тому

      Yes, sorry, it was using base 256. So there’s 256 digits. Instead of 0 to 9 it goes from 0 to 255 - each digit is one byte. It’s just easier for the CPU.
      I really like the idea of using 4 counts arrays to avoid zeroing! If we could be guaranteed that malloc would zero for us, we’d be good as gold! I’m sure that would save some time! I don’t think malloc does Zero anything so it mightn’t save us time and we probs have to do it ourselves :( Haha, still cheers for watching mate, and cheers for the suggestion :)

    • @vukpsodorov5446
      @vukpsodorov5446 3 роки тому

      @@WhatsACreel well, i suppose you could set all of the digits to zero when initiating the array (or at least i think so, if we're thinking of the same problem)
      and in any case, thank you for these videos!

    • @mikefochtman7164
      @mikefochtman7164 3 роки тому

      @@WhatsACreel Might even try something like memset(count, 0, 256*sizeof(count[0])). The '256*sizeof(count[0])' would be resolved at compile time to something like 'memset(count, 0, 1024)' (assuming 32-bit int), so this is probably the fastest way to zero out a block of memory. No overhead setting up a for() loop or indexing and such.

  • @overlisted
    @overlisted 2 роки тому

    A funny thing that I just thought while watching this: You can use radix sort to sort strings alphabetically (letters = numbers, radix = the size of the alphabet)! Is this how programs usually do that?

  • @_c_e_
    @_c_e_ 3 роки тому

    Can you do a multi pivot sort using threading?

  • @raphiki4780
    @raphiki4780 2 роки тому

    3:02 I'm wondering why you used for(;;). Woulnd't it better to use while(1) instead ?

    • @mshonle
      @mshonle 2 роки тому

      Any reasonable compiler would compile for(;;) and while(1) to the same code, so there is no performance difference, so it’s a stylistic choice. It’s more idiomatic to use “for(;;)” for a “forever” loop, not just because of the cute name and that it’s one character shorter, but because having an empty conditional expression in a for statement is the one place in the language that specifically defines a natural infinite loop. As I said, a compiler would treat “while(1)” the same, but semantically the code is literally saying “load the immediate value 1 and see if it’s non-zero,” which feels like an unneeded step given the language already directly supports indefinite looping. For similar reasons, if the value of an increment operation isn’t used, many chose to use the pre-increment notation (++i) over the post-increment notation (i++).

  • @soberhippie
    @soberhippie 3 роки тому +1

    Would using other modulos (apart from 10) give better or worse results? Or, rather, how would it affect the speed and what are the things to consider when choosing the base for modulos?

    • @WhatsACreel
      @WhatsACreel  3 роки тому +1

      Yes, it can make a big difference! Mostly, we stick to powers of 2 for the modulus, so we can use boolean AND instead of dividing to get the digits. Cheers for watching mate, have a good one :)

    • @soberhippie
      @soberhippie 3 роки тому

      @@WhatsACreel Thank you for your videos, they are not only educational, but a lot more fun than other similar ones )

  • @GRHmedia
    @GRHmedia 4 місяці тому

    The radix short should effectively require 2x the data size in memory while the quick sort should be able to work within the data's own space.
    If one used a thread pool the quick sort could be improved a good bit in performance. Doubt it would catch up with radix sort though. It would require waiting to each division is done to split the work load.

  • @Gutagi
    @Gutagi 3 роки тому +1

    Im a begginer programmer and would like to know how to easily make a function getDigits(number,base)
    I managed to make a base 10 one, However other bases don’t work...

  • @alex_smallet
    @alex_smallet 3 роки тому +1

    Great series. But why did you not mention that radix sort can start from both ends? I.e. - there is LSD (the one you covered) and MSD sort, which is basically the same, but starts from the leftmost digit.

    • @WhatsACreel
      @WhatsACreel  3 роки тому

      Yes, sorry. It's just I had the code for LSD and some animation ideas. No real reason other than that. Maybe we can do another vid in the future on MSD? I'm not sure. Both are fine algorithms, and cheers for watching :)

  • @dm121984
    @dm121984 3 роки тому +2

    I wonder if you can use radix-type sorting on classes and the like. For example, a string - you could radix sort character by character I think - I just wonder if overhead would kill it. For other data types, for example a class with 3 numbers (hours, mins, secs), I imagine you could radix by seconds, then minutes, then hours.

    • @WhatsACreel
      @WhatsACreel  3 роки тому

      It's certainly capable of sorting strings, yep! It's pretty flexible, really, so long as you can figure out how to read the data as something like unsigned int you're good to go! Cheers for watching mate :)

    • @sakari_n
      @sakari_n 3 роки тому

      You are correct about sorting data with multiple key values per element, because radix sort is type of stable sort you can sort starting from the least significant key value to most significant (like in your example first sort by seconds then by minute and last by hour).

  • @kestans
    @kestans 3 роки тому +1

    I had some data to extract from siemens PLC database and make a monthly report to exel or push report on the fly to the web interface. What I did I read the month data to ram and save indexes of data of each new day, I then create multiple threads on cpu so each thread handles calculations for that day data so thread receives the memory address of data and index locations of that day data. so it is running eg 30 threads and as if computer has more cores it reduced the time need for calculations as much times as much more cores the cpu had.

  • @AlessioSangalli
    @AlessioSangalli 3 роки тому +3

    Fantastic mini-series; absolutely fantastic. However the part where you measure the performance of sorting 10, 100 elements is cringe 😅😊 come on you should have started at 1.000.000 🤣🤣

  • @raunaquepatra3966
    @raunaquepatra3966 2 роки тому +1

    Can we do radix sort for floats?

  • @harrysvensson2610
    @harrysvensson2610 3 роки тому +1

    It would be fun to hear if any of the algorithms could be sped up by using SIMD functions, and then by utilizing locality (maybe structs with alignas?) for the cache lines to reduce the number of cache misses.
    Optimize it to the ground!

    • @kungfujesus06
      @kungfujesus06 3 роки тому

      It can be and has been. Intel performance primitives provides a very nice radix sort for all integer widths as well as floats

    • @controlledsingularity8084
      @controlledsingularity8084 3 роки тому

      gpgpu will smoke cpu simd.

    • @kungfujesus06
      @kungfujesus06 3 роки тому

      @@controlledsingularity8084 For a very branchy integer based operation? Maybe, but probably not by the margin you think it will. Do any of the STL-like utility libraries for CUDA have a radix sort to compare it to? Likely the copy latency will be the longest thing

  • @FalcoGer
    @FalcoGer 3 роки тому

    allocating stack memory is basically free (just increment the stack pointer register)
    malloc/new is where the overhead comes in. Asking the operating system for memory is a big deal.

  • @valseedian
    @valseedian 2 роки тому

    when you're working with truly huge datasets radix has no comparison. it's the only choice for pre-bucketizing string lists over a million unique points.
    I've been using a pseudoradix methodology to keep my homebrew linked list map class sorted live.
    the memory overhead isn't impossible to overcome when using linked lists, and radix also allows for parallelism where practically all others are requisite linear.
    when I'm dealing with data sets that won't fit in working ram, radix to the rescue again! just make your first bucketizing step output to file storage.
    continue bucketizing to the disc until buckets fit in working ram.
    array radix is kinda limiting for that tiny lookup bias. I feel like on average you gain more from being able to move blocks of memory and insert nodes than you lose from that lookup while the sort process is happening.
    my linked list comes with 2 jump pointers that keep relatively close to binary locality when the map is open for active additions or deletions. optionally the map could be collapsed into an array when it's keys became const.

  • @cmdlp4178
    @cmdlp4178 3 роки тому +5

    Small optimization:
    The count array could have been allocated on the stack, that would just be 4*256=1024 bytes. Today stack usage and arrays on the stack are not a problem, as long as the arrays have constant size. In C there are variable length arrays with array size not known at compiletime, the same thing could be done with alloca. If you would declare an array with variable size you should always check if it has a relatively tiny size like 10kB or less. On windows you have a default maximum stack size of 1MB.

    • @harrysvensson2610
      @harrysvensson2610 3 роки тому

      "On windows you have a default maximum stack size of 1MB."
      Is that the maximum size an array can be when passed as an argument to a function (aka put on the stack)?

    • @mikefochtman7164
      @mikefochtman7164 3 роки тому

      @@harrysvensson2610 You wouldn't normally pass an array as an argument, but rather a pointer to that memory location where the array is (or in C++ it could be a reference, but the effect is the same. The stack just has to hold 'sizeof(pointer)' bytes.

    • @benjaminfacouchere2395
      @benjaminfacouchere2395 3 роки тому

      I don't really see the optimization here, care to explain it?

    • @phizc
      @phizc 3 роки тому

      @@benjaminfacouchere2395 Allocating and deleting allocated memory on the heap takes some time. If you sort a large array it's not going to matter much, but if you sort many small arrays that time is going to add up. Having the array on the stack is just changing the stack pointer, so it's essentially free.
      Replacing
      int* count = new int[256];
      with
      int count [256];
      and removing
      delete[] count;
      makes the count array be on the stack instead of the heap. It will be removed automatically when the function ends, and again, it's just changing the stack pointer, so it's free.
      Regarding the maximum stack size: Every time the program enters a function it "creates" space on the stack for the variables that function uses. So if you make a recursive function that needs 1000 bytes of stack space, it will have used a million bytes when it's gone 1000 steps "deep";
      e.g.
      void rec( int x )
      {
      if( x < 0 ) return;
      int a[250];
      rec( x - 1);
      }
      rec( 1000 );
      will used 1 million bytes (more actually since arguments are also put on the stack) before it starts to unwind. With a maximum stack size of 1 MB, you may run out stack space and crash if something down the call stack also allocated a lot of data on the stack.

    • @benjaminfacouchere2395
      @benjaminfacouchere2395 3 роки тому

      @@phizc Sure, you have a small one-time allocation penalty, but that's it.
      Obviously we won't use radix sort on small arrays as other algorithms are more efficient there, also a count array of 256 is overkill then.

  • @wazydotmeter
    @wazydotmeter 3 роки тому

    Have a contest between different users rigs to see who's is the fastest? with the samples of code used.

  • @pw6048
    @pw6048 3 роки тому +1

    For radixsort, would it matter if you would scan only one time and build 4 array's instead of scan 4 times?
    so instead of :
    for (int shift = 0, s = 0; shift L 4; shift++, s += 8)
    {
    // Zero the counts
    for (int i = 0; i L 256; i++)
    count[i] = 0;
    // Store count of occurrences in count[]
    for (int i = 0; i L n; i++)
    count[(arr[i] GG s)&0xff]++;
    // Change count[i] so that count[i] now contains
    // actual position of this digit in output[]
    for (int i = 1; i L 256; i++)
    count[i] += count[i - 1];
    // Build the output array
    for (int i = n - 1; i GE 0; i--)
    .........
    something like :
    // Zero the counts
    for (int i = 0; i L 256; i++)
    {
    count[0][i] = 0;
    count[1][i] = 0;
    count[2][i] = 0;
    count[3][i] = 0;
    }
    // Store count of occurrences in count[]
    for (int i = 0; i L n; i++)
    {
    count[0][(arr[i] GG 0)&0xff]++;
    count[1][(arr[i] GG 8)&0xff00]++;
    count[2][(arr[i] GG 16)&0xff0000]++;
    count[3][(arr[i] GG 24)&0xff000000]++;
    }
    // Change count[i] so that count[i] now contains
    // actual position of this digit in output[]
    for (int i = 1; i L 256; i++)
    {
    count[0][i] += count[0][i - 1];
    count[1][i] += count[1][i - 1];
    count[2][i] += count[2][i - 1];
    count[3][i] += count[3][i - 1];
    }
    for (int shift = 0, s = 0; shift L 4; shift++, s += 8)
    {
    // Build the output array
    for (int i = n - 1; i GE 0; i--)
    this way initializing would be a little faster, and stil fit in L1 cache, and you dont scan the input 4 times. More work per loop.
    the building of the output arrays stil has to take place 4 time.
    but would this be faster? ( i'm not that good at C++ to test this myself )

    • @surters
      @surters 8 днів тому

      The quad counting seems like a nice idea, write a paper and post the address here!

  • @ValdaXD
    @ValdaXD 3 роки тому +1

    Oh. Code...
    Awesome! I thought this was gonna be a 2 part video 😅😅

  • @PaulTheFox1988
    @PaulTheFox1988 3 роки тому

    Looking at the results for quick sort I noticed it was almost linear once you reached 100k elements and above, which I wouldn't have expected.
    I mean, it's still nearly an order of magnitude slower than radix at 1bn elements but even at 10m elements it's still a usable algorithm if 1 second of wait time is acceptable, especially with radix sorts massive RAM requirements.