Thanks for giving your time freely to play with this sort of stuff. UA-cam is an amazing medium for us mortals to engage with interesting people like yourself. Keep up the great work 👍
Thanks Dave I am a Software Engineer, just graduated from college and am starting out. I love your content. I once had a professor who said "Programming is wizardry, and programmers are wizards." Someday I hope to be as great a wizard as you buddy.
I just ran across your channel a week ago, and I'm really enjoying hearing your take on different programming issues! I used to work out the details of an algorithm using whatever scripting language was available on the platform, and once i had a solid plan, I would go back and rewrite it using C or FORTRAN or whatever else. This proved an effective way to cook up some great code that could do the job. Thanks for all of the great comments during your videos!
@@donaldklopper yeah, outlay is usually nothing compared to power, in an industry. Outlay is usually only an issue for home and small businesses that let equipment sit idle 99.99% of the time, even while "working".
I actually like the speed you talk at. You're the only videos which I can watch a regular speed instead of 2x like most others and 1.5x for everything else.
This was unexpected. I ran the CPP code on a WSL 2 terminal running Ubunutu. The CPU on the box is an AMD Ryzen 3800X running at stock speeds. And still, it outpaced the Threadripper. The first run turned in a score of 9622! Passes: 9622, Time: 5.000000, Avg: 0.000520, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
M1 is still very impressive for a very new product in it's first life cycle. Also factoring in the power consumption makes it look even more impressive.
I wrote a multithreaded solution to prime number generation in C++ a few months ago, it's actually not too hard to implement. Would be interesting to see how much the threadripper outpaces the M1 when you use all the cores lmao and would perhaps be a good next-step up from this.
@@tommcintosh4705 Sure, it's important, but it's not more important than multithreaded performance. Things that tend to take a long time (e.g. compilation, 3d rendering, encoding video files, etc.) also tend to benefit from multiple threads, plus with more threads you can run more software concurrently (e.g. even if most software _was_ single threaded, being able to run more of it simultaneously could be a huge benefit). Also all current implementations of x86 has SMT: an optimization around the weakness that it has in purely single-threaded workloads by allowing a single core to do a bit more than one thread's worth of tasks at once (essentially, a lot of the core's resources are left idle by it's design, and that idle portion can be used to execute another thread at the same time). The M1 specifically has a relatively large advantage in that _one_ aspect, but essentially you're handicapping x86 by not letting it use it's benefits as well. Based on that, it's pretty misleading to show off single-threaded performance and act as if it's _that_ important of a metric. Edit: to be clear, I'm not saying Dave is being misleading here, but that Apple's sudden surge of "hey, check out the single-threaded performance of our M1 part and see how powerful it is, also do benchmarks with single threads plz thx bye" is misleading and the fact it's worked: many people are suddenly trying to come up with super synthetic benchmarks that show off this weakness of x86 and push it as a huge problem, when it is typically _not_ that huge of a deal in practical usage.
@@tommcintosh4705 I find that very little software is still single-threaded nowadays. Even games which are often very intensive on a particular single thread are usually multithreaded.
@@nephatrine Yup, no matter how much optimization you do on a single threaded code, it'll be hard to beat just spawning a crap ton of threads, even with bad optimization (if you can that is). I recently had a .Net code run on a single thread for almost 50 minutes (that was optimized), but running it on 12 threads got it below 5 minutes. Try doing that on 1 core I dare you. (Also, later I got it running on my GPU using OpenCL, run the same task in under 10 seconds XD)
I run a Ryzen 1600 (14 nm version no OC 3.2-3.6 GHz clock speeds). And I got this result with g++ -Ofast Passes: 8427, Time: 5.000000, Avg: 0.000593, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1 I would expect it to be a lot lower.
3 роки тому+1
Got a similar result on my AMD Ryzen 7 4800H with Radeon Graphics, no OC in a Laptop. Passes: 9840, Time: 5.000000, Avg: 0.000508, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
I was wondering what was going on; glad to see I'm not alone. 3900X @ 4.2GHz all-core OC -> Windows 10 -> VirtualBox VM running Mint 20.1 = 9384 passes.
Running PrimeCPP on my iMac with a 10700K CPU results in: Passes: 8607, Time: 5.000000, Avg: 0.000581, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
You have no idea how apt the drag racing analogy is. I've been working on my own cars for more than 40 years. I know my way around an engine. But the idea of tearing down and rebuilding a 10,000 HP engine in 45 minutes is basically sci-fi to me. Similarly, I've been playing with computers since my folks bought us an Apple IIe back in the mid '80s. But what you do here is basically voodoo. Sure, I understand the concepts. It's the depth and breadth of the minutia that impresses me. Fun stuff.
Great work once again, Dave! The subtitles are helpful, especially because I watch at 2x speed. There were a couple places where they were missing. I remember one when you were talking about BTR in the beginning, and one when you were talking about the bugs found in your code. Edit: and the entire Python apologetics chapter
99.2 K Subs as I type this! You found your groove and your channel is growing nicely! I remember (as it was not so long ago) joining when your sub count measured in the hundreds. I do hope that you will continue to feature automotive content and tech projects as well. Well done, Dave!
10:23 Nice of you to have mentioned the std::vector thing, that was discussed in some comments of the previous video. It would be interesting to see whether its template specialization in your STL implementation was done actually with bitfields (and if so, what are the differences compared to your bitfield manipulation), or using actual 1-byte bools (that would be then byte-aligned)...
Dave, I can''t program anything more advanced than a PLC, but when ever a page with your videos load, I hit the thumb ups regardless, as you always increase my understanding of the stuff I have no knowledge in. Thank you !
Dave, I am really enjoying your videos! I am currently studying Computer Science in school and hope to pursue a career in programming and your videos are inspiring me to continue my pursuits!
Nice information, glad you brought up that Python isn't the answer to all code. Lately with all the do it in python rant in alot of the developer areas, its nice to hear use the language that makes sense for the task at hand. Thanks again!
Some coders want everything available in the language they already know. That's how we got the do it all in Python crowd and do it all in JavaScript crowd as well.
I do heaps of programming with deep learning, sometimes Web server logic, etc. A lot also includes prototyping, so my calculations of "speed" always include how long I need to code. Sure, had I written my code in pure C/C++/etc., it probably would have been 100 times faster than it is now. But I need to get stuff done instead of obsessing on how low-level I can get. Had I done that, I would probably have finished 10% of my work shortly before retirement in a couple of decades. It's perfectly sensible that there's languages on so many levels (no pun intended). No point on starting a war over _that_, too. Except for R. This just sucks. ;)
I was about to go and write GoLang, PHP, Pascal implementations, then I saw all the existing implementations and now I'm not sure its worth just being another "me too" :)
Dang! That's just peachy, a (former) Microsoft employee has forced me to upgrade once again. I just upgraded to a subscriber.😁 Thank you for the great content.
For giggles I run Dave's code on my computers here. The Windows boxes (Ryzen 5 and Intel i5) run the g++ in Debian under WSL2, the other machines run Debian or Raspberry Pi OS on bare metal. To be honest, I'm very impressed with the Ryzen. AMD Ryzen 5 3600X 6-Core Processor => Passes: 9605 AMD Athlon(tm) II X3 460 Processor => Passes: 3642 Intel(R) Core(TM) i3-4005U CPU => Passes: 2551 Intel(R) Atom(TM) CPU N270 => Passes: 911 Intel(R) Atom(TM) CPU N450 => Passes: 871 Raspberry Pi 3 => Passes: 764
I assume if you went around saying "You had the best tool", you may in fact be THAT tool. Great info and done with a sense of humour, logic and pragmatism that seems to be a rarity these days. Keep it up.
The video on compiler performance should be interesting, I'm getting 9000-10000 on my 3600 with gcc-clang (clang a little bit faster) while it shouldn't be that faster than the 3970x in single core One interesting thing is that replacing vector with a pretty simple bitset gives a pretty good speedup in clang, while not in gcc Nice video Dave
I've watched so many of your videos that I was amused that I was not already subbed. Well I fixed that bug. Speaking of bugs, could you do a video about all the rare bugs you know about? Always found that fun.
3 haters who don’t have any clue what he’s talking about. I mean I know what he’s talking about but don’t know how to do it...but I don’t hate Thanks for entertaining content!
Apreciate your efford to include a subtitle in an informative video like this. You talk like a C program runs on a newest CPU when my brain is a pentium 3 running Java which is constantly overheating
at 10:03 your testing of index % 2 == 0 and index & 1 == 0 - only makes a difference if you are running in debug, not in release mode, as release mode will always compiled SomeVariable % 2 == 0 to the more optimized version (i.e. not use modulo explicitly as it is a very costly operation, in relative terms).
For the record, gcc and clang won't use modulo explicitly in debug builds if index is unsigned, msvc will. However if index is signed, msvc and gcc wont use modulo but clang will.
@@pikachulovesketchup666 of course all compilers does it, my observation was simply about debug vs release builds, and as Nathan showed thats not the entire story today.
> When it comes to gaming and other certain workloads that [single core performance ] is the reality of what matter Luckily that been slowly changing since Moore's law has broken down and cpu manufactures have been adding more cores! There are going to be workloads that can never be parallel but luckily there's a lot of low hanging fruit for typical applications to add parallelism.
ARM is a load-store ISA but presumably Apple did something for x86 emulation that allowed it to operate in a register-memory manner. Not sure if that applies to native ARM code or not. ARM definitely has some bit twiddling instruction, I'd be a little surprised if the compiler is generating shifted bit masks and ANDs for your bit test. For the scalar pipeline ARM's 32 bit ISA had predication but it looks like aarch64 dropped that complexity. What you really want to maximize your integer throughput is something that auto vectorized (or explicitly vectorize it yourself with the neon intrinisics). Of course if I remember from your last video, this code has integer division in it, which takes a huge performance hit for all architectures in terms of latency. X86 and ARM both lack vectorized division due to the ridiculously complicated amount of gyrations that have to occur in the ALU for it. That having been said, I haven't finished your video yet, I'm only 5 minutes in. I'm curious how this goes.
love it, was curious about the M1...don't have one...not in a hurry to get one...but curious where Apple is headed with it. Looking forward to your compiler comparison. Also something I don't get to look at much...in my world it's visual studio...and you live with it. But I know from prior experience that is not the only game out there.
Neat. Somehow, _all_ the results were actually impressive. The lowly Pi 3 is impressive for how narrow the delta actually is between cheapest possible self-contained computer and a TOTL desktop CPU. The Pi 4 for how much tighter that gap. The M1 for being a brand new product with the slider pegged dead in the middle between "optimized for low power" and "optimized for high performance." And, of course, the Threadripper for having the biggest 🥜 of just about any CPU available. haha
Controversial/manipulative test. Changing vector to vector puts M1 on par with TR. and with C array it scores 13.5k (C array improves result for x86 as well)
@@DavesGarage true. But sdt::vector implementation appears to be suboptimal for arm systems. I wonder how it gonna work with manual bitmasks. I would expect ~2x improvement for rPi
The highest result I saw with the code from this video is 7929 on a Ryzen 9 3900X with g++, 7877 with clang. With the repo code it was 10191 (g++) and 10684 (clang). In theory the Ryzen 7 5800X should be about the fastest.
just a few years ago (seems that way anyway) single core ipc would be the main thing to look at regarding games, but i have recently started playing pc games again and nearly all of them either use multiple threads/cores or in some cases require them. multithread/core performance is more important to games now.
My guess? Any RISC instruction set is probably not going to perform it's best in this workload which is very load/store intensive. Stuff that chews on a few registers is where ARM/MIPS/RISC-V shine, not striding over a list where memory accesses are a separate instruction.
I wonder how the raspis would fare when overclocked, would love to see that comparison as well. Also, your speech became much easier to understand since the blue screen video, so that's nice for all non-native English speakers (:
I wish you would have just added one more benchmark. It could have been the same problem to solve, only allowed for multi threading. That way, we could also speculate as to what performance degradation we could (maybe/maybe not) see from the M1 doing what it does.
I agree the M1 is definitely doing something interesting for x86 emulation, though it appears to be just adding hardware support for strong memory ordering when running code intended for the x86, which given the cache heavy nature of this benchmark probably wouldn’t have much effect.
This was the straw that broke the camel's back in favour of me buying an M1 Mac after a decade of netbooks and secondhand business laptops from Japan. The high performance with long battery life and low heat output got me close, but not close enough to fork out the $$$ until I saw even the x86 emulation was sometimes faster than on x86 hardware.
@@andrewdunbar828 what makes Dave’s tests here interesting is that the M1 is a Laptop CPU... the Threadripper is a Desktop CPU. It will be fun to see what Apple do in the Desktop space with their ARM implementation!
Hello! I figured it out I think. The M1:s performance cores runs at 3200MHz, the AMD 3970X runs at 3700MHz. If we assume that the code is written (I have not looked at i yet) in such a way that the compiler simply can't optimize the code due to the way the instructions depend on the result of a prior instruction (hop it makes sense), the result of each CPU is roughly 2 cycles per MHz. So the M1 would score 3200*2 = 6400 and AMD 3970X would score 3700*2 = 7400. It lines up quite nicely, but it does not test the difference in architecture in a good way.
This channel makes me happy.
I am not even a programmer but somehow listening to Dave is interesting and calming at the same time
Totally! 🥰
Me too
He made me appreciate windows .... that says alooooot
It's so calming, entertaining, educational and just plain fun.
Thanks for giving your time freely to play with this sort of stuff.
UA-cam is an amazing medium for us mortals to engage with interesting people like yourself.
Keep up the great work 👍
Thanks Dave I am a Software Engineer, just graduated from college and am starting out. I love your content. I once had a professor who said "Programming is wizardry, and programmers are wizards." Someday I hope to be as great a wizard as you buddy.
All the best 👍 Per (DK)
Mellow piano music, sparkly lights.. new Dave's Garage episode! ... It feels like Christmas! Dave thank you so much.. as always, top notch content.
This is bloody brilliant.
Also, the fact that Nano was used as the editor made my day. Kudos to you sir!
nano is so nice :D
@@bobbydazzler6990 No.. masochists!
I LOVE the fact you talk at a nice, normal pace. There are some channels I watch at 1.5x speed just to get them to talk at a normal pace.
I just ran across your channel a week ago, and I'm really enjoying hearing your take on different programming issues! I used to work out the details of an algorithm using whatever scripting language was available on the platform, and once i had a solid plan, I would go back and rewrite it using C or FORTRAN or whatever else. This proved an effective way to cook up some great code that could do the job. Thanks for all of the great comments during your videos!
That's the comparison that we needed but didn't know it!
This has quickly become my favorite channel.
This guy is what UA-cam should be
Thanks for the kind words!
I'm a simple man. I see Dave drop a video, I watch it. It's really not complicated. Your a legend dude 👏
I appreciate that!
Dave you talk in perfect speed. For once I don't have to speed up the video I'm watching 🤣🤣
That's funny ;-). Yup, I default to 1.25X I think!
Agreed
Zoomers
Hahah :D
I watch these at 2x speed. But then again, I watch most others at 3x.
As a car/drag racing enthusiast and hardware engineer learning to code this was an excellent episode. Just subbed!
I don't usually notice background music without hating it but I think you found the right balance of musical complexity and intrusiveness
Great video, as always! Maybe another metric to consider: price per pass? :)
For example: the Pi 3B+, $35/305 ~ $0.11/pass
And Watts consumed per pass ;-)
@@donaldklopper yeah, outlay is usually nothing compared to power, in an industry. Outlay is usually only an issue for home and small businesses that let equipment sit idle 99.99% of the time, even while "working".
Oh damn this is gonna get wild
I actually like the speed you talk at. You're the only videos which I can watch a regular speed instead of 2x like most others and 1.5x for everything else.
So I changed the vector to std::array, and got ~13000 passes on my m1 air. Fyi, it was ~4500 passes with vector.
that probably would lead it to also be faster on the other implementations, since it becomes static memory
This was unexpected. I ran the CPP code on a WSL 2 terminal running Ubunutu. The CPU on the box is an AMD Ryzen 3800X running at stock speeds. And still, it outpaced the Threadripper. The first run turned in a score of 9622!
Passes: 9622, Time: 5.000000, Avg: 0.000520, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
Hi Dave,
thanks for producing this channel! Very enjoyable!
I ran PrimeCPP on my 5950X in WSL2:
Passes: 11267, Time: 5.000000, Avg: 0.000444, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
Passes: 11327, Time: 5.000000, Avg: 0.000441, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
Passes: 11346, Time: 5.000000, Avg: 0.000441, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
Cool! I've seen a 12000 as well from another viewer, but I think he was overclocked!
@@DavesGarage , User_Overclocked_Error - Only Machines Should Be Overclocked (0xB00B1377)
I love that we are mathing it up on different systems.
M1 is still very impressive for a very new product in it's first life cycle. Also factoring in the power consumption makes it look even more impressive.
also cost makes it impressive for its performance you could get almost 3 Mac minis for the cost of just the threadripper chip
@@michaelhenecke the threadripper is a server chip, no person needs that many cores
@@jan-lukas Yep, and we can get a decent gaming laptop with mac price
It'll be interesting to plot the same chart but divide by Watts used by the CPU.... Surprising results...
And you mentioned Turbo Pascal! I like you.
Thanks for the quality content. This is both entertaining and educational.
Would be cool to see an optimized version of a wasm and Node benchmark in addition to the vector optimizations you made to the CPP benchmark!
The showdown of the decade
Thanks for making these, as a constantly learning programmer these are invaluable.
I wrote a multithreaded solution to prime number generation in C++ a few months ago, it's actually not too hard to implement. Would be interesting to see how much the threadripper outpaces the M1 when you use all the cores lmao and would perhaps be a good next-step up from this.
Single thread performance is still super important. So much software is single threaded.
@@tommcintosh4705 Sure, it's important, but it's not more important than multithreaded performance. Things that tend to take a long time (e.g. compilation, 3d rendering, encoding video files, etc.) also tend to benefit from multiple threads, plus with more threads you can run more software concurrently (e.g. even if most software _was_ single threaded, being able to run more of it simultaneously could be a huge benefit).
Also all current implementations of x86 has SMT: an optimization around the weakness that it has in purely single-threaded workloads by allowing a single core to do a bit more than one thread's worth of tasks at once (essentially, a lot of the core's resources are left idle by it's design, and that idle portion can be used to execute another thread at the same time). The M1 specifically has a relatively large advantage in that _one_ aspect, but essentially you're handicapping x86 by not letting it use it's benefits as well.
Based on that, it's pretty misleading to show off single-threaded performance and act as if it's _that_ important of a metric.
Edit: to be clear, I'm not saying Dave is being misleading here, but that Apple's sudden surge of "hey, check out the single-threaded performance of our M1 part and see how powerful it is, also do benchmarks with single threads plz thx bye" is misleading and the fact it's worked: many people are suddenly trying to come up with super synthetic benchmarks that show off this weakness of x86 and push it as a huge problem, when it is typically _not_ that huge of a deal in practical usage.
@@tommcintosh4705 Well yeah
@@tommcintosh4705 I find that very little software is still single-threaded nowadays. Even games which are often very intensive on a particular single thread are usually multithreaded.
@@nephatrine Yup, no matter how much optimization you do on a single threaded code, it'll be hard to beat just spawning a crap ton of threads, even with bad optimization (if you can that is).
I recently had a .Net code run on a single thread for almost 50 minutes (that was optimized), but running it on 12 threads got it below 5 minutes. Try doing that on 1 core I dare you. (Also, later I got it running on my GPU using OpenCL, run the same task in under 10 seconds XD)
- Tell me you're a Windows developer without saying "I'm a Windows developer"
- OK.exe
./no
@@mek101whatif7 imma need you to $rm -rf / right tf now
windefproc
Exactly what I came to the comments for
@@jabalahkhaldun3467 Useless unless you --no-preserve-root
I really appreciate you and your channel. This is a great example of a proper benchmark
i'm aspiring to take my interest in tech further, and this channel is a reason for that!
Thanks for this episode. Looking forwards to see how different compilers perform.
Congrats, you are the first youtuber who convinced me to click on the Like button upfront.
ikr
I run a Ryzen 1600 (14 nm version no OC 3.2-3.6 GHz clock speeds). And I got this result with g++ -Ofast
Passes: 8427, Time: 5.000000, Avg: 0.000593, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
I would expect it to be a lot lower.
Got a similar result on my AMD Ryzen 7 4800H with Radeon Graphics, no OC in a Laptop.
Passes: 9840, Time: 5.000000, Avg: 0.000508, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
I got 8200 passes on Ryzen 3600X but compiled with MSVC. WTF?
Using clang in Ubuntu 21.04, my Ryzen 4750GE w/o overclock:
Passes: 10777, Time: 5.000000, Avg: 0.000464, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
I was wondering what was going on; glad to see I'm not alone. 3900X @ 4.2GHz all-core OC -> Windows 10 -> VirtualBox VM running Mint 20.1 = 9384 passes.
Running PrimeCPP on my iMac with a 10700K CPU results in:
Passes: 8607, Time: 5.000000, Avg: 0.000581, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
You have no idea how apt the drag racing analogy is. I've been working on my own cars for more than 40 years. I know my way around an engine. But the idea of tearing down and rebuilding a 10,000 HP engine in 45 minutes is basically sci-fi to me. Similarly, I've been playing with computers since my folks bought us an Apple IIe back in the mid '80s. But what you do here is basically voodoo. Sure, I understand the concepts. It's the depth and breadth of the minutia that impresses me. Fun stuff.
Great work once again, Dave!
The subtitles are helpful, especially because I watch at 2x speed. There were a couple places where they were missing. I remember one when you were talking about BTR in the beginning, and one when you were talking about the bugs found in your code.
Edit: and the entire Python apologetics chapter
99.2 K Subs as I type this! You found your groove and your channel is growing nicely! I remember (as it was not so long ago) joining when your sub count measured in the hundreds. I do hope that you will continue to feature automotive content and tech projects as well. Well done, Dave!
Sorry about your stroke, Dave. Rapid recovery! 😁
You and curious Marc are my favorite UA-camrs right now
the Threadrippers and zen2 in general are such beasts man.
10:23 Nice of you to have mentioned the std::vector thing, that was discussed in some comments of the previous video.
It would be interesting to see whether its template specialization in your STL implementation was done actually with bitfields (and if so, what are the differences compared to your bitfield manipulation), or using actual 1-byte bools (that would be then byte-aligned)...
becoming one of my favorite channels.
Dave, I can''t program anything more advanced than a PLC, but when ever a page with your videos load, I hit the thumb ups regardless, as you always increase my understanding of the stuff I have no knowledge in. Thank you !
Code a prime calculator in ladder logic ;)
@@stonent I do most of the stuff in FB, but point taken lol
I smashed the thumbs-up button. I couldn't argue with your logic.
You smashed it? Do I sound like Peter McKinnon? You can just lightly click it. But I thank you nonetheless!
Mr. Dave you're one of the best content creators that I had the pleasure to find on UA-cam
Dave, I am really enjoying your videos! I am currently studying Computer Science in school and hope to pursue a career in programming and your videos are inspiring me to continue my pursuits!
Really entertaining - the right balance of tech with humor i enjoy - and always stay for the outtakes - Thanks Dave
Glad you enjoyed it!
Dude, well done.
Hey, thanks!
I love that your terminal window is blue with light grey text.
Nice information, glad you brought up that Python isn't the answer to all code. Lately with all the do it in python rant in alot of the developer areas, its nice to hear use the language that makes sense for the task at hand. Thanks again!
Some coders want everything available in the language they already know. That's how we got the do it all in Python crowd and do it all in JavaScript crowd as well.
I do heaps of programming with deep learning, sometimes Web server logic, etc. A lot also includes prototyping, so my calculations of "speed" always include how long I need to code.
Sure, had I written my code in pure C/C++/etc., it probably would have been 100 times faster than it is now. But I need to get stuff done instead of obsessing on how low-level I can get. Had I done that, I would probably have finished 10% of my work shortly before retirement in a couple of decades.
It's perfectly sensible that there's languages on so many levels (no pun intended). No point on starting a war over _that_, too.
Except for R. This just sucks. ;)
Juhu don't know why a Video like that makes me that Happy
I'm really getting a lot out of your content, Dave. Many thanks.
Hi, would be nice if the github url was mentioned in the description. Otherwise nice episode.
Looks like he fixed that.
Thanks @DavePL, there goes a few hours on my long weekend playing with this :) Great content BTW now one of my favourite channels.
I was about to go and write GoLang, PHP, Pascal implementations, then I saw all the existing implementations and now I'm not sure its worth just being another "me too" :)
Interestingly the CPP versions of this achieve 4820 on my super old i7-870. FYI I achieved 8221 on my i9-9900K
Dave, I love the content and the upvote is worth it just because you bothered to make chapter markers in this video!
Dang! That's just peachy, a (former) Microsoft employee has forced me to upgrade once again. I just upgraded to a subscriber.😁 Thank you for the great content.
For giggles I run Dave's code on my computers here. The Windows boxes (Ryzen 5 and Intel i5) run the g++ in Debian under WSL2, the other machines run Debian or Raspberry Pi OS on bare metal. To be honest, I'm very impressed with the Ryzen.
AMD Ryzen 5 3600X 6-Core Processor => Passes: 9605
AMD Athlon(tm) II X3 460 Processor => Passes: 3642
Intel(R) Core(TM) i3-4005U CPU => Passes: 2551
Intel(R) Atom(TM) CPU N270 => Passes: 911
Intel(R) Atom(TM) CPU N450 => Passes: 871
Raspberry Pi 3 => Passes: 764
This will be of no interest to anyone but a Pi 1 Model B (from 2012) achieves a score of 97
It makes me happy to know! Thank you for sharing!
This is so detailed and neerdy. I love it!
I assume if you went around saying "You had the best tool", you may in fact be THAT tool. Great info and done with a sense of humour, logic and pragmatism that seems to be a rarity these days. Keep it up.
The bloopers got me! Whole ep of gag reel please lololololol
Great channel Dave, lots of great info. Hope you can help folks porting Windows to the raspberry with your knowledge.
I gave that feedback about talking speed, and he kept that in mind 😀. Hats off sir.
Glad you liked it! I'm always paying attention and trying :-)
The video on compiler performance should be interesting, I'm getting 9000-10000 on my 3600 with gcc-clang (clang a little bit faster) while it shouldn't be that faster than the 3970x in single core
One interesting thing is that replacing vector with a pretty simple bitset gives a pretty good speedup in clang, while not in gcc
Nice video Dave
I think due to thermals the threadripper runs slower per core while having more of them. So the higher boost clock gives you an advantage.
Even though I am current swinging in a hammock, in front of a volcano in Costa Rica, I could not miss a Dave's Garage premiere.
Living the dream!
I may be joining you, Liberal Lunatic Free Zone...
The .exe extension at 6:36 does reveal your Windows roots..
Well presented and articulated though, as always.
Great job!
I've watched so many of your videos that I was amused that I was not already subbed. Well I fixed that bug. Speaking of bugs, could you do a video about all the rare bugs you know about? Always found that fun.
Dave you rock! I love your channel!!
Enjoy the channel. Good stories and random bloopers. Cheers!
Got ~10k on an old 6600k and was sort of surprised, but in the end it makes sense as it's a single core workload. Great video.
3 haters who don’t have any clue what he’s talking about. I mean I know what he’s talking about but don’t know how to do it...but I don’t hate Thanks for entertaining content!
1:16 Hell yes! Thumbs up and subscribed right away. You manage time very well in all videos i have seen so far.
I thought I had a stroke when I saw Cascade working on my shared control system in 1988, maybe 90. It was so funny it deserved to get shared.
I would love to see a drag race between C++ and Rust!
Apreciate your efford to include a subtitle in an informative video like this. You talk like a C program runs on a newest CPU when my brain is a pentium 3 running Java which is constantly overheating
at 10:03 your testing of index % 2 == 0 and index & 1 == 0 - only makes a difference if you are running in debug, not in release mode, as release mode will always compiled SomeVariable % 2 == 0 to the more optimized version (i.e. not use modulo explicitly as it is a very costly operation, in relative terms).
For the record, gcc and clang won't use modulo explicitly in debug builds if index is unsigned, msvc will. However if index is signed, msvc and gcc wont use modulo but clang will.
@@pikachulovesketchup666 of course all compilers does it, my observation was simply about debug vs release builds, and as Nathan showed thats not the entire story today.
> When it comes to gaming and other certain workloads that [single core performance ] is the reality of what matter
Luckily that been slowly changing since Moore's law has broken down and cpu manufactures have been adding more cores! There are going to be workloads that can never be parallel but luckily there's a lot of low hanging fruit for typical applications to add parallelism.
Much love and appreciation from the Italian computer science UA-cam community!
Love this follow-up to the first SW drag race video...and we get bloopers! Great work Dave (and production staff?) :)
Just me and a couple of shop dogs! Maybe at 200K I can hire a student editor :-)
ARM is a load-store ISA but presumably Apple did something for x86 emulation that allowed it to operate in a register-memory manner. Not sure if that applies to native ARM code or not. ARM definitely has some bit twiddling instruction, I'd be a little surprised if the compiler is generating shifted bit masks and ANDs for your bit test.
For the scalar pipeline ARM's 32 bit ISA had predication but it looks like aarch64 dropped that complexity. What you really want to maximize your integer throughput is something that auto vectorized (or explicitly vectorize it yourself with the neon intrinisics). Of course if I remember from your last video, this code has integer division in it, which takes a huge performance hit for all architectures in terms of latency. X86 and ARM both lack vectorized division due to the ridiculously complicated amount of gyrations that have to occur in the ALU for it.
That having been said, I haven't finished your video yet, I'm only 5 minutes in. I'm curious how this goes.
My time feels valued
love it, was curious about the M1...don't have one...not in a hurry to get one...but curious where Apple is headed with it. Looking forward to your compiler comparison. Also something I don't get to look at much...in my world it's visual studio...and you live with it. But I know from prior experience that is not the only game out there.
Naming the output .exe is well played ,)
Love your videos Dave all the way from uk
17k views and over 4k likes.
Such a stroooong like to view ratio. You're going very strong here, Dave!
Best of luck!
Neat. Somehow, _all_ the results were actually impressive.
The lowly Pi 3 is impressive for how narrow the delta actually is between cheapest possible self-contained computer and a TOTL desktop CPU.
The Pi 4 for how much tighter that gap.
The M1 for being a brand new product with the slider pegged dead in the middle between "optimized for low power" and "optimized for high performance."
And, of course, the Threadripper for having the biggest 🥜 of just about any CPU available. haha
Controversial/manipulative test. Changing vector to vector puts M1 on par with TR. and with C array it scores 13.5k (C array improves result for x86 as well)
vector wastes 7 bit per byte and can't accomodate the larger sieves though!
@@DavesGarage true. But sdt::vector implementation appears to be suboptimal for arm systems. I wonder how it gonna work with manual bitmasks. I would expect ~2x improvement for rPi
Good stuff. Saskatooner here
Dunno if Rosetta code already has a prime sieve implementation or not, but this could make a fine addition
Watching you never gets old.
Great video! I'd like to see a CPP vs Rust vs Go showdown
The highest result I saw with the code from this video is 7929 on a Ryzen 9 3900X with g++, 7877 with clang. With the repo code it was 10191 (g++) and 10684 (clang). In theory the Ryzen 7 5800X should be about the fastest.
got 8908 on my R7 5800x
but using the standard run.cmd from the github
just a few years ago (seems that way anyway) single core ipc would be the main thing to look at regarding games, but i have recently started playing pc games again and nearly all of them either use multiple threads/cores or in some cases require them. multithread/core performance is more important to games now.
My guess? Any RISC instruction set is probably not going to perform it's best in this workload which is very load/store intensive. Stuff that chews on a few registers is where ARM/MIPS/RISC-V shine, not striding over a list where memory accesses are a separate instruction.
I wonder how the raspis would fare when overclocked, would love to see that comparison as well.
Also, your speech became much easier to understand since the blue screen video, so that's nice for all non-native English speakers (:
I wish you would have just added one more benchmark. It could have been the same problem to solve, only allowed for multi threading. That way, we could also speculate as to what performance degradation we could (maybe/maybe not) see from the M1 doing what it does.
Thank you Dave for sharing this video.
Closin in on 100k!
Love the increase in editing quality
Thanks! Not sure which ones you mean compared to, but at least it's the right direction!
It would be interesting to include x64/Rosetta vs. arm64/native on the M1...
I agree the M1 is definitely doing something interesting for x86 emulation, though it appears to be just adding hardware support for strong memory ordering when running code intended for the x86, which given the cache heavy nature of this benchmark probably wouldn’t have much effect.
This was the straw that broke the camel's back in favour of me buying an M1 Mac after a decade of netbooks and secondhand business laptops from Japan. The high performance with long battery life and low heat output got me close, but not close enough to fork out the $$$ until I saw even the x86 emulation was sometimes faster than on x86 hardware.
@@andrewdunbar828 what makes Dave’s tests here interesting is that the M1 is a Laptop CPU... the Threadripper is a Desktop CPU. It will be fun to see what Apple do in the Desktop space with their ARM implementation!
@@blooddude I might be mistaken, but from what I know the ARM based architectures don't scale that well.
@@blooddude if anything
Hello!
I figured it out I think.
The M1:s performance cores runs at 3200MHz, the AMD 3970X runs at 3700MHz.
If we assume that the code is written (I have not looked at i yet) in such a way that the compiler simply can't optimize the code due to the way the instructions depend on the result of a prior instruction (hop it makes sense), the result of each CPU is roughly 2 cycles per MHz.
So the M1 would score 3200*2 = 6400 and AMD 3970X would score 3700*2 = 7400.
It lines up quite nicely, but it does not test the difference in architecture in a good way.