Software Drag Racing: M1 vs ThreadRipper vs Pi
Вставка
- Опубліковано 1 чер 2024
- Dave pits a new Apple Silicon M1 vs an AMD ThreadRipper 3970X while a Pi3B+ an Pi4 try to tag along! See the surprising results and the reasons behind them in this episode of Dave's Garage Software Drag Racing.
Code for this project is available here:
github.com/PlummersSoftwareLL...
0:00 Start
2:50 Single Core Workloads Defined
4:00 BT and BTR Instructions
5:50 Install C, C++ etc
6:20 PI3B+
8:00 PI4
8:35 PIs compared
9:00 M1
11:08 Spoiler results
11:20 Github details
13:00 Python Apologetics
14:30 Closing - Навчання та стиль
This channel makes me happy.
I am not even a programmer but somehow listening to Dave is interesting and calming at the same time
Totally! 🥰
Me too
He made me appreciate windows .... that says alooooot
It's so calming, entertaining, educational and just plain fun.
Thanks for giving your time freely to play with this sort of stuff.
UA-cam is an amazing medium for us mortals to engage with interesting people like yourself.
Keep up the great work 👍
I don't usually notice background music without hating it but I think you found the right balance of musical complexity and intrusiveness
So I changed the vector to std::array, and got ~13000 passes on my m1 air. Fyi, it was ~4500 passes with vector.
that probably would lead it to also be faster on the other implementations, since it becomes static memory
Dave you talk in perfect speed. For once I don't have to speed up the video I'm watching 🤣🤣
That's funny ;-). Yup, I default to 1.25X I think!
Agreed
Zoomers
Hahah :D
I watch these at 2x speed. But then again, I watch most others at 3x.
becoming one of my favorite channels.
This guy is what UA-cam should be
Thanks for the kind words!
- Tell me you're a Windows developer without saying "I'm a Windows developer"
- OK.exe
./no
@@mek101whatif7 imma need you to $rm -rf / right tf now
windefproc
Exactly what I came to the comments for
@@jabalahkhaldun3467 Useless unless you --no-preserve-root
Great video, as always! Maybe another metric to consider: price per pass? :)
For example: the Pi 3B+, $35/305 ~ $0.11/pass
And Watts consumed per pass ;-)
@@donaldklopper yeah, outlay is usually nothing compared to power, in an industry. Outlay is usually only an issue for home and small businesses that let equipment sit idle 99.99% of the time, even while "working".
I smashed the thumbs-up button. I couldn't argue with your logic.
You smashed it? Do I sound like Peter McKinnon? You can just lightly click it. But I thank you nonetheless!
Hi, would be nice if the github url was mentioned in the description. Otherwise nice episode.
Looks like he fixed that.
I LOVE the fact you talk at a nice, normal pace. There are some channels I watch at 1.5x speed just to get them to talk at a normal pace.
Neat. Somehow, _all_ the results were actually impressive.
The lowly Pi 3 is impressive for how narrow the delta actually is between cheapest possible self-contained computer and a TOTL desktop CPU.
The Pi 4 for how much tighter that gap.
The M1 for being a brand new product with the slider pegged dead in the middle between "optimized for low power" and "optimized for high performance."
And, of course, the Threadripper for having the biggest 🥜 of just about any CPU available. haha
This will be of no interest to anyone but a Pi 1 Model B (from 2012) achieves a score of 97
It makes me happy to know! Thank you for sharing!
Thanks Dave I am a Software Engineer, just graduated from college and am starting out. I love your content. I once had a professor who said "Programming is wizardry, and programmers are wizards." Someday I hope to be as great a wizard as you buddy.
All the best 👍 Per (DK)
For giggles I run Dave's code on my computers here. The Windows boxes (Ryzen 5 and Intel i5) run the g++ in Debian under WSL2, the other machines run Debian or Raspberry Pi OS on bare metal. To be honest, I'm very impressed with the Ryzen.
AMD Ryzen 5 3600X 6-Core Processor => Passes: 9605
AMD Athlon(tm) II X3 460 Processor => Passes: 3642
Intel(R) Core(TM) i3-4005U CPU => Passes: 2551
Intel(R) Atom(TM) CPU N270 => Passes: 911
Intel(R) Atom(TM) CPU N450 => Passes: 871
Raspberry Pi 3 => Passes: 764
M1 is still very impressive for a very new product in it's first life cycle. Also factoring in the power consumption makes it look even more impressive.
also cost makes it impressive for its performance you could get almost 3 Mac minis for the cost of just the threadripper chip
@@michaelhenecke the threadripper is a server chip, no person needs that many cores
@@jan-lukas Yep, and we can get a decent gaming laptop with mac price
Apreciate your efford to include a subtitle in an informative video like this. You talk like a C program runs on a newest CPU when my brain is a pentium 3 running Java which is constantly overheating
This was unexpected. I ran the CPP code on a WSL 2 terminal running Ubunutu. The CPU on the box is an AMD Ryzen 3800X running at stock speeds. And still, it outpaced the Threadripper. The first run turned in a score of 9622!
Passes: 9622, Time: 5.000000, Avg: 0.000520, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
I wrote a multithreaded solution to prime number generation in C++ a few months ago, it's actually not too hard to implement. Would be interesting to see how much the threadripper outpaces the M1 when you use all the cores lmao and would perhaps be a good next-step up from this.
Single thread performance is still super important. So much software is single threaded.
@@tommcintosh4705 Sure, it's important, but it's not more important than multithreaded performance. Things that tend to take a long time (e.g. compilation, 3d rendering, encoding video files, etc.) also tend to benefit from multiple threads, plus with more threads you can run more software concurrently (e.g. even if most software _was_ single threaded, being able to run more of it simultaneously could be a huge benefit).
Also all current implementations of x86 has SMT: an optimization around the weakness that it has in purely single-threaded workloads by allowing a single core to do a bit more than one thread's worth of tasks at once (essentially, a lot of the core's resources are left idle by it's design, and that idle portion can be used to execute another thread at the same time). The M1 specifically has a relatively large advantage in that _one_ aspect, but essentially you're handicapping x86 by not letting it use it's benefits as well.
Based on that, it's pretty misleading to show off single-threaded performance and act as if it's _that_ important of a metric.
Edit: to be clear, I'm not saying Dave is being misleading here, but that Apple's sudden surge of "hey, check out the single-threaded performance of our M1 part and see how powerful it is, also do benchmarks with single threads plz thx bye" is misleading and the fact it's worked: many people are suddenly trying to come up with super synthetic benchmarks that show off this weakness of x86 and push it as a huge problem, when it is typically _not_ that huge of a deal in practical usage.
@@tommcintosh4705 Well yeah
@@tommcintosh4705 I find that very little software is still single-threaded nowadays. Even games which are often very intensive on a particular single thread are usually multithreaded.
@@nephatrine Yup, no matter how much optimization you do on a single threaded code, it'll be hard to beat just spawning a crap ton of threads, even with bad optimization (if you can that is).
I recently had a .Net code run on a single thread for almost 50 minutes (that was optimized), but running it on 12 threads got it below 5 minutes. Try doing that on 1 core I dare you. (Also, later I got it running on my GPU using OpenCL, run the same task in under 10 seconds XD)
I run a Ryzen 1600 (14 nm version no OC 3.2-3.6 GHz clock speeds). And I got this result with g++ -Ofast
Passes: 8427, Time: 5.000000, Avg: 0.000593, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
I would expect it to be a lot lower.
Got a similar result on my AMD Ryzen 7 4800H with Radeon Graphics, no OC in a Laptop.
Passes: 9840, Time: 5.000000, Avg: 0.000508, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
I got 8200 passes on Ryzen 3600X but compiled with MSVC. WTF?
Using clang in Ubuntu 21.04, my Ryzen 4750GE w/o overclock:
Passes: 10777, Time: 5.000000, Avg: 0.000464, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
I was wondering what was going on; glad to see I'm not alone. 3900X @ 4.2GHz all-core OC -> Windows 10 -> VirtualBox VM running Mint 20.1 = 9384 passes.
Running PrimeCPP on my iMac with a 10700K CPU results in:
Passes: 8607, Time: 5.000000, Avg: 0.000581, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
My time feels valued
Sorry about your stroke, Dave. Rapid recovery! 😁
That's the comparison that we needed but didn't know it!
I just ran across your channel a week ago, and I'm really enjoying hearing your take on different programming issues! I used to work out the details of an algorithm using whatever scripting language was available on the platform, and once i had a solid plan, I would go back and rewrite it using C or FORTRAN or whatever else. This proved an effective way to cook up some great code that could do the job. Thanks for all of the great comments during your videos!
I'm a simple man. I see Dave drop a video, I watch it. It's really not complicated. Your a legend dude 👏
I appreciate that!
Mellow piano music, sparkly lights.. new Dave's Garage episode! ... It feels like Christmas! Dave thank you so much.. as always, top notch content.
Dude, well done.
Hey, thanks!
This is bloody brilliant.
Also, the fact that Nano was used as the editor made my day. Kudos to you sir!
nano is so nice :D
@@bobbydazzler6990 No.. masochists!
It'll be interesting to plot the same chart but divide by Watts used by the CPU.... Surprising results...
And you mentioned Turbo Pascal! I like you.
You have no idea how apt the drag racing analogy is. I've been working on my own cars for more than 40 years. I know my way around an engine. But the idea of tearing down and rebuilding a 10,000 HP engine in 45 minutes is basically sci-fi to me. Similarly, I've been playing with computers since my folks bought us an Apple IIe back in the mid '80s. But what you do here is basically voodoo. Sure, I understand the concepts. It's the depth and breadth of the minutia that impresses me. Fun stuff.
The showdown of the decade
Thanks for this episode. Looking forwards to see how different compilers perform.
at 10:03 your testing of index % 2 == 0 and index & 1 == 0 - only makes a difference if you are running in debug, not in release mode, as release mode will always compiled SomeVariable % 2 == 0 to the more optimized version (i.e. not use modulo explicitly as it is a very costly operation, in relative terms).
For the record, gcc and clang won't use modulo explicitly in debug builds if index is unsigned, msvc will. However if index is signed, msvc and gcc wont use modulo but clang will.
@@pikachulovesketchup666 of course all compilers does it, my observation was simply about debug vs release builds, and as Nathan showed thats not the entire story today.
I love that your terminal window is blue with light grey text.
We do like charts!
Congrats, you are the first youtuber who convinced me to click on the Like button upfront.
ikr
Thanks for making these, as a constantly learning programmer these are invaluable.
I really appreciate you and your channel. This is a great example of a proper benchmark
As a car/drag racing enthusiast and hardware engineer learning to code this was an excellent episode. Just subbed!
Great work once again, Dave!
The subtitles are helpful, especially because I watch at 2x speed. There were a couple places where they were missing. I remember one when you were talking about BTR in the beginning, and one when you were talking about the bugs found in your code.
Edit: and the entire Python apologetics chapter
Thanks for the quality content. This is both entertaining and educational.
Oh damn this is gonna get wild
I actually like the speed you talk at. You're the only videos which I can watch a regular speed instead of 2x like most others and 1.5x for everything else.
This has quickly become my favorite channel.
I love that we are mathing it up on different systems.
Dave, I am really enjoying your videos! I am currently studying Computer Science in school and hope to pursue a career in programming and your videos are inspiring me to continue my pursuits!
99.2 K Subs as I type this! You found your groove and your channel is growing nicely! I remember (as it was not so long ago) joining when your sub count measured in the hundreds. I do hope that you will continue to feature automotive content and tech projects as well. Well done, Dave!
I'm not a native english language speaker, but I'm mostly ok with your speed, except in those rare circumstances where you talk native slurred american without much emphasis on words :D Thats really hard to get for me. And yeah... a programmer thats been on the dark side now spilling the beans.. how can i not subscribe
This is so detailed and neerdy. I love it!
I would love to see a drag race between C++ and Rust!
Thank you Dave for sharing this video.
Hi Dave,
thanks for producing this channel! Very enjoyable!
I ran PrimeCPP on my 5950X in WSL2:
Passes: 11267, Time: 5.000000, Avg: 0.000444, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
Passes: 11327, Time: 5.000000, Avg: 0.000441, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
Passes: 11346, Time: 5.000000, Avg: 0.000441, Limit: 1000000, Count1: 78498, Count2: 78498, Valid: 1
Cool! I've seen a 12000 as well from another viewer, but I think he was overclocked!
@@DavesGarage , User_Overclocked_Error - Only Machines Should Be Overclocked (0xB00B1377)
Great channel Dave, lots of great info. Hope you can help folks porting Windows to the raspberry with your knowledge.
Love this follow-up to the first SW drag race video...and we get bloopers! Great work Dave (and production staff?) :)
Just me and a couple of shop dogs! Maybe at 200K I can hire a student editor :-)
i'm aspiring to take my interest in tech further, and this channel is a reason for that!
This beats watching mindless TV. I am learning something about some thing I truly enjoy, computers.
Got ~10k on an old 6600k and was sort of surprised, but in the end it makes sense as it's a single core workload. Great video.
Nice information, glad you brought up that Python isn't the answer to all code. Lately with all the do it in python rant in alot of the developer areas, its nice to hear use the language that makes sense for the task at hand. Thanks again!
Some coders want everything available in the language they already know. That's how we got the do it all in Python crowd and do it all in JavaScript crowd as well.
I do heaps of programming with deep learning, sometimes Web server logic, etc. A lot also includes prototyping, so my calculations of "speed" always include how long I need to code.
Sure, had I written my code in pure C/C++/etc., it probably would have been 100 times faster than it is now. But I need to get stuff done instead of obsessing on how low-level I can get. Had I done that, I would probably have finished 10% of my work shortly before retirement in a couple of decades.
It's perfectly sensible that there's languages on so many levels (no pun intended). No point on starting a war over _that_, too.
Except for R. This just sucks. ;)
the Threadrippers and zen2 in general are such beasts man.
I've watched so many of your videos that I was amused that I was not already subbed. Well I fixed that bug. Speaking of bugs, could you do a video about all the rare bugs you know about? Always found that fun.
Thanks!
10:23 Nice of you to have mentioned the std::vector thing, that was discussed in some comments of the previous video.
It would be interesting to see whether its template specialization in your STL implementation was done actually with bitfields (and if so, what are the differences compared to your bitfield manipulation), or using actual 1-byte bools (that would be then byte-aligned)...
Really entertaining - the right balance of tech with humor i enjoy - and always stay for the outtakes - Thanks Dave
Glad you enjoyed it!
Dang! That's just peachy, a (former) Microsoft employee has forced me to upgrade once again. I just upgraded to a subscriber.😁 Thank you for the great content.
Watching you never gets old.
I thought I had a stroke when I saw Cascade working on my shared control system in 1988, maybe 90. It was so funny it deserved to get shared.
Wish I hadn't given a thumbs up for this video... because now I can't give it a thumbs up anymore 😊 Nice vid!
The .exe extension at 6:36 does reveal your Windows roots..
Well presented and articulated though, as always.
Great job!
The bloopers got me! Whole ep of gag reel please lololololol
Even though I am current swinging in a hammock, in front of a volcano in Costa Rica, I could not miss a Dave's Garage premiere.
Living the dream!
I may be joining you, Liberal Lunatic Free Zone...
Dave, I love the content and the upvote is worth it just because you bothered to make chapter markers in this video!
Would be cool to see an optimized version of a wasm and Node benchmark in addition to the vector optimizations you made to the CPP benchmark!
3 haters who don’t have any clue what he’s talking about. I mean I know what he’s talking about but don’t know how to do it...but I don’t hate Thanks for entertaining content!
1:16 Hell yes! Thumbs up and subscribed right away. You manage time very well in all videos i have seen so far.
Woo drag racing, yeah!
ARM is a load-store ISA but presumably Apple did something for x86 emulation that allowed it to operate in a register-memory manner. Not sure if that applies to native ARM code or not. ARM definitely has some bit twiddling instruction, I'd be a little surprised if the compiler is generating shifted bit masks and ANDs for your bit test.
For the scalar pipeline ARM's 32 bit ISA had predication but it looks like aarch64 dropped that complexity. What you really want to maximize your integer throughput is something that auto vectorized (or explicitly vectorize it yourself with the neon intrinisics). Of course if I remember from your last video, this code has integer division in it, which takes a huge performance hit for all architectures in terms of latency. X86 and ARM both lack vectorized division due to the ridiculously complicated amount of gyrations that have to occur in the ALU for it.
That having been said, I haven't finished your video yet, I'm only 5 minutes in. I'm curious how this goes.
great video, very interesting comparison and i love the jazz in the background
Glad you liked it!
Love your videos Dave all the way from uk
super interesting shtuff
Dave you rock! I love your channel!!
love it, was curious about the M1...don't have one...not in a hurry to get one...but curious where Apple is headed with it. Looking forward to your compiler comparison. Also something I don't get to look at much...in my world it's visual studio...and you live with it. But I know from prior experience that is not the only game out there.
Enjoy the channel. Good stories and random bloopers. Cheers!
Naming the output .exe is well played ,)
It would be interesting to include x64/Rosetta vs. arm64/native on the M1...
I agree the M1 is definitely doing something interesting for x86 emulation, though it appears to be just adding hardware support for strong memory ordering when running code intended for the x86, which given the cache heavy nature of this benchmark probably wouldn’t have much effect.
This was the straw that broke the camel's back in favour of me buying an M1 Mac after a decade of netbooks and secondhand business laptops from Japan. The high performance with long battery life and low heat output got me close, but not close enough to fork out the $$$ until I saw even the x86 emulation was sometimes faster than on x86 hardware.
@@andrewdunbar828 what makes Dave’s tests here interesting is that the M1 is a Laptop CPU... the Threadripper is a Desktop CPU. It will be fun to see what Apple do in the Desktop space with their ARM implementation!
@@blooddude I might be mistaken, but from what I know the ARM based architectures don't scale that well.
@@blooddude if anything
Great video and test!!
Thanks for sharing :-)
Good stuff.
You and curious Marc are my favorite UA-camrs right now
i like PI
and your channel
Great video! I'd like to see a CPP vs Rust vs Go showdown
Thanks @DavePL, there goes a few hours on my long weekend playing with this :) Great content BTW now one of my favourite channels.
I was about to go and write GoLang, PHP, Pascal implementations, then I saw all the existing implementations and now I'm not sure its worth just being another "me too" :)
Interestingly the CPP versions of this achieve 4820 on my super old i7-870. FYI I achieved 8221 on my i9-9900K
I'm really getting a lot out of your content, Dave. Many thanks.
Juhu don't know why a Video like that makes me that Happy
> When it comes to gaming and other certain workloads that [single core performance ] is the reality of what matter
Luckily that been slowly changing since Moore's law has broken down and cpu manufactures have been adding more cores! There are going to be workloads that can never be parallel but luckily there's a lot of low hanging fruit for typical applications to add parallelism.
Sub' earned by not dragging this out. 👍
My guess? Any RISC instruction set is probably not going to perform it's best in this workload which is very load/store intensive. Stuff that chews on a few registers is where ARM/MIPS/RISC-V shine, not striding over a list where memory accesses are a separate instruction.
Dave, I can''t program anything more advanced than a PLC, but when ever a page with your videos load, I hit the thumb ups regardless, as you always increase my understanding of the stuff I have no knowledge in. Thank you !
Code a prime calculator in ladder logic ;)
@@stonent I do most of the stuff in FB, but point taken lol
just a few years ago (seems that way anyway) single core ipc would be the main thing to look at regarding games, but i have recently started playing pc games again and nearly all of them either use multiple threads/cores or in some cases require them. multithread/core performance is more important to games now.
Love your content 👏👏👏👏
First of all I really enjoy the content you produce.
An idea for the topic
on y cruncher program
multi-threaded Pi calculation