It was great meeting you in St. Louis! Your build was amazing, and I'm totally jealous of that 1.4 _TERABITS_ of bandwidth! The 512 GB of RAM on the Threadripper is a little ostentatious, but it's really cool to highlight both ends of current ARM 'cluster-in-a-box' solutions.
I'm all about minimal..love the ITX build on the Streacom ITX Bench chassis..I have one of these as well and this is definately giving me some ideas here. Would love a stand alone system next to my workstation to run some Rancher K3s...
@@unknowntechio definitely love the packaging on it. Imagine having that on your desk and being able to tinker with it. So much power, such small footprint. And the energy efficiency, what if you ran it on solar energy?!
12:45 Use case: Super NAT box with network traffic analysis. NAT VPS is becoming more common, and what better way to do the NAT than big powerful NAT gateway. Keep rest of the infrastructure management exactly the same as non-NAT. I am assuming you can push the network traffic via PCI-E to the host node, thus you could do traffic analysis for network optimization, monitoring, security etc. Further you can use the host node then as the VPN gateway to access the NAT side, management etc.
I'd really love a video with an actual installed software use-case scenario to help me understand what kinds of real work could be done on a DPU cluster. I haven't yet been able to wrap my head around what's possible from the general DPU conversation. What can I do today with that massive x86 and those 8 DPUs? Show us some live working use-cases! 😁😍
I have a good one and it was my original inspiration early last September. Container orchestrator with custom runtime to offload functions or calculations to the ISA most suited for the workload. For instance, one would use cavium MIPS for network and crypto, x86 for general use, gpu for obvious, spark for heat/time machine/death by oracle, and arm for.... Um... 8bit retro gaming, farty tasks, bragging rights... Idk what arm excels at offhand besides liliputing, size vs cost,
I have been part of projects that did simulated distributed environments in a box, for showing what's possible. This would be that, but with performance available. Yes, you could do it with VMs, but if you had an edge based solution that used ARM this would be a pretty cool way to do it.
@@morosis82 Right but what's the need for the Threadripper? I'm trying to wrap my head around the software configuration of this setup. Wouldn't they each run an ARMv8 Linux as a cluster? Or like you said, you could run KVM/ESXi etc. on the ASUS mainboard, but how are those cards presented to the hypervisor? Do they use a special DPU driver and QEMU handles architecture virtualization? I simply don't understand lol. Also I was looking at the DPU brochures and I don't fully understand what workloads these are meant for. And I don't know if $15k is a budget build for this sort of compute or not. Once again the $7,000 AMD chip seems pointless or overkill at the very least.
i am so impatient of watching your fiber videos, i am starting to get really interested about the topic thanks to your coverage. also good choice, this one with jeff is a really good collab!
Hey Patrick! This pair of videos with Jeff made my day. I am jealous of both of you! One thing that would help us understand your project better would be some benchmarks, which you didn't do here or in the main article.
We used this platform for some of the Threadripper Pro coverage we did earlier this year so it seemed redundant benchmarking a CPU we have already done a lot with. On the DPU side, the UCSC / Sandia folks did a killer job also looking at the accelerator performance for crypto and such. That paper is so good that I am not sure what we would add other than "the A72 core is crazily slower than the Zen2 core." I mean, I can run Linpack on it, but we will have the Ampere Altra Max next week which is more interesting for performance Arm since those are N1 cores.
That is an awesome build. The only thing I would worry about is heat build up around the DPU cards. I would want some fans taking in cool air and blowing it on the DPU cards.
Jeff: man I want that node cluster Patrick: man I want that node cluster 2 fellow nerds at different price points but still wanting what the other guy has
Totally could. The other way to look at it though is that you would need to buy ~20 of the Turing Pi 2 clusters to get similar capabilities, minus the networking. That was the point we were showing. A small version and a big version.
That's amazing all those 100G ports on that little thing haha! You pretty well have the same fiber as a data center build out! :) We have a bunch of 40g fiber going into our newest build out along with some 100G.
It's not often you get to see what high-performance ARM can get to. :) What kind of power supply did that require? And have you measured peak draw on it?
The system right now is on an 850w PSU. I think max these take 60-65W or so across the card including optics. The 2.5GHz cards can go 75W or something like that (sorry cannot remember off the top of my head)
This is a cool build, and great to see the crossover with Jeff. Any chance that you could run some benchmarks on the Arm DPUs similar to what Jeff did on his?
Probably will do for the main site this week. The HPL is so bad on these lower-end Arm cores that the Threadripper does much better. In the meantime, the UCSC/ Sandia paper on BlueField-2 performance I think is done very well. arxiv.org/pdf/2105.06619.pdf
From the paper: "Individual Results Analysis: As we expected, the BlueField-2 card’s performance generally ranked lowest of all the systems tested except the RPi4." So, while not great, node for node, the STH Arm cluster should outperform Jeff's cluster in compute workloads.
5:38 Chad with a Tesla Cybertruck vs. Jeff with a puny Toyota Prius meme right here. Joking aside, thanks for this silly mis-comm; that was seriously funny.
you RANDOMLY meet him at an ARCH?? As in ARCH LINUX??? Are you trying to tell us something Patrick????? ....... .... ...... ....... ..... .... .. . yea I didn't think so
Nice cluster box. As a fellow competitive person, I think that I could easily beat it. Put a X12DPG-QT6 Supermicro motherboard in a 4U rack chassis box, 2 x 40 core Icelakes + 4TB ram, 6 DPUs + 4 port 10G ethernet. You could liquid cool almost everything including at least 4 DPUs. Thanks for the video.
It can, if you leave the CPU out of it. (RDMA is AWESOME!) Even without the Threadripper Pro, on my old Intel Xeon E5-2690 (v1) cluster, in the IB bandwidth benchmark, I can top out at around 96-97 Gbps (out of a theoretical max of 100 Gbps), when running in RDMA mode. In an actual application usage, I'm closer to around maybe 80-82 Gbps, but that also be HIGHLY dependent on the application that you're using (and how well the MPI has been implemented in the application itself). When I benchmarked the network bandwidth using Fluent, I was able to get 10 GB/s max across four nodes (all of whom are connected to a Mellanox MSB-7890 externally managed 36-port 100 Gbps EDR Infiniband switch). (Sadly, alas, the X9 platform only has PCIe 3.0 x16 and the ConnectX-4 dual 100 Gbps VPI port cards are also only PCIe 3.0 x16, which means that the bus interface can't support two 100 Gbps connections at the full, line speed, simultaneously.) But you can definitely get a lot closer to it than anything else. Ethernet on the VPI ports has between a 1-3% penalty vs. Infiniband.
Right. There is enough bandwidth assuming DMA used, but it's somewhat close so it might be a little limited depending on how much overhead there is. If you try to do actual CPU processing on the data, you probably will run into Infinity Fabric bottlenecks. PCIe 4.0 x16 has 31.5 GB/s = 0.252 Tb/s of bandwidth, so with 7 nodes the available bandwidth between the nodes and CPU is sufficient: 1.764 Tb/s. (These are all full-duplex links.) DDR4-3200 provides a data rate of 25.6 GB/s, which times 8 channels gives 1.6384 Tb/s total. But I think the main intended use of the DPUs is to offload a lot of the processing to the ARM cores, so that the CPU doesn't have to actually process all the data.
@@movax20h What do the Bluefields actually do? They're high throughput, but what is their purpose? And even then I don't understand why they are paired with the Threadripper. Would this be a machine learning solution, and that's the need for the bandwidth? I'm reading the STH article and it mentions crypto offloading, but are we talking about mining?
Well... if you can get your hands on a dual tower, like the TT Core W200, with the P200 add ons, you could probably run 2 of those rigs you have there, plus enough space to fit ... 64 arm boards? Depends on how nuts you go with it, but R-pi boards fit in drive slots real nice like...
This is a little bit overkill to my dream of having a plex DVR cluster, record on one node, and use all the other nodes to transcode the videos from the massive MPEG2 files, over to X265 using TDARR or handbrake. But the TuringPi, even with the 6 core Jetsons, might not have enough power to transcode. Currently my 3950x needs supplemental support from a 4650G to keep up with recordings. Might need two turingpi V2, each with 24 cores.
@@GeorgeWashingtonLaserMusket For one, the Jetson Nanos i have do have a hardware transcoder. However The problem i have is that going from MPEG2 to H265 creates larger files than the original I've tested: NVidia NVENC on Maxwell, Pascal, and Turing (IIRC Maxwell couldnt transcode to H265 at all, or couldnt do it from MPEG2) AMD VCE Vega, VegaII, Vega2.5(4000 and 5000 series Ryzen APUs), RDNA1 Intel QSV HD4000, HD5000, HD530, HD630 Apple Video Toolbox M1 ALL of them create larger files than the original, but only when going from MPEG2, to H265, where as CPU normally halves the file size. or even better if low-grain black and white My 3950X is about as fast as my GTX 1650, though, it uses 3x more power at 90w vs 30w. This is a sacrifice i'm willing to make to achieve my intended goal of not losing quality, while significantly reducing file size
Im more than a little annoyed that there was both a supercomputer conference AND Patrick from sth in my city, and i didnt know about it until 2 weeks late
All I can say is that I've been planning this build, albiet on a much smaller scale, since early September. I guess now we wait to see other ISA's get in the mix... or perhaps a DPU crypto offload challenge? IDK but I'm a huge fan of efficiency and minimalist OS. Lets have the parts of the application run on the silicon where they run best and not encumber the devs with sticking to one architecture. Now we need an ISA-aware Swarm Orchestrator....
Excuse me. I don't know anything about clusters, or how to build one. Although it does look like an interesting subject to further investigate. I wanted to ask one thing. DAMN, WHAT IS THAT ARCH IN THE INTRO MADE OF ? Oh my GOD, amazing.
Not too bad really. You are right there is ducted airflow behind the cable cover and they need airflow. On the other hand the chips themselves are sub 55W since there is power budget for the RAM optics and such on the card as well. The networking is using most of the power on them.
You are totally right that they are hot, but we have run it for several weeks. These are the lower power ones and there is a chassis fan blowing over them.
I am not sure what you mean? The Top500 results include full cluster-wide networking. You would normally only run Linpack on CPUs with vector acceleration such as AVX2 on the AMD chips (or AVX512 on Xeons). There the Threadripper is around 1.5-1.6Tflops. So you are about 150-160 Raspberry Pi 4's just for the main CPU on this. On the DPU side, the A72 cores have more performance per core than the Pi's but are really there to accelerate crypto. As you scale small nodes, the interconnect networking ends up eating a lot more power and performance. That is why you cannot just take a number from this machine (or a small cluster) and compare it to a Top500 result, because scaling interconnects is such a big deal in terms of power and performance. We actually test a lot of the high-end 4-8 GPU supercomputer nodes, and high-density CPU nodes, so people would look at us funny in the industry if we made a claim that this cluster, or any small cluster, represented a part of a Top500 linpack run.
@@favesongslist Yes, but it would have made his look a lot less efficient if it used 4x the power and cost 1.5-2x this entire cluster just to get the same performance as the x86 cores only. That would 100% make sense to folks in the industry as the RPi cores/ boards are not made for Linpack style workloads.
Had a few screenshots in the video,l. For the actual box, the fans spin up, a few blinking lights. Not much to it really. The side needs to be on to ensure there is enough airflow though. Hopefully when the new studio is finished can do more with everything on. Good feedback
@@ServeTheHomeVideo that one was nice. But I’d love to see and hear more about those fiber runs and equipment you mentioned in this video. Anyways, you are doing a great job and I really appreciate every video you put out.
interesting video even though I don't think I would choose this solution for my small project or I would reach for adult server hardware, but it's amazing to see where you can go with today's technology and look at a little unusual solution .. both gentlemen have interesting content and it was would be interesting to see some common project ;)
I know it's going to be those Bluefield cards! Just one simple question: is the host sharing the 14x100G connections with the DPUs or are those ports exclusively for the DPUs?
is there like an older DPU one could buy off ebay for cheap to build something? I have an x58 workstation board with like 8 PCI-e lanes. would love to try this
To be frank, the BlueField-2 feels very rough. I am not sure if I would recommend getting an older BlueField 1 or something. My guess is that we will start seeing more used units next year when the new generations come out.
@@HavokR505 More how you would use these is to use them as distinct nodes. Imagine running ESXi, as an example, on the BlueField-2 DPU, then being able to use that to provision vSAN, the bare metal server, and handle all the networking via the DPU. So ESXi runs on the DPU instead of passing through the DPU to the host machine.
@@ServeTheHomeVideo As I recall from the video though, the cluster nodes doesn't have NVMe onboard or DAS (unless it's like passed through from the AMD Threadripper Pro system). And the AMD Threadripper Pro system itself doesn't have a 100 Gbps connection that's tied/directly connected to it. So it would be interesting to see how you might deploy a NVMe-oF solution if the DPUs doesn't have any onboard or direct attached NVMe drives to it and/or that the NVMe storage has to be passed through to the DPUs. Like I would understand if the DPU had either onboard or NVMe directly attached to it and then you can present that to the fabric. But I've never seen how you would do the same if you don't have any onboard NVMe nor any NVMe that's directly attached to said DPU. That's interesting.
Is there a budget friendly ARM card, I have this weird dream where I can have proxmox running a few VMs on x86 but I'd like to be able to spin up an arm VM from the same machine. Does hardware exist to do this that isn't a £2000 DPU? Some kind of PCiE card that took compute modules would be ideal.
It might be cheaper to just run a RPi or a cloud instance for Arm. We are going to have more on the ASRock Rack mATX board for Ampere Altra in a few weeks. Also search for ASRock Rack ALTRA-NAS which I hope to do in Q1 2024
@@ServeTheHomeVideo it would definitely be cheaper but I wouldn't be as cool. I remember you could get an 086 co-pro for a BBC micro and with a few cunning commands your trusty BBC would suddenly be a PC compatable... Now I know its a completely different era of computing but I think it would be a lot of fun if my x86 machine could BE an arm machine too rather than just emulating one.
Why aren't the parts linked in the description? No overall breakdown of the prices of the parts either What's the point of a video like this if the sources aren't linked?
Idle is ~200W with 14x CWDM optics and the 10Gbase-T ports lit up. The networking part uses a ton of power. Max not over 750W yet. In terms of noise, not silent, but nowhere near the screaming servers we normally test. There is a lot more that can be done like ducting a 120mm fan to the cards that you would want to do to keep noise down.
I think we have a 25GbE 48 port switch somewhere in the queue. Let me look into doing more of those 10G units. I think I saw some on the schedule for next year that the team is working on.
@@ServeTheHomeVideo Speaking of 25GbE switches, any word on when we might see stuff like a fanless homelab friendly 25GbE switch similar to the MikroTik CRS305 or 309? I picked up some used 25GbE NICs on ebay and unless I directly connect them I'm stuck with running them at 10 gig.
I would love to see an update to your cluster with: 1. Dual Ampere Altra 128C CPUs (with PCIe 5.0 upgrade) 2. Many more DPUs 3. NVIDIA GPUs You can use 16 of the 32 PCIe 5.0 lanes on BlueField 3 to connect to an NVIDIA GPU with PCIe 5.0 or two NVIDIA GPUs with PCIe 5.0 to 2 x PCIe 4.0. 2 x Ampere Altra -> 12 BlueField 3 -> 12 NVIDIA GPUs (PCIe 5.0) 2 x Ampere Altra -> 12 BlueField 3 -> 24 NVIDIA GPUs (PCIe 4.0)
It was great meeting you in St. Louis! Your build was amazing, and I'm totally jealous of that 1.4 _TERABITS_ of bandwidth! The 512 GB of RAM on the Threadripper is a little ostentatious, but it's really cool to highlight both ends of current ARM 'cluster-in-a-box' solutions.
Thanks Jeff! :-)
Aaaaaaaahhhh! Fan Girling in the corner here, the ultimate CS comic book crossover!
I'm all about minimal..love the ITX build on the Streacom ITX Bench chassis..I have one of these as well and this is definately giving me some ideas here. Would love a stand alone system next to my workstation to run some Rancher K3s...
@@unknowntechio definitely love the packaging on it. Imagine having that on your desk and being able to tinker with it. So much power, such small footprint. And the energy efficiency, what if you ran it on solar energy?!
@@ServeTheHomeVideo Turns out the best order to watch them both is interleaving them ;)
I must say, this is by far the coolest, most awesome and unpredicted collaboration of the year! Good job guys!
Thanks Bogdan. I was telling Jeff last night how much I liked his video (he shared it with me yesterday.)
Did I actually say "Intel cores" instead of AMD or x86? That is why I normally do not edit my own videos! Oops
Jeff Geerling lookin' bad ass. Wouldn't be surprised if he gets additional TSA screening.
I saw St Louis Arch and I immediately thought of Jeff Geerling. And then he appeared! This is awesome
Funny who you randomly bump into while on the road.
I like the industrial look of the build. Nothing beats the beauty of rows and rows of RAM and PCI-E cards.
Watched Jeff pretty much right away. What i loved is that he used he own software he wrote to control it. He's the man!
12:45 Use case: Super NAT box with network traffic analysis.
NAT VPS is becoming more common, and what better way to do the NAT than big powerful NAT gateway. Keep rest of the infrastructure management exactly the same as non-NAT.
I am assuming you can push the network traffic via PCI-E to the host node, thus you could do traffic analysis for network optimization, monitoring, security etc.
Further you can use the host node then as the VPN gateway to access the NAT side, management etc.
Always impressive how connected Jeff is. He's got more comments than Tom has Friends.
Yeah he also has lots of comments at Linus tech tips
I'd really love a video with an actual installed software use-case scenario to help me understand what kinds of real work could be done on a DPU cluster. I haven't yet been able to wrap my head around what's possible from the general DPU conversation. What can I do today with that massive x86 and those 8 DPUs? Show us some live working use-cases! 😁😍
Nothing. Those are just some fancy kungfu for show.
X2 for your question.
I have a good one and it was my original inspiration early last September. Container orchestrator with custom runtime to offload functions or calculations to the ISA most suited for the workload. For instance, one would use cavium MIPS for network and crypto, x86 for general use, gpu for obvious, spark for heat/time machine/death by oracle, and arm for.... Um... 8bit retro gaming, farty tasks, bragging rights... Idk what arm excels at offhand besides liliputing, size vs cost,
I have been part of projects that did simulated distributed environments in a box, for showing what's possible. This would be that, but with performance available.
Yes, you could do it with VMs, but if you had an edge based solution that used ARM this would be a pretty cool way to do it.
@@morosis82 Right but what's the need for the Threadripper? I'm trying to wrap my head around the software configuration of this setup. Wouldn't they each run an ARMv8 Linux as a cluster? Or like you said, you could run KVM/ESXi etc. on the ASUS mainboard, but how are those cards presented to the hypervisor? Do they use a special DPU driver and QEMU handles architecture virtualization? I simply don't understand lol.
Also I was looking at the DPU brochures and I don't fully understand what workloads these are meant for.
And I don't know if $15k is a budget build for this sort of compute or not. Once again the $7,000 AMD chip seems pointless or overkill at the very least.
Awesome! Two of my most favorite YT creators finally met, it's like a crossover on a super hero movie. XD
I love this Patrick - Jeff meeting at the iconic Gateway Arch, St. Louis, MO, 2:05. Both passionate about their stuff.
i am so impatient of watching your fiber videos, i am starting to get really interested about the topic thanks to your coverage. also good choice, this one with jeff is a really good collab!
I think we are going to have more on the STH main site too. Maybe a piece on the Fluke meters we are using to test the fiber.
@@ServeTheHomeVideo great!
It's a cluster in a box..... a cluster in a box baby
Hey Patrick! This pair of videos with Jeff made my day. I am jealous of both of you! One thing that would help us understand your project better would be some benchmarks, which you didn't do here or in the main article.
We used this platform for some of the Threadripper Pro coverage we did earlier this year so it seemed redundant benchmarking a CPU we have already done a lot with. On the DPU side, the UCSC / Sandia folks did a killer job also looking at the accelerator performance for crypto and such. That paper is so good that I am not sure what we would add other than "the A72 core is crazily slower than the Zen2 core." I mean, I can run Linpack on it, but we will have the Ampere Altra Max next week which is more interesting for performance Arm since those are N1 cores.
(Present to camera) "What a coincidence that I would meet you here!" Lol
Planned coincidences are the best kind!
:-)
You didn't have to do him like that. You really didn't have to bring a server grade build to a challenge with the raspberry pi guy.
That is an awesome build. The only thing I would worry about is heat build up around the DPU cards. I would want some fans taking in cool air and blowing it on the DPU cards.
That is a real worry. There are chassis fans blowing over the cards that are not shown well.
Patrick: "Let's do a l'll spinny", 4:51. I am so proud of that move of your cluster on the high stool. Table still yet to come though.
Yea. Table is an issue
I saw Jeff's videos and this one and I thought "oh, what a coincidence" xD
Jeff: man I want that node cluster
Patrick: man I want that node cluster
2 fellow nerds at different price points but still wanting what the other guy has
"wifi6" XD oh yes. i think i need this for my next youtube machine.
not overkill at all.
Amazing crossover 😊
Glad you enjoyed it
I was at the Jeff's and just waiting for this video. And voila... here it is.
We planned to have them come out at the same time today :-)
And this year's Oscar for best actor goes to... :D
With out this channel I would not know what I want for a career. Thank you!
You could buy several of his clusters for the cost of one of your nodes!
Totally could. The other way to look at it though is that you would need to buy ~20 of the Turing Pi 2 clusters to get similar capabilities, minus the networking. That was the point we were showing. A small version and a big version.
Patrick - I have 1.4 Tbps network bandwidth
Me - *crying in my 2.5gbps* 🤣
I'm waiting for this. Anyone else been doing the same?
Arm64, M.2 array , ECC, 10Gbe, ZFS.
Cool collaboration with Jeff
At $15,000, it's still probably a better deal than an Ampere ARM server if you're only getting one.
Wow, what a crazy random happenstance to meet GeerlingGuy RIGHT THERE at The Arch. Amazing.
Also amazing that there was a Sony FX3 setup on a nearby garbage can recording and we both had mics on as well. Amazing!
That's amazing all those 100G ports on that little thing haha!
You pretty well have the same fiber as a data center build out! :)
We have a bunch of 40g fiber going into our newest build out along with some 100G.
Basically having the same fibre cabling structures in data centres or central offices. It surely is very convenient than placing switches everywhere.
Fantastic collab video!!!! Glad to see Jeff!!!
It's not often you get to see what high-performance ARM can get to. :)
What kind of power supply did that require? And have you measured peak draw on it?
The system right now is on an 850w PSU. I think max these take 60-65W or so across the card including optics. The 2.5GHz cards can go 75W or something like that (sorry cannot remember off the top of my head)
I think Jeff covers that a bit in his video mate. You can plug a 24 pin adapter and run that through a round tip adapter. Pretty cool.
I love the acting its alot better than in a marvel movie
Ha!
It was a good competition 🔥
My next pfSense home router. ;-)
Patrick, you seem even happier than normal today bro.
I love getting to do these projects.
That is freakin awesome!!!!!
WOW!!! That thing is AMAZING!! (and Patrick's computer is nice too LOL) I only have ONE question: Can it run Crysis? :D
Proxmox arm and rke2, and proxmox x86+ rke2 n I'd be a happy nerd. Especially with infiniband disk shelf/san
You cheated, he said “none of this x86 stuff” :P but yeah, it is super cool. Clusters are fun!
Your ethusiasm is legendary!
Don't forget the PSP cores - you have plenty of arm cores on that ROME x86 processor itself - one per CCD 😂
This is red shirt Patrick, I’m surprised you didn’t modify anything like red shirt Jeff does :)
Even red sweatshirt Patrick!
insane rig
Wow, what a setup
This is a cool build, and great to see the crossover with Jeff. Any chance that you could run some benchmarks on the Arm DPUs similar to what Jeff did on his?
Probably will do for the main site this week. The HPL is so bad on these lower-end Arm cores that the Threadripper does much better. In the meantime, the UCSC/ Sandia paper on BlueField-2 performance I think is done very well. arxiv.org/pdf/2105.06619.pdf
Thank you for the link to the paper!
From the paper: "Individual Results Analysis: As we expected, the
BlueField-2 card’s performance generally ranked lowest of all
the systems tested except the RPi4."
So, while not great, node for node, the STH Arm cluster should outperform Jeff's cluster in compute workloads.
5:38 Chad with a Tesla Cybertruck vs. Jeff with a puny Toyota Prius meme right here.
Joking aside, thanks for this silly mis-comm; that was seriously funny.
Jeff does not drive a Prius!
Lovely build and home fibering's project.
Small note: You seem to be shooting 24fps footage and importing it into a 30fps video. This causes terrible judder! Remember to shoot in 30 or 60fps.
you RANDOMLY meet him at an ARCH?? As in ARCH LINUX??? Are you trying to tell us something Patrick????? .......
....
......
.......
.....
....
..
.
yea I didn't think so
Awesome sauce
Nice cluster box. As a fellow competitive person, I think that I could easily beat it. Put a X12DPG-QT6 Supermicro motherboard in a 4U rack chassis box, 2 x 40 core Icelakes + 4TB ram, 6 DPUs + 4 port 10G ethernet. You could liquid cool almost everything including at least 4 DPUs. Thanks for the video.
Looking at this and wondering if that CPU could actually route 1.4Tbps.
Maybe an idea for next project :D
It can, if you leave the CPU out of it.
(RDMA is AWESOME!)
Even without the Threadripper Pro, on my old Intel Xeon E5-2690 (v1) cluster, in the IB bandwidth benchmark, I can top out at around 96-97 Gbps (out of a theoretical max of 100 Gbps), when running in RDMA mode.
In an actual application usage, I'm closer to around maybe 80-82 Gbps, but that also be HIGHLY dependent on the application that you're using (and how well the MPI has been implemented in the application itself).
When I benchmarked the network bandwidth using Fluent, I was able to get 10 GB/s max across four nodes (all of whom are connected to a Mellanox MSB-7890 externally managed 36-port 100 Gbps EDR Infiniband switch).
(Sadly, alas, the X9 platform only has PCIe 3.0 x16 and the ConnectX-4 dual 100 Gbps VPI port cards are also only PCIe 3.0 x16, which means that the bus interface can't support two 100 Gbps connections at the full, line speed, simultaneously.)
But you can definitely get a lot closer to it than anything else.
Ethernet on the VPI ports has between a 1-3% penalty vs. Infiniband.
Right. There is enough bandwidth assuming DMA used, but it's somewhat close so it might be a little limited depending on how much overhead there is. If you try to do actual CPU processing on the data, you probably will run into Infinity Fabric bottlenecks.
PCIe 4.0 x16 has 31.5 GB/s = 0.252 Tb/s of bandwidth, so with 7 nodes the available bandwidth between the nodes and CPU is sufficient: 1.764 Tb/s. (These are all full-duplex links.)
DDR4-3200 provides a data rate of 25.6 GB/s, which times 8 channels gives 1.6384 Tb/s total.
But I think the main intended use of the DPUs is to offload a lot of the processing to the ARM cores, so that the CPU doesn't have to actually process all the data.
The CPU itself would bottleneck around 150Gbps.
But the NICs have flow offload, and it is not to hard to use these feature, using standard tools.
@@movax20h What do the Bluefields actually do? They're high throughput, but what is their purpose? And even then I don't understand why they are paired with the Threadripper. Would this be a machine learning solution, and that's the need for the bandwidth? I'm reading the STH article and it mentions crypto offloading, but are we talking about mining?
Best for running Qubes OS for the big machine
Patrick...how many coffees do you take before the catch phrase ? ehehehehe Keep the good work :)
0. If I drink coffee before I do these people say I speak too quickly. I usually only record when I am tired.
@@ServeTheHomeVideo WOW :)
Well... if you can get your hands on a dual tower, like the TT Core W200, with the P200 add ons, you could probably run 2 of those rigs you have there, plus enough space to fit ... 64 arm boards? Depends on how nuts you go with it, but R-pi boards fit in drive slots real nice like...
This is a little bit overkill to my dream of having a plex DVR cluster, record on one node, and use all the other nodes to transcode the videos from the massive MPEG2 files, over to X265 using TDARR or handbrake. But the TuringPi, even with the 6 core Jetsons, might not have enough power to transcode. Currently my 3950x needs supplemental support from a 4650G to keep up with recordings. Might need two turingpi V2, each with 24 cores.
@@GeorgeWashingtonLaserMusket For one, the Jetson Nanos i have do have a hardware transcoder. However
The problem i have is that going from MPEG2 to H265 creates larger files than the original
I've tested:
NVidia NVENC on Maxwell, Pascal, and Turing (IIRC Maxwell couldnt transcode to H265 at all, or couldnt do it from MPEG2)
AMD VCE Vega, VegaII, Vega2.5(4000 and 5000 series Ryzen APUs), RDNA1
Intel QSV HD4000, HD5000, HD530, HD630
Apple Video Toolbox M1
ALL of them create larger files than the original, but only when going from MPEG2, to H265, where as CPU normally halves the file size. or even better if low-grain black and white
My 3950X is about as fast as my GTX 1650, though, it uses 3x more power at 90w vs 30w. This is a sacrifice i'm willing to make to achieve my intended goal of not losing quality, while significantly reducing file size
This video was fanatic!
Thanks George! I hope you have a great day.
...but can run crysis2 in 4k without dlss?
just kidding this is awsome, i wanna see what u can do with a ultimate cluster like that.
Jeff stole Patrick's camera, we have the video evidence
Im more than a little annoyed that there was both a supercomputer conference AND Patrick from sth in my city, and i didnt know about it until 2 weeks late
Would there be a point in something like this but with 4 ghost/beast canyon nuc's?
That would be super cool as well. Look at the Intel VCA's too if you want to go down that path.
Thanks to both of you for some awesome content
Thanks for watching!
All I can say is that I've been planning this build, albiet on a much smaller scale, since early September. I guess now we wait to see other ISA's get in the mix... or perhaps a DPU crypto offload challenge? IDK but I'm a huge fan of efficiency and minimalist OS. Lets have the parts of the application run on the silicon where they run best and not encumber the devs with sticking to one architecture. Now we need an ISA-aware Swarm Orchestrator....
Excuse me.
I don't know anything about clusters, or how to build one. Although it does look like an interesting subject to further investigate.
I wanted to ask one thing. DAMN, WHAT IS THAT ARCH IN THE INTRO MADE OF ?
Oh my GOD, amazing.
The real question is was it red shirt, or blue shirt Jeff?
Red Shirt Jeff isn't allowed near the prototype boards :D
I was into the Video until 14:06, the fan wire stuck in between the case and mobo plate LOL
Can they see each other over PCIe without involving the CPU?
Those cards look like they'll catch on fire unless you add an industrial fan and a custom shroud to your build... what are the thermals like?
Not too bad really. You are right there is ducted airflow behind the cable cover and they need airflow. On the other hand the chips themselves are sub 55W since there is power budget for the RAM optics and such on the card as well. The networking is using most of the power on them.
Heh, have ever run this bindle more than 15 mins? Thise bf2 cards pretty HOT and need for blow thru server like cooling.
You are totally right that they are hot, but we have run it for several weeks. These are the lower power ones and there is a chassis fan blowing over them.
@@ServeTheHomeVideo did you checked its temperature during this run? And what do you mean that thise was low power? Special models or settings?
so bazooka to a knife fight it is ... I just hope Jeff has a nice case to make up for the diff in fire power ...
These two guys need a emmy for acting!
thats an ARM and a LEG!
Great one.
Nice video, thank you for sharing :)
Wow what a server in a box. Could you ask Jeff if he can help you see how this server cluster performs on the top supercomputer ratings.
I am not sure what you mean? The Top500 results include full cluster-wide networking. You would normally only run Linpack on CPUs with vector acceleration such as AVX2 on the AMD chips (or AVX512 on Xeons). There the Threadripper is around 1.5-1.6Tflops. So you are about 150-160 Raspberry Pi 4's just for the main CPU on this. On the DPU side, the A72 cores have more performance per core than the Pi's but are really there to accelerate crypto.
As you scale small nodes, the interconnect networking ends up eating a lot more power and performance. That is why you cannot just take a number from this machine (or a small cluster) and compare it to a Top500 result, because scaling interconnects is such a big deal in terms of power and performance.
We actually test a lot of the high-end 4-8 GPU supercomputer nodes, and high-density CPU nodes, so people would look at us funny in the industry if we made a claim that this cluster, or any small cluster, represented a part of a Top500 linpack run.
@@ServeTheHomeVideo It would have been fun to added what Jeff Geering did in his video
@@favesongslist Yes, but it would have made his look a lot less efficient if it used 4x the power and cost 1.5-2x this entire cluster just to get the same performance as the x86 cores only. That would 100% make sense to folks in the industry as the RPi cores/ boards are not made for Linpack style workloads.
How big were the burgers at that St. Louis McDonalds. Did they spend all their money on the first arch and had none left for the second
I tried going to BBQ but they were closed
Would be interesting to see it running.
Had a few screenshots in the video,l. For the actual box, the fans spin up, a few blinking lights. Not much to it really. The side needs to be on to ensure there is enough airflow though. Hopefully when the new studio is finished can do more with everything on. Good feedback
Great content! Any chance for a studio tour video in the future?
Possibly. The last one with the blue door studio did not do well though: ua-cam.com/video/q1no7rXWALs/v-deo.html
@@ServeTheHomeVideo that one was nice. But I’d love to see and hear more about those fiber runs and equipment you mentioned in this video. Anyways, you are doing a great job and I really appreciate every video you put out.
I'm new to this whole arm cluster server (thanks youtube algorithm). What is the usecase for this type of system?
se ve cool!
High Level acting XD
interesting video even though I don't think I would choose this solution for my small project or I would reach for adult server hardware, but it's amazing to see where you can go with today's technology and look at a little unusual solution .. both gentlemen have interesting content and it was would be interesting to see some common project ;)
I know it's going to be those Bluefield cards! Just one simple question: is the host sharing the 14x100G connections with the DPUs or are those ports exclusively for the DPUs?
Shared
@@ServeTheHomeVideo Great to hear. The Bluefield/ConnectX-6 cards really blurs between the line of NICs and being its own system.
By Arm, did you mean spend an ARM and a Leg?!!!
is there like an older DPU one could buy off ebay for cheap to build something?
I have an x58 workstation board with like 8 PCI-e lanes. would love to try this
To be frank, the BlueField-2 feels very rough. I am not sure if I would recommend getting an older BlueField 1 or something. My guess is that we will start seeing more used units next year when the new generations come out.
@@ServeTheHomeVideo ok fair enough. I assume passing through the processing power to like a hypervisor or something is a nightmare for an amateur.
@@HavokR505 More how you would use these is to use them as distinct nodes. Imagine running ESXi, as an example, on the BlueField-2 DPU, then being able to use that to provision vSAN, the bare metal server, and handle all the networking via the DPU. So ESXi runs on the DPU instead of passing through the DPU to the host machine.
hm 3 months later and the turing pi is till TBA..^^
I'm still trying to think of what I would use the ARM cores/processors for.
The big one will eventually be running services like NVMe-oF and handling network offloads. For now, it is more fun to use them as cluster nodes.
@@ServeTheHomeVideo
As I recall from the video though, the cluster nodes doesn't have NVMe onboard or DAS (unless it's like passed through from the AMD Threadripper Pro system).
And the AMD Threadripper Pro system itself doesn't have a 100 Gbps connection that's tied/directly connected to it.
So it would be interesting to see how you might deploy a NVMe-oF solution if the DPUs doesn't have any onboard or direct attached NVMe drives to it and/or that the NVMe storage has to be passed through to the DPUs.
Like I would understand if the DPU had either onboard or NVMe directly attached to it and then you can present that to the fabric.
But I've never seen how you would do the same if you don't have any onboard NVMe nor any NVMe that's directly attached to said DPU.
That's interesting.
Is there a budget friendly ARM card, I have this weird dream where I can have proxmox running a few VMs on x86 but I'd like to be able to spin up an arm VM from the same machine. Does hardware exist to do this that isn't a £2000 DPU? Some kind of PCiE card that took compute modules would be ideal.
It might be cheaper to just run a RPi or a cloud instance for Arm. We are going to have more on the ASRock Rack mATX board for Ampere Altra in a few weeks. Also search for ASRock Rack ALTRA-NAS which I hope to do in Q1 2024
@@ServeTheHomeVideo it would definitely be cheaper but I wouldn't be as cool.
I remember you could get an 086 co-pro for a BBC micro and with a few cunning commands your trusty BBC would suddenly be a PC compatable... Now I know its a completely different era of computing but I think it would be a lot of fun if my x86 machine could BE an arm machine too rather than just emulating one.
Why aren't the parts linked in the description?
No overall breakdown of the prices of the parts either
What's the point of a video like this if the sources aren't linked?
So... What can you do with your overkill system?
Great challenge andexecutions!!! How much energy your server consumes? How loud/quiet is it? 10q and keep doing great videos
Idle is ~200W with 14x CWDM optics and the 10Gbase-T ports lit up. The networking part uses a ton of power. Max not over 750W yet. In terms of noise, not silent, but nowhere near the screaming servers we normally test. There is a lot more that can be done like ducting a 120mm fan to the cards that you would want to do to keep noise down.
@@ServeTheHomeVideo wow, I would have guessed at least twice that figures...
btw I use arch.... to meet people...
OK, but since all of them are on a single board, why no install Ubuntu on a SSD connected to the board and make all the nodes boot from that?
Could you maybe do a segment on 24-48 port used 10 gig switches that can be found in an affordable price range
I think we have a 25GbE 48 port switch somewhere in the queue. Let me look into doing more of those 10G units. I think I saw some on the schedule for next year that the team is working on.
@@ServeTheHomeVideo Speaking of 25GbE switches, any word on when we might see stuff like a fanless homelab friendly 25GbE switch similar to the MikroTik CRS305 or 309? I picked up some used 25GbE NICs on ebay and unless I directly connect them I'm stuck with running them at 10 gig.
5:53 Define 7 TG, I recognize that case from a mile away because I have the same one right under my desk :^)
I would love to see an update to your cluster with:
1. Dual Ampere Altra 128C CPUs (with PCIe 5.0 upgrade)
2. Many more DPUs
3. NVIDIA GPUs
You can use 16 of the 32 PCIe 5.0 lanes on BlueField 3 to connect to an NVIDIA GPU with PCIe 5.0 or two NVIDIA GPUs with PCIe 5.0 to 2 x PCIe 4.0.
2 x Ampere Altra -> 12 BlueField 3 -> 12 NVIDIA GPUs (PCIe 5.0)
2 x Ampere Altra -> 12 BlueField 3 -> 24 NVIDIA GPUs (PCIe 4.0)
Woah! Is that a R6 case I see?