If you spec up a similar Supermicro SYS-740GP-TNRT, you end up around €7,500, so another €5,000 to get the 4 Wormhole boards, 4 interconnects and cables doesn't seem like a terrible deal. Seems quite reasonable if I'm being honest. That's €1,250 per board and interconnect.
it's not expensive for some workstation at all, assuming they also put in a quality SSD - and not to forget all the specc'ing and QA they had to do to make it a working system.
They recommend Ubuntu 20.04 on a brand new system? Tenstorrent should at least be on 22.04 if not 24.04 by now. That seems like a deeper issue there. Do their test systems/Docker containers really not have anything newer when they're trying to deploy these things into production systems? Surely it's not that hard to update to the latest and do some regression testing.
They do have 20.04 and 22.04 builds on their TT-Buda releases page, python 3.8 and 3.10 respectively. torch 2.1.0. It seems to be about average in that regard in that industry.
If you can speak to somebody from the Ascalon team, I'd be quite interested in what they thought of implementing the compressed instructions? There was a lot of debate if they hinder high performance implementations, Qualcomm proposed removing them from the RVA profile, but that wasn't accepted, so it's here to stay. I'm wondering if it's maybe just harder to design and verify, but not impactful on the final hardware cost. Considering that engineers have more experience with fixed sizes ISAs or the amalgamation that is x86, where you might use other implementation strategies.
The ethernet scaleout is impressive, but arent those 800G Optics extremly expensive? A recent Semianalysis article did some calculations on the NVL576 Nvidia B200 System with 16 racks, arguing that it wont find a lot of customers like the H200 variant as the optics alone add up to 10k(!) per GPU. I guess (active) copper cabling is cheaper and those 300W cards can certainly be packed tighter than 1000W B200 chips, but I wonder how big deployments can be to be still cost effective
This is super interesting to me, as I've heard statements in real estate about the minimal extra cost for buildings to look nice, and how most developers just save that money rather than trying to build a more pretty world, and how incorrect that view is. I see these cases that way personally, I can't stand the look of school/business PCs from OEMs, TT build quality seems so exciting!! I know you're specialized and probably backed up with work already, but it would be amazing to hear your thoughts on AtomicSemi, and Sam. To me that's a dream company, if only my background allowed the education to work there. As this is the case, I hope amazing journalists such as yourself will cover the topic a lot as they go forward with more designs, and allow us a glimpse inside of the future of fabs, and American design. God bless
Hey, over the years I've also been making lists of IT hardware with great design, and many times it correlates - great looks on great products; i suppose it's because it shows some of the mindset, good hiring and also that there's slack for doing something right. It does not always translate to more success, but comparing product by product one can very much judge the book by its cover.
great product they got there, great progress over the last one! nice openings also 🙂would love to run their dc+cloud envs (but can't :( ) the pricing is upsetting me as a hardware geek, the $999 card vs $1399 - you basically need to get the dual-chip one for economic reasons but two of the cheaper ones make more sense for testing :> the scalability concept looks good and a great choice for existing datacenters. when you're there, please also let him talk about their ip offerings a bit, they mention a lot of potential uses. (and i'd love to be able to just buy a $50 slot on their cloud for testing out baby steps, without getting in touch and distracting someone)
One note, that 512 does fully fill out all the DRAM slots on the air cooled workstation. If you want more you'd have to replace DIMMs. (at least assuming their listed specs are accurate) Also 8004 series would be Sienna, not Genoa.
Not really. A lot more model bring up is needed to know the real world performance. But for Wormhole for StableDifussion it's right now (with not-so-well optimized software) on the same perf/watt with GPUs. But using 14nm instead of 7/4nm. Also the programming model is better when scaling out. You just get a larger grid size for the systolic array.
These new cards remind me of IBM’s mainframe approach: not the fastest in raw speed but excelling in data throughput and distribution, and similar to NVIDIA’s Hopper architecture’s focus on massive I/O and connectivity improvements. Will be cool to try out.
tenstorrent is the one ai company where less messaging would probably make their messaging clearer jim keller is an excellent hire for running the company - but in interviews .. there are just too many non tenstorrent questions everyone wants to ask him. hire a marketing guy to field tenstorrent product related questions in a more focussed and limited way and and then we can ask mr keller all the technical and historical questions we want without derailing the message. jim keller in a sense steals the spotlight from his own company sometimes. or least overshadows it. (personally id love to hear more about his early career at dec working with the alpha cpu team - almost certainly the best engineering team ever assembled it seems in hindsight. is there any major tech company in the US today without a former alpha engineer in their c-suite somewhere?)
I agree. A lot of interviews ive seen with Jim Keller about Tenstorrent tend to get way off track and into subjects like x86 vs ARM, Tesla, Apple, etc. However there is a counter argument to having someone else do all the interviews, and that is that interviewers would be less likely to reach out and people would be less likely to watch without Jim Keller being there. Most people know if Tenstorrent because of Jim Keller, not because the hardware stands out above the rest, so it may be a mistake have someone else do the interviews.
@@__aceofspades ...perhaps there already is such a person and his phone never rings lol ur point is well taken - but i wonder how much of that perception identifying keller with tenstorret is obscuring the distinction between tenstorrent and the other less substantial start-ups - they are one of the very few that has actual revenue and actual sales they also seem to be one of the very few that has both competent sw and hw
It's ironic that the QuietBox has a louder design language than the LoudBox. So these chips/cards are supposed to be used together where model parameter data is distributed over many cards? And this hasn't been done because other companies did not prioritize network speed which makes this unfeasible?
> And this hasn't been done because other companies did not prioritize network speed which makes this unfeasible? Groq does that, each card has like 220MB of SRAM cache on-die and that's it, no external memory
Question: Will we see Tenstorrent IP in the mainstream PC market any time soon? I could think of a market for PCIe addon cards to serve the vast market of non-AI capable PCs.
Question: if they paint the picture of multiple developers using the same device(s), and show that those will be interfaced via PCIe, does it have support for SR-IOV or is it one plain PCIe device? How about the 2 cpu ones like the 130 or 300, do they show two devices on PCIe or one? SR-IOV would fit better in the HPC environment where I'd usually hand out a private cloud account to the researchers/devs and have OpenNebula (etc) assign them a VM with a virtual PCI instance.
IBM has a interesting solution to this where you can rent out an entire Rack/Cluster and all you pay is what it costs to run the system and its turn key IBM sends out hardware and personal to put it in place and get it running. Not familiar with the details but this is roughly what i’ve read
I’m a bit confused as to why the external bandwidth is of focus here, the 3.2tbps is achieved over a 16x200 interface, meanwhile nvidia has single port supporting 800.
At first sight.. wormhole is a mediocre name. Considering that the point is all about connecting a bunch or processors as if they arent seperayed by many feet makes the name actually very compelling. I LOVE it.
why do you need to buy the actual hardware to develop software for it? cant a dev just rent an instance of the hardware and work on that to test the software?
Tinygrad is $15k (red) or $25k (green). It comes with only 128GB DRAM and either a single MI250X or a single H100. Lead time is 2-5 months on that. Networking is also only 1 GbE by default.
The memory is way too small, if it had 64GB per card I would buy one without hesitation, and if it is half as good as advertised I'd buy one of the workstations.
It is cheaper to buy a 3090 Fe has Peak INT8 Tensor TOPS single-precision 284 half-precision 568 and 24 GB GDDR6X per card and will do fp32 fp16 fp8 fp4 and fp2 just from the tensor cores for personal use with Cuba out of the box. Also, NVIDIA fixed the drivers so will use ram for fall over if exceed the 24gb of VRAM and if you run out of ram will even fall over to an SSD or M.2. With GGUF could use a say a used Genoa CPU with eccDDR5 12 channel memory at 460.8 GB/s to keep bandwidth high for the fall over. This is what I use at home and with 192 threads, 128 lanes of PCIe 5.0, AVX-512, I built for 4k. The CPU alone can run 20b to 70b models np with 12 sticks of ddr5 32gb per stick for 384gb of ram and the motherboard has 10k Ethernet on board!
Are they only targeting fp8? What's the fp16, 32, 64? Do devs have to write their own software? Nvidia's biggest advantage is their software stack- it's the best, and it's nearly ten years ahead of any other ecosystem.
I am in the position to build a new workstation for myself. while my thesis is planned to be done by end of September. But my workstation is planned for mid of October (new Intel CPU generation). And the only real solution so far as been dual 3090ti/4090 as the RTX6000 is way too expensive. So interesting to see Tentstorrent trying to approach developer workstations - but I am not yet convinced. I want to see some performance claims as well as compatibility claims. My workload is developing langauge model evaluation - so I want to be able to run any model of Huggingface, at fp16/bf16 and deterministically. so I don't care about your fp2 and int4 quantization... The software needs to be there, best just work in place with accelerate or maybe vllm soon. Training doesn't matter and so does scale out - this is a single rackmounted/desktop system that can I give a job queue and then look at my runs. And 12k or 15k for this is sorta meh. If a single Qualcomm Cloud AI 100 Ultra can do larger models than this whole system - this isn't what I am looking for. But it's the right direction I'd say. Going with nvidia or perhaps even intel is the more likely direction for me.
If you don't care about training perf then get a 7900xtx???? Most benchmarks seem to put it around 3070-class perf for inference but you get 24 gigs of vram for far cheaper than an nvidia counterpart.
If you spec up a similar Supermicro SYS-740GP-TNRT, you end up around €7,500, so another €5,000 to get the 4 Wormhole boards, 4 interconnects and cables doesn't seem like a terrible deal. Seems quite reasonable if I'm being honest. That's €1,250 per board and interconnect.
The boards are $1400 each from Tenstorrent's website. So that's a saving :)
it's not expensive for some workstation at all, assuming they also put in a quality SSD - and not to forget all the specc'ing and QA they had to do to make it a working system.
They recommend Ubuntu 20.04 on a brand new system? Tenstorrent should at least be on 22.04 if not 24.04 by now. That seems like a deeper issue there. Do their test systems/Docker containers really not have anything newer when they're trying to deploy these things into production systems? Surely it's not that hard to update to the latest and do some regression testing.
What are successful use cases this hardware can be applied
They do have 20.04 and 22.04 builds on their TT-Buda releases page, python 3.8 and 3.10 respectively. torch 2.1.0. It seems to be about average in that regard in that industry.
Audio seems to be out of sync but 👍👍
Seems fine here? Looking at the project file in the video editor and it's all lined up with internal vs external recording
@@TechTechPotato It's very much out of sync.
@@TechTechPotatodefinitely out of sync
Yup. Out of sync.
Definitely out of sync. The interview clips are in sync though, just the talking head parts that are out of sync.
If you can speak to somebody from the Ascalon team, I'd be quite interested in what they thought of implementing the compressed instructions?
There was a lot of debate if they hinder high performance implementations, Qualcomm proposed removing them from the RVA profile, but that wasn't accepted, so it's here to stay.
I'm wondering if it's maybe just harder to design and verify, but not impactful on the final hardware cost. Considering that engineers have more experience with fixed sizes ISAs or the amalgamation that is x86, where you might use other implementation strategies.
Qualcomm just wanted to use their ARM IP. Fuck Qualcomm.
yo the front panel on the TT-Loud system fucks way harder than it needs to.
😂
My Spidey-Sense is tingling... 🕸🕷
The ethernet scaleout is impressive, but arent those 800G Optics extremly expensive? A recent Semianalysis article did some calculations on the NVL576 Nvidia B200 System with 16 racks, arguing that it wont find a lot of customers like the H200 variant as the optics alone add up to 10k(!) per GPU. I guess (active) copper cabling is cheaper and those 300W cards can certainly be packed tighter than 1000W B200 chips, but I wonder how big deployments can be to be still cost effective
This is super interesting to me, as I've heard statements in real estate about the minimal extra cost for buildings to look nice, and how most developers just save that money rather than trying to build a more pretty world, and how incorrect that view is. I see these cases that way personally, I can't stand the look of school/business PCs from OEMs, TT build quality seems so exciting!!
I know you're specialized and probably backed up with work already, but it would be amazing to hear your thoughts on AtomicSemi, and Sam. To me that's a dream company, if only my background allowed the education to work there. As this is the case, I hope amazing journalists such as yourself will cover the topic a lot as they go forward with more designs, and allow us a glimpse inside of the future of fabs, and American design.
God bless
Hey, over the years I've also been making lists of IT hardware with great design, and many times it correlates - great looks on great products; i suppose it's because it shows some of the mindset, good hiring and also that there's slack for doing something right.
It does not always translate to more success, but comparing product by product one can very much judge the book by its cover.
great product they got there, great progress over the last one! nice openings also 🙂would love to run their dc+cloud envs (but can't :( )
the pricing is upsetting me as a hardware geek, the $999 card vs $1399 - you basically need to get the dual-chip one for economic reasons but two of the cheaper ones make more sense for testing :> the scalability concept looks good and a great choice for existing datacenters.
when you're there, please also let him talk about their ip offerings a bit, they mention a lot of potential uses.
(and i'd love to be able to just buy a $50 slot on their cloud for testing out baby steps, without getting in touch and distracting someone)
Can Wormhole also do training and backpropagation; or is it only for inference (like Grayskull)?
One note, that 512 does fully fill out all the DRAM slots on the air cooled workstation. If you want more you'd have to replace DIMMs. (at least assuming their listed specs are accurate)
Also 8004 series would be Sienna, not Genoa.
i think they listed spare DIMM sockets for both models, so likely it's 128GB modules or larger?
Do we have any performance comparisons between this and AMD/Intel/NVidia equipment?
Not really. A lot more model bring up is needed to know the real world performance. But for Wormhole for StableDifussion it's right now (with not-so-well optimized software) on the same perf/watt with GPUs. But using 14nm instead of 7/4nm.
Also the programming model is better when scaling out. You just get a larger grid size for the systolic array.
These new cards remind me of IBM’s mainframe approach: not the fastest in raw speed but excelling in data throughput and distribution, and similar to NVIDIA’s Hopper architecture’s focus on massive I/O and connectivity improvements. Will be cool to try out.
tenstorrent is the one ai company where less messaging would probably make their messaging clearer
jim keller is an excellent hire for running the company - but in interviews .. there are just too many non tenstorrent questions everyone wants to ask him. hire a marketing guy to field tenstorrent product related questions in a more focussed and limited way and and then we can ask mr keller all the technical and historical questions we want without derailing the message. jim keller in a sense steals the spotlight from his own company sometimes. or least overshadows it.
(personally id love to hear more about his early career at dec working with the alpha cpu team - almost certainly the best engineering team ever assembled it seems in hindsight. is there any major tech company in the US today without a former alpha engineer in their c-suite somewhere?)
I agree. A lot of interviews ive seen with Jim Keller about Tenstorrent tend to get way off track and into subjects like x86 vs ARM, Tesla, Apple, etc. However there is a counter argument to having someone else do all the interviews, and that is that interviewers would be less likely to reach out and people would be less likely to watch without Jim Keller being there. Most people know if Tenstorrent because of Jim Keller, not because the hardware stands out above the rest, so it may be a mistake have someone else do the interviews.
@@__aceofspades ...perhaps there already is such a person and his phone never rings lol
ur point is well taken - but i wonder how much of that perception identifying keller with tenstorret is obscuring the distinction between tenstorrent and the other less substantial start-ups - they are one of the very few that has actual revenue and actual sales
they also seem to be one of the very few that has both competent sw and hw
Question: Are there any plans for a product to address the automotive sector?
Would like to know, how Mr Jim Kellers plans are panning out for him, and up´s and downs.
It's ironic that the QuietBox has a louder design language than the LoudBox.
So these chips/cards are supposed to be used together where model parameter data is distributed over many cards? And this hasn't been done because other companies did not prioritize network speed which makes this unfeasible?
Nvlink is pretty fast, but yeah.
> And this hasn't been done because other companies did not prioritize network speed which makes this unfeasible?
Groq does that, each card has like 220MB of SRAM cache on-die and that's it, no external memory
@@niter43 wow, that sounds like little memory! How much network speed do they have?
Question: Will we see Tenstorrent IP in the mainstream PC market any time soon? I could think of a market for PCIe addon cards to serve the vast market of non-AI capable PCs.
Question: if they paint the picture of multiple developers using the same device(s), and show that those will be interfaced via PCIe, does it have support for SR-IOV or is it one plain PCIe device? How about the 2 cpu ones like the 130 or 300, do they show two devices on PCIe or one? SR-IOV would fit better in the HPC environment where I'd usually hand out a private cloud account to the researchers/devs and have OpenNebula (etc) assign them a VM with a virtual PCI instance.
If I'm not mistaken Ljubiša is no longer with Tenstorrent... I wounder if you could catch up with him and see what is he into these days...
I did a video about his new company back in March! ua-cam.com/video/kbYVXURdN0s/v-deo.html
Cool stuff
IBM has a interesting solution to this where you can rent out an entire Rack/Cluster and all you pay is what it costs to run the system and its turn key IBM sends out hardware and personal to put it in place and get it running. Not familiar with the details but this is roughly what i’ve read
Q for keller. Do they have any cloud providers intrested or is tt working on a cloud?
I’m a bit confused as to why the external bandwidth is of focus here, the 3.2tbps is achieved over a 16x200 interface, meanwhile nvidia has single port supporting 800.
Just bought a 8700G to learn NPU programming. Now I want one of these too, but I’d have to convince them to donate one to me…
Is there any use besides inference for FP8 yet?
At first sight.. wormhole is a mediocre name. Considering that the point is all about connecting a bunch or processors as if they arent seperayed by many feet makes the name actually very compelling. I LOVE it.
I bought an E150 but maybe I'll buy a wormhole card for Xmas but for now ask them to put the mugs for sale 😂
why do you need to buy the actual hardware to develop software for it? cant a dev just rent an instance of the hardware and work on that to test the software?
Workstations are better when you use them long enough and want to connect directly to local infra (i.e SAN)
Quite a price for first party product, Tinygrad tinybox offers better value for OEM build. But it sure have it's advantages in network and memory.
Tinygrad is $15k (red) or $25k (green). It comes with only 128GB DRAM and either a single MI250X or a single H100. Lead time is 2-5 months on that. Networking is also only 1 GbE by default.
Yasmina from Tenstorrent looks like the girl from the healthy junk food youtube channel
The memory is way too small, if it had 64GB per card I would buy one without hesitation, and if it is half as good as advertised I'd buy one of the workstations.
It is cheaper to buy a 3090 Fe has Peak INT8 Tensor TOPS single-precision 284 half-precision 568 and 24 GB GDDR6X per card and will do fp32 fp16 fp8 fp4 and fp2 just from the tensor cores for personal use with Cuba out of the box. Also, NVIDIA fixed the drivers so will use ram for fall over if exceed the 24gb of VRAM and if you run out of ram will even fall over to an SSD or M.2. With GGUF could use a say a used Genoa CPU with eccDDR5 12 channel memory at 460.8 GB/s to keep bandwidth high for the fall over. This is what I use at home and with 192 threads, 128 lanes of PCIe 5.0, AVX-512, I built for 4k. The CPU alone can run 20b to 70b models np with 12 sticks of ddr5 32gb per stick for 384gb of ram and the motherboard has 10k Ethernet on board!
Are they only targeting fp8? What's the fp16, 32, 64? Do devs have to write their own software? Nvidia's biggest advantage is their software stack- it's the best, and it's nearly ten years ahead of any other ecosystem.
I am in the position to build a new workstation for myself. while my thesis is planned to be done by end of September. But my workstation is planned for mid of October (new Intel CPU generation). And the only real solution so far as been dual 3090ti/4090 as the RTX6000 is way too expensive. So interesting to see Tentstorrent trying to approach developer workstations - but I am not yet convinced. I want to see some performance claims as well as compatibility claims.
My workload is developing langauge model evaluation - so I want to be able to run any model of Huggingface, at fp16/bf16 and deterministically. so I don't care about your fp2 and int4 quantization... The software needs to be there, best just work in place with accelerate or maybe vllm soon. Training doesn't matter and so does scale out - this is a single rackmounted/desktop system that can I give a job queue and then look at my runs. And 12k or 15k for this is sorta meh.
If a single Qualcomm Cloud AI 100 Ultra can do larger models than this whole system - this isn't what I am looking for. But it's the right direction I'd say. Going with nvidia or perhaps even intel is the more likely direction for me.
If you don't care about training perf then get a 7900xtx???? Most benchmarks seem to put it around 3070-class perf for inference but you get 24 gigs of vram for far cheaper than an nvidia counterpart.
but how would you be getting that Qualcomm card?
@@udirt there might be some resellers, but I think emailing their sales people is a good start
Why Xeon and not Threadripper Pro?
Nvm 😂
Tenstorrent probably select an affordable SMCI board that customers can actually get in a reasonable amount of time.
Can u run hybrid cloud on these
Ahh the wasteful ignorance of AI.. love to see the waste
I couldn't care less about Ai hardware honestly. its getting boring that this is the only thing everyone is covering and talking about anymore.
I mean this whole video is an ad anyway lol
i think you're on the wrong channel and video, this is for developer who is interested on the AI
@nebadon2025 Since the Nvidia and super micro share prices exploded AI hype is all the rage , it's where the money is at