Maverick Chips for the Next Silicon Generation

TechTechPotato

Додати в
- Мій плейлист
- Переглянути пізніше
Поділитися

Поділитися

Вставка

Розмір відео:

Показувати елементи керування програвачем

Автоматичне відтворення

Автоповтор

Опубліковано 26 гру 2024

КОМЕНТАРІ • 174

@greggleswong Місяць тому ⁺⁶²
"19.5, 67.0, 45.0"
"That's numberwang!"
@TechTechPotato Місяць тому ⁺³
Aha haha best comment. Pinned
@bean_TM Місяць тому ⁺³⁹
Seems incredibly cool. But I'd need proof it actually works first
@Pavlobot5 Місяць тому ⁺⁵⁰
Sounds like "trust me bro" performance
@fnorgen Місяць тому ⁺⁷
With some of these modern efficiency optimized server CPUs being crammed full of those little e-cores, I don't really see the value proposition for this thing. They claim they'll have some super advanced branch prediction system that will drastically improve throughput, but that appears like a very tough problem to solve.
And if there's enough money in HPC, it wouldn't surprize me to see Nvidia splitting their high end GPUs into low precision optimized and high precision optimized product lines. They have the budget for it.
@lalishansh Місяць тому ⁺²
i remember Jim Keller saying CPU is most about branch predictions (now a days), this seems like branch prediction on steroids EXITING!!
@niks660097 6 днів тому
@@lalishansh He is not wrong, CPUs are mostly about branch prediction, look at zen architechture its still 4-wide decode on frontend, compared to intel 6-wide and apple M series's wider decoder, but still has better IPC(i.e zen 5) cause of their branch predictor and micro-cache.
@MoonDweller1337 Місяць тому ⁺¹⁶
How is it different from traditional branch prediction and speculative execution?
@TechTechPotato Місяць тому ⁺¹⁶
Both end up going towards a fixed compute array. Here the size of the compute array changes given the workflow.
@TheWunder Місяць тому ⁺⁶
@@TechTechPotato Thank you Dr Potato
@lucasfernandesgrotto6279 Місяць тому ⁺¹
@@TechTechPotatoand they do that without being FPGA?
@quantumbacon Місяць тому ⁺³
Guess they'll have fixed silicon with various paths/pipelines/SE&BPwidths . Then the compiler wraps in some path optimiser tables that cause registers to get used it a programmatically way, which are attached to the various pipelines.
So at some point the path lookups get evaluated and compete for optimal use.
Feels like it doesn't work for modelling where 'every solution' gets calculated. Or anything like monte Carlo.
Also Devs at department of energy know how to optimise, or they are using machine learning to assist optimising code.
Makes sense to let Nvidia GPUs run the code optimisers.
@lucasfernandesgrotto6279 Місяць тому
@@quantumbacon thank you!
@TheCebulon Місяць тому ⁺⁷
There is one important number missing!
And it means a lot to me: 42.
@ProjectPhysX Місяць тому ⁺²⁸
This sounds fancy but in practice is unlikely to work. HPC codes need a dumb vector processors with the FP64 vector compute throughput and the memory bandwidth and capacity to back it up. They don't need fancy dynamic branch prediction. HPC codes don't use branching for the most part, so there is nothing that smart branch prediction can even optimize. The telemetry collection for this branch prediction will probably even slow it down.
If the thing doesn't support OpenCL/SYCL and instead needs recompiling, it is basically DOA. Recompiling for special hardware never goes smooth, there is always some detail that doesn't work and needs extra debugging, and developers don't have the time nor money to adjust their codes for another proprietary chip that does things different than industry standard. See Intel Knights Corner and how that worked out...
@henriksundt7148 Місяць тому ⁺³
You are right if it is only applied to standard, massively parallell tasks, like training of weights and biases in a static feed-forward structure / neural net. However, these architectures are to a large extend popular much because of the availability of this kind of uniform hardware. There are so many tasks that a) today is performed on the CPU, but could be faster (there are examples in the video's description), and b) approaches that are not popular because they are slow. In these domains, NextSilicon can have an impact.
@foobarf8766 Місяць тому ⁺⁴
HPC does use branching in scientific applications, it is limited by GPUs being vector-only and IBM Power is still around for good reason, but you are right about the work in porting -- and that also goes against GPUs -- when those applications don't lend themselves to floating point, there are surely a few in meterology and climate science.
@ABaumstumpf Місяць тому
And even with the branches: In decent code you are already reaching the hardware-limits: Either you fully saturate the ALUs or the memory bandwidth.
(or in bad cases like we have had today - networking... 250 workers can put some load on the system).
@bernadettetreual Місяць тому ⁺⁴
I find it hard to believe that this is true. It's like the theoretical advantages of the Java VM over AoT-compiled code. They never materialized.
@MrHaggyy Місяць тому ⁺⁹
This looks really interesting in combination with Mojo. The big problem in HPC is getting the right ALU in the right configuration. I'm curious how they get that much performance out of branch prediction. In the optimization problems, I know we had to explore a search space until we hit a sufficiently low error for example. The branches were, run this stuff, and occasionally check if it's sufficient. From my understanding utilizing the ALU correctly by compiling certain functions like matrix multiply with the hardware in mind would be much more effective, as it cuts down the serial path you are parallelizing.
The combination of AI and HPC will be an interesting market. An engineer can explore a vastly bigger search space if you train AI to do textbook engineering and give it the right amount of HPC. This kind of workflow is already done in fluid dynamics for vehicles and buildings and has been used to optimize combustion in ICE's. I'm also pretty sure turbine manufacturers use it as well.
IBM tried to innovate that space with variable precision. The idea was to brute force the search space in low resolution, sort the results, and recompute the areas of interest with higher precision. But I don't think it got adopted that well. Probably just to complex to handle.
@foobarf8766 Місяць тому ⁺²
The SIMD instructions on my general purpose (AMD) CPU give good baby-step giant-step performance, but I haven't compared with the latest Power ISA to really say. Stuff like that which is heavy on the branching doesn't lend itself well to GPUs. (edit: %s/prediction/branching)
@MrHaggyy Місяць тому
Well, it depends on how you branch. I played around with fixed-step Euler and Runge Kutta on my GPU, and if you run the same instructions on a large enough dataset, they fit really well. The same is true for back-propagation-based algorithms. But it gets tricky when you have something like ODE45, which has variable step size and conditional branching. On those, a view dozen cores that combine the benefits of different initial conditions and variable step size/branching would be the best.
@nextlifeonearth Місяць тому ⁺⁸
So it's basically a super wide out of order execution pipeline with speculative execution on steroids. So instead of like 4 fpus per core they have, say, 256 and a massive execution buffer (I expect the hbm is for that) to keep them fed.
Their isa is probably defined for ooo, which is why the recompile is enough.
The branch predictor simply learns and works farther ahead than any current cpu.
Or that's what I'm getting from this.
@elad_raz Місяць тому
No ISA, it's a dataflow
@autohmae Місяць тому ⁺⁷
HPC is not a small market, if they can get a good chunk, it's a good niche
@soonts Місяць тому ⁺⁴
GPUs are only “fixed” within one dispatch / one draw call. However, many practical compute problems/3D scenes are split into thousands if not millions dispatches / draw calls, and CPU running code can adjust size of dispatches / length of draw calls in runtime.
BTW, on Windows it is critical to control count of in-flight compute thread groups / draw calls because the OS insists GPUs should stay responsive at all times even when loaded, and has the timeout detection and recovery (TDR) feature in the OS kernel to enforce the policy.
To be fair, until work graphs arrived in D3D12, said flexibility was tricky to implement. D3D11 supports indirect dispatches / draw calls, queries to track completion of things, other queries to measure time spent computing / rendering things, but developers need to build their custom pipelines on top of these primitives.
@dddslimebbb Місяць тому ⁺⁴
I'm seeing a lot of mentions of branch predictions, but this seems to be a misunderstanding (or am I the one misunderstanding?)
My reading is that the code "flow" is analyzed, and this is used to allocate more compute "width" in that area. Whereas branch prediction is about guessing what comes next on a branch so you don't starve your pipeline. Branch pradection may be an important part of this chip but it isn't what's being showcased.
@clehaxze Місяць тому
That turns the computation problem into VLIW. Maybe the clip is dynamically assigning how much execution units for each branch? So less taken branches gets 4 units while the most common code path gets 16. I can see this work somewhat. But instruction retiring and branch prediction error panelty is going to be nuts. And writing a compiler for this sounds like a yucky problem (assuming they use some annotation in the ISA).
@capability-snob Місяць тому ⁺³
I'm all for divergence-busting techniques, even if quite a lot of HPC workloads don't absolutely need them. I suspect the bigger challenge with general vector compute is around memory access, Ian mentioned this briefly but it looks worth digging into.
@kehoste Місяць тому ⁺¹
@TechTechPotato I can't wait for a more detailed deep-dive on this.
The stuff they had on display at their booth was by far the most impressive thing I saw on the Supercomputing'24 exhibit floor (though I definitely missed a bunch of things, no doubt).
@mytech6779 Місяць тому ⁺¹¹
Intel Xe has good 64b performance, they purposefully favored HPC rather than AI training, and is used in Argonne labs Aurora computer.
Xe2("Battlemage" when in the Arc graphics form factor.) should be even better with 64b int support. Intel oneAPI already offers write once compile everywhere SyCL/C++, across venders and device types(AdaptiveCPP is another SyCL compiler alternative to oneAPI).
nVidia double precision fell off a cliff years ago (like 10% of fp32 these days, basically just software support rather than native 64b register size); my old 2012 AMD w7000 GCN1.0 had DP speed exactly half of SP speed due to 64b registers that were split for 32b.
@BozesanVlad Місяць тому
I'm curious why ARC is FPGA, at least at driver level to "make" it work as a GPU
@rightwingsafetysquad9872 Місяць тому ⁺¹
Nvidia has 64 bit performance that is 1/4 that of 32 bit on their fattest chips. The ones that make it into GeForce cards do not. GA100 had full 64b support, GA102 did not.
@foobarf8766 Місяць тому ⁺¹
Goes to show the cost of branched compute tasks on GPUs. For baby-step giant-step I find better performance on Ryzen than can be extracted from any Radeon.
@xpk0228 Місяць тому ⁺¹
Blackwell cut a lot on FP64 since their focus is now on AI.
@mytech6779 Місяць тому ⁺²
@@rightwingsafetysquad9872 Ah , yes x100 chips are actually the expected 2:1 of physical 64bit. but all the others(even in the enterprise line) have something near 40:1 drop off for double precision at least since Pascal.
(I'm only referring to compute speed inside the GPU without memory bottleneck considerations)
@jamesdk5417 Місяць тому ⁺²³
As an older gamer, you always surprise me how little I know about things outside of gaming. Thanks very much for
@keyboard_g Місяць тому ⁺³
The latest versions of the .Net runtime profile and second and third generation jit the code to tune hot paths and make better assembly instructions.
@ullibowyer Місяць тому ⁺³
When I have a set of ALUS and some branchy code the proportion of ALUs running the rare path automatically goes up as the rare path gets more common. This gets more complicated with vector units which suffer a large slowdown when a small number of lanes follow a rare path but if that's what is being addressed here then the message is being lost/oversimplified. 😢 On the other hand data flow programming is awesome so nice to hear something which sounds like that 🎉
@TechTechPotato Місяць тому ⁺²
It's something we'll go into in time as we dive deeper, for sure
@vasudevmenon2496 Місяць тому ⁺¹
I still find it hard to solve fp64 conversion with mantissa or exponential part when it's a negative number. I remember Pascal had much better fp64 performance than Maxwell and my friends 1050ti was way faster than GTX 980 in cuda peak workload. Great to see this approach.
@Chriva Місяць тому ⁺¹⁶
1:10 No amount of bits can please the real hardcore people. Check out arbitrary precision maths 😂 (Hunt for primes and pi decimals in particular)
@incription Місяць тому ⁺²
yep its literally unavoidable, although you can use integers in place of large floating points, you just have to adjust the formulas
@foobarf8766 Місяць тому ⁺²
It's true I need a 160 bit integer math processor, I don't even care about this floating point stuff, I'm not trying to make a bad poetry machine
@levygaming3133 Місяць тому ⁺¹
@@foobarf8766 what kinda math are you doing where you’re not ever going to get decimals? Or even just _use_ decimals? Prime number hunting the slow way?
(Especially b/c I’m pretty sure that prime number hunting takes advantage of AVX floats.)
@HansCNelson Місяць тому
IF runtime adaptive acceleration really works (big if), is it fair to think that it could jump big parts of the CUDA moat?
@rwantare1 Місяць тому ⁺¹⁰
And then a small code change destroys your performance because their (proprietary?) runtime optimiser no longer understands what you're trying to do.
This already happens with speculative execution, just that programming for magic performance gains is no fun.
@nextlifeonearth Місяць тому ⁺²
To my understanding speculative execution is exactly what they're doing, but bigger than ever before.
A giant branch predictor, a ton of fpus per core fed by this branch predictor. The recompile is probably for their own isa that they designed for ooo.
@elad_raz Місяць тому ⁺¹
@@nextlifeonearth This is why we don't follow instructions. No processor core, no execution pipeline, no ISA. Stay tuned for a technology launch in a few months.
@dddslimebbb Місяць тому ⁺¹
Would be interesting to see how this could integrate with MLIR (the LLVM magic that Mojo uses). I also wonder if, with sufficient support, this could be well suited for accellerating functional programming languages without the traditional FP when translating to something that will run on traditional hardware. That might make some HPC-using mathematicians *very* happy.
@erictayet Місяць тому ⁺¹
I think I know what it is. Quick background, I specialise in DSP when I was in school, so I've worked with fixed-precision DSP, MatLab & gcc in the past.
Based on what I'm hearing, I will generalise this chip as a General-Purpose multi-precision DSP that support pipeline & branch prediction.
Imagine a DSP that only runs Intel SSE & AVX with SMT support but with on-die HBM, with a compiler that has a front-end like MatLab but can directly target this new chip.
Interested to learn how wrong I am when the chip comes out.
@elad_raz Місяць тому ⁺¹
@@erictayet It is a dataflow hardware, and stay tuned to learn more!
@erictayet Місяць тому
@@elad_raz so like a state machine as implemented in FPGA to simulate Kmap? But each state machine has an ALU/FPU to run in a neural net rather than a simple comparator?
Just shooting wildly here. I have worked with Altera FPGA in my work and it's a completely different way of thinking about how the machine worked. Certainly not a Von Neumann machine which I'm used to code for.
@tristan7216 Місяць тому ⁺¹
Sounds like branch prediction at a larger scale, reorganizing the placement of code on chip to optimize data flow. They should be able to measure the performance per Watt boost on open source science codes, so I'd expect it works pretty well if they've done that. It'll depend on the code though. Interesting.
@Swordhero111 Місяць тому ⁺²
Is this just cgra with extra steps?
@henrycobb Місяць тому ⁺⁶
Intel promised that Itanium just needed a better compiler. How'd that work out?
@foobarf8766 Місяць тому ⁺²
They positioned that against the IBM Power which was like 20 years of compiler work ahead, so not bad considering? But OpenCL is a thing now so maybe this had a better chance?
@mytech6779 Місяць тому ⁺¹
Itanium relied 100% on the compiler. That was the whole point, to do all the pipelining stuff in software at compile rather in hardware at every execution, and thus a net savings on silicon area and power consumption.
@TheGreenPianist Місяць тому ⁺¹
nice to see that FP64 is not ignored all the way in these times 😅 our NWP models are increasingly more mixed FP32/64 precision but a large part of the code will always need just many F64 flops
@countdown4100 Місяць тому ⁺¹
8:05 "It's not that. They've told me it's not that." Yeah? Well, then what is it?
@Cybot_2419 Місяць тому ⁺²
Does this only support OMP target or is there something simular to CUDA/HIP to program this? Im wondering if its worth it to port GPU codes to this (that use CUDA/HIP and not omp target) that are mainly memory bandwidth constrained. Or is this more intended for codes that are CPU only?
@TheBackyardChemist Місяць тому ⁺⁶
Do they have a good OpenCL driver? I am not going to write vendor-specific code for the product of a company that might do a nitrogen triiodide impression and go poof at any moment.
@TechTechPotato Місяць тому ⁺⁶
That's the beauty, the code here isn't vendor specific.
@TheBackyardChemist Місяць тому ⁺³
@@TechTechPotato I am not convinced yet but I hope they succeed
@djsnowpdx Місяць тому ⁺¹
Your video about all big core smartphones cleaned up how I think about Apple CPUs now. I just disregard the little cores. So the M4 is fast, but with only 3-4 big cores, you might consider the M4 Pro for any CPU-intensive workflows, and the M4 Max is not much better, so only buy that if you need more GPU than M4Pro, and expect a slight battery life hit. Thanks Dr. Cutress!
@darveshgorhe Місяць тому ⁺¹
What's the difference between the runtime optimization performed by Maverick 2 and something like a JIT compiler or branch prediction? Is the idea that the more used code paths actually use more hardware where was JIT compilers and branch prediction create heuristics for code paths in software?
@Quarky_ Місяць тому ⁺¹
17:20 is this a 20 min ad?
@rb8049 Місяць тому ⁺²
I remember the Fairchild multi chip module CPUs in the 1980’s.
@karehaqt Місяць тому ⁺¹⁸
Ian, please talk about whats happening with Super Micro, shares down 30% due to their auditors Ernst & Young resigning today. Tech press seems oddly quiet about the whole thing which has been ongoing for months.
@TechTechPotato Місяць тому ⁺¹³
Company investor relations tend to only talk to the investor press. It's rare that Tech Press get a call about share prices
@karehaqt Місяць тому ⁺³
@@TechTechPotato It just seems weird to me that nobody has even spoke of it, especially since the DoJ started investigating them for alleged accounting violations. I'm just wondering if this is going to tank their AI dreams.
@muhdiversity7409 Місяць тому ⁺²
@@karehaqt I watched something a few weeks ago that explained exactly how naughty they were being. Something to do with multiple companies colluding to inflate the books. I think that probably explains the media blitz they have been doing across YT talking about their DC solution's. Probably in an attempt to drown out the bad news.
@todorkolev7565 Місяць тому ⁺¹
I just watched a PR piece (L1tech) about Super Micro and I was still shocked people see them as a legit company, because I remember when we had to replace all our servers because they were bugged with Chinese spy chips... SuperMicro is greasing the right wheels, apparently!
@platin2148 Місяць тому
As longs as any input isn't serially dependent on any other
@xpk0228 Місяць тому
Well this seems like they have to produce working compilers for their hardware, and that is really hard. I guess we should wait and see, but intel tried with IA-64 and even them could not get the compilers working.
@Veptis Місяць тому ⁺¹
modern NPUs only do INT8 (plus a bit more FP on the DSPs)... so I am now wondering if you can write some kernels to do fp32 math with the int8 MACs
@ProjectPhysX Місяць тому ⁺²
Possible yes, but throughput will be awful, especially with emulation support for denormals. So it doesn't really make sense.
@Veptis Місяць тому ⁺¹
@ProjectPhysX I have seen doom run on worse hardware... But this will be my summer project for the winter
@ManuFortis Місяць тому
It may not be their intended usage, but I have an idea for a game I've wanted to make for a long time, that I think this technology from Next-Silicon will make incredibly easy for me to accomplish now, in comparison to before. Before, I was looking at the potential of having to deploy servers just for hosting background logic going on in the game, not even multiplayer aspects. This just flipped the table for me. If it can be integrated well enough with the kind of system I have in mind right now... It could be done all in one server. Before, I was looking at a potential of a cluster, and gasping at the prices.
So instead I decided to downgrade some of the graphics that I would end up using, because it would at least free up some compute in the cpu and gpu.
But with this... that's not necessary anymore. If I understand correctly that is. If I do understand correctly, I can offload all the game logic going on onto the accelator, allowing the GPU and CPU involved to do their own tasks separately. Or in the worst case scenario, it merely just makes the operation of all that game logic much more effecient while still being a load on the cpu and gpu as well to some extent. But if that's the worst case, I can work with that. I think.
What's the game idea? As much as I would love to share, it also would be a dang shame if said idea were poached. I will instead say this at the very least. Imagine an MMO where everything you do actually affects everyone else, and not just through some premade restrictive scripts, but actual logic dictating what the most likely scenario is next. When you pull a pail of water from a river, it actually reduces in amount flowing behind you. If you chop down a tree, it actually stays chopped down until a new one can grow to replace it, properly. Not spawning in on a set timer. If you pull too much water from that river, the tree may not grow at all due to lack of ground water in the area now. (If taken far enough.)
The way I was looking at the likely path for coding something like that, I was met with the need for parallelism. And a lot of compute capability. You aren't running something like that on a typical CPU, to put it bluntly. And GPU's in the consumer market, well... not happening there either. So I started to look at Accellerators. And that's how I got to the server clusters.
I put that game idea on hold, because I just cannot even begin to afford to do something on that scale. But with this Maverick chip. I feel like Pandora seeing hope at the bottom of the box.
@adul00 Місяць тому ⁺¹
This looks awfully similar to profile-guided optimization (PGO), which collects runtime information to help ordinary compiler (like GCC) optimize code better for that code execution pattern / scenario.
@MrGarrax Місяць тому ⁺¹
Sounds interesting - an accelerator that adapts over time to your code and improves performance and efficiency, but if there is a bug inside this system that will be very unfunny to debug. Well will have to wait and see thx 4 the news.
@reinerfranke5436 Місяць тому ⁺¹
Seem to me a clever SW solution looking for hardware demonstration to be later also target "legacy" CPU and GPU mixtures. As i learned from Spice circuit simulation on GPU that part of the code is easy to port to small graph flow of some hundred lines but anything sparse matrix get hit by long memory latency. FEM is possible a different target where very small kernels are at 100% compute and the HBM only feed huge trunks of data departionings.
Still all have memory/compute separation. I think the real thing is coming with stacking memory where the interconnect is counted in millions, each transfer billions per seconds, not hundreds transfering 10s of billions. This will break the memory limit into new applications of code.
@bayanzabihiyan7465 Місяць тому ⁺²
Doesn’t MI300X (and MI300A) have surpurb FP64 performance while having the memory BW to support it?
You mentioned Nvidia, but AMD is I believe a bigger player in HPC, they are powered some of the worlds best HPC super computers.
@TechTechPotato Місяць тому
Based on total compute, yes, but AMD is only in a small handful of (top) systems.
@ProjectPhysX Місяць тому ⁺²
Yes MI300X is 82 TFlops vector FP64, and 163 TFlops matrix FP64. That thing is a beast and it will be hard for a startup to become even remotely competitive.
@xpk0228 Місяць тому ⁺¹
AMD will probably do better than NVDA in HPC since they did not gut their FP64 path like blackwell did. Also there is less of a software issue there.
@DanFrederiksen Місяць тому
what chemistry did you need FP64 for? 32bit covers quite a range
@quibster Місяць тому
so this is like Adaptive ASIC, but they are also saying for 100% sure they will do the software and not lump it on the customer? could this be the way to go if you "just want more hpc"?
@sambojinbojin-sam6550 Місяць тому
"It's not that. We've said it's not that."
"Ok, it's kinda that, but with a patent. Big difference."
@proesterchen Місяць тому
Sounds IA-64-like in its reliance on compiler and predication at least for the initial setup, while the hardware reconfiguration must have really terrible latency if they go with split resources on branches rather than just redoing the ops using the full hardware on a miss.
@JoeHacobian Місяць тому
So they basically made a (jit-next meets the v8 engine) processor for general compute
@petertrypsteen 12 днів тому
But how does the 64-bit; double precision performance compare to software emulation with 32-bit; single precision data types?
@gadlicht4627 Місяць тому ⁺¹
A lot of best models that us neural networks as part of the model, but not full model so they will be continued use for improvement in non-ML part. For example, if you know the exact physics of a simulation or the laws it obeys, using ML might be frankly stupid if your computer can handle computation of those exact terms well. It may even take less computation power as you get rid of superfluous things, and instead everything goes to actual calculation. If you do not know exact physics or laws, or its computationally impossible, you can still get a boost by modelling what you can model and using neural network to modify that model at point in a hybrid approach. The model not based on neural network can lead to neural network being more grounded in reality (so better results), needing less training as grounded, faster at times, and more. This is very much cas eby case thing
@thiagofreire4496 Місяць тому
Hi, Ian. Does RISC-V already have instructions equivalent to Neon, SVE and SVE2 of ARM CPUs?
@TechTechPotato Місяць тому ⁺¹
RVV goes down that path :)
@xwingfighter999 Місяць тому
So my favourite density fucntional theory package running at the speed of a GPU? Without having to ask the devs to rewrite all their codebase to CUDA? I am interested.
@TheLkdude Місяць тому
SRC - systems developed similar technology under reconfigurable computing technology
@juancarlospizarromendez3954 Місяць тому
is not there GDDR7 memory?
@thegeforce6625 Місяць тому
I’m probably wrong, but this kinda reminds me of those Transmeta Crusoe chips from the early 2000’s.
@evdrivertk Місяць тому
I'm thinking that the 800 pound gorillas (Intel/AMD) are going to come out with special compilers that convert your C++/Fortan code to their architecture without all the hand-porting efforts.
@1introvert_guy Місяць тому ⁺¹
12:15 this is such a marketing graph (well because it it!). But I hate these graphs :/ especially because I can't see the numbers or more details.
@MaxHaydenChiz Місяць тому ⁺¹
I really want to understand how this hardware works. Is it a variation of a CGRA? Regardless, extraordinary claims, require extraordinary evidence.
@foobarf8766 Місяць тому
Also curious but is it really that extraordinary? IBM made similar leaps between Power generations, 4096 entries in the Power10 TLB, the Intel/AMD entry to the HPC space with GPUs is because of their price point not capabilities.
@alexg50446 Місяць тому
Is it only better at FP64 in performance/power, but not lower precision?
@skypickle29 Місяць тому
How is this different than branch prediction? I even remember the DEC alpha which had a processor monitoring the cpu for metrics like this. Unless the processor can reconfigure an fPGA that is optimal for the observed calculations, then rewrite the code to maximize efficiency on the fly - then the design will not be optimal.
@RwilliaMHI Місяць тому
It's not an fpga+asic programming another fpga within the SoC, like it wasn't comingling of funds at ftx crypto.
@jedijackattack3594 Місяць тому ⁺¹
So its a feed foward DPU. We have had these for ages and I don't think its going to help for most hpc tasks.
@foobarf8766 Місяць тому
If you mean the IBM/DARPA thing that was never going to go retail, but now that OpenCL is a thing, this might have a chance?
@artifactingreality Місяць тому
I have been imagining such a chip myself. A self-programming fpga if you will. Amazing that someone is going to build it.
@NickChapmanThe Місяць тому ⁺²
Appreciate the perspective. The late disclosure seemed a little disingenuous.
@MrMrMrMrT Місяць тому
Isn’t it cost disadvantages? From a power draw aspect
@dankodnevic3222 Місяць тому
After years of reading about miracle devices, which turned flop, I'm rather to believe when I see it. Related to the precision issue, I would like to see scalable FPU, that goes beyond FP64, in hardware, when needed (high order polynomials, etc.), more than some magical branching prediction.
@incription Місяць тому
It doesn't accelerate AI in anyway does it? Just to make sure
@TechTechPotato Місяць тому ⁺¹
Only at full precision, not reduced precision modes
@SimplestUsername 14 днів тому
Could you please make a video breaking down Googles quantum computer and what it's actually capable of?
@jonathanjones7751 Місяць тому ⁺¹
Ponte Vecchio did 52TFLOPS of FP64 but intel sunset it. Was that more hardware or software that limited its adoption?
@TechTechPotato Місяць тому ⁺⁴
A bit of both, but also the theoretical memory bandwidth was almost impossible to achieve. The Chips and Cheese team even worked with Intel for their coverage and struggled to get >50%.
@jonathanjones7751 Місяць тому
The memory bandwidth is a great point. 47 tiles or soemthing and were seeing memory issues with Foveros with ARL. thank you for the reply. Hopefully it can get remedied for Falcon Shores if that is still an HPC part.
@xpk0228 Місяць тому ⁺¹
It's more like the design of PVC is just not good. from what we see in Aurora the 52TFLOPS is peak and unsustainable under real life conditions. MI250X on the other hand can do 45 Tflops consistently in Frontier.
@Matlockization Місяць тому
It was very interesting that you would display who and how many accelerators were used. I don't see why Intel can't populate their P & E cores in a grid with GPU cores right now. However, I think AMD is closer to this practically than Intel. Obviously, I have concerns about latency.
@Mark_Williams. Місяць тому
Remember these numbers. Look at this cool new tech! Numbers under embargo... bah! lol
Looks very cool though.
Gives me vibes of Intel's alleged Royal core project with rentable units. An achitecture that dynamically adapts to the workload to improve performance. Interesting stuff!
@ABaumstumpf Місяць тому
I mean that is what branch-predictors are already doing. And everything you and them have presented so far sounds exactly like a CPU with an FPGA and some fixed-function blocks - which falls flat in terms of performance compared to the more normal vectorisation-approach for most cases, but can be faster if the workload is not your normal memory intensive task but rather you need some more complex operations and have extra blocks for that (some extra trig-hardware etc).
And really? Code is mostly taking the most-likely path? XD
@oj0024 Місяць тому
Does the number 0.8373 mean anything to you?
@moienahmadi2377 Місяць тому
Founder of NextSilicon is Elad Raz. According to Founders Village: "Mr. Raz served in the elite 8200 intelligence unit of the Israel Defense Forces". The more you know... ⭐
@TechTechPotato Місяць тому
Israel does have manditory military service. A lot of tech people there have been in intelligence forces one way or another - it's why Israel is a tech hub.
@kamilhorvat8290 Місяць тому
Is this Transmeta CPUs reinvented?
@jimtekkit Місяць тому
I'm hoping like hell that Radeon will bring back some FP64 compute performance to the masses with UDNA. Nvidia severely nerfed it with Maxwell and even many Quadros are nerfed. The upsell is insanely steep. Radeon aren't much better right now with their focus on CDNA for that type of workload.
@PterAntlo Місяць тому
I wish them the best, but that sounds very much like what Intel said with Larabee: you don't habe have to adapt your program, just recompile it and our compiler/lib/jit will do the rest. And well, that didn't work out as well as everyone hoped.
@LogioTek Місяць тому ⁺³
Radeon VII still good then?
@TechTechPotato Місяць тому ⁺⁵
Efficiency ain't great, and the software stack needs work, but zoom zoom
@LogioTek Місяць тому
@TechTechPotato Yea tell me about AMD software/driver stack. I sometimes get AMD driver crashes just from playing UA-cam videos on my 7950X3D iGPU. When I actually edit videos it becomes a nightmare.
From my tinkering several years ago, Radeon VII efficiency doubles from reducing core and memory clocks by 25% each.
@acasccseea4434 Місяць тому ⁺⁴
Doing disclosures at the end is dodgy...
If you don't want to spend watch time, at least put a text up...
@sameeranjoshi1087 Місяць тому
Good one
@JohnJohn-ts6ux Місяць тому
Hi sir love your videos very much, I admire your hard work thank you so much again, could you please do a video, metiatek CPU 9400 v snapdragon Elite, because I'm thinking getting Samsung ultra 25, or possibly oppo flagship high end smartphone 9400 mediatek CPU, which one performs better thanks for your time keep it up😀😀
@TechTechPotato Місяць тому ⁺¹
I'm waiting for a D9400 and S8E sample
@JohnJohn-ts6ux Місяць тому
😊ok thanks
@firsttyrell6484 Місяць тому ⁺³
This chip looks like a nightmare to optimize for. Look, on the first run this part of code was slow, I'm going to optimize it. On the next run this part of code does not matter anymore due to hardware magic (optimization), but the code is still slow in some other place instead, back to square one.
@AhmadAli-kv2ho Місяць тому
Theres 256floating points?
@lbgstzockt8493 Місяць тому
Theoretically you can have any power of two for your size, it just gets really impractical really fast. Pretty much nobody does more than 256 bits.
@philflip1963 Місяць тому
The Road Not Taken
By Robert Frost
Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;
Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,
And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.
I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I-
I took the one less traveled by,
And that has made all the difference.
@MasamuneX Місяць тому
what if we made and asic that just "changes"
@TheoneandonlyRAH Місяць тому
this is nice!
@cj09beira Місяць тому ⁺⁶
kinda of a shame CDNA wasn't at all mentioned when its much more HPC focused than the Nvidia counterparts
@TechTechPotato Місяць тому ⁺⁷
More content to come ! :)
@rb8049 Місяць тому
Does MATLAB run on it?
@cj09beira Місяць тому
@@TechTechPotato Btw, any plans to talk about SOI?, its been absent of late since GF gave up on 7nm, with all this new quest for high performance i wonder why what seems like a "easy" avenue for a frequency and or efficiency boost isn't being used.
@JorgetePanete Місяць тому
it's*
@quantumbacon Місяць тому
Ian, I think you might be giving people the impression that FP64 makes calculations at 64bit precision.
this is incorrect.
@alexcastas8405 Місяць тому
'applications run orders of magnitude faster' ... big claims
@RicoElectrico Місяць тому
I wonder if Intel will acquire them only to sell off 5 years later.
@pcoverthink Місяць тому
L1 size is a huge red flag for me. Money can buy good nodes and a lot of hbm but this l1 amount sounds like bs
@vogue43 Місяць тому ⁺¹
All that about flow was pretty much the ... before profit. It explained nothing. Magic happens, perf goes to the moon, trust me bro.
@acasccseea4434 Місяць тому
Sounds like branch prediction😅
@kilngod1943 Місяць тому
AMD accelerators get 3x better fp64 compute than NVidia, there is a reason national labs are buying AMD based super computers.
@ultraveridical Місяць тому
Another video, another mention of "clients". These are becoming ads more and more, and with the disclosure near the end.
@TechTechPotato Місяць тому ⁺¹
This video isn't an ad. But good try though. I'm an analyst and consultant. All my clients, past and present, are listed in the description. I'm very open about this.
@Server0750 Місяць тому
@MrAndrzejWu Місяць тому
ok it sounds interesting :)
@DS-pk4eh Місяць тому
I thought AMD had good hardware with 64bit FP support
@foobarf8766 Місяць тому
Intel and AMD should be here with products like this... where are they? Smoking blockchains behind the bike sheds again?

Наступне

Автоматичне відтворення