acc. to Gamers Nexus, AMD said that the lower die gets thinned (to ~20µm), while the upper stays (relatively) thick. Meaning in the old approach the CCD got thinned and had a low thermal mass, adding to the problem of accumulating heat. In the new approach the cache die is thinned. Which should a) mean more thermal mass for the CCD and b) reduce any of the named negative aspects of this approach.
Latency shouldn't suffer, because of increased clocks (especially base clocks). I am wondering on what Ian didn't talk about - the fact that there are less TSV's or at least some forms of connections between the dies. Also they look smaller, which will probably negatively impact how ideally those chiplets have to be layered. Last time it was 9 microns, now it will hopefully not be too much less. Or maybe I am misremembering the details of high res photos? Maybe there are less TSV's, but they are bigger? Damn, I will have to re-check it.
Remember a cooler transistor needs way less voltage than a hot one. So keeping them cool should have top priority. Lower temps helps with uncontrolled local heatbuildup that can result in rapid degradation. Thermal expansion should also be greatly reduced. All this will make the cpu more bulletproof so perhaps even less RMA's for Amd. I dont think latency shift will be major since the distance change is miniscule. The reason that it was on top in the first place was that it came out of the skunk works and was a fast way to get more out of an existing design. If the cache tech was ready on day one they would have made it underneath from the start. I love the innovation and focus on making my gaming faster unlike the competition.
@TechTechPotato I can't believe I missed that, but I forgot to ask the important question - will "Backside Power Delivery" change the "pros and cons" of this "flip". Or rather is it possible that with it, maintaining high frequency will be easier, but some of the negatives will no longer be valid.
Well pointed out, I am quite curious as well how BPD will affect all 3D stacking concepts either with logic on the top or bottom. Obviously as one commenter pointed out as well thinning silicon dies will be needed no matter of what approach is chosen as thermal benefits r pretty significant. Another industry progress in my opinion will be both existing modified or new materials which will deliver much necessary benefits including thermal dissipation. Chip stacking play and will continue to play a key role including advanced packaging as producing large dies will become even more expensive. Also SRAM cache has which plays even more important role for AI chips reached its limits and cannot be shrink anymore, while it takes expensive space on the die, thus it will have to be stacked as well. Looks like SRAM L1 cache will continue as it is due to its small size/space, however L2/3/4 which can be much larger will definitely get stacked or will be eventually replaced with new type of memory industry is looking for decades. There isn't a holy grail memory for now, however resistive memristor type or even more promising PCM might win the industry adoption. Looking at the "natural way" o things which is the most efficient, harden and proven for millions of years, the brain as an organic processor doesn't distinguish compute logic from memory and takes it as same, which says a lot which way future not just AI compute should go.
@@El.Duder-ino You have to understand that this "SRAM doesn't scale with node" thing, is only true if you compare it to scale factor. You will always be able to get "some" scaling. Plus making SRAM dense isn't too hard. There are other techniques that make it scale decently, apart from 2.5 and 3D stacking.
@@jannegrey Good to know👍Even more reasons for making on chip memory SRAM or something new which might come up in the future larger and more prevalent in all chips. Recent AMD's V-Cache consumer CPUs like released 9800X3D benefit not only in games from larger cache. Judging from the early tests flipping cache die underneath compute was definitely a right choice as results r much better than the opposite approach seen in the last gen. Industry needs not only faster cores, but also at least as fast and efficient but larger on chip memory not only for AI workloads. Maybe PCM memory will deliver that together with non volatility, however until then we r stuck completely with the SRAM which is getting limited as stacking won't completely solve it's major capacity + thermal issues.
@@El.Duder-ino Most of it comes from higher TDP and MUCH higher thermal, frequency, power and voltage limits. Rest comes from SRAM, because Zen cores are famously starved for memory. Now they can put 3 times as much info on predictable data and instructions without having to go to system memory - yeah it's gonna improve a lot. In Cinebench if you put 9700X and 9800X3D on the same PPT - they have basically the same score (except that's hard to do, because their PPT values don't exactly match up, currently BIOS runs 9800X3D with some power settings like regular 7000 series. Plus X3D chips always had different PPT than the rest. There is however a setting that "equalizes" them. Sort of. ECO mode. And it shows 1-3% difference in favor of the 9800X3D. Which given that it doesn't hit any of it's other limits, probably just means that cache is big enough that some instructions manage to repeat themselves before they get purged. Also - warning, some channels have results for 9700X and other Zen 5 CPU's from launch from before Windows updates and upgraded BIOS. And then they compare it to 9800X3D with all those improvements. Zen 5 was grounds up redesign. And it included this cache being at the bottom. That being said there are some things in Zen 5 that are active but unused that Alexander Yee wrote about. He suspected that they will be used in Zen 6 and what we see is leaving that in silicon to test electric compatibility. I do wonder though if some microcode couldn't activate those parts. They aren't soldered off or anything. Firmware just doesn't use it. People like giving Arrowlake benefit of the doubt - saying that it is for Intel like "Zen moment" and that stuff wasn't initially optimized for Zen and it took years to when it finally was. I mean version 24H2 of Windows 11 boosts Zen 3 performance enough that it comfortably competes with 12th Gen. So 7 years and we still get optimizations for "new" architecture. But Zen 5 is "new" as well. You can make Zen 1 to Zen 4 parts into one family and now Zen 5 to presumably Zen 8 at least into another one. Which means it will also get a boost from optimizations. For V-cache though.... We'd need Alexander Yee's level deep dive. And not many people go that deep. I'll check if there is something, but probably not for next few weeks.
Sounds like an optimization, not a dilemma. There already was dead space in the previous cache chiplet. It sounds like a good place to put power and signaling
Are you sure all those connections were somewhere on the outside of the chip? Because memory clusters work best when they are closely packed. Spreading them to allow TSV connections makes is a bit problematic.
I don't get the fourth point. Any data access has to go through the cache anyway. There's no "skip the cache access step" latency advantage for cache on top here.
Not literally through the cache, but through the cache controller, that's the difference. The cache controller looks in embedded tag memory whether things are cached or not, and dispatches either external bus fetch or cache fetch, and it also puts things into the cache when they arrive from external bus. Besides it all depends on the memory mapping and there's lots of stuff that bypasses cache entirely. PCIe address spaces for example are excluded from cache. DRAM can be mapped cached or uncached. Usermode apps just happen to only see the cached DRAM. Now the question is where the L3 cache controllers live, whether there's one per CCD, or one per CCX, and cache dies are connected by simple address-data bus that simply extends the internal cache structure of the CCD, or whether extra cache dies are connected by the same bus (InfinityFabric) as CCD and IOD and come with their own L3 controllers. If i were to bet, it's one of the former, since the closer you can bring the cache controller L3 to L2 the better since they have to manage synchronisation and invalidation traffic.
@@SianaGearz Thanks for the detailed comment, upvoted. Of course by "through the cache" I meant through the cache controller. There has to be a logic block that maintains the relevant data structures and implements pieces like coherency and replacement policies, especially since we're talking about shared L3 cache. I still find Ian's description utterly confusing. Starting from 9:18 - "the data has to go into the core first, the core then accesses the cache, back to the core, and then the core deals what it has to do by going to the PCI-E controller and then out to the GPU".
@@boshi9I believe Ian's main point about that is that all of the vias to the compute die have to physically pass through the cache chip. The issues being latency, alignment of connections, and signal integrity.
The X3D series were known for lower performance when there were cache misses, particularly for compute tasks, where they performed worse. They may be targeting better performance for those tasks by ensuring higher clocks. There's nothing keeping them from reversing the stack for datacenter chips, just that the they incur more r&d and validation costs for implementing a second kind of stack. Given how Zen 5 hasn't been a real improvement over Zen 5...I suspect they had to try something different and it paid off. But I also think the X3D cost them more in the process.
I can't imagine there is any influence on latency with compute on top. One fraction of mm extra of not perfectly straight conductors going through low-ish power IC (which can be optimized to avoid placing circuits around TSVs even) sounds like nothing compared to going several mm away to IO die through substrate where multiple connections are already present. If FCLK turns out to be same then latency is guaranteed to be the same. I guess upper limit of FCLK might be lower in worst case.
One thing that's surprising is the lack of L3 capacity increase. The vcache die is ~36mm² on their current products. The Zen 5 die is 70.6mm². What are they doing with all that additional die space?
It's a compromise between capacity and latency. The bigger the memory, the more it has to sift through per cycle and latency is something you'd avoid in its intended application i.e. games. You have to remember that Zen 5 underwent a significant increase in transistor density for better branch prediction so a lot of the interconnects for vcache was reduced on the logic layer since the thirsty vcache can be directly fed with juice from the power vias underneath. It's 2 birds with 1 stone approach.
the cells aren't as dense as previous v-cache models, if they had seen any considerable increase in performance by increasing cache capacity even more, i'm sure they could've easily done it, imo 64mb is already way more than enough for gaming, i'm pretty sure 32mb would be only 5 to 10% slower in gaming.
@@MaxIronsThird quite the contrary. If you checked high yield's video of Zen 5 under the microscope, you will see that the vcache this time around is a lot narrower and more centrally compact. So instead of stacking silicon on top, you can have taller power vias to feed the CCD.
Now that the v-cache is underneath the CCD, they had to put a bunch of vias in the v-cache die to make connections between the package substrate and the CCD. I'm guessing those are taking up most of that additional die space. It's also possible they switched to a lower power SRAM cell for the v-cache which uses more area but reduces power consumption (and thus reduces heat), since the v-cache is now further away from the heatsink. But my guess is it's mainly the through-silicon vias.
I think they should probably at least make a version of it which uses dual stacked cache chiplets, because they can easily make them without significant extra development or production costs, I think, and because some people would definitely be willing to pay more for one which has stacked cache on both compute chiplets. It will also allow them to have an even more impressive flagship CPU, which should have some marketing value for them. I know not that many people would actually want one, but a lot of people would buy it just for bragging rights, and for the bigger spec number, and the people who actually run workloads which would benefit from that would absolutely love them. So, unless they're worried about the added development complexity and cost of making that additional SKU, it seems to me that they should do it, but I wouldn't bet on them doing it either. If they make one of these, they might not even want to bother with a version which only has a single cache chiplet on one of the compute chiplets.
While it might be interesting to do this on a 9950x, personally I believe they really missed the mark here with the Epyc 4004 series already. For the 3D vcache chips they produced, which were effectively 7000 series Ryzen, I would have loved to have seen a 3D vCache 16 core chip. Would have been a potential upgrade from a 7950x for my server workloads. The total clock speed for these chips is also not as relevant I believe. If max clock speed is desired, you'd anyway pick the non 3D vCache chip I would guess... Too bad they didn't do that though, hopefully they'll refresh the Epyc 4004 with Zen5 with dual 3d vcache. Would be an easy upgrade path then.
@@maou5025 Is that a problem though? Each core has it's own L1 and L2 cache. I assumed each CCD had it's own separate L3 cache. Sure, I can see a lot of issues with cache line invalidation across the CCDs for certain workloads, but that would be a big concern even without the V-cache, and it appears AMD already dealt with that problem years ago. If cache latency doesn't get worse, then I don't see why they shouldn't just pile on more V-cache, even if only to make thread scheduling simpler for the OS, what with each CCD getting the same performance characteristics. There were quite a few problems on the 7950X3D where cache dependant threads kept getting assigned to the wrong CCD, resulting in worse performance for many applications than the 7800X3D.
@@fnorgen if one CCD have to cross and address the other then the problem is exactly the same. V-cache or not. If AMD have to spend money on finding the way to address cross CCD latency then probably they will rather spend more on making more cache or update the CCD to 10 cores.
One thing I would be curious about with the L3 die is if the cache can be accessed via the other CCD for cross-CDD cache hits. We know that latency between cache hits cross CDD boundaries is expensive, and one potential latency win would be for multi-threaded workloads that can then hit the L3 cache of the other CCD with marginally less latency. That in conjunction with higher boost frequencies, I could see an argument that overall latency in two CCD designs may not be as bad and may reduce overheads in OS scheduling complexity if pining workloads to specific CCDs is less of an issue. With the leaked benchmarks of the Zen 5 9800X3D chips, there is a significant bump over the 7800X3D in multi-core performance, but surely this is due to the higher boosting frequency as opposed to lower compute to RAM latency. In any event, I'm confident the Dr got this one right with thermals being the primary driver here, but I am also interested in seeing the cross CCD latency in the two CCD designs when they're reviewed vs Zen 4 - will average cross CDD latency be better or worse per time interval. Clock latency isn't the be all and end all, take for example DD5, CAS latency goes up in terms of cycles (not time), but performance is still higher overall with the higher base frequency.
Seeing as SRAM doesn't scale I would imagine AMD wants to reduce native CCD cache amounts, There are many simultaneous things they could do as I see it. 1) node shift from 4nm > 3nm obviously. 2) SRAM doesn't scale: so AMD reduces L3 from 32MB>16MB on mainstream parts and there should be a decent cost reduction (Gamers can go and buy a Vcache version as usual with 128MB slice). 3) lower clocks means more efficiency so the gaming parts are obviously highly efficient as we see today. 4) they offer a 24 core model: CCD0 is 8xZen6 with Vcache (16MB+128MB), CCD1 is 16 Zen6c, so its great for gaming and production sipping power. Actually I wonder about just having a massive 32c Zen6c CCD with double GMI link.. but the 1c performance would tank by about 20% from 5Ghz>4Ghz. There are so many options and combinations of cores, clocks and cache that the SKU possibilities get busy fast.
Cache on bottom is the better approach because the cache on top will mean the clock speed of the core is reduced. There would have to be spacers that act as a thermal insulator.
I actually would give the point "Packaging" to Compute on top because you dont need like 3 additional structural silicon chips to fill the gaps and this makes the adjusting way easier. With compute on top its way way easier to put 2 CCDs on top of one big (2*size) V-Cache slice. Maybe its something we will see with the 9900X3D and 9950X3D because AMD can connect these 16 cores directly via this hypothetical big V-Cache slice. It would act more like monolithical 16C CPU with better latencies. With this possibility in mind, you can also put the point "latency" to the right side (compute on top) greetings from germany
@6:27 not necessarily valid. Looking at how their cache was top stacked prior, there were "dummy" silicon blocks on either side of the cache because the cache wasn't large enough to cover the entire die. If the cache is similarly sized and on bottom couldn't power be routed through where the dummy blocks would be on the bottom? Seems feasible. You would still have to have all the via data pins properly insulated but power would be less of an issue I'd think.
It's not just power vias but also data. With both but especially data you have to be careful with how it's routed to get the "cleanest" signal possible. routing to the "dummy" silicon area increases complexity of the chip and decreases signal strength and increases voltage loss, and data latency.
With compute on top, the distance between substrate and compute is the equivalent of slightly longer motherboard traces. Good motherboard design could/should negate this additional length to the signal path. Compute to cache distance is the same either way. Compute on top gives better thermals, leading to the ability to overclock. Personally, I think the advantage is with compute on top.
You could have added frequency to compute on top for the reasons you said for latency. I like that you mentioned smartphone chips. Only view applications favor high frequencies. Any server or mobile stuff will favor efficiency over frequency in the compute/watt compromise. Also, thermodynamics is a complicated topic. If you match the thickness of the layers and the thermal mass (mass * heat coeff) correctly a cache die could act as a heat spreader. You also have the option to cool the CPU over the power and GND connection. I wonder if we will see custom-matched cache sizes on top of CU's outside of AI chips. Something like we have in microchips, where certain addresses in RAM can be pulled by a periphery and don't require a load fetch store from the CPU.
Interesting dilemma, thx for that! We will be 100% sure once X3D is hopefully soon delided (der8auer)😉BTW, as other commenters like @jannegrey have pointed out - how is this going to impact BPD (Backside Power Delivery)? Thinning chip dies will happen no matter what, however BPD might bring some new challenges right?
My take is that they will keep compute on top for the gaming cpu if it works this generation, but then have compute at the bottom for other products (x900 and x950 + servers where they brute force cooling). At least until they change how we think of thermal dissipation in that context and somehow find a way to cool effeciently what's at the bottom.
Nope, EoS means you don't mix and match. This is how Zen5 was designed, so this is likely what it will be on all products using Vcache this generation.
@lordec911 oh sorry, I didn't mean for this generation, but for future ones (more zen7 or whatever it will be then than 6 at that point). This one would basically be to see "how it goes"
@@nekogami87 Oh, I missed that. As to that, who knows. EoS and KISS means you stick to a single design/technique but maybe there are enough benefits to customize for the future market segments. I still think the end goal is the IOD basically becoming an active interposer chip with the CCDs (and Vcache) stacked on top of it (maybe a small GPU chiplet too). Then you could throw HBM on it or LPDDR next to it. Basically a single tile version of the current Instincts.
@@RenRenification I think you got me wrong in the first place, I´m not talking about any 3D-Vcache, just CCD on top of the I/O-die instead of beside it
I think in next few generations we will see CCD on top of cache on top of the I/O die. Next step will be to increase amount of cache layers in-between CCD and I/O die
Ian, I think the purpose of the L3 cache memory being placed on top or bottom of the same die is because, this way, the die has less area (mm²) and, therefore, is cheaper to manufacture. I think that, if the 64 MB of L3 cache were planar, the die would have more area and be much more expensive to manufacture. The fact that the cache memory layer to be closer to the x86 cores is just a consequence. And the correct term is "layer" of the L3 cache memory and not "die" of the cache memory, since the die is only one. I don't think there is a "soldering" of 2 dies by the TSVs.
If there is a need for spacers with compute on top, why not leverage those spacers to improve signaling? That’d save on the need to use the TSV to move power to the top of stack since there is an alternative path up. Since the spacers have just surface wiring, do they even need to be shaped like a box? Can the outside edge be sloped to reduce wire length? That’d be a shorter distance than two right angle (one in the package and the other vertical through the spacer/TSV). Similarly if the compute and cache dies don’t need to be the same dimensions, you can have the top of the SRAM stack use the same wire bonding for power off of the spacers. Looking from the top, imagine two rectangles and rotate one 90 degree from the other, then overlay to get a cross shaped arrangement. For a hypothetical Turin-X, eight stacks of 64 MB SRAM underneath all 16 chipsets would equate to 8704 MB of L3 cache in a package. Going for 128 MB SRAM dies would permit a 16896 MB of L3 cache. That’s more SRAM as L3 than the average consumer system nowadays has DRAM. I’m rather disappointed that AMD has no official plans for Turin-X as leveraging multiple stacks would game changer for cache sensitive workloads. Even more mundane workloads would be able to entirely be run out of the L3 cache. The L3 latency with 8 stacks would not be good but still radically faster than DRAM. Bandwidth would not change based on stack size. These wouldn’t be cheap parts (16 compute, 128 SRAM and 1 IO die) but for some markets I’d imagine that they’d pay the premium as it’s still be cheaper than some per core software licensing schemas.
The thing i was most surprised about is how insulating the bonding layers were. It seems to me that reducing these has helped probably as much as the switcharound because I don't see thermal TSVs on the CCD, so the thermal transfer from the lower layer must be improved.
You assume that the bonding layer is perfectly flat but in the nanometer scale it is not totally flat. So the interface will have significantly higher thermal resistance.
@@Eternalduoae There must be a capping layer migration barrier and surface adhesive. You can remove some of this if you have a material that can do multiple functions.
@@Boris-Vasiliev It should have better cooling because the hot spot is now closer to the cooling solution. This is the reason why this new X3D has unlocked overclocking. AMD is now confident that the chip will not cook itself.
Is it possible to design in a way such that most of the power goes to the top from the edges of the dies, to whichever die type (memory or compute) is on top? Would this reduce power going through the chips, from bottom to top? (power compared to data is not latency sensitive i.e. it does not need to propagate through the entire wire paths like data, so thinking if power takes the "longer route" it would still work the same) If now having the power distribution going around the edges to the top chip instead of through the bottom chip, I wonder if the freed up space between the substrate and bottom chip has potential to use the extra available space to have "copper lanes" of some shape dissipating heat more effectively out "sideways" onto the substrate area outside of the chiplets? (to cool the bottom chip a bit more). Thinking of having these "copper heat lanes" for neither power nor data, just for heat transfer (unless it could also provide power at the same time). Just some curious "layman" thoughts
It seems like a compute on top is going to be the bigger win here as many of the short comings from a compute on bottom are just layering issues instead of thermal issues. Aka, something that can be fixed vs something that has real physics limits. Granted, the more stack layers, the more the other would make sense, but for a single stack it doesnt seem like it would be much of an issue. With that said, I still think a more direct L4 would still see some similar performance up lifts if they had direct connection to the compute it self without having to go over the IF. A VIA connection the sits to the side that a fully stack memory cache could connect to.
So ~ the one thing this layout has got going for it, is thermals. In every other way ~ EVERY OTHER WAY ~ it works better with the memory on top. But you can't cool it.
So 2 things: 1. Why doesn't AMD put the 3D cache on the IO die? wouldn't that let them also produce high core count X3D chips? 2. With the cache on the bottom, wouldn't it improve latency as far as memory read/writes go? (I could be wrong here, but as I understand, CPU's have to go through the L3 cache to get to memory anyway, so with the L3 on bottom, you can essentially check L3 on the way to memory in a much more efficient pattern, can you not?)
1. You could but you would have a much higher latency/power penalty due tot he cores having to go through IF links to access. Longterm, if they get the IOD to N4P or turn the IOD into an active interposer with low enough power consumption, they will stack the CCDs on top. 2. That's interesting... I would agree that cache on bottom seems like the better option but you can also configure/level the cache with the cache on top so that you don't have to go off-chip for memory reads, i.e. the cache on the CCD keeps the data there until it is written out to memory, though that may not be ideal for cache heavy workloads.
@@lordec911 you make a valid point as far as #1 goes, and i think for number 2...i guess it would be work dependant. You'd have to measure how much information is new information and how much information is read/written to memory vs read/written to cache...im assuming with new information, youd take a hit, but with cache reads and memory reads, the hit would be less, if there's a hit at all
L3 cache on the IO die would mean much higher latency between cores and the L3 cache. I suppose they could do some sort of L4 cache stacked on the IO die which is shared between all CCDs, but adding more and more levels of cache results in diminishing returns, so it may not have been worth it.
If they are skipping Turin-X, I really hope we will at least see at least one full-Vcache 16 core AM5 part, either Ryzen or Epyc. Perhaps Vcache Threadripper PRO?
Maybe a compute sandwich design is a possibility? I.e. cache layers both above AND below the compute die. Maybe not high-end gaming but servers would be plausible.
Thats just bad idea. Firstly, you should not break cache into separate parts, because it increases latency. And second - those two chips will be operating in totally different thermal and power conditions, which means they need different desings and separate production lines. The solution is always one way or another, not both. We just dont know yet what is better: memory on top or at the bottom.
How are you supposed to effectively cool a high package power compute die under multiple layers of HBM? My thinking is the thermals should be so much better irs worth the cost and complexity of having the compute die on top. However overcoming the latency may destroy all these benefits...
What about putting Efficiency cores at the bottom and the cache up and in the part of Performance cores at the top and cache at the bottom, and even the Perf. cores at the top can share cache that they will have at one side, the one over the Effi. cores and at the same time use their main cache at the bottom? I think that could work?
I was wondering as generally it's said one time of moving around data consumes more power than one arithmetic operation on the compute chip, how is it that the compute chip generates more heat? Is it because the amount of arithmetic operations inside the compute chip is much higher than the amount of times we're moving data?
Has anyone thought about trying to make heat conduct more efficiently through the bottom: through the substrate and the socket and so on? If you could get the heat relatively efficiently through the bottom and into a heatsink under the chip then that could mitigate the thermal disadvantage of compute-on-bottom. (And ofc there’s also the possibility of (fairly-)efficiently dissipating heat both through the top and the bottom.) I’m sure it wouldn’t exactly be easy to improve conduction through the substrate, but likely every other approach to the problem is difficult and/or unsatisfactory too.
Your forgetting that amd wants to use that power headroom for increased clocks and that scales linearly. Worst yet if it comes with increased voltage for those clocks which scales exponentially.
Imagine a CPU design where the CPU PCB is a hollow □ square with the die in the middle where the contact pads of the LGA are on both sides of the CPU PCB, but only around a thinner perimeter with a larger void of pads in the center that are rebalanced with added pads on the opposite side perimeter. The die would have an IHS on both sides and board contact pads around the sandwiched IHS's in a perimeter on both sides. Then the die could be cooled from both sides with 2 heat sinks. Instead of routing all power and data channels to the bottom layer of the die, each individual layer of the die would route all IO directly to the outside perimeter of the die where it would make contact with the outer square PCB where the power/data contact pads are. The LGA "Land Grid Array" pads on both sides of the CPU PCB could be sandwiched between two hollow □ square □ ram modules from both sides of the motherboard and compress onto a rim LGA on the motherboard which channels all power and IO so that the CPU die could have more IO directions for routing and 2x more cooling. To explain it in an analogy, instead of having a skyscraper with limited elevators and stairways that have to share the same vertical column space to transfer things between levels, the skyscraper has doors on every level that lead to the outside edge of the building where an object routing highway can access any level without passing through other levels so that layers can communicate with each other with less interference. Of course the transistors at center of the die would have the most latency to reach the outside, so they could also have a central routing like elevators down the middle that has a gradient falloff toward the outside rim. They could make the hollow square □ ram modules fit into a hollow motherboard rim socket from both sides to hold the CPU between them in their sockets. Then CPU coolers and RAM like explained above could be installed on both sides of the motherboard to help the CPU run faster. CPU manufacturers could have higher stacked cache, make thicker die's with more cores that have better routing in a smaller space that can be cooled more effectively. Intel could call it Sandwich Lake and AMD Sandwich Canyon and they could feed masses of drooling nerds with new potato sandwich chips.
I like my 9950x but out of the box one ccx is 300MHz slower than the other. You csn get them closer by manually tuning but I feel like I got shafted by AMD just a bit. I paid for 16 cores not 8 fast 8 less fast.
Would it work if the 3D V-cache chip has spacing in the middle, like a hole, or separating it with something like a die that manages thermals now that the compute die is now in the bottom and now thermals are about to be affected?
i don't know much about silicon and engineering and what not but seeing as the memory is lower down, would it be possible that perhaps AMD is working toward some kind of Unified cache that bridges both chiplets?
I *love* how everyone and their brother is an armchair quarterback CPU designer now. *eyes Ian nervously - what have you done* you should collab with the Asionometry and Semianalysis guys more often and form a team. or more collabs with Level1 Wendell or GamersNexus Steve would be good too.
We have seen the massive benefit that the large L3 cache provides in gaming applications. Even though AMD could stack the cache, they haven't done so thus far. Why do you think AMD is not shipping chips with 400+MB L3 cache? Is it due to diminishing returns, technology limitations, or something else?
Hmmh, How much heat does the I/O - chiplet from AMD produce? Because i smell some memory real-estate that could be used aswell. Might not be prime-latency, but if you need large strings of bits and bytes out FAST why not try it? I mean, if you build part of the Chip as a skyscraper, you need to do that with the rest of the chip too, right? Can't just "top off to the IHS" with solder.
Mmmm Cache, 9950X3D 16 cores 208MB of Cache. I'll take that! makes sense consumer will get that this time since they will have capacity in not doing Turin X this cycle.
Since the cache memory on the bottom is now sized the same as the CPU die, why can't they use the extra space that is essentially blank silicon for extra power vias to reduce resistance?
wouldnt cache at bottom make sense when there is HBM coming in ...where HBM can still lay on top and cache be at the bottom and cores will be in between.
Didn't you forgot a major win for "Compute on Bottom" with more free die space from much smaller vias on the FEOL layer? High Yield looked at the Zen5 CCD from Fritz's pictures and the on-CCD cache was much denser with the vias barely noticeable.
Also, I remember in the early to mid 2010's when die stacking and HBM talk got started, the thermal limitations was a big deal. There were papers I found about dummy/thermal bumps/vias to help transfer heat through layers, I want to say the sweet spot was around 15-20% extra. Haven't heard anything about it now that we are actually stacking chips... so what happened? It didn't actually produce results or power/thermal limits got pushed too far for that type of solution?
Why would you need TSV in the CCD when the cache is in the bottom. I doubt the High Yield analysis is correct. He analyzed it from the wrong assumption that the cache die is on top.
@@kazedcat Oh, good point. I was still assuming the TSVs on the CCD would be needed to reach the correct layers for good power/data distribution but you can do that normally with the bonded pads.
CPU designs decisions are made years ahead of product launches and I think AMD were looking at Intel with their high frequencies and thought that even with 3DVCache that they might struggle in gaming against Intel 14900K etc so even with all the added complexity the only room for them to gain performance in gaming was to push the frequencies higher than were possible with the Zen 4 (7000 series X3D). Personally I am interested to see if they do dual 3DV Cache on the 12 and 16 core 9000 series if they don't then I will skip Zen 5 and wait for Zen 6.
the more cache you can get off the CPU die the better. tigher packed cores. closer cache memory because its just 1 later down - not a zillion KM across or insane paths with nuts latency. Ive always seen Cache as a bandaid. you pay for it in heat, energy and its covering up for the poor ability to transfer data from other parts of the package. AMD have a super ARM program going i think still? ARM is so much easier to target for with the less instructions. i really would love to see LS present a spandragon killer ARM product from AMD with all their IP included to humble the likes of MALI elc.
Then you get complications with embedding the cache chiplet into the substrate or having dummy silicon structures that need vias, meaning 4 different pieces have to be perfectly aligned for bonding. It is much more simple to just make the cache chiplet less dense and make it the same size as the CCD.
I might be silly here but could they not build the vcache on the other side of the processor, ie on the same bit of silicon, lke an A side and B side of a record?
If I understood correctly; Currently only one side of the die is having transistors, etc etched on it, and figuring out a way to do both sides could be a huge improvement. Normally, layers of connections are then built on top of this side while the other is left as pure unworked silicon. The 'through-silicon-vias' that connect these stacked chiplets are made by basically digging deep enough into the die that they're exposed after the unworked side is ground off.
@@whyjay9959 yeah. That's what I thought. Even if the costs were higher per finished die the overall costs would be lower because you get rid of a dedicated cache die.
Thanks fopr putting this into perspective, I thought it was a good idea, now it seems they're setting all the wrong priorities here... And help Intel with closing up to their efficiency for no good reason.
I don't believe the ~100μm extra distance signals need to travel through the bottom die into the compute die are a measurable latency hit Also, the reduced temperature of the compute die from the better thermals of it being on top will reduce the power consumption significantly, more so than the hit of power and data having to travel further to it
Once you factor in the additional capacitance and inductance of the TSVs the added latency is probably measurable, but not a big enough issue to counter the thermal benefits of having the v-cache on bottom.
Impedance is what determines latency and hybrid bonding has higher impedance compared to no hybrid bonding. The extra TSV in the cache die also adds impedance.
@@MaxIronsThird Anandtech, before their shutdown published latency data between Zen 4 & Zen 5. There is noticeable difference even within the same CCD and even after AGESA 1.2.0.1, the interface CCD for the 9950X is still higher than 7950X but that could also be attributed to the infinity fabric staying at 256bit bus while the CCD has a 512bit bus.
@@TheBlackIdentety Right. Will wait for 9950X3D reviews before deciding which to pair with my upcoming 5090 build, coz it's a PC I wanna last 10+ years.
top bottom doens't metter the substrate is some way a heat sink... try cool a cpu with another cpu! man that substrate and the ihs catch the heat really fast!
acc. to Gamers Nexus, AMD said that the lower die gets thinned (to ~20µm), while the upper stays (relatively) thick. Meaning in the old approach the CCD got thinned and had a low thermal mass, adding to the problem of accumulating heat. In the new approach the cache die is thinned. Which should a) mean more thermal mass for the CCD and b) reduce any of the named negative aspects of this approach.
Latency shouldn't suffer, because of increased clocks (especially base clocks). I am wondering on what Ian didn't talk about - the fact that there are less TSV's or at least some forms of connections between the dies. Also they look smaller, which will probably negatively impact how ideally those chiplets have to be layered. Last time it was 9 microns, now it will hopefully not be too much less. Or maybe I am misremembering the details of high res photos? Maybe there are less TSV's, but they are bigger? Damn, I will have to re-check it.
Thermal mass isn’t the term you mean. I think you mean thermal resistance? Once at steady state the thermal mass doesn’t matter.
@@jarretta2656 i think they're talking about thermal capacity.
@@Eternalduoae as in the maximum allowable temperature?
@@jarretta2656 no, as in the amount of heat that can be absorbed per unit mass. In chemistry we might call it latent heat capacity.
Remember a cooler transistor needs way less voltage than a hot one. So keeping them cool should have top priority. Lower temps helps with uncontrolled local heatbuildup that can result in rapid degradation. Thermal expansion should also be greatly reduced. All this will make the cpu more bulletproof so perhaps even less RMA's for Amd.
I dont think latency shift will be major since the distance change is miniscule. The reason that it was on top in the first place was that it came out of the skunk works and was a fast way to get more out of an existing design. If the cache tech was ready on day one they would have made it underneath from the start.
I love the innovation and focus on making my gaming faster unlike the competition.
At smaller nodes and lower voltages a cooler transistor actually needs more voltage
@@andrewschmier6749Is that a typo? At lower V you need more V?
@TechTechPotato
I can't believe I missed that, but I forgot to ask the important question - will "Backside Power Delivery" change the "pros and cons" of this "flip".
Or rather is it possible that with it, maintaining high frequency will be easier, but some of the negatives will no longer be valid.
I too am looking forward to Dr Cutress's thoughts on Backside Power Delivery's arrival and interaction with stacking.
Well pointed out, I am quite curious as well how BPD will affect all 3D stacking concepts either with logic on the top or bottom. Obviously as one commenter pointed out as well thinning silicon dies will be needed no matter of what approach is chosen as thermal benefits r pretty significant. Another industry progress in my opinion will be both existing modified or new materials which will deliver much necessary benefits including thermal dissipation. Chip stacking play and will continue to play a key role including advanced packaging as producing large dies will become even more expensive. Also SRAM cache has which plays even more important role for AI chips reached its limits and cannot be shrink anymore, while it takes expensive space on the die, thus it will have to be stacked as well. Looks like SRAM L1 cache will continue as it is due to its small size/space, however L2/3/4 which can be much larger will definitely get stacked or will be eventually replaced with new type of memory industry is looking for decades. There isn't a holy grail memory for now, however resistive memristor type or even more promising PCM might win the industry adoption. Looking at the "natural way" o things which is the most efficient, harden and proven for millions of years, the brain as an organic processor doesn't distinguish compute logic from memory and takes it as same, which says a lot which way future not just AI compute should go.
@@El.Duder-ino You have to understand that this "SRAM doesn't scale with node" thing, is only true if you compare it to scale factor. You will always be able to get "some" scaling. Plus making SRAM dense isn't too hard. There are other techniques that make it scale decently, apart from 2.5 and 3D stacking.
@@jannegrey Good to know👍Even more reasons for making on chip memory SRAM or something new which might come up in the future larger and more prevalent in all chips. Recent AMD's V-Cache consumer CPUs like released 9800X3D benefit not only in games from larger cache. Judging from the early tests flipping cache die underneath compute was definitely a right choice as results r much better than the opposite approach seen in the last gen. Industry needs not only faster cores, but also at least as fast and efficient but larger on chip memory not only for AI workloads. Maybe PCM memory will deliver that together with non volatility, however until then we r stuck completely with the SRAM which is getting limited as stacking won't completely solve it's major capacity + thermal issues.
@@El.Duder-ino Most of it comes from higher TDP and MUCH higher thermal, frequency, power and voltage limits. Rest comes from SRAM, because Zen cores are famously starved for memory. Now they can put 3 times as much info on predictable data and instructions without having to go to system memory - yeah it's gonna improve a lot. In Cinebench if you put 9700X and 9800X3D on the same PPT - they have basically the same score (except that's hard to do, because their PPT values don't exactly match up, currently BIOS runs 9800X3D with some power settings like regular 7000 series. Plus X3D chips always had different PPT than the rest.
There is however a setting that "equalizes" them. Sort of. ECO mode. And it shows 1-3% difference in favor of the 9800X3D. Which given that it doesn't hit any of it's other limits, probably just means that cache is big enough that some instructions manage to repeat themselves before they get purged.
Also - warning, some channels have results for 9700X and other Zen 5 CPU's from launch from before Windows updates and upgraded BIOS. And then they compare it to 9800X3D with all those improvements.
Zen 5 was grounds up redesign. And it included this cache being at the bottom. That being said there are some things in Zen 5 that are active but unused that Alexander Yee wrote about. He suspected that they will be used in Zen 6 and what we see is leaving that in silicon to test electric compatibility. I do wonder though if some microcode couldn't activate those parts. They aren't soldered off or anything. Firmware just doesn't use it.
People like giving Arrowlake benefit of the doubt - saying that it is for Intel like "Zen moment" and that stuff wasn't initially optimized for Zen and it took years to when it finally was. I mean version 24H2 of Windows 11 boosts Zen 3 performance enough that it comfortably competes with 12th Gen. So 7 years and we still get optimizations for "new" architecture. But Zen 5 is "new" as well. You can make Zen 1 to Zen 4 parts into one family and now Zen 5 to presumably Zen 8 at least into another one. Which means it will also get a boost from optimizations.
For V-cache though.... We'd need Alexander Yee's level deep dive. And not many people go that deep. I'll check if there is something, but probably not for next few weeks.
Sounds like an optimization, not a dilemma. There already was dead space in the previous cache chiplet. It sounds like a good place to put power and signaling
Are you sure all those connections were somewhere on the outside of the chip? Because memory clusters work best when they are closely packed. Spreading them to allow TSV connections makes is a bit problematic.
I don't get the fourth point. Any data access has to go through the cache anyway. There's no "skip the cache access step" latency advantage for cache on top here.
Dude is still holdling his intel stock, and hoping for the best.
If he knew what he think he knows he would have helped intel with arrowlake.
Maybe if it needs to access something in another CCD's cache chip?
Not literally through the cache, but through the cache controller, that's the difference. The cache controller looks in embedded tag memory whether things are cached or not, and dispatches either external bus fetch or cache fetch, and it also puts things into the cache when they arrive from external bus.
Besides it all depends on the memory mapping and there's lots of stuff that bypasses cache entirely. PCIe address spaces for example are excluded from cache. DRAM can be mapped cached or uncached. Usermode apps just happen to only see the cached DRAM.
Now the question is where the L3 cache controllers live, whether there's one per CCD, or one per CCX, and cache dies are connected by simple address-data bus that simply extends the internal cache structure of the CCD, or whether extra cache dies are connected by the same bus (InfinityFabric) as CCD and IOD and come with their own L3 controllers. If i were to bet, it's one of the former, since the closer you can bring the cache controller L3 to L2 the better since they have to manage synchronisation and invalidation traffic.
@@SianaGearz Thanks for the detailed comment, upvoted. Of course by "through the cache" I meant through the cache controller. There has to be a logic block that maintains the relevant data structures and implements pieces like coherency and replacement policies, especially since we're talking about shared L3 cache. I still find Ian's description utterly confusing. Starting from 9:18 - "the data has to go into the core first, the core then accesses the cache, back to the core, and then the core deals what it has to do by going to the PCI-E controller and then out to the GPU".
@@boshi9I believe Ian's main point about that is that all of the vias to the compute die have to physically pass through the cache chip. The issues being latency, alignment of connections, and signal integrity.
You could put compute layer in the middle and sandwich it with 2 memory layers.
so, you'd rather have an extra 64mb of cache on top of the 96mb currently on 3D models, but keep the lower clocks of the older models?
Maybe, but according to the images of the die, that would take up a good amount of space on the core layer.
"Why not both?"
This would be a logical sandwich. 😅
But the thermals?
The X3D series were known for lower performance when there were cache misses, particularly for compute tasks, where they performed worse. They may be targeting better performance for those tasks by ensuring higher clocks. There's nothing keeping them from reversing the stack for datacenter chips, just that the they incur more r&d and validation costs for implementing a second kind of stack.
Given how Zen 5 hasn't been a real improvement over Zen 5...I suspect they had to try something different and it paid off. But I also think the X3D cost them more in the process.
Well, today's Nov 1st for those of us at most 4.5hrs behind UTC! 😂
I can't imagine there is any influence on latency with compute on top. One fraction of mm extra of not perfectly straight conductors going through low-ish power IC (which can be optimized to avoid placing circuits around TSVs even) sounds like nothing compared to going several mm away to IO die through substrate where multiple connections are already present. If FCLK turns out to be same then latency is guaranteed to be the same. I guess upper limit of FCLK might be lower in worst case.
Just the person I was hoping would cover this!
One thing that's surprising is the lack of L3 capacity increase.
The vcache die is ~36mm² on their current products. The Zen 5 die is 70.6mm².
What are they doing with all that additional die space?
Putting all those TSVs and spacing out the SRAM clusters to better manage heat.
It's a compromise between capacity and latency. The bigger the memory, the more it has to sift through per cycle and latency is something you'd avoid in its intended application i.e. games.
You have to remember that Zen 5 underwent a significant increase in transistor density for better branch prediction so a lot of the interconnects for vcache was reduced on the logic layer since the thirsty vcache can be directly fed with juice from the power vias underneath. It's 2 birds with 1 stone approach.
the cells aren't as dense as previous v-cache models, if they had seen any considerable increase in performance by increasing cache capacity even more, i'm sure they could've easily done it, imo 64mb is already way more than enough for gaming, i'm pretty sure 32mb would be only 5 to 10% slower in gaming.
@@MaxIronsThird quite the contrary. If you checked high yield's video of Zen 5 under the microscope, you will see that the vcache this time around is a lot narrower and more centrally compact.
So instead of stacking silicon on top, you can have taller power vias to feed the CCD.
Now that the v-cache is underneath the CCD, they had to put a bunch of vias in the v-cache die to make connections between the package substrate and the CCD. I'm guessing those are taking up most of that additional die space. It's also possible they switched to a lower power SRAM cell for the v-cache which uses more area but reduces power consumption (and thus reduces heat), since the v-cache is now further away from the heatsink. But my guess is it's mainly the through-silicon vias.
My question is will the 9950X3D be dual cache this time since there is no longer the heat/clocks tradeoff of the previous generations?
I think they should probably at least make a version of it which uses dual stacked cache chiplets, because they can easily make them without significant extra development or production costs, I think, and because some people would definitely be willing to pay more for one which has stacked cache on both compute chiplets. It will also allow them to have an even more impressive flagship CPU, which should have some marketing value for them.
I know not that many people would actually want one, but a lot of people would buy it just for bragging rights, and for the bigger spec number, and the people who actually run workloads which would benefit from that would absolutely love them.
So, unless they're worried about the added development complexity and cost of making that additional SKU, it seems to me that they should do it, but I wouldn't bet on them doing it either. If they make one of these, they might not even want to bother with a version which only has a single cache chiplet on one of the compute chiplets.
It won’t work. The data bank must contain data form both side meaning cache needed to be split in half.
While it might be interesting to do this on a 9950x, personally I believe they really missed the mark here with the Epyc 4004 series already. For the 3D vcache chips they produced, which were effectively 7000 series Ryzen, I would have loved to have seen a 3D vCache 16 core chip. Would have been a potential upgrade from a 7950x for my server workloads. The total clock speed for these chips is also not as relevant I believe. If max clock speed is desired, you'd anyway pick the non 3D vCache chip I would guess... Too bad they didn't do that though, hopefully they'll refresh the Epyc 4004 with Zen5 with dual 3d vcache. Would be an easy upgrade path then.
@@maou5025 Is that a problem though? Each core has it's own L1 and L2 cache. I assumed each CCD had it's own separate L3 cache. Sure, I can see a lot of issues with cache line invalidation across the CCDs for certain workloads, but that would be a big concern even without the V-cache, and it appears AMD already dealt with that problem years ago.
If cache latency doesn't get worse, then I don't see why they shouldn't just pile on more V-cache, even if only to make thread scheduling simpler for the OS, what with each CCD getting the same performance characteristics. There were quite a few problems on the 7950X3D where cache dependant threads kept getting assigned to the wrong CCD, resulting in worse performance for many applications than the 7800X3D.
@@fnorgen if one CCD have to cross and address the other then the problem is exactly the same. V-cache or not. If AMD have to spend money on finding the way to address cross CCD latency then probably they will rather spend more on making more cache or update the CCD to 10 cores.
Mad love for the silicon design representation in EXCEL!
Discord still needs the video link posted. Thanks for the video!!!
Always appreciate your insights!
One thing I would be curious about with the L3 die is if the cache can be accessed via the other CCD for cross-CDD cache hits. We know that latency between cache hits cross CDD boundaries is expensive, and one potential latency win would be for multi-threaded workloads that can then hit the L3 cache of the other CCD with marginally less latency. That in conjunction with higher boost frequencies, I could see an argument that overall latency in two CCD designs may not be as bad and may reduce overheads in OS scheduling complexity if pining workloads to specific CCDs is less of an issue.
With the leaked benchmarks of the Zen 5 9800X3D chips, there is a significant bump over the 7800X3D in multi-core performance, but surely this is due to the higher boosting frequency as opposed to lower compute to RAM latency. In any event, I'm confident the Dr got this one right with thermals being the primary driver here, but I am also interested in seeing the cross CCD latency in the two CCD designs when they're reviewed vs Zen 4 - will average cross CDD latency be better or worse per time interval.
Clock latency isn't the be all and end all, take for example DD5, CAS latency goes up in terms of cycles (not time), but performance is still higher overall with the higher base frequency.
C4 bumps - sounds like Marine Technology
Seeing as SRAM doesn't scale I would imagine AMD wants to reduce native CCD cache amounts, There are many simultaneous things they could do as I see it.
1) node shift from 4nm > 3nm obviously.
2) SRAM doesn't scale: so AMD reduces L3 from 32MB>16MB on mainstream parts and there should be a decent cost reduction (Gamers can go and buy a Vcache version as usual with 128MB slice).
3) lower clocks means more efficiency so the gaming parts are obviously highly efficient as we see today.
4) they offer a 24 core model: CCD0 is 8xZen6 with Vcache (16MB+128MB), CCD1 is 16 Zen6c, so its great for gaming and production sipping power.
Actually I wonder about just having a massive 32c Zen6c CCD with double GMI link.. but the 1c performance would tank by about 20% from 5Ghz>4Ghz.
There are so many options and combinations of cores, clocks and cache that the SKU possibilities get busy fast.
SRAM might not scale but die shots are in, and they've somehow densified that SRAM somewhat fierce! Looks like this rule isn't written for them.
According to TSMC, SRAM bit-cell size is the same between their N5 and N3E. N3B had slightly smaller SRAM cells, but only Apple used that.
Cache on bottom is the better approach because the cache on top will mean the clock speed of the core is reduced. There would have to be spacers that act as a thermal insulator.
I actually would give the point "Packaging" to Compute on top because you dont need like 3 additional structural silicon chips to fill the gaps and this makes the adjusting way easier. With compute on top its way way easier to put 2 CCDs on top of one big (2*size) V-Cache slice. Maybe its something we will see with the 9900X3D and 9950X3D because AMD can connect these 16 cores directly via this hypothetical big V-Cache slice. It would act more like monolithical 16C CPU with better latencies. With this possibility in mind, you can also put the point "latency" to the right side (compute on top)
greetings from germany
@6:27 not necessarily valid. Looking at how their cache was top stacked prior, there were "dummy" silicon blocks on either side of the cache because the cache wasn't large enough to cover the entire die. If the cache is similarly sized and on bottom couldn't power be routed through where the dummy blocks would be on the bottom? Seems feasible. You would still have to have all the via data pins properly insulated but power would be less of an issue I'd think.
It's not just power vias but also data. With both but especially data you have to be careful with how it's routed to get the "cleanest" signal possible. routing to the "dummy" silicon area increases complexity of the chip and decreases signal strength and increases voltage loss, and data latency.
With compute on top, the distance between substrate and compute is the equivalent of slightly longer motherboard traces. Good motherboard design could/should negate this additional length to the signal path. Compute to cache distance is the same either way. Compute on top gives better thermals, leading to the ability to overclock. Personally, I think the advantage is with compute on top.
I love power bottoms.
"When in doubt, C4." - Jamie Hyneman
You could have added frequency to compute on top for the reasons you said for latency. I like that you mentioned smartphone chips. Only view applications favor high frequencies. Any server or mobile stuff will favor efficiency over frequency in the compute/watt compromise. Also, thermodynamics is a complicated topic. If you match the thickness of the layers and the thermal mass (mass * heat coeff) correctly a cache die could act as a heat spreader. You also have the option to cool the CPU over the power and GND connection.
I wonder if we will see custom-matched cache sizes on top of CU's outside of AI chips. Something like we have in microchips, where certain addresses in RAM can be pulled by a periphery and don't require a load fetch store from the CPU.
Interesting dilemma, thx for that! We will be 100% sure once X3D is hopefully soon delided (der8auer)😉BTW, as other commenters like @jannegrey have pointed out - how is this going to impact BPD (Backside Power Delivery)? Thinning chip dies will happen no matter what, however BPD might bring some new challenges right?
My take is that they will keep compute on top for the gaming cpu if it works this generation, but then have compute at the bottom for other products (x900 and x950 + servers where they brute force cooling).
At least until they change how we think of thermal dissipation in that context and somehow find a way to cool effeciently what's at the bottom.
Nope, EoS means you don't mix and match. This is how Zen5 was designed, so this is likely what it will be on all products using Vcache this generation.
@lordec911 oh sorry, I didn't mean for this generation, but for future ones (more zen7 or whatever it will be then than 6 at that point). This one would basically be to see "how it goes"
@@nekogami87 Oh, I missed that. As to that, who knows. EoS and KISS means you stick to a single design/technique but maybe there are enough benefits to customize for the future market segments.
I still think the end goal is the IOD basically becoming an active interposer chip with the CCDs (and Vcache) stacked on top of it (maybe a small GPU chiplet too). Then you could throw HBM on it or LPDDR next to it. Basically a single tile version of the current Instincts.
@@lordec911 can't wait to have enough stacked layer to get a "cube" full computer :D
Could this be a test run for having the compute dies on top of the I/O die as a "interposer"
Doubtful, too much latency
@@RenRenification why would it be more than going through the substrate with infinity fabric?
@ because the io die is physically farther away than directly on top of the ccd.
@@RenRenification I think you got me wrong in the first place, I´m not talking about any 3D-Vcache, just CCD on top of the I/O-die instead of beside it
I think in next few generations we will see CCD on top of cache on top of the I/O die. Next step will be to increase amount of cache layers in-between CCD and I/O die
Thanks for having this up after the GN video
Ian, I think the purpose of the L3 cache memory being placed on top or bottom of the same die is because, this way, the die has less area (mm²) and, therefore, is cheaper to manufacture. I think that, if the 64 MB of L3 cache were planar, the die would have more area and be much more expensive to manufacture. The fact that the cache memory layer to be closer to the x86 cores is just a consequence.
And the correct term is "layer" of the L3 cache memory and not "die" of the cache memory, since the die is only one. I don't think there is a "soldering" of 2 dies by the TSVs.
I love the images made in Excel. Proper engineering!
I am still hoping for a dual CCD chip with both having the Vcache. One can hope.
Would be great for a dual-GPU cloud gaming server.
If there is a need for spacers with compute on top, why not leverage those spacers to improve signaling? That’d save on the need to use the TSV to move power to the top of stack since there is an alternative path up. Since the spacers have just surface wiring, do they even need to be shaped like a box? Can the outside edge be sloped to reduce wire length? That’d be a shorter distance than two right angle (one in the package and the other vertical through the spacer/TSV).
Similarly if the compute and cache dies don’t need to be the same dimensions, you can have the top of the SRAM stack use the same wire bonding for power off of the spacers. Looking from the top, imagine two rectangles and rotate one 90 degree from the other, then overlay to get a cross shaped arrangement.
For a hypothetical Turin-X, eight stacks of 64 MB SRAM underneath all 16 chipsets would equate to 8704 MB of L3 cache in a package. Going for 128 MB SRAM dies would permit a 16896 MB of L3 cache. That’s more SRAM as L3 than the average consumer system nowadays has DRAM. I’m rather disappointed that AMD has no official plans for Turin-X as leveraging multiple stacks would game changer for cache sensitive workloads. Even more mundane workloads would be able to entirely be run out of the L3 cache. The L3 latency with 8 stacks would not be good but still radically faster than DRAM. Bandwidth would not change based on stack size. These wouldn’t be cheap parts (16 compute, 128 SRAM and 1 IO die) but for some markets I’d imagine that they’d pay the premium as it’s still be cheaper than some per core software licensing schemas.
You spent some 2 minutes repeating over and over and over and over, in slightly different phrases, that the compute die on top is easier to cool.
The thing i was most surprised about is how insulating the bonding layers were.
It seems to me that reducing these has helped probably as much as the switcharound because I don't see thermal TSVs on the CCD, so the thermal transfer from the lower layer must be improved.
You assume that the bonding layer is perfectly flat but in the nanometer scale it is not totally flat. So the interface will have significantly higher thermal resistance.
@kazedcat yeah, it just wasn't something i was thinking about. But was interesting to see that they optimised the layer, too.
@@Eternalduoae There must be a capping layer migration barrier and surface adhesive. You can remove some of this if you have a material that can do multiple functions.
I'm curious, if x3d chips will have better cooling than regular ones. Because now cores are moved closer to the heatsink.
@@Boris-Vasiliev It should have better cooling because the hot spot is now closer to the cooling solution. This is the reason why this new X3D has unlocked overclocking. AMD is now confident that the chip will not cook itself.
Is it possible to design in a way such that most of the power goes to the top from the edges of the dies, to whichever die type (memory or compute) is on top?
Would this reduce power going through the chips, from bottom to top? (power compared to data is not latency sensitive i.e. it does not need to propagate through the entire wire paths like data, so thinking if power takes the "longer route" it would still work the same)
If now having the power distribution going around the edges to the top chip instead of through the bottom chip, I wonder if the freed up space between the substrate and bottom chip has potential to use the extra available space to have "copper lanes" of some shape dissipating heat more effectively out "sideways" onto the substrate area outside of the chiplets? (to cool the bottom chip a bit more). Thinking of having these "copper heat lanes" for neither power nor data, just for heat transfer (unless it could also provide power at the same time).
Just some curious "layman" thoughts
Isn't MI300A already essentially this? That thing has cache on the bottom while Zen4 CCD on top.
It seems like a compute on top is going to be the bigger win here as many of the short comings from a compute on bottom are just layering issues instead of thermal issues. Aka, something that can be fixed vs something that has real physics limits. Granted, the more stack layers, the more the other would make sense, but for a single stack it doesnt seem like it would be much of an issue. With that said, I still think a more direct L4 would still see some similar performance up lifts if they had direct connection to the compute it self without having to go over the IF. A VIA connection the sits to the side that a fully stack memory cache could connect to.
So ~ the one thing this layout has got going for it, is thermals. In every other way ~ EVERY OTHER WAY ~ it works better with the memory on top. But you can't cool it.
He's just objectively wrong on Latency though.
So 2 things:
1. Why doesn't AMD put the 3D cache on the IO die? wouldn't that let them also produce high core count X3D chips?
2. With the cache on the bottom, wouldn't it improve latency as far as memory read/writes go? (I could be wrong here, but as I understand, CPU's have to go through the L3 cache to get to memory anyway, so with the L3 on bottom, you can essentially check L3 on the way to memory in a much more efficient pattern, can you not?)
1. You could but you would have a much higher latency/power penalty due tot he cores having to go through IF links to access. Longterm, if they get the IOD to N4P or turn the IOD into an active interposer with low enough power consumption, they will stack the CCDs on top.
2. That's interesting... I would agree that cache on bottom seems like the better option but you can also configure/level the cache with the cache on top so that you don't have to go off-chip for memory reads, i.e. the cache on the CCD keeps the data there until it is written out to memory, though that may not be ideal for cache heavy workloads.
@@lordec911 you make a valid point as far as #1 goes, and i think for number 2...i guess it would be work dependant. You'd have to measure how much information is new information and how much information is read/written to memory vs read/written to cache...im assuming with new information, youd take a hit, but with cache reads and memory reads, the hit would be less, if there's a hit at all
L3 cache on the IO die would mean much higher latency between cores and the L3 cache. I suppose they could do some sort of L4 cache stacked on the IO die which is shared between all CCDs, but adding more and more levels of cache results in diminishing returns, so it may not have been worth it.
1. AMD does do that with the MI300 series, and will likely do something similar with Zen 6, given they are rumored to switch the interconnect too
Do both.. P-cores + integrated graphics + E-cores cache on the bottom with P-core cache and E-cores + SOC + AI acceleration on top.
Cash goes out the door, who really needs these new CPUs?
me?
If they are skipping Turin-X, I really hope we will at least see at least one full-Vcache 16 core AM5 part, either Ryzen or Epyc. Perhaps Vcache Threadripper PRO?
Maybe a compute sandwich design is a possibility? I.e. cache layers both above AND below the compute die. Maybe not high-end gaming but servers would be plausible.
Thats just bad idea. Firstly, you should not break cache into separate parts, because it increases latency. And second - those two chips will be operating in totally different thermal and power conditions, which means they need different desings and separate production lines. The solution is always one way or another, not both. We just dont know yet what is better: memory on top or at the bottom.
I wonder when they are going to start stacking compute dies as a next step instead of using chiplets.
Fascinating stuff. Rocket science is child's play compared to manufacturing these chips.
How are you supposed to effectively cool a high package power compute die under multiple layers of HBM? My thinking is the thermals should be so much better irs worth the cost and complexity of having the compute die on top. However overcoming the latency may destroy all these benefits...
maybe those empty spaces on the sides of the chip could be used for a slower L4 cache as a buffer?
What about putting Efficiency cores at the bottom and the cache up and in the part of Performance cores at the top and cache at the bottom, and even the Perf. cores at the top can share cache that they will have at one side, the one over the Effi. cores and at the same time use their main cache at the bottom? I think that could work?
I wonder if any snack manufacturers would consider making something that looks exactly like a finished microchip wafer.
I was wondering as generally it's said one time of moving around data consumes more power than one arithmetic operation on the compute chip, how is it that the compute chip generates more heat? Is it because the amount of arithmetic operations inside the compute chip is much higher than the amount of times we're moving data?
Wiuld be expensive...but tiny channels for internal water flow for cooling.
I won’t be surprised when diamond substrates are used for thermal conduction. It’s so cool that you can also where it as jewelry as well! 😂
Has anyone thought about trying to make heat conduct more efficiently through the bottom: through the substrate and the socket and so on? If you could get the heat relatively efficiently through the bottom and into a heatsink under the chip then that could mitigate the thermal disadvantage of compute-on-bottom. (And ofc there’s also the possibility of (fairly-)efficiently dissipating heat both through the top and the bottom.) I’m sure it wouldn’t exactly be easy to improve conduction through the substrate, but likely every other approach to the problem is difficult and/or unsatisfactory too.
Wouldn't power decrease with the lower operating temperature of the compute die, more than the increase in data transmission power?
Your forgetting that amd wants to use that power headroom for increased clocks and that scales linearly. Worst yet if it comes with increased voltage for those clocks which scales exponentially.
Imagine a CPU design where the CPU PCB is a hollow □ square with the die in the middle where the contact pads of the LGA are on both sides of the CPU PCB, but only around a thinner perimeter with a larger void of pads in the center that are rebalanced with added pads on the opposite side perimeter. The die would have an IHS on both sides and board contact pads around the sandwiched IHS's in a perimeter on both sides. Then the die could be cooled from both sides with 2 heat sinks. Instead of routing all power and data channels to the bottom layer of the die, each individual layer of the die would route all IO directly to the outside perimeter of the die where it would make contact with the outer square PCB where the power/data contact pads are. The LGA "Land Grid Array" pads on both sides of the CPU PCB could be sandwiched between two hollow □ square □ ram modules from both sides of the motherboard and compress onto a rim LGA on the motherboard which channels all power and IO so that the CPU die could have more IO directions for routing and 2x more cooling. To explain it in an analogy, instead of having a skyscraper with limited elevators and stairways that have to share the same vertical column space to transfer things between levels, the skyscraper has doors on every level that lead to the outside edge of the building where an object routing highway can access any level without passing through other levels so that layers can communicate with each other with less interference. Of course the transistors at center of the die would have the most latency to reach the outside, so they could also have a central routing like elevators down the middle that has a gradient falloff toward the outside rim.
They could make the hollow square □ ram modules fit into a hollow motherboard rim socket from both sides to hold the CPU between them in their sockets. Then CPU coolers and RAM like explained above could be installed on both sides of the motherboard to help the CPU run faster. CPU manufacturers could have higher stacked cache, make thicker die's with more cores that have better routing in a smaller space that can be cooled more effectively. Intel could call it Sandwich Lake and AMD Sandwich Canyon and they could feed masses of drooling nerds with new potato sandwich chips.
Autism alert
I like my 9950x but out of the box one ccx is 300MHz slower than the other. You csn get them closer by manually tuning but I feel like I got shafted by AMD just a bit. I paid for 16 cores not 8 fast 8 less fast.
Would it work if the 3D V-cache chip has spacing in the middle, like a hole, or separating it with something like a die that manages thermals now that the compute die is now in the bottom and now thermals are about to be affected?
i don't know much about silicon and engineering and what not but seeing as the memory is lower down, would it be possible that perhaps AMD is working toward some kind of Unified cache that bridges both chiplets?
I *love* how everyone and their brother is an armchair quarterback CPU designer now. *eyes Ian nervously - what have you done*
you should collab with the Asionometry and Semianalysis guys more often and form a team. or more collabs with Level1 Wendell or GamersNexus Steve would be good too.
We have seen the massive benefit that the large L3 cache provides in gaming applications. Even though AMD could stack the cache, they haven't done so thus far. Why do you think AMD is not shipping chips with 400+MB L3 cache? Is it due to diminishing returns, technology limitations, or something else?
Hmmh, How much heat does the I/O - chiplet from AMD produce? Because i smell some memory real-estate that could be used aswell. Might not be prime-latency, but if you need large strings of bits and bytes out FAST why not try it? I mean, if you build part of the Chip as a skyscraper, you need to do that with the rest of the chip too, right? Can't just "top off to the IHS" with solder.
I wonder if they thought about a cache-compute-cache sandwich?
Mmmm Cache, 9950X3D 16 cores 208MB of Cache. I'll take that! makes sense consumer will get that this time since they will have capacity in not doing Turin X this cycle.
Since the cache memory on the bottom is now sized the same as the CPU die, why can't they use the extra space that is essentially blank silicon for extra power vias to reduce resistance?
they need way more power vias now that the ccd is on top, so they're definitely already doing that
The right decision for higher clocks.
This is why we need a 4th spatial dimension.
9:42 "your amount of L3 access is low" - what does that mean? Any memory access will go through L3 and populate/read cache lines.
On the bottom it will go to L1 and L2 first. L3 act like a pool.
Cache does not work like this.
wouldnt cache at bottom make sense when there is HBM coming in ...where HBM can still lay on top and cache be at the bottom and cores will be in between.
Thermals is the only downside of the 7800X3D, curious to see how the 9800X3D will "stack up" ;)
Didn't you forgot a major win for "Compute on Bottom" with more free die space from much smaller vias on the FEOL layer?
High Yield looked at the Zen5 CCD from Fritz's pictures and the on-CCD cache was much denser with the vias barely noticeable.
Also, I remember in the early to mid 2010's when die stacking and HBM talk got started, the thermal limitations was a big deal.
There were papers I found about dummy/thermal bumps/vias to help transfer heat through layers, I want to say the sweet spot was around 15-20% extra.
Haven't heard anything about it now that we are actually stacking chips... so what happened?
It didn't actually produce results or power/thermal limits got pushed too far for that type of solution?
Why would you need TSV in the CCD when the cache is in the bottom. I doubt the High Yield analysis is correct. He analyzed it from the wrong assumption that the cache die is on top.
@@kazedcat Oh, good point. I was still assuming the TSVs on the CCD would be needed to reach the correct layers for good power/data distribution but you can do that normally with the bonded pads.
CPU designs decisions are made years ahead of product launches and I think AMD were looking at Intel with their high frequencies and thought that even with 3DVCache that they might struggle in gaming against Intel 14900K etc so even with all the added complexity the only room for them to gain performance in gaming was to push the frequencies higher than were possible with the Zen 4 (7000 series X3D). Personally I am interested to see if they do dual 3DV Cache on the 12 and 16 core 9000 series if they don't then I will skip Zen 5 and wait for Zen 6.
No Tunin-X?😢😢😢 i was waiting for it
Any recommendations for a 12-16 core cpu that would be the better choice? I have a gaming system and a workstation am5 for both
I think they should build cooling channels into the chip. Doesn't graphene conduct heat really well?
It does. However it also conducts electricity very good
This is indeed the direction this tech is headed.
Cool get an bottom and top cache with shared HBM RAM interposed.
Now that they’ve done both individually, why not both at the same time? A cache sandwich.
Will AMD apply the same bottom X3D approach to the Epyc Turin CPUs?
I mention this at the end of the video :)
Isn't it just whichever produces the most heat goes closest to the heat sink?
I must wonder - will AMD go out with a bang and release a small run of 5950X3D for AM4? (and where do i get one lol)
Can you tell me spmething about the "idle power draw" of the Ryzen 9000 series? Is it better or worse than 7000 series?
For simple stacking, would it make sense to make a compute "sandwich" between 2 cache dies?
i feel like that would be worse of both worlds, now not even the thermals are good
There are probably some specific workloads that would benefit from that, but in general it's probably not worth doing.
Isn't 3D V cache rather thermal sensitive, to begin with?
the more cache you can get off the CPU die the better. tigher packed cores. closer cache memory because its just 1 later down - not a zillion KM across or insane paths with nuts latency. Ive always seen Cache as a bandaid. you pay for it in heat, energy and its covering up for the poor ability to transfer data from other parts of the package.
AMD have a super ARM program going i think still? ARM is so much easier to target for with the less instructions. i really would love to see LS present a spandragon killer ARM product from AMD with all their IP included to humble the likes of MALI elc.
Come on amd, let us experiment with micro channel liquid cooling
The question is, when did they now this. What's in the research labs right now?
The cache chiplet was smaller than the Core chiplet, maybe they don't have to bring everything through the cache chiplet.
Then you get complications with embedding the cache chiplet into the substrate or having dummy silicon structures that need vias, meaning 4 different pieces have to be perfectly aligned for bonding. It is much more simple to just make the cache chiplet less dense and make it the same size as the CCD.
You can flip your cache, hell you can flip the whole chip. You can't flip one-sided printer paper, it will dissapear.
What about compute | substrate | cache? 🙃
I might be silly here but could they not build the vcache on the other side of the processor, ie on the same bit of silicon, lke an A side and B side of a record?
If I understood correctly; Currently only one side of the die is having transistors, etc etched on it, and figuring out a way to do both sides could be a huge improvement.
Normally, layers of connections are then built on top of this side while the other is left as pure unworked silicon. The 'through-silicon-vias' that connect these stacked chiplets are made by basically digging deep enough into the die that they're exposed after the unworked side is ground off.
@@whyjay9959 yeah. That's what I thought. Even if the costs were higher per finished die the overall costs would be lower because you get rid of a dedicated cache die.
Cores, cache and IMC on separate plates, please
That's Zen6
Where is Chips and Cheese's 285K review? 😆
how about top and bottom?
Thanks fopr putting this into perspective, I thought it was a good idea, now it seems they're setting all the wrong priorities here... And help Intel with closing up to their efficiency for no good reason.
-Because I was inverted (mavericache)
Just stack it like a Big Mac, no worries.
Just glue it on the side like a little solar panel.
I don't believe the ~100μm extra distance signals need to travel through the bottom die into the compute die are a measurable latency hit
Also, the reduced temperature of the compute die from the better thermals of it being on top will reduce the power consumption significantly, more so than the hit of power and data having to travel further to it
It increases crosstalk and you need higher voltage to hit higher frequency, which increases heat generated within the SRAM layer. That's my take.
it's definitely not a lot of latency added
Once you factor in the additional capacitance and inductance of the TSVs the added latency is probably measurable, but not a big enough issue to counter the thermal benefits of having the v-cache on bottom.
Impedance is what determines latency and hybrid bonding has higher impedance compared to no hybrid bonding. The extra TSV in the cache die also adds impedance.
@@MaxIronsThird Anandtech, before their shutdown published latency data between Zen 4 & Zen 5. There is noticeable difference even within the same CCD and even after AGESA 1.2.0.1, the interface CCD for the 9950X is still higher than 7950X but that could also be attributed to the infinity fabric staying at 256bit bus while the CCD has a 512bit bus.
Will 9950X3D outgun 9800X3D in gaming?
Unless it has two V-Cache CCDs, no. Even if it does it won't be by much.
@@TheBlackIdentety Right. Will wait for 9950X3D reviews before deciding which to pair with my upcoming 5090 build, coz it's a PC I wanna last 10+ years.
Nope
@@R6ex I can't imagine keeping the same PC for 10 years. It only takes a few years for cheap hardware to outperform the best old hardware
@@JDD_Tech_MODS 😥
top bottom doens't metter the substrate is some way a heat sink... try cool a cpu with another cpu! man that substrate and the ihs catch the heat really fast!
Nothing about it being on the bottom is a bad thing