@@filipenicoli_The difference is whether the person using it understands what it means and whether its being used to inform or advertise Like the word modular for example, very different in circuitry and engineering as a whole; but in a commercial context its a painful buzzword
2:56 The IF protocol isn't PCI-E, that’s what the physcial layer is using (bascially the wires). The protocol is based on HyperTransport. Please disregard my mistake. And, as someone on Patreon pointed out, Navi 31/32 might already use ASE FOCoS packaging, we don't know for sure if its InFO_oS/R from TSMC.
Latency is not the boogieman that everyone seems to think it is. The larger caches in the ZEN cpus help mitigate memory latency and actually benefit SMT bu giving the second thread a little more space to perform as thread 1 waits for data. Your descriprion of Zen 2 Infinity fabric is correct, each interconnect is a single point to point connection, each of which can be saturated by the data from a dual channel memory implementation. Zen 3 changed things, replacing the single bidirectional point to point connections with a loop that provides dual bidirectional inteconnects that double the data transfer bandwidth and eliminate the saturated bottlenecks when memory reads and GPU writes were competing with each outher that caused the slow gaming performance on the 1000=3000 ryzen cpus. Zen 4 changed things up a little to reduce power consumption, limiting the IF to 2000mhz instead of the frequency of the memory. Dual connections at 2000, dont double the memory bandwidth requirements of ddr4 6000MTs but still provide enough bandwidth not to be bottlenecked by the dual channels of memory and GPU competing for IF bandwidth.
@@bradmorri Since there's no way to compare, you're statement is baseless as there's no data to back it up. I have to disagree when you run apps that are heavily threaded and data always needs to be passed between threads AND you have a 2 CCD part. That latency is going to add up. And in fact this is why Intel has been able to compete in different areas over others against Zen 4. The next issue is, and I'm sorry you're just GOING to be wrong here, as cores get faster either through IPC uplifts and clock speed improvement, EVERY bit of latency will matter more and more and to say otherwise would be wild. Cache is NOT read ahead and Cache ONLY provides benefit in certain applications, mostly when you need to keep REUSING the same data/code over and over, and in the world of PC this happens more in gaming than anything else which is why X3D parts are better for gaming. But if you do a render task, sorry but that cache is almost worthless because you are CONTINUALLY using new data from streams, and then creating new streams. These are read from memory or read from disk, then write back to disk operations. Now, for code there's a lot more benefit from bigger L1 and L2 caches especially when you keep running some function over and other again, as in that render task. You don't get benefit for the data with larger cache, but you do get the benefit for the code. I mean really go check out and myriad of CPU reviews, look at what applications have benefited from larger L1, L2 and L3 cache. Latency in core to core data transfers which once again will happen in heavily threaded apps, WHEN you have to pass data from a core on one CCD to another is impacted by that latency and it's part of the reason why moving to 2 CCD parts doesn't scale as nicely as what most people would want. But that latency also affects other operations and I said.
@@bradmorri your comment is at odds with itself. There's a reason why we don't have monster L1/L2 caches on CPUs, which is because of latency. The physically bigger the cache becomes, the higher the latency and the less efficient the core becomes. That's why AMD made such a huge deal about v-cache and 3d stacking, it allowed them to make the caches larger without moving them physically further away from the logic that needs them. And why do we need big caches? Latency. Having to wait to go out to memory is slow and if we can avoid doing so, we should. DMA exists to cut latency, it's one of the main drivers for CXL, etc. Some things are obviously more latency bound than others, obviously, but poor latency hurts everything.
I remember Adoreds video from years and years ago talking about new interposer technology. As always there's a 5-10 year gap from research to application.
Intel already use silicon interposer since Alder lake from 2020-2021 era. Their tile are using Foveros. Intel EMIB is like within package communication, that's what AMD does with chiplet. Pretty cool Intel tech. If Intel already been doing it in 2021, I think it won't be long until AMD catches up and start using silicon interposer for their chiplet. Maybe in 2-3 years.
He covered the "Buttered Donut" and many laughed. Turns out, he was right and Zen3 did have the "Buttered donut" tech. It was the foundation for the 3D caches
The substrate manufacturing can be incredibly difficult similar to the silicon itself. Future cpus will be staggeringly complex beyond the transistors themselves.
Yeah apple tax. IMO but from tech implementations Apple did wonders with own chips. I can just applaud performance. Especially GPU section is very very impressive per watt.
Apple are planning on using a silicon interposed for the construction of individual chips, meaning each processing core will be cut and glued into place, as will each cache, gpu, neural processor, analog processor and whatever specialty chunks they decide to add next. Shouldn’t expect to see it before M7 though (~2028ish)
I know some of this packaging is TSMC's or licensed from other research groups. But unlike Intel, AMD has really pushed the envelop with Zen, including all the research they've backed and invested up into their supply chain. I'm really excited for Zen 6 because that's when High-NA EUV Node, GAAFet, Next-gen Packaging with 2.5 Interconnect, RDN5 (Improved RT, Image Reconstruction, Mesh Shaders, etc), All new Zen 5+ architecture with Iterative Improvements in Zen6, All New Zen 6 Memory Controller, 512AVX Full Width, will all converge into a single innovation step. I'd imagine this is the platform a Steam Deck 2 APU will be built on. 2026.
Si interposers are coslty because they are generally huge but on a consumer platform like zen 6 they can make sense if AMD manage to shrink both the logic and IO die and put them right next to each other, but at that point they might as well use EMIB
@@keylanoslokj1806 Consumer platform = Ryzen, inFO = new ways of packaging chiplets. I mean i don't even know what you actually mean, the video already explain most of it.
Fantastic video. Really enjoyed the breakdown and elaboration on the various interconnect options and their cost/benefit implications. Subbed for more!
I concur with your analysis and your conclusion that Infinity Link (meaning InFO in the end) will be AMD's next interconnect technology. Moreover, I'm not only seeing AMD go with the same interconnect technology for client and server, I see the mere potential to do so as a very compelling argument in favor of Infinity Link. Also, it would be a very AMD thing to do. Ever since their return to competitiveness, AMD has capitalized on implementing the most cost-effective solution to an engineering problem that just gets the job done without overextending themselves and further iterating on proven solutions afterwards. Other than that, a very well thought out video that I enjoyed a lot! Your presentation skills have improved significantly. Hard to believe you're still doing all this in your spare time. Should you ever come somewhere around the middle of Germany, I'd very much like to have a beer together. 😉 Oh - and Tom has tainted you. You've obviously converted to the "mm-squared" crowd. 😂
Very good, sensible, credible, video! Discovered your channel with the SteamDeck OLED video you did and this one is excellent as well! One question about Infinity Links: Since it improves latency and bandwidth, would that be a major advantage for iGPU as well? Even by Zen6 it's highly unlikely that we have on package GDDR or HBM, but already if the local cache was shared between CPU and iGPU, that would alleviate somewhat the enormous bottleneck of the iGPU right? While benefitting from the energy efficiency and going out to main memory when it is needed.
What packaging technology is it utilized on apple's MAX chips? Are they silicon bridges? Anyhow, Thank you for these explainers and compendiums. Really appreciated.
Do you mean the Ultra versions, which is two Max chips glued together? Yes, that uses silicon bridges. Even the Max chips that don't end up in an Ultra still have a relatively large portion of their area dedicated to the interconnect they don't use.
Thank you both for the answer. Yes, I meant the Ultra version. I had the lineup mixed up in my mind. This is interesting then, since it means that the vías that connect to the bridge are vertical with respect to the plane of the die, and not stacked on the side as per apple marketing material. I was wondering how that was possible though lithography
@@salmiakki5638 It is a very confusing lineup. Pro, Max, Ultra... you can be forgiven for thinking the Max would be... like... the max and not the mid :D
I wonder about how this changes X3D implementation? I can see them carrying over the existing design where the memory chiplet is stacked on the CCD. I could also see them stackong the CCD on top of the memory chiplet, removing the need for TSVs on the CCD Enabling increased logic density, and smaller die area on the cutting edge node. This would also improve thermals as well. That being said AMD could use a SI interposer with an integrated memory for the X3D variants and organic interposer for the non X3D variants.
5:20 I don‘t really see why the EUV reticle matters. As long as you don‘t place any (or at least not too much) logic in the interposer and it really is just an interposer, then you can use why less advanced nodes with larger reticles or where mask stitching is easier to do. In fact the upper metals in newer process nodes are still done using immersion machines, not EUV (at least that was true last time I look at it and it would be weird if it changed). You need to go through the upper metals anyway so its not like you can achieve a higher bump density, so using the same expensive processing nodes for an interposer makes little sense in my opinion (as long as you don‘t put serious logic in it that is). I just looked it up, our last research chip Occamy had chiplets fabed in 12nm and the interposer used 65nm.
Your at ETH? I agree, makes little sense to have the EUV reticle matter. As far as I am aware, CoWoS uses passive silicon interposers. I know ST has done quite a bit of work on active interposers, moving things like power management, clock management, power gating hardware into the interposers, with the interposer based on something like a 65nm technology. Could even offset the higher cost of the interposer as you save area on the 2/3 nm CPU die because you move that stuff onto the interposer.
That was exactly what I was thinking, older process node fabs would be delighted to have the opportunity to fabricate a relatively high value interposer die and you would get great yields on such a simple layout.
You make a good case for what AMD might do for Zen 6 consumer products. I also think we might get a preview of it with Zen 5 based Strix Halo if it is multi-die as has been rumored. InFO_oS seems like the natural method of going about since we've already seen it work with GPU IP in RDNA 3 and it would need a more efficient interconnect for mobile and APU use. Odds are probably still good that for at least some enterprise products they will use silicon bridges or interposers, especially since like Intel, AMD will also be looking to use HBM on Epyc if the rumors pan out.
Great info and well presented as always! Thanks for the explanation, and organic interposer technology definitely seems like the best solution for Zen 6. I also am glad to hear you talk about how power and heat of data transmission is becoming a performance bottleneck, and the role of advanced packaging in solving it. I'd love to see you do a video on in memory/near memory computing and how that may filter down to consumer products.
The current revised rationalised layout of Zen CPUs hints at future layouts. They'll probably use infinity link, it costs less and will increase yields. They've already rationalised their CPU layouts into more logical blocks, they're getting their house in order and probably using current generation of CPUs to prepare for their next move. Look at the dies, clues are there.
Curious why the ‘double die’ method used with early Pentium D’s was abandoned. It seemed like a good solution for two dies communicating with each other and yields. If both dies were good, you have really low power intercommunications already in place. If one die is good and one is bad, you can create a lower end chip by slicing in half. If the yields are mixed, then you have an intermediate product (i.e. 7900X instead of 7950X).
Data rates are much higher these days. First gen Ryzen used the dual die method, but it had all sorts of problems moving memory around efficiently when scaled out. The IO die method allowed keeping the cores all fed evenly and kept costs down. But, DDR5 is outpacing the bandwidth AMD can get with the current link density, so a solution to that issue is required.
The Pentium D was a terrible solution. There was no interconnect at all. They were two separate processors communicating over the front side bus through the socket. Same situation later with the Core 2 Quad, which was two separate Core 2 Duo chips that could only communicated over the FSB through the socket - no interconnect between the dies at all on the package.
From what I've seen the next step is to get latency and bandwidth of existing interconnects back to monolithic levels, but far future I would expect to see stacked logic and I/O like you've said. I could see CCDs stacked on top of an active interposer that houses the I/O functions.
@@pham3383 The obvious and more sensible option is to just put the cache chip underneath instead of on-top. The drawback is that you basically cant do 'optional' cache chip additions anymore and have to use it as standard unless you want to make a different chiplet without the same connects underneath. But the benefits are great. You can use more cache, you can get better cooling on the compute that needs it, and you can put more cores in a CCD since you dont need to put any large L3 on the main compute die anymore(or alternatively use a smaller die for the compute chiplet).
I was just thinking about this while I was changing the thermal compound on my 7900 XTX the other day. Thanks for sharing more about what is going on under those chips!
Motherboard redesign to allow more than 128 PCIe lanes at full speed. Imagine having at least 4 PCI Express slots with x16 lanes at full speed and having workstation/server capability in a moderately priced package. Imagine the improvement. The only thing is, one can't work without the other.
I sincerely hope they will produce a breakthrough in bandwidth capacity so as to make commonplace the real-time speech to text to speech, synchronizing, and blending so as to remove muddling accents without completely removing speakers' otherwise natural speech character and tonality. Such a breakthrough would be equivalent to inventing the babelfish, and probably result in a Nobel Prize.
The era of band-aids and patches. It was the norm to bring as much as possible onto one die for decades of course because it saved power. But the cost per transistor has largely stopped at 28 nm nodes, so now this makes economic sense. Chiplets with just a PCB are hobbled. We can call these things anything we want but they all are just fancy versions of wiring boards with more or less capacitance depending on the feature size. Intel seems to be betting on glass (not mentioned here). Although it is put forward that silicon is not economic that seems to be what Intel is doing for now until they move to glass. It will be interesting who uses IFS technologies in the next few years.
How is EUV reticle size affecting interposer? They would use EUV for core chiplets and increasingly older processes for bigger silicon. Wouldn't be surprised if the interposer was 28 or higher
I hope that the implementation of organic substrates at those scales does not mean shorter lifespans where the effects of high temperatures during a span of years end up degrading the organic layers causing cracking and corrosion of conductors or bad contacts. Right now I can take a CPU from 20 years ago with 40k H of usage and still works.
TSMC made Ryzen : In-package interconnect wiring, performance and latency take a backseat. Intel 14nm & 10nm : A +100 MHz per year " *_Refreshed_* " backseat.
Cache cascade. I would like to see Zen6 with support for more on package and on motherboard memory cache chips. The on package version could potentially leverage the new interconnect as a high capacity and high bandwidth L4 cache. The on motherboard variant could be a cache between main memory and storage similar to Optane/3D Xpoint. CXL/infinity fabric might be utilized for the on motherboard cache while infinity link is used for the L4 cache on the cpu package. The existing extended L3 3D V-Cache should remain as well in addition to these other caches for models where the cost is tolerable. These additional caches could help improve overall performance in some cases in addition to any internal CPU architectural improvements.
Thanks for providing us these insights. Probably you are right in your conclusion. They need to feed the increasing number of cores with more memory bandwidth. Didn't know they left so much power consumption on the table 😮, good that they have so much room for improvement.
Great video and analysis! You took an incredibly complex subject and made it understandable to the average youtube laymen. And I am glad AMD is focusing on power performance and not just performance, because that’s what makes Apple’s silicon so great. I’m watching this video on an M4 iPad Pro, which can beat AMD’s top offerings at 1/10th the power cost.
A great video, one metric that could be the most important could be thermal efficiency, if we overheat then performance is limited - does one connector better dissipate heat?
I was amazed to see Package-on-package solutions being used on earlier versions of Raspberry PIs some years ago. Would thermal management be the major technical challenge for these kinds of platforms?
Mobile SOCs have been using memory on package for a very long time, it's tech based on Toshiba's DRAM package staking, from 2 decades ago. RPIs have always used Broadcom SOCs for embedded solutions, that do have stacked DRAM over logic, and they were somewhat early in implementing the technology for the mobile market. The main limitation is memory size, although you can stack 10 high right now. Some of the cooling efficiency loss is gained back as memory transfers are more efficient with less metal in the way. But yes, the more you stack, the more you alternate silicon and packaging material and silicon etc. - creating a heat barrier.
I don't expect the cheap and efficient on substrate interconnect will go away, that's still good enough for way too many uses cases, no need to invite the extra cost and production capacity constraint of advanced packaging for products that don't really need it.
If they don't take advantage of the optical properties of the silicon, they will be left behind. As a note...It was the backside power delivery breakthrough, and the optical silicon switching and outboarding the I/O that enabled Moore's law to scale to 6x.
Whatever interconnect technology AMD does choose I assume it will be the same between Consumers grade chips (RYZEN) and enterprise grade chips (EPYC). The reason why I believe this is that Epyc and Ryzen already share many of the same chips and they could use the consumer grade chips as a testing ground for their enterprise grade chips and vice versa. (Like how 3D Vcache was an EPYC first technology that they bought to Ryzen.)
I feel as though in the near future these connections will become even more thorough by using almost an organic material. I realize with silicon we are getting there, and this coming technology is amazing, imagine 'growing' all the interposers and 'wires' for a CPU. edit: oh. Well.. yeah then ok you explain this (and I didn't even realize they used this already, man I'm behind the times!!)
I think AMD will again use the dense organic interconnect that they used on the RX 7900XTX. A silicon interposer for Epyc would be difficult and expensive due to size. So, if servers aren't using interposers, consumer products won't either. I think the RDL material is a possibility for server too. Actuality, since the compute chiplets are the same, server and consumer packages have to use the same fan-out packaging technology. So it will be RDL.
I think it's great, that the CPU manufacturer still pushes their ideas to improve the CPUs. Even though AMD is now on top in many cases and way superior in the server market, and still have a bunch of points to improve Zen. Unlike 12 years ago when Bulldozer came and Piledriver/Excavator had no real changes, just pushed the clockspeed. Intel was superior in every measurement and also just pushed the clockspeed and made nearly no changes.
Excellent video, i’m both intrigued and confused. AMD’s chipset design seems to be much much more rudimentary than Intel’s EMIB and foveros. Yet, it worked much better. Why?
Packaging is only one part of CPU performance and efficiency. Monolithic is most performant and most efficient. More advanced packaging will not compensate for deficiencies in the node, micro architecture, or core layouts/counts. You can use chiplets to increase performance by having more silicon than monolithic, or reduce costs by using such smaller dies that the higher yields and simpler design pay for the packaging and then some. AMD uses a tiny chiplet to house their CPU cores and uses this chiplet and most of their server, workstation, and mainstream CPUs. With such scale AMD can easily bin their products to have high frequencies for desktop and excellent efficiency for server and workstation; they are also very cheap with excellent yields. AMD’s usage of a separate IO die allows them to use a separate node for IO, and AMD can save a lot of money. The negatives of using such primitive packaging are higher in server and workstation. AMD remains more efficient than Intel by using more advanced nodes and having higher core counts. The higher core counts also allow AMD to maintain a performance lead. All together, AMD’s chiplet philosophy reduces costs for the company. Intel is less efficient than AMD primarily due to less advanced nodes. The 13900K at ISO power is more efficient than the 5950X, but is competing with the 7950X which has a node advantage. Meteor Lake is actually comparable in efficiency to Phoenix in spite of using tiles, but the silicon interposer causes it be much more expensive than Phoenix. On server Intel uses massive tiles over 5x as large as AMD’s CPU chiplet. Intel is limited by yields causing many of their server products to be delayed. Intel can’t go bigger due to poor yields, so their server products have a core count disadvantage compared to AMD. Delays cause Intel to launch on poorer nodes, and having fewer cores further hurts performance and efficiency compared to AMD.
AMD got into it first on consumer platforms. We've only seen EMIB in server chips from Intel so far. Up until very recently with Meteor Lake, which uses Foveros on a full interposer, all Intel consumer CPUs have been functionally monolithic. AMD actually loses power everywhere with their current interconnects compared to a monolithic chip, and you can see this in idle power draw where a 14900K can get into the single digit watts while the 7950X sits above 20W most of the time. Where they got ahead was with a process node advantage, sometimes multiple steps ahead. Intel's 14nm was impressive for how much they extracted from it and how much power the dies can take before exploding, but it was never a highly efficient node aside from the year or so it was brand new and everything else was just worse, making it look good in comparison.
Silicon interposers or bridges should be manufacturable on obsolete manufacturing processes like >28nm, no? And they wouldn't be nearly as sensitive to minor defects as a CPU or GPU. So they'd increase cost, but not anywhere near as much as you might guess based on the size of the "chip" compared to a modern CPU or CPU. Those old foundries are probably very cheap to place orders on as well as very reliable.
Don't Meteor Lake CPUs use silicon interposers? If they change Infinity fabric, I think we it should expect an additional "interconnect tier" that is decoupled from infinity fabric. Faster than infinity fabric, but slower than L3 links. Chip and cheese also showed that L3 latency penalty from N21 to N31 was surprisingly small.
There is no EUV reticle limit for interposers because the feature size should be well above the minimum for 193nm wavelength lithography. No one will make a huge interposer on a fab that could be making 28nm planar semiconductors instead. It's all going to be coming from depreciated fabs, probably from the 90nm days or earlier. The only question I have is if manufacturers will start putting active or passive components on interposers. Resistors seem like an easy thing to do. Inductors are typically done near top metal so that would also be easy. Capacitors would need poly, gate oxide and doping so that's a lot more process but it gets you close to doing CMOS or bipolar devices.
So Zen 6 infinity link is going to attack the weakest link in the current Zen systems, which held them back from completely obliterating Intel in gaming. Sometimes some games just straight up have way poorer frame pacing on AMD compared to Intel even tho AMD had higher fps on average.
I honestly think that beyond Zen 6, CPU performance won't matter much in the consumer market, it will all be about GPU and AI acceleration. The best interconnect for CPUs will be the one that is the cheapest while not significantly limiting CPU performance. The best interconnect overall is the one that works best for GPUs and AI accelerators.
With technologies like bridge dies, I see a cut in latency by reducing the SERDES overhead. When your channel is sufficciently short, I think it is likely we see a switch back from lots of unsynced serial links to massive parallel busses with 'dumb' transceivers. This can cut out a lot of protocol overhead, saving power and latency.
Frame pacing for games and latency sensitive applications will be the biggest benefactors of Zen 6. This also solves energy consumption when communicating between different dies so Idle power consumption should go way down as well. So chiplets are a more viable technology for mobile applications like laptops and maybe even handhelds.
On desktop and server it seems overall better to increase cache than fabric performance, if it comes at the expense of cost. One must not forget that the IF design was developed due to cost. It was the cheaper way to increase core counts beyond what was possible. Is the fabric power even an issue right now? From everything I see, the issue is that power is not scaling as well as density on new nodes. So we end up with 95C cores and 200W+ CPUs.
with the new infinity link would it be easier to place smaller chiplets? i think it would be more efficient to surround the io die with the cpu dies to spread the heat more evenly and reduce the paths.
Yes, it is kind of crazy AMD is taking this approach for three generations and now a fourth. But consumers apparently don't care enough about the latency and power efficiency drawbacks. For me, with a passively cooled system, it looks insane a chip like the Ryzen 5600X uses 30W just sitting at the desktop, compared to 15W for a 5700G (which of course uses a monolithic die and therefore isn't wasting power on the interconnect). Unfortunately, reviews focus on peak performance and people don't notice the fans spinning a few hundred rpm faster than necessary. I do hope Zen 6 will bring a change to this.
Well to be fair, their competition(Intel) is FAR less efficient on PC, laptop and servers. And if amd wants, as they've shown, they can compete even with ARM in efficiency(look their latest ryzen CPUs for laptops and their crazy efficiency)
@@peq42_ yes, overall efficiency is excellent for Ryzen CPUs. But I was commenting on the efficiency at idle/low loads. The chiplet design is wasting a lot of energy when the CPU is not being used or used lightly.
Think about how you would cool that stack. The I/O die can run fairly hot itself. L3 inside it could be interesting, but keeping as much cache close to the cores is better, which is what V-cache is for. Long term I suspect an active interposer handling I/O could happen, but bridges between dies would also do mostly the same job.
If CAMM2 modules take off we first have to deal with those. CKD might become optional part of spec in DDR6 we see this year. A single CAMM2 offers dual channel, and since they use less space getting more "sticks" in takes less space. Much easier to stack 2-3 of those modules.
Would it be possible to use older generation silicon tech for the interconnect silicon stuff? Like this might not be true, but my assumption would be that we only have a very limited capacity to make like sub 5nm stuff but a much larger capacity to make like 20nm stuff. And I was thinking that couldn't we just use the 20nm factories to make the interconnects cheaply and then then 5nm factories for the actual important chips? Surely the interconnects don't need to have the fanciest smallest gates possible?
amd could have just elevated area needed for silicon bridges if possible but worth a look for even layout amd hight convergences of heat dissipation needs of zen4. should be coming hopefully.
I think same as you, and Intel is doing what he have, they don't have experience with organic RDL, and I don't even know if they have any demo of this technology. And RDL looking great, because from what I know, even if package don't work, they can remove RDL and put on new one.
My guess would be that Ryzen will take Organic RDL/Infinity Links for sure, but for server CPUs, I'd bet everything on Silicon Bridges. There's just no point in going cheap there, especially if they're reworking the entire packaging.
They should add the dual data back in and let the chipset be a glorified usb controller. Hopefully that allows much better routing and avoids retainers pushing the signal all over the mobo and back again.
Well with AM5 they have lots of Z axis room to play with in theory due to making the IHS thicker to be "backwards compatible" with AM4 heatsinks. So some Z axis package stacking? That is assuming that one they do this and two they plan on trying to cram it into AM5 still. I would not be surprised if ZEN 6 is a new socket though if they are going to be doing any of what I have been seeing in the leaks.
Silicon Interposers also kind of defeat the point of splitting the chips in the first place no? It was to lower manufacturing losses and inaccuracies while being more thermally and thus energy efficient by spreading the heat load. Not sure I'm following your logic as it would undo all of the benefits.
You still keep the benefit of manufacturing different things on different nodes, not to mention easier semi-custom designs, which are AMD's forte, or at least important part of buiseness
While the locations of the chiplets on the CPU are the same, actuqlly looking at thwur layouts they are very different. The IO die between zen2 and zen3 is basically the same ofc. but rhe core chiplets are all very different from one another. And the zen4 IO-die also looks really different than the previous one
Intel is making MTL's on a modified verison of an old node. 40nm might not offer the interconnect density, but but 28-20nm would be plenty. I don't know what TSMC makes their silicon packages on though, they might be using 16nm as they've had that node for ages.
Yes, but 1st gen Threadripper/Epic behaves more like 2-4 separate processors on one package, and starting with 2nd gen it's one processor with CPU cores in different chiplets
Excellent prediction analysis, thank u once again for very educative vid. Chip packaging will most certainly play a very important role in the future. As u elaborated it's going to be right optimum in between performance vs costs. AMD's approach is clear to be more performant per $ and as efficient per Watt as they can be to keep their competitive edge against competition. Can you open a "can of worms" in future vids about Intel's approach to use glass substrate as well as other competing technologies besides organic? I also wonder what Nvidia will deliver with their Arm custom cores and what kind of SoC/APUs we r going to see from them. With failed acquisition of Arm Nvidia clearly showed their future plan to not be just a GPU gaming/compute accelerator company, but complete CPU + GPU & all kinds of other accelerator company like AMD and Intel.
You're skilled at illustrating how industry buzzwords translate into real-world scenarios. Much appreciated.
yes
But are they really buzzwords? I mean, most packaging acronyms are somewhat descriptive.
@@filipenicoli_The difference is whether the person using it understands what it means and whether its being used to inform or advertise
Like the word modular for example, very different in circuitry and engineering as a whole; but in a commercial context its a painful buzzword
Sure is! Makes you wonder why the manufacturers can't just explain it that way, but I trust him more anyway.
2:56 The IF protocol isn't PCI-E, that’s what the physcial layer is using (bascially the wires). The protocol is based on HyperTransport. Please disregard my mistake.
And, as someone on Patreon pointed out, Navi 31/32 might already use ASE FOCoS packaging, we don't know for sure if its InFO_oS/R from TSMC.
Latency is not the boogieman that everyone seems to think it is. The larger caches in the ZEN cpus help mitigate memory latency and actually benefit SMT bu giving the second thread a little more space to perform as thread 1 waits for data.
Your descriprion of Zen 2 Infinity fabric is correct, each interconnect is a single point to point connection, each of which can be saturated by the data from a dual channel memory implementation. Zen 3 changed things, replacing the single bidirectional point to point connections with a loop that provides dual bidirectional inteconnects that double the data transfer bandwidth and eliminate the saturated bottlenecks when memory reads and GPU writes were competing with each outher that caused the slow gaming performance on the 1000=3000 ryzen cpus.
Zen 4 changed things up a little to reduce power consumption, limiting the IF to 2000mhz instead of the frequency of the memory. Dual connections at 2000, dont double the memory bandwidth requirements of ddr4 6000MTs but still provide enough bandwidth not to be bottlenecked by the dual channels of memory and GPU competing for IF bandwidth.
I already know what you meant because you didn't say that it uses PCIe, you said it's based on PCIe
@@bradmorri Since there's no way to compare, you're statement is baseless as there's no data to back it up. I have to disagree when you run apps that are heavily threaded and data always needs to be passed between threads AND you have a 2 CCD part. That latency is going to add up. And in fact this is why Intel has been able to compete in different areas over others against Zen 4.
The next issue is, and I'm sorry you're just GOING to be wrong here, as cores get faster either through IPC uplifts and clock speed improvement, EVERY bit of latency will matter more and more and to say otherwise would be wild.
Cache is NOT read ahead and Cache ONLY provides benefit in certain applications, mostly when you need to keep REUSING the same data/code over and over, and in the world of PC this happens more in gaming than anything else which is why X3D parts are better for gaming. But if you do a render task, sorry but that cache is almost worthless because you are CONTINUALLY using new data from streams, and then creating new streams. These are read from memory or read from disk, then write back to disk operations. Now, for code there's a lot more benefit from bigger L1 and L2 caches especially when you keep running some function over and other again, as in that render task. You don't get benefit for the data with larger cache, but you do get the benefit for the code.
I mean really go check out and myriad of CPU reviews, look at what applications have benefited from larger L1, L2 and L3 cache. Latency in core to core data transfers which once again will happen in heavily threaded apps, WHEN you have to pass data from a core on one CCD to another is impacted by that latency and it's part of the reason why moving to 2 CCD parts doesn't scale as nicely as what most people would want. But that latency also affects other operations and I said.
can you make video about intel backside power delivery *BPD* technology?
@@bradmorri your comment is at odds with itself.
There's a reason why we don't have monster L1/L2 caches on CPUs, which is because of latency. The physically bigger the cache becomes, the higher the latency and the less efficient the core becomes. That's why AMD made such a huge deal about v-cache and 3d stacking, it allowed them to make the caches larger without moving them physically further away from the logic that needs them.
And why do we need big caches? Latency. Having to wait to go out to memory is slow and if we can avoid doing so, we should. DMA exists to cut latency, it's one of the main drivers for CXL, etc. Some things are obviously more latency bound than others, obviously, but poor latency hurts everything.
I remember Adoreds video from years and years ago talking about new interposer technology. As always there's a 5-10 year gap from research to application.
Intel already use silicon interposer since Alder lake from 2020-2021 era. Their tile are using Foveros. Intel EMIB is like within package communication, that's what AMD does with chiplet. Pretty cool Intel tech. If Intel already been doing it in 2021, I think it won't be long until AMD catches up and start using silicon interposer for their chiplet. Maybe in 2-3 years.
He covered the "Buttered Donut" and many laughed. Turns out, he was right and Zen3 did have the "Buttered donut" tech. It was the foundation for the 3D caches
@@hammerheadcorvette4 That sounds very dramatic. Last time i checked AMD never specified their chiplet topology but that was back when Zen 3 launced.
@@slimjimjimslim5923AMD did it in GPUs prior to that with Fiji (radeon fury)
The substrate manufacturing can be incredibly difficult similar to the silicon itself. Future cpus will be staggeringly complex beyond the transistors themselves.
I guess Apple's make back their Ultra SoC interposer investment, by charging 200 dollars for one additional 256GB generic SSD NAND flash chip.
£200 for 8GB of ram..
lol never thought of it like that
Yeah apple tax. IMO but from tech implementations Apple did wonders with own chips. I can just applaud performance. Especially GPU section is very very impressive per watt.
Apple = trash for retail marketing non technical crowds sheeps
Apple are planning on using a silicon interposed for the construction of individual chips, meaning each processing core will be cut and glued into place, as will each cache, gpu, neural processor, analog processor and whatever specialty chunks they decide to add next. Shouldn’t expect to see it before M7 though (~2028ish)
I know some of this packaging is TSMC's or licensed from other research groups. But unlike Intel, AMD has really pushed the envelop with Zen, including all the research they've backed and invested up into their supply chain. I'm really excited for Zen 6 because that's when High-NA EUV Node, GAAFet, Next-gen Packaging with 2.5 Interconnect, RDN5 (Improved RT, Image Reconstruction, Mesh Shaders, etc), All new Zen 5+ architecture with Iterative Improvements in Zen6, All New Zen 6 Memory Controller, 512AVX Full Width, will all converge into a single innovation step. I'd imagine this is the platform a Steam Deck 2 APU will be built on. 2026.
Si interposers are coslty because they are generally huge but on a consumer platform like zen 6 they can make sense if AMD manage to shrink both the logic and IO die and put them right next to each other, but at that point they might as well use EMIB
Still too expensive for consumer platform, inFO is what will be used, like RDNA 3.
@@thevaultsupwhat do those terms mean
@@keylanoslokj1806 Consumer platform = Ryzen, inFO = new ways of packaging chiplets. I mean i don't even know what you actually mean, the video already explain most of it.
Fantastic video. Really enjoyed the breakdown and elaboration on the various interconnect options and their cost/benefit implications. Subbed for more!
I concur with your analysis and your conclusion that Infinity Link (meaning InFO in the end) will be AMD's next interconnect technology. Moreover, I'm not only seeing AMD go with the same interconnect technology for client and server, I see the mere potential to do so as a very compelling argument in favor of Infinity Link.
Also, it would be a very AMD thing to do. Ever since their return to competitiveness, AMD has capitalized on implementing the most cost-effective solution to an engineering problem that just gets the job done without overextending themselves and further iterating on proven solutions afterwards.
Other than that, a very well thought out video that I enjoyed a lot! Your presentation skills have improved significantly. Hard to believe you're still doing all this in your spare time. Should you ever come somewhere around the middle of Germany, I'd very much like to have a beer together. 😉
Oh - and Tom has tainted you. You've obviously converted to the "mm-squared" crowd. 😂
These are excellent illustrations and explanations for those of us who are nerds but not experts!
Wow, this was sooo above my understanding, but you made it understandable and the how an whys!
Very good, sensible, credible, video! Discovered your channel with the SteamDeck OLED video you did and this one is excellent as well!
One question about Infinity Links:
Since it improves latency and bandwidth, would that be a major advantage for iGPU as well? Even by Zen6 it's highly unlikely that we have on package GDDR or HBM, but already if the local cache was shared between CPU and iGPU, that would alleviate somewhat the enormous bottleneck of the iGPU right? While benefitting from the energy efficiency and going out to main memory when it is needed.
What packaging technology is it utilized on apple's MAX chips? Are they silicon bridges?
Anyhow, Thank you for these explainers and compendiums. Really appreciated.
Do you mean the Ultra versions, which is two Max chips glued together? Yes, that uses silicon bridges. Even the Max chips that don't end up in an Ultra still have a relatively large portion of their area dedicated to the interconnect they don't use.
There's no official information, but Apple is using a silicon bridge. It's either CoWoS-L or InFO_LSI.
Thank you both for the answer.
Yes, I meant the Ultra version. I had the lineup mixed up in my mind.
This is interesting then, since it means that the vías that connect to the bridge are vertical with respect to the plane of the die, and not stacked on the side as per apple marketing material. I was wondering how that was possible though lithography
@@salmiakki5638 It is a very confusing lineup. Pro, Max, Ultra... you can be forgiven for thinking the Max would be... like... the max and not the mid :D
I wonder about how this changes X3D implementation?
I can see them carrying over the existing design where the memory chiplet is stacked on the CCD. I could also see them stackong the CCD on top of the memory chiplet, removing the need for TSVs on the CCD Enabling increased logic density, and smaller die area on the cutting edge node. This would also improve thermals as well.
That being said AMD could use a SI interposer with an integrated memory for the X3D variants and organic interposer for the non X3D variants.
here's to more insight and knowledge from no other than High Yield! Thank you for the lessons!
As an AM5 user, I'd just like to see a higher core chiplet. Also I'd like to see a 2 chiplet CPU with both chips using Vcache.
Interesting, nice insight into complicated topics. Thanks for simplifying.
5:20 I don‘t really see why the EUV reticle matters. As long as you don‘t place any (or at least not too much) logic in the interposer and it really is just an interposer, then you can use why less advanced nodes with larger reticles or where mask stitching is easier to do.
In fact the upper metals in newer process nodes are still done using immersion machines, not EUV (at least that was true last time I look at it and it would be weird if it changed). You need to go through the upper metals anyway so its not like you can achieve a higher bump density, so using the same expensive processing nodes for an interposer makes little sense in my opinion (as long as you don‘t put serious logic in it that is).
I just looked it up, our last research chip Occamy had chiplets fabed in 12nm and the interposer used 65nm.
Your at ETH?
I agree, makes little sense to have the EUV reticle matter. As far as I am aware, CoWoS uses passive silicon interposers. I know ST has done quite a bit of work on active interposers, moving things like power management, clock management, power gating hardware into the interposers, with the interposer based on something like a 65nm technology. Could even offset the higher cost of the interposer as you save area on the 2/3 nm CPU die because you move that stuff onto the interposer.
That was exactly what I was thinking, older process node fabs would be delighted to have the opportunity to fabricate a relatively high value interposer die and you would get great yields on such a simple layout.
hi, just subscribed - your channel rang a bell from a MooresLawIsDead podcast you participated!
While watching a video I got an ad for an advanced packaging company... They make custom cardboard boxes.
Since I've started watching videos like this, I've started getting ads for help with advanced macular degeneration.
Wrong AMD, advertisers. 😂
5:33 who else thought at first he was showing the headquarter of some random silicon tech company
now i cant unsee
You make a good case for what AMD might do for Zen 6 consumer products. I also think we might get a preview of it with Zen 5 based Strix Halo if it is multi-die as has been rumored. InFO_oS seems like the natural method of going about since we've already seen it work with GPU IP in RDNA 3 and it would need a more efficient interconnect for mobile and APU use. Odds are probably still good that for at least some enterprise products they will use silicon bridges or interposers, especially since like Intel, AMD will also be looking to use HBM on Epyc if the rumors pan out.
I don't know how you did it but I was highly entertained and interested throughout the video
It's dark magic ;)
For Epyc I think the silicon interposer makes a lot of sense as the margins are higher there, for Ryzen, AMD's Z6 solution makes a lot of sense
Great info and well presented as always! Thanks for the explanation, and organic interposer technology definitely seems like the best solution for Zen 6. I also am glad to hear you talk about how power and heat of data transmission is becoming a performance bottleneck, and the role of advanced packaging in solving it. I'd love to see you do a video on in memory/near memory computing and how that may filter down to consumer products.
The current revised rationalised layout of Zen CPUs hints at future layouts. They'll probably use infinity link, it costs less and will increase yields. They've already rationalised their CPU layouts into more logical blocks, they're getting their house in order and probably using current generation of CPUs to prepare for their next move. Look at the dies, clues are there.
Curious why the ‘double die’ method used with early Pentium D’s was abandoned. It seemed like a good solution for two dies communicating with each other and yields. If both dies were good, you have really low power intercommunications already in place. If one die is good and one is bad, you can create a lower end chip by slicing in half. If the yields are mixed, then you have an intermediate product (i.e. 7900X instead of 7950X).
I think the Pro vs. Max version of Apple Silicon chips use this approach.
Data rates are much higher these days. First gen Ryzen used the dual die method, but it had all sorts of problems moving memory around efficiently when scaled out.
The IO die method allowed keeping the cores all fed evenly and kept costs down. But, DDR5 is outpacing the bandwidth AMD can get with the current link density, so a solution to that issue is required.
@@unvergebeneid Sort of, yes, but they use the area of the die that would normally be cut to transfer data, so it's just on piece of silicon.
The Pentium D was a terrible solution. There was no interconnect at all. They were two separate processors communicating over the front side bus through the socket. Same situation later with the Core 2 Quad, which was two separate Core 2 Duo chips that could only communicated over the FSB through the socket - no interconnect between the dies at all on the package.
@@TrueThanny wasn't the FSB famously slow and bottlenecky anyway?
I haven't understood the "advanced packaging" mentioned over recent years at topic, but now I feel all caught up :)
I wonder what comes next after silicon bridges.
Stacking dies on top of the I/O die?
@@brainletmong6302Yeah. AMD claims it's no problem with two layers (normal CPU + V-cache), but three layers? More? We're bound to see issues.
From what I've seen the next step is to get latency and bandwidth of existing interconnects back to monolithic levels, but far future I would expect to see stacked logic and I/O like you've said. I could see CCDs stacked on top of an active interposer that houses the I/O functions.
how to manage the heat layer in x3d variants,high IPC,low latency and high cache cpu is amd future,if they dont mess up
the current x3d stacking is fine, it's just not rated for the same high temperature as the CCD below
@@pham3383 The obvious and more sensible option is to just put the cache chip underneath instead of on-top. The drawback is that you basically cant do 'optional' cache chip additions anymore and have to use it as standard unless you want to make a different chiplet without the same connects underneath. But the benefits are great. You can use more cache, you can get better cooling on the compute that needs it, and you can put more cores in a CCD since you dont need to put any large L3 on the main compute die anymore(or alternatively use a smaller die for the compute chiplet).
I read somewhere Zen6 would have 3D cache underneath
I am not sure AMD has the specs on Zen6 final yet.
Though putting it underneath makes for easier transport of heat away from the cpu.
@@jamegumb7298 Zen6 is over 2 years away, they're still in the simulation phase
Lower CPU temps are good, but are the 3d v cache itself sensitive to heat? @@jamegumb7298
I was just thinking about this while I was changing the thermal compound on my 7900 XTX the other day. Thanks for sharing more about what is going on under those chips!
These animations are so cool, awesome video!
Motherboard redesign to allow more than 128 PCIe lanes at full speed. Imagine having at least 4 PCI Express slots with x16 lanes at full speed and having workstation/server capability in a moderately priced package. Imagine the improvement. The only thing is, one can't work without the other.
The best deep dives on the tube. Cheers🍻
this channel deserve more... please share guys
I'd like to see highly efficient interposers and interconnects. Very interesting video, thank you
Damn i love this stuff.. right up there with the best in youtube. No one else is delivering such digestible info on these topics.
Cool Channel, finally, I get a good recommendation from youtube...
I sincerely hope they will produce a breakthrough in bandwidth capacity so as to make commonplace the real-time speech to text to speech, synchronizing, and blending so as to remove muddling accents without completely removing speakers' otherwise natural speech character and tonality. Such a breakthrough would be equivalent to inventing the babelfish, and probably result in a Nobel Prize.
16:22 I would like to see photonic interconnects. There were articles about them for years, but so far no actual mass market products use them.
Love this breakdown. Rocking an undervolted 7700X and it's a VERY capable CPU.
The era of band-aids and patches. It was the norm to bring as much as possible onto one die for decades of course because it saved power. But the cost per transistor has largely stopped at 28 nm nodes, so now this makes economic sense. Chiplets with just a PCB are hobbled. We can call these things anything we want but they all are just fancy versions of wiring boards with more or less capacitance depending on the feature size. Intel seems to be betting on glass (not mentioned here). Although it is put forward that silicon is not economic that seems to be what Intel is doing for now until they move to glass. It will be interesting who uses IFS technologies in the next few years.
How is EUV reticle size affecting interposer? They would use EUV for core chiplets and increasingly older processes for bigger silicon. Wouldn't be surprised if the interposer was 28 or higher
DUV has the same reticle size. And yes, you are right, interposers are mostly older, non EUV nodes.
I hope that the implementation of organic substrates at those scales does not mean shorter lifespans where the effects of high temperatures during a span of years end up degrading the organic layers causing cracking and corrosion of conductors or bad contacts. Right now I can take a CPU from 20 years ago with 40k H of usage and still works.
Danke! Ich kann kaum auf Zen5 warten.
Thanks for your amazing content. Cant wait for Zen5.
TSMC made Ryzen : In-package interconnect wiring, performance and latency take a backseat.
Intel 14nm & 10nm : A +100 MHz per year " *_Refreshed_* " backseat.
Cache cascade. I would like to see Zen6 with support for more on package and on motherboard memory cache chips. The on package version could potentially leverage the new interconnect as a high capacity and high bandwidth L4 cache. The on motherboard variant could be a cache between main memory and storage similar to Optane/3D Xpoint. CXL/infinity fabric might be utilized for the on motherboard cache while infinity link is used for the L4 cache on the cpu package. The existing extended L3 3D V-Cache should remain as well in addition to these other caches for models where the cost is tolerable. These additional caches could help improve overall performance in some cases in addition to any internal CPU architectural improvements.
Thanks for providing us these insights. Probably you are right in your conclusion. They need to feed the increasing number of cores with more memory bandwidth. Didn't know they left so much power consumption on the table 😮, good that they have so much room for improvement.
Great video and analysis! You took an incredibly complex subject and made it understandable to the average youtube laymen. And I am glad AMD is focusing on power performance and not just performance, because that’s what makes Apple’s silicon so great. I’m watching this video on an M4 iPad Pro, which can beat AMD’s top offerings at 1/10th the power cost.
A great video, one metric that could be the most important could be thermal efficiency, if we overheat then performance is limited - does one connector better dissipate heat?
wow^^ mega interessant! besten dank für deine mühen!
I was amazed to see Package-on-package solutions being used on earlier versions of Raspberry PIs some years ago. Would thermal management be the major technical challenge for these kinds of platforms?
Mobile SOCs have been using memory on package for a very long time, it's tech based on Toshiba's DRAM package staking, from 2 decades ago. RPIs have always used Broadcom SOCs for embedded solutions, that do have stacked DRAM over logic, and they were somewhat early in implementing the technology for the mobile market.
The main limitation is memory size, although you can stack 10 high right now. Some of the cooling efficiency loss is gained back as memory transfers are more efficient with less metal in the way. But yes, the more you stack, the more you alternate silicon and packaging material and silicon etc. - creating a heat barrier.
Thinking about this more there should definitely be a connection between the chips
I really love how you used the green ambient light for an AMD video.
I don't expect the cheap and efficient on substrate interconnect will go away, that's still good enough for way too many uses cases, no need to invite the extra cost and production capacity constraint of advanced packaging for products that don't really need it.
I am very excited for the weird laser stuff that mark papermaster was talking about
If they don't take advantage of the optical properties of the silicon, they will be left behind. As a note...It was the backside power delivery breakthrough, and the optical silicon switching and outboarding the I/O that enabled Moore's law to scale to 6x.
Whatever interconnect technology AMD does choose I assume it will be the same between Consumers grade chips (RYZEN) and enterprise grade chips (EPYC). The reason why I believe this is that Epyc and Ryzen already share many of the same chips and they could use the consumer grade chips as a testing ground for their enterprise grade chips and vice versa. (Like how 3D Vcache was an EPYC first technology that they bought to Ryzen.)
I feel as though in the near future these connections will become even more thorough by using almost an organic material. I realize with silicon we are getting there, and this coming technology is amazing, imagine 'growing' all the interposers and 'wires' for a CPU.
edit: oh. Well.. yeah then ok you explain this (and I didn't even realize they used this already, man I'm behind the times!!)
Didn't Vega also use an organic interposer? I think AMD only used silicon interposer once, in their Fury series card.
How about connecting the individual chiplets via edge-on connectors?
I wonder if this switch will somehow affect next iterations of MI chips, or will they stay on SI interposers because of the size of a chip
5:20 why the EUV reticle limit in particular? I would've assumed you could use a much older node since that would still be way denser than a PCB
I misspoke, it’s not about the EUV reticle limit since most interposers are produced in older nodes. But DUV has the same 858mm2 reticle limit.
I think AMD will again use the dense organic interconnect that they used on the RX 7900XTX. A silicon interposer for Epyc would be difficult and expensive due to size. So, if servers aren't using interposers, consumer products won't either. I think the RDL material is a possibility for server too. Actuality, since the compute chiplets are the same, server and consumer packages have to use the same fan-out packaging technology. So it will be RDL.
I think it's great, that the CPU manufacturer still pushes their ideas to improve the CPUs. Even though AMD is now on top in many cases and way superior in the server market, and still have a bunch of points to improve Zen.
Unlike 12 years ago when Bulldozer came and Piledriver/Excavator had no real changes, just pushed the clockspeed. Intel was superior in every measurement and also just pushed the clockspeed and made nearly no changes.
Looks like i will go and update to Zen6 3D model once it comes out :)
Very well explained thank you.
They all sound like good options. Wonder how costly it would be to make technology providers for each option.
Excellent video, i’m both intrigued and confused.
AMD’s chipset design seems to be much much more rudimentary than Intel’s EMIB and foveros. Yet, it worked much better. Why?
I think it worked because it's simpler. It didn't require as big a leap in technology as Intel's strategy.
Intel used monolith for desktop, chiplets for laptop. AMS used monolith for laptop, chiplets for desktop. Weird but each have their own advantages.
Packaging is only one part of CPU performance and efficiency. Monolithic is most performant and most efficient. More advanced packaging will not compensate for deficiencies in the node, micro architecture, or core layouts/counts. You can use chiplets to increase performance by having more silicon than monolithic, or reduce costs by using such smaller dies that the higher yields and simpler design pay for the packaging and then some.
AMD uses a tiny chiplet to house their CPU cores and uses this chiplet and most of their server, workstation, and mainstream CPUs. With such scale AMD can easily bin their products to have high frequencies for desktop and excellent efficiency for server and workstation; they are also very cheap with excellent yields. AMD’s usage of a separate IO die allows them to use a separate node for IO, and AMD can save a lot of money. The negatives of using such primitive packaging are higher in server and workstation. AMD remains more efficient than Intel by using more advanced nodes and having higher core counts. The higher core counts also allow AMD to maintain a performance lead. All together, AMD’s chiplet philosophy reduces costs for the company.
Intel is less efficient than AMD primarily due to less advanced nodes. The 13900K at ISO power is more efficient than the 5950X, but is competing with the 7950X which has a node advantage. Meteor Lake is actually comparable in efficiency to Phoenix in spite of using tiles, but the silicon interposer causes it be much more expensive than Phoenix. On server Intel uses massive tiles over 5x as large as AMD’s CPU chiplet. Intel is limited by yields causing many of their server products to be delayed. Intel can’t go bigger due to poor yields, so their server products have a core count disadvantage compared to AMD. Delays cause Intel to launch on poorer nodes, and having fewer cores further hurts performance and efficiency compared to AMD.
AMD got into it first on consumer platforms. We've only seen EMIB in server chips from Intel so far. Up until very recently with Meteor Lake, which uses Foveros on a full interposer, all Intel consumer CPUs have been functionally monolithic. AMD actually loses power everywhere with their current interconnects compared to a monolithic chip, and you can see this in idle power draw where a 14900K can get into the single digit watts while the 7950X sits above 20W most of the time.
Where they got ahead was with a process node advantage, sometimes multiple steps ahead. Intel's 14nm was impressive for how much they extracted from it and how much power the dies can take before exploding, but it was never a highly efficient node aside from the year or so it was brand new and everything else was just worse, making it look good in comparison.
Very interesting video! Micro (nano?) electronics are insane.
Silicon interposers or bridges should be manufacturable on obsolete manufacturing processes like >28nm, no? And they wouldn't be nearly as sensitive to minor defects as a CPU or GPU. So they'd increase cost, but not anywhere near as much as you might guess based on the size of the "chip" compared to a modern CPU or CPU. Those old foundries are probably very cheap to place orders on as well as very reliable.
Don't Meteor Lake CPUs use silicon interposers?
If they change Infinity fabric, I think we it should expect an additional "interconnect tier" that is decoupled from infinity fabric. Faster than infinity fabric, but slower than L3 links.
Chip and cheese also showed that L3 latency penalty from N21 to N31 was surprisingly small.
There is no EUV reticle limit for interposers because the feature size should be well above the minimum for 193nm wavelength lithography.
No one will make a huge interposer on a fab that could be making 28nm planar semiconductors instead. It's all going to be coming from depreciated fabs, probably from the 90nm days or earlier.
The only question I have is if manufacturers will start putting active or passive components on interposers. Resistors seem like an easy thing to do. Inductors are typically done near top metal so that would also be easy. Capacitors would need poly, gate oxide and doping so that's a lot more process but it gets you close to doing CMOS or bipolar devices.
The DUV reticle limit is the same.
So Zen 6 infinity link is going to attack the weakest link in the current Zen systems, which held them back from completely obliterating Intel in gaming. Sometimes some games just straight up have way poorer frame pacing on AMD compared to Intel even tho AMD had higher fps on average.
I honestly think that beyond Zen 6, CPU performance won't matter much in the consumer market, it will all be about GPU and AI acceleration.
The best interconnect for CPUs will be the one that is the cheapest while not significantly limiting CPU performance.
The best interconnect overall is the one that works best for GPUs and AI accelerators.
With technologies like bridge dies, I see a cut in latency by reducing the SERDES overhead. When your channel is sufficciently short, I think it is likely we see a switch back from lots of unsynced serial links to massive parallel busses with 'dumb' transceivers. This can cut out a lot of protocol overhead, saving power and latency.
I wonder what applications zen 6 will do better considering the design change.
Frame pacing for games and latency sensitive applications will be the biggest benefactors of Zen 6.
This also solves energy consumption when communicating between different dies so Idle power consumption should go way down as well.
So chiplets are a more viable technology for mobile applications like laptops and maybe even handhelds.
On desktop and server it seems overall better to increase cache than fabric performance, if it comes at the expense of cost.
One must not forget that the IF design was developed due to cost. It was the cheaper way to increase core counts beyond what was possible.
Is the fabric power even an issue right now? From everything I see, the issue is that power is not scaling as well as density on new nodes. So we end up with 95C cores and 200W+ CPUs.
with the new infinity link would it be easier to place smaller chiplets? i think it would be more efficient to surround the io die with the cpu dies to spread the heat more evenly and reduce the paths.
Yes, it is kind of crazy AMD is taking this approach for three generations and now a fourth. But consumers apparently don't care enough about the latency and power efficiency drawbacks. For me, with a passively cooled system, it looks insane a chip like the Ryzen 5600X uses 30W just sitting at the desktop, compared to 15W for a 5700G (which of course uses a monolithic die and therefore isn't wasting power on the interconnect). Unfortunately, reviews focus on peak performance and people don't notice the fans spinning a few hundred rpm faster than necessary. I do hope Zen 6 will bring a change to this.
Well to be fair, their competition(Intel) is FAR less efficient on PC, laptop and servers. And if amd wants, as they've shown, they can compete even with ARM in efficiency(look their latest ryzen CPUs for laptops and their crazy efficiency)
@@peq42_ yes, overall efficiency is excellent for Ryzen CPUs. But I was commenting on the efficiency at idle/low loads. The chiplet design is wasting a lot of energy when the CPU is not being used or used lightly.
Are they also finally transitioning to GAA transistors?
So much gluing.
I like to have a 16C-CCX; the sooner the better. We have been stuck with eight cores for far too long.
Why not 3D stacking, with chiplets on top of i/o die and L3 inside the i/o die? And Infinity links to connect more i/o dies on Epyc...
Think about how you would cool that stack. The I/O die can run fairly hot itself. L3 inside it could be interesting, but keeping as much cache close to the cores is better, which is what V-cache is for. Long term I suspect an active interposer handling I/O could happen, but bridges between dies would also do mostly the same job.
@@DigitalJedi cooling side, Is the same as actual x3d parts, but with hotter Chips on top instead of on bottom...
Any chance we'll see the ability for quad channel ram on consumer motherboards in the future? That bandwidth on the apple m chips looks so nice...
If CAMM2 modules take off we first have to deal with those. CKD might become optional part of spec in DDR6 we see this year.
A single CAMM2 offers dual channel, and since they use less space getting more "sticks" in takes less space. Much easier to stack 2-3 of those modules.
@@jamegumb7298i don't think you can stack more than two
Would it be possible to use older generation silicon tech for the interconnect silicon stuff? Like this might not be true, but my assumption would be that we only have a very limited capacity to make like sub 5nm stuff but a much larger capacity to make like 20nm stuff. And I was thinking that couldn't we just use the 20nm factories to make the interconnects cheaply and then then 5nm factories for the actual important chips? Surely the interconnects don't need to have the fanciest smallest gates possible?
amd could have just elevated area needed for silicon bridges if possible but worth a look for even layout amd hight convergences of heat dissipation needs of zen4. should be coming hopefully.
I think same as you, and Intel is doing what he have, they don't have experience with organic RDL, and I don't even know if they have any demo of this technology.
And RDL looking great, because from what I know, even if package don't work, they can remove RDL and put on new one.
My guess would be that Ryzen will take Organic RDL/Infinity Links for sure, but for server CPUs, I'd bet everything on Silicon Bridges. There's just no point in going cheap there, especially if they're reworking the entire packaging.
the IOD should have a snoop filter or directory for cache coherence.
They should add the dual data back in and let the chipset be a glorified usb controller.
Hopefully that allows much better routing and avoids retainers pushing the signal all over the mobo and back again.
Aint it funny how we are going back to a "monolitic" design
Well with AM5 they have lots of Z axis room to play with in theory due to making the IHS thicker to be "backwards compatible" with AM4 heatsinks. So some Z axis package stacking? That is assuming that one they do this and two they plan on trying to cram it into AM5 still. I would not be surprised if ZEN 6 is a new socket though if they are going to be doing any of what I have been seeing in the leaks.
Will Zen6 include an NPU for the desktop iteration? 😢
Silicon Interposers also kind of defeat the point of splitting the chips in the first place no? It was to lower manufacturing losses and inaccuracies while being more thermally and thus energy efficient by spreading the heat load. Not sure I'm following your logic as it would undo all of the benefits.
You still keep the benefit of manufacturing different things on different nodes, not to mention easier semi-custom designs, which are AMD's forte, or at least important part of buiseness
Do new socket would be needed, am6 ?
While the locations of the chiplets on the CPU are the same, actuqlly looking at thwur layouts they are very different. The IO die between zen2 and zen3 is basically the same ofc. but rhe core chiplets are all very different from one another. And the zen4 IO-die also looks really different than the previous one
Memory bandwidth is also very relevant on t he desktop platform in the graphics side of APUs if that's affected by this.
AMD also needs to further enhance the 3D V-cache and provision all cores to have it in the higher end product line, and not just half.
both dies with v-cache isn't that much better from what i've read
5:19 Does an interposter have to be manufactured on the most bleeding edge nodes though? Wouldn't something like 40/28nm be enough?
No, and they are not. Still, due to their size and mask stitching, they are expensive.
Intel is making MTL's on a modified verison of an old node. 40nm might not offer the interconnect density, but but 28-20nm would be plenty. I don't know what TSMC makes their silicon packages on though, they might be using 16nm as they've had that node for ages.
Isnt zen1 threadripper the actual 1st occurrence of chiplets? Zen2 was later wasnt it?
Yes, but 1st gen Threadripper/Epic behaves more like 2-4 separate processors on one package, and starting with 2nd gen it's one processor with CPU cores in different chiplets
Excellent prediction analysis, thank u once again for very educative vid. Chip packaging will most certainly play a very important role in the future. As u elaborated it's going to be right optimum in between performance vs costs. AMD's approach is clear to be more performant per $ and as efficient per Watt as they can be to keep their competitive edge against competition.
Can you open a "can of worms" in future vids about Intel's approach to use glass substrate as well as other competing technologies besides organic?
I also wonder what Nvidia will deliver with their Arm custom cores and what kind of SoC/APUs we r going to see from them. With failed acquisition of Arm Nvidia clearly showed their future plan to not be just a GPU gaming/compute accelerator company, but complete CPU + GPU & all kinds of other accelerator company like AMD and Intel.