One Ring to rule them all, One Ring to find them, One Ring to bring them all, and in the darkness bind them. In the Land of Interposers, where the interconnects lie. Shoutout to Adored's video tackling the subject in mid-2018: ua-cam.com/video/G3kGSbWFig4/v-deo.html
AMDs cross bar interconnect technology is difficult to scale beyond 8 CCX, I think with Genoa AMD will have more trouble addressing crossbar latency. Also AMD needs something like intel emib coz sending traces to the substrate using a serdes will add additional latency and power consumption. If it's a silicon bridge like emib then you don't need serdes coz your traces don't leave the silicon die. But then having 16 + 1 chiplets and connecting through emibs could be challenging. Amd needs to develop mesh interconnect to have meaningful scalability with 128 or more core processors.
But you didn't learn, huh. They only wanted "a little bit of evil", just as you're only wanting "a little but of 3D". Sorry, it's all or nothing ;P Not sure where in the video I'm referencing but it got a very large head-tilt and much chattering with myself XD
AMD might throw us a curveball here. They could be using the ring bus to only pass the L3 cache address of the information to the other core it communicating with and simultaneously putting the information in the L3 cache, and then the other core pull the information from that L3 cache address directly instead. That would explain why it in almost every single way performs as if they were all p2p connected. And since there's effectively moved less data on the bus the bandwidth is never saturated and uses less power.
I clearly remember exact same butter donuts from AdoredTV few years ago. Quite ironic not many believed him. Also techpowerup recently published few AMD slides that were worth including in this video tbh. Sadly all of the actual innovation AMD always try to bring may be again overshaded by intel's obscure garbage. Creating those "energy efficient cores" (who needs more than lets say 2 or 4 of them for an idle PC state to perform their allegedly main function, yet alder lake has 8 of them and raptor lake will have 16 of them?) is just a most efficient way for them to regain rendering and other scalable by default apps performance crown while keeping comfortable (nowadays, thanks to AMD forcing intel to make at least 8) amount of actual cores for ages again. And with this trick intel having those crowns basically will hold down the industry and handicap global computing power and its scalability because everything is being made for leader like software development and optimizations so we're stuck in that vicious cycle of single-core circlejerk.
I like how you say "honestly," as if you're surprised that you've never done it, despite how commonplace it is among the general population to ponder the topology of CPU core interconnects.
It's one of the most important and still unsolved design issues in CPU engineering. And thats for two decades now a hot topic. But yeah, if you don't have an interest in CPU design or high performance computing it's a topic you don't run into.
My favorite long-term candidate for CPU-interconnect & NoC topology would be a 2D/3D flattened butterfly topology, & depending on how the cores & components are layed-out maybe a twisted variant thereof. Butterfly networks in general have several advantages over rings, fully connected but even over mesh designs for a lot of parameters but also overall. They are also well understood & have efficient routing algoirthms. Original 2007 Paper: "Flattened butterfly: a cost-efficient topology for high-radix networks" by John Kim, William J. Dally, Dennis Abts. & there are several studies analysing this toplogy. Unfortunately it's unlikely that any processor or chip will use such a topology until 2030 because it still has a patent pending - unless any of the manufacturers is willing to pay the inventors.
Tech is always interesting when you read research papers. That co authored paper about networks is a few years old, I think 2017 or 2018. So this has been contemplated for a while
@@Jaker788 2015 I believe. And LOL @ "interesting... reading research papers". Well, there's a clear division right there on subjective terms (for some anything to do with "reading" automatically excludes it from possibly being "interesting", let alone "research papers) :D Please note, I'm also self-mocking as I don't like reading (I'm a slow reader so I don't read much so I fall out of habit ad-infinitum)
@@ChrispyNut Yeah I'll agree that research into concepts isn't exactly as exciting as something being presented as functional. It's the difference between "yeah we could potentially do this really awesome thing" and "look at this really awesome thing we actually have working and you can buy it soon!"
@@Jaker788 No, I think there's a misunderstanding. I'm WAY more into concepts and theories and ifs, butts and maybes than I am finished "stuff" (especially as the finished, commercialised stuff barely resembles the initial concepts (see Intel Light Peak -> Thunderbolt)). Just ... such people are in the minority (and I'm in the minority of the minority in not being a reader). :) I spend so much of time with my head in the future that when the latest, greatest, taking-the-world-by-storm thing is all the rage I'm seen as a misery guts cos it's so fuckin' lame next to the thing I'd been thinking about that was interrupted with the thing I'd been thinking about decade(s) earlier. :'-( If you get my drift.
I remember a long ass 2018 AMD's research paper on shitton of different connectivity approaches from a simple mesh to some crzy shit like double-crossed-toroidal-butterfly-god-of-ancient-elves.
SGI workstations had a glorious design, at the centre of it there was a massive switch design. Every component could talk to every other one, full speed. Complicated, yet simple, very fast and responsive system. There is a guy on YT that does teardowns on systems he collected, and he opened one up. I wish all pc's followed this glorious architecture.
doesn't infinity fabric is already doing this, entire I/0 goes through infinity fabric and as far AMD's paper's go, infinity fabric have inbuilt buffers of multiple KBs for all IO operations..
@@niks660097 yes, but SGI was in the 90s. infinity fabric does take power, so you have to budget your components against the total TDP budget on the package
The beauty of SGI's crossbar was that it allowed systems to scale bandwidth with socket count. The initial Origin2000 series did impose a latency penalty but this was resolved with the Origin3000 series which increased the scaling from 128 to 2048 sockets with lower worst case latency, increasing bisection bw to more than 1TB/sec. Naturally these systems ran as a NUMA design with a single OS instance (or it could be partitioned), so a user doing, say, defense imaging or GIS could run a task and immediately have access to dozens or hundreds of CPUs and relevant connected I/O and gfx power (infact I've still not seen any modern product that quotes faster image loading rates than the Group Station, though it's likely a thing but just not public; NVIDIA probably makes custom tech that isn't COTS, ditto AMD, indeed SGI did this at times, eg. for Lockheed). The 8-port crossbar was the most complex chip SGI ever designed, it required 6 months of Verilog testing. Each port had a 2MB cache buffer, so although installed CPUs might have 2MB L2 (such as in a max spec Octane2), the crossbar had a lot more memory of a similar type, so not cheap. The crossbar had four independent connections and these could change which ports were connected to which other ports on each clock tick, allowing for continuously variable I/O paths. At the same time, applications could not only lock in an I/O path to secure guaranteed bandwidth (with DMA), the REACT extensions to IRIX supported real-time response certainty aswell, hence the broad use of SGIs in defense and other industrial applications. This meant, for example, that a digital video stream could be routed through to main RAM without involving the CPUs, and with a hw guarantee that it would never drop a frame, while at the same time the same crossbar is routing other data aswell. CPUs were not connected directly to the crossbar though; in Origin, CPUs and RAM were connected to a HUB chip. Each HUB had two ports: one goes to a crossbar, the other to the router fabric (similar tech, ie. NUMAlink). Thus, any CPU could connect to any other either directly via its local HUB, or via a crossbar link, or via a router link. See: www.sgidepot.co.uk/mod_block_diag_server.gif This did mean more hops though with Origin2000, but the arch changed with the 3000 series to solve this (along with a modular brick design instead of connected half-racks), resulting in much lower latency penalties for long routes (I think the worst case scenario in a 1024-CPU O3K is 50% latency penalty for the most distant nodes). The design also used an interesting caching mechanism to cope with the situation where data changed by one CPU could invalidate copies held by many others, but that's a whole other thing. There's a lot more nuance to all this of course (see below for refs, PDFs, etc.) Note Octane used a simplified chip called HEART, to which the CPUs and RAM are connected, but HEART has just a single link to the crossbar because there's no router fabric. For more, see my index pages: www.sgidepot.co.uk/origin/ www.sgidepot.co.uk/octane/ Note SGI had been planning to scale single image support with Origin4000 to 37500 sockets (along with IR5 for gfx), but alas with all the management screwups, loss of staff, etc., that never happened, but the NUMAlink tech lives on, I think HP is still using it as NUMALink8 or something, giving 64GB/sec per port, though I doubt they'll carry on the arch any further. A caveat to the awesomeness though: many XIO option cards (such as PCI, FC, etc.) used an XIO/PCI bridge chip and the early versions of these chips were kinda naff, limiting PCI bw to around 185MB/sec. The boards for O3K were better. Still, I was able to get 600MB/sec from an Octane, which for 1997 is kinda nuts. Not had a chance to try the same thing with my O3800 yet.
@@dercooney I can't speak for AMD or modern markets, but for SGI the cost aspect wrt their target markets was largely irrelevant. One oil company told me their $2M Onyx2/RealityCentre setup paid for itself in *six seconds* (brownie points if you can guess how). Note I was the head sysadmin of a RealityCentre for a few years; an early version, it was a 16-CPU 3-rack Onyx2 with five IR2E pipes. Wish I had the time to do vids on my SGIs, but alas YT came along a tad too late for that really. Maybe some day.
Quickly got flashbacks to BNC ring networking (when 10Mbs was AMAZING), funny how similar computing is up and down the chain, back and forth through time.
this is probably the exact opposite of technique that PUA's use to hack the programming of a fembot step 1 is realizing they will never divulge accurate information regarding their user manual. EDIT: I almost forgot its 2021. you might have better luck if its a guy.
I watched this whole video. I didn't understand everything you talked about but I came away largely grasping the big picture. I enjoyed this, thank you.
20:38 What's not clear to me exactly, is why you'd use an interposer for interconnect within a lone chiplet. You can do your butterfly/torus/etc on regular metal layers without needing to go out to an interposer, it's plenty doable to have signals weave cross different metal layers. Is there a shortage of metal layers in the readily available processes? I wouldn't imagine so. Even when you bring multiple chiplets into play, you can design things such that the intra-chiplet links of a big butterfly/torus are in metal layers, while only the inter-chiplet links go into the interposer.
The point of the video is that as you scale to 16 cores, a ring doesn't work, so you might want to do an on-die mesh. But even then, there are better meshes, so with a one-chiplet interposer it would be easier to work on independently, or optimize when it comes to Serdes links.
@@TechTechPotato great video! I bet a big contributing factor of what's designed and used in the future will be highly dependent on how well software and core scheduling evolves and works. From a workload perspective, the amount of use cases for more than 8 cores to be allocated to a single process/job whereby inter-core latency is highly important seems very low even in the enterprise space. In the vast majority of use cases that I've seen whereby beyond 8 cores is required, it has been a highly parallel and core/thread independent workload. Therefore, if essentially good NUMA aware scheduling is being used, I highly doubt there'd be many use cases whereby the extra connections and overhead of a more complex and expensive architecture would be worth it short to medium term. 4 slower cores in Zen 1 was certainly not ideal for plenty of enterprise and hosting use cases while the latest 8 high performance low latency interconnected cores seems by far the sweet spot.
These ideas go back decades and will always hold true. I'm hoping AMD has split the core-to-interconnect so that the core can stay the same, but they can change out the interconnect topology at will.
Thanks Ian! It's really interesting to talk about all these techniques used in individual products years ago can be combined together to make something extremely innovating. Take the interposer and maybe some HBM from Vega, combine it with Zen 4 chiplets, the V-Cache from Zen3D, and the I/O die from Zen 2 for something incredible. I'm still waiting to see if Zen 4's iGPU will be a separate chiplet, or maybe part of the I/O die. You can tell AMD has a long term game plan and have been executing one piece at a time, and when the time is right they start combining those individual pieces.
I do think you are correct in that their next step is using an interposer. The step after will be an active interposer with some logic built in, this is when it will becomes really exciting
I'm not terribly worried about AMD and their ability to innovate with design with respect to CPUs. The constraint will likely be the capabilities of their foundry partner.
Physical/engineering limitations, always the party-pooper of cool concepts..... until they become the enabler and round we go in the endless loop until a blackhole comes along to keep infinity out of the equation :)
@@ChrispyNut computers can get only so dense until you start dealing with exotic matter, weird particle effects, & then black holes. I want to see how wild CPUs will get in the future
@@Apocalymon I don't mean we create the black hole, rather that everything eventually ends up in a black hole (figuratively, not literally everything). Maybe we see how wild CPUs get all the time, the organic brain.
nice vid. I remember watching a vid from jim at adored some time ago about interposer technology and all the untapped potential. nice to know that they are finally looking into those options for higher bandwidth/ less latency
I just found your channel and it is really interesting, im currently making my bachelor in mathematics and maybe want to go in this industry, thanks for the interesting content
Amazingly, I understood a lot through your explanation. As a marketing & retail professional, I really do not need much about this. I'm just really a curious gamer. :P
please don't, he displays ring architecture wrong. Intel Ring is a Bus, transfering data from one core to another doesn't go through other cores and increses latency like he describes. A CPU core is not a router. In Intel Ring Bus, all cores are connected to the Bus, there is no such thing as hops. The huge dissadvantage of the Ring Bus is that only one core can transfer data at the same time. All other cores must wait. The Bus is shared between all cores. It's like a network hub, the more devices on the hub, the slower the bandwidth because it's shared. The advantage, latency stays the same no matter how many cores
I mean - with Zen3 it might just be a simple bisected direct-routed ring bus. That gives you basically static time between all cores, but lower total bandwidth when all cores try to talk to each other. And that is also what we see when comparing Zen2 to Zen3: Zen2 has higher bandwidth when just 2 cores are talking with each other and when the data-size is small. But with higher traffic the difference vanishes. Sadly there is very little benchmark-software to measure that...... now i am kinda intrigued, might just measure the CPUs i have at hand.
02:05 "means you only need to hop once means you have lower latency" Is this really what it means? I though ring bus is called that way because it's shared (i.e. only one member can send data at a time), not because each member acts as a blocking waypoint. The latency improvement comes from less owners but not having shared link among all members can hurt latency in practical tasks.
Core fabrics are point to point. They don't use shared bus because shared bus have very high impedance which will slow down signaling. In very long connection they would even need to break the link into several section and put repeaters in between to reduce impedance. The latency penalty of a repeater is less than the latency penalty of high impedance interlink.
Пётр Б. There is no one ring, The ring is made up of multiple point to point link fragments. So each node can broadcast to their own link fragment at the same time but the nodes will have to recieve the signal in sequence. If a node is broadcasting at one end and receiving signal in the other end the recieved signal will have to be stored in a buffer before it can be rebroadcasted into the next link fragment. If the ring is bidirectional then there will be buffers for each direction.
@@ПётрБ-с2ц Each node needs two reciever and two transmitter to form a bidirectional ring. So peak throughput of the node is 4x the per direction bandwidth
While the cross-bar topology provides the lowest node-to-node latency, it comes at the cost of scalability due to accumulation of hot spots in the cross-bar switch fabric. This was the reason Intel moved to ring topology beyond 6-cores with the 32nm Westmere, at the marginal cost of latency increase, more wiring and interface logic, but at the same time it gained some bandwith.
Absolutely loving your content Ian. These kind of deep dives rarely happen elsewhere, and shed light on what is a pretty complicated, and rather important industry! As for the 3D designs, will cooling become a significant issue? And how will they solve it?
Or they have a central bidirectional ring with minimal physical distance inside the ring, but the longest distance is from the core to the ring. So latency between ring hops could be marginal but there would be single significant latency and that's from core to the ring.
Has anyone wonder why the Ryzen logo looks like a burning circle, an eye? He will ryze again! "One Ring to rule them all, One Ring to find them, One Ring to bring them all and in the darkness bind them."
What if the "structural silicon" on top of the cores is replaced with another bisected ring connecting the cores on that side? Could allow for 16 core chiplets.
I think part of the reason they put the cache on top of the cache and not the cores was heat so I don't see them putting anything on top if they can help it.
@@duckrutt Well they are already putting silicon over it, its just blank. Interconnects use some power but not huge amounts of it. I think a bigger issue is having to drill TSVs nearby of the cores.
Ever since zen started, they have always mentioned the infinity fabric, which seems to be the secret sauce to the long term scalability of their designs
But thankfully, contrary to AdoredTV, this is a channel that does not make shit up all the time and is talking about interesting things (that he also understands).
@@ABaumstumpf Wow you could be a professional political pundit on Fox News with how wrong you are. Jim doesn't "Make shit up", he interprets information to deduce expected outcomes. Sometimes gets it wrong sometimes right, but he's clear that what he does has a high margin of error. He also talks about interesting things (in this case 3 years before this channel. Ian also talks about things he doesn't really understand (in fact, I'm pretty sure I recall him stating such in this very video about the very topics under discussion). Basically, you don't like Jim, you do like Ian. That's cool, but doesn't vomit garbage everywhere in the hopes it blinds us to the truth.
@@ChrispyNut "he interprets information to deduce expected outcomes" Yeah no. He just regurgitates stories he got from anywhere without doing even the most basic of checks most of the time. And only very rarely does he do some simple interpolation (can't do much wrong there). "Sometimes gets it wrong sometimes right, but he's clear that what he does has a high margin of error." In a way - he has a very high error-margin as in - he is about as accurate as saying "at some time in the morning the sun will rise and in the evening it will set". "(in fact, I'm pretty sure I recall him stating such in this very video about the very topics under discussion)" Well, Ian understands what he understands and what he does not full understand (or where he is not an expert) - aka the opposite of AdoredTV. "Basically, you don't like Jim, you do like Ian. " Nope, i just can't stand that AdoredTV is spouting bullshit.
I think you are all missing one more option for the ring topology of AMD. (Not exclusive to other hybrid versions, or use of cross connection). Using a skip one ring topology. Various options here, especially with the use of bidirectional, dual rings. 1to 2, skip 3, 4 to 5, skip 6... Second ring, 1 to 3 (skip2), etc... Various skip options (do not stop stations) available, depending on your needs... With the use of a bisector in each ring, you would get some significantly lower latencies. Even if you lower bandwidth on the ring's for power saving, or turn one off for further power savings, you effectively lower the total latency at all times. ( In this scenario, you wouldn't turn off one ring until you absolutely need to do so)
19:30 - Ian, could it be that AMD is using something which is bisecting the ring but doing so in a dynamic manner? ie. multiple bisections, but they can change their routing clock to clock like a nonblocking crossbar?
Outstanding video. Really makes you think about what is possible with the Zen road map. Seems the real limiting factor will be socket size. Even with the 3D concept you discussed the fact remains that more cores equals more space. I am sure 3 nm will help with the space problem some but will not be perfect nor will it be the only method used. I think on the EPYC side the socket size was planed well out from the beginning and probably will support 128 core products with node shrinks. The new AM5 socket size for consumer is going to tell a lot about where core counts are going to go in that market. Something you did not mention but I could see benefitting from the interposer layer would be a GPU chiplet with its own 3D layered graphics memory/cache (64-128 MB). It would open AMD up to providing more powerful integrated GPU's. The ability to make an APU would just be a choice of including the chiplet on the cpu package. Till your interposer idea I understood why this was not done as the CPU/GPU traffic would use most of the bandwidth through the IO die. The interposer would remove a good portion of the IO die bandwidth being used allowing for a very powerful APU to be produced using chiplets. My thinking here would be that the Ryzen chip would have space for up to three 8-core chiplets on an interposer. The high end chips greater than 16 cores would only be CPU's. The 16 and lower core count chips would have space for a GPU chiplet and allow for AMD to produce up to 16 core APU's. On the flip side they could produce 4/6/8 core APU's with two GPU chiplets for better graphics performance. In all your take on the the third layer in the 3D stack being an interposer opens many doors for AMD. Their chiplet approach has proven to have significant advantages and I am sure 3D cache is just he beginning to the "3D" nature of future chips. Again great video and insight.
Now... imagine that the interposer is based on silicon photonics... you want one ring... trivial... two rings, three rings, fifty rings... it's all a matter of adding another wavelength of light... oh, and it doesn't generate heat, or magnetic field, or more power. I never really understood the significance of this tech until you explained the interposer. I doubt it will make it down to the consumer, but for the rack designs, this will be huge.
How about vertically stacking not only cache but also cores? Thermal issues aside, a third dimension could open up new interesting solutions. Really interesting video, thank you!
Unfortunately it's the thermal issues that are the reasons why that doesn't happen. But 3D topology is basically graph theory, and we have centuries of research there.
Thanks Ian, you took something as topology of cpu made it easier than it is (of course was a glimpse of it), great work. By the way the potato always remember me a pringle or lays hahaha make me want to buy one.
I think the very reason why Intel was struggling to go to 10nm in the first place is the mesh may be too complex, whereas AMD (and by extension TSMC) can make simpler designs like bisected rings and achieve great yields on a smaller node, hence, they were able to get it released first (3rd Gen), and refine it and get better results on the same process node (Ryzen 5000 series)
Solid point. Solid possibility. & not that it matters as it pertains to whether this scenario is any more or less likely, but compared w/ the popular narrative that intel's struggles have been a result of complacency it'd be interesting to find the reality was just the opposite; they floundered out of an unwillingness to compromise away from theoretical peak performance (the cutting edge). The long term rewards of success w/ such a complex topology would be quite the siren song. The performance possible w/ said approach could likely result in 5+ yrs of virtually assured industry dominance. That said, whether sticking w/ such an approach given its power efficiency disadvantages as market trends continued toward greater & greater emphasis on performance ÷ power was a wise decision is a whole other can of worms.
The reason Intel struggled with 10nm was from a few design reasons: 1. They went for contact over active gate to replace FinFET, way too sensitive to disturbance at fab time and yields were terrible as a result. Everyone is now going for Gate all around or Intel's SUPERFIN! COAG was dropped 2. They used quad patterning for the fine grained details before EUV was available. Way to complicated and wide error range, double pattering is the reasonable limit. 3. They used Cobalt electrical channels instead of copper, reason being that copper needed more insulation than wire at these small scales and cobalt would not. However Cobalt is hard and brittle in comparison to copper, temp swings might break or fracture these channels. Each reason is ranked in contribution to the disaster.
Dr. Cutress, love this type of content. Could you be bothered to discuss the various implementations of SMT to help us learn why it’s a practical (aside from ~33% greater bandwidth) and also “pointless” at the same time?
Ring of rings, nerd of nerds, power of powers, analysis of analytics. I have to sip a coffee not to fall into slumber when i'm watching this stuff, but afterwards i feel like a Power User. Super Hero level achieved :) Can you make a colorful T-shirt with a Folded Torus Topology?
18:35 With two bisected rings, doing forward and back communication, wouldn't it be possible to have the two rings 90 degrees apart from each other in the diagram? On the image on the right, there's two blue horizontal bars signifying the connections of the ring. If you added two bars of a different color, say, red, going vertically, it would illustrate my idea. Would that not give you a connected network that was acting "nearly like" a fully connected one? Potentially a best of both worlds scenario, aye?
At what level of cache size will it become easier and faster to move the program code and state to the data?
3 роки тому
Maybe Ryzen's hops are not stops. Maybe transmission is more like broadcast on shared bus - whole bus becomes busy and only sending and receiving cores are interested in data. When you put shortcuts in ring bus you can temporarily cut one ring bus into two subbuses allowing for two simultaneus transmissions. Add bidirectionality and ring bus gets even more throughput. I dont know, just guessing.
With multiplexers all you need is a one time broadcast of all cores to every other core to find its location on the ‘ ring ‘, then store that location as an address in that ‘ ring ‘ so you can write to that address.
that would essentially be a shared bus used by all cores. The problem with that I suspect is that you get contention when multiple cores need to talk to one another, you need to sync and incur stalls when the Bus is in use, and the overall bandwidth is smaller than with lots of small interconnects. Also you may get into trouble with your fabric clock speeds, since the signal needs to go all the way around in long traces, instead of just a number of short fast hops.
Active interposers are also lithography field limited. Stitching as for CIS is 2x for FF and 4x for MF sensors is difficult. Array stitching with yield tolerance for the active com interposer is the right choice. 32nm instead of 65nm is better for > 10GBaud serials because Ft peak there.
a 4 core fully connected has 3 connections at each core. a ring only needs 2 connections at each core. I'd guess that there are cross-connects on the ring with those extra interconnects.
6:00 It seems like a link between 1 and 16 and 4 and 12 would significantly improve latency while only marginally increasing power usage. Kinda surprised that isn't a factor in their mesh design.
is a bit interesting, especially if even a mash can be of different dimensions with the same number of nodes, akin to the ring with bi-fold links. Hmm. I like the idea of an interconnect die in a stacked die chip. Great vid, and presentation. B)
I was thinking about your last minute talk on the 5950... What about putting the chiplets directly on the IOD via TSVs? A good amount of IOD power is for the SERDES of the chiplets. Maybe the mesh can be but on the IOD or an interposer can be put between the chiplet and the IOD (the V cache can be still present on top)...
How about a 3 layer stack (as they are doing with VNAND). Bottom layer just for inter-core connectivity (potentially on a 12nm process node), middle layer cores etc (possibly with some interconnects, allowing them to "cross" the other interconnects), top layer cache? Edit: I should wait until the end of the video before commenting, shouldn't I.
@@TechTechPotato So given that cutting edge processes are the most expensive and have the worst yields, a 3 layer stack with a cheap interconnect stack, tiny single core dies on the latest process, and one big unified cache die covering all of them on a not-quite-cutting edge process, giving the best yields?
I'm merely an observer here, but can a twisted toroidal shape still be described as a 'ring' (for the sake of politricks), but actually present full connectivity with 8 rings twisted into one toroidal shape?
I wonder why no one ever mentions the root-tree kind of topology. It's definitely 3D and the stem is a multi lane multiplexer where relatively easily a 32 connection can be made on each lane. Scale that with stop gap multiple stems and a cpu finds itself in the gpu territory but with a much more efficient topology. Just a thought.
AMD have made a number of statements regarding the power requirements of the IO Die in EPYC - knowing the breakdown of DDR4, PCI-e, all those other on-SoC buses and SerDes in terms of power usage...
I really don’t think that AMD is concerned with interconnect power. TSMC is already working on micro heat spreader designs for your 3D chiplet stack. Basically a die which is all metal with an inner chamber of thermal fluid to dissipate heat.
Its not just about power. Also latency. What Ian is proposing would essentially make the 6 chiplets on each side into a mega chiplet latency wise. 48 cores all 1 hop away instead of a hop to IO die, across the die, and then a hop to the correct chiplet.
Thanks Dr. Ian for the step by step break down of this topic. The future could get way more exciting as they scale up the cores. It will probably take years to trickle down to consumer products? I'm pretty happy with my Ryzen 5800X on a single CCX.
yes they the interposer can be an interconnectivity interface, altough it might be not, because they showed a 3900XT3D on a event this last august and therefore the adding connectivity wasn't used in the 3000 series baseline, my question is, how they engineered the unused connections points ( I remember hearing that those connection were already there) to be compatible with the Zen2 and the Zen 3 architecture? OR the L3 cache installed on the 3900XT3D and those on the next 5800X3D aren't the same. they shouldn't.
I've been so curious about possible mesh architectures. The way you described AMDs potential 3D multi chiplet with the (Butterfly, Torus) mesh interposer was so interesting. I also think there may be another step AMD may take, by maybe arranging a stacked mesh. Having a stacked interposer mesh, should allow the latency to to be reduced exponentially. The human brain has a stacked mesh I think. In the way we can access simple memories or fine details and skills that we've learned, is like stacks on stacks on stacks etc. Like one core to another core to another core etc., the mesh stacks could give that similar connectivity. We may never see it in our Zen6 or Zen(?) PCs but that must be a path they are trying to achieve in the future don't you think? I don't imagine the V-Cache will be individually allocated for each core in the future though just massive cache for every core to access. They are probably heading to where the a similar mesh could be used for Cache as well. Sorry for rambling. I hope you have a great day and keep up the super content.
Stupid question but why do cores have to be connected to each other? As far as I know the L1 and L2 caches are private to each core so they dont need it for that and to access shared data they simply go to the L3 cahce (or main memory), cant they? Why exactly would core 1 speak to core 2 ?
We may eventually hit the limit of digital computing on silicon, but there are advances in quantum and optical technologies, which have a lot of potential.
@@saricubra2867 If you look beyond desktop cpu's, that outlook becomes so INSANELY fucking stupid. Apple specifically, but also all high end arm architectures have insane per watt single thread performance. Desktops arent nearly everything...
How do you verify the latencies from one cpu node to another cpu node and from one cpu node to the DDR memory. Do you use some tools to generate traffic which can stress the fabric(The ring or the mesh) ? I am assuming you do all these testing at a post silicon level.
They have a program that test thread to thread latency. The program pingpongs data between threads and they can measure the performance on varying pingpong level.
IMHO as scale increases CPU architects will take more and more pages from the Network/Datacenter Engineering playbooks. I would imaging after some number of cores they'll switch to either CLOS topology (through multiple levels of switches) or maybe they'll go straight to a Dragonfly topology.
Super interesting video, a lot of info made easy to digest :) Adored did also touch this topic for those that are interested in these kind of cpu topology and interposers.
28nm interposer (passive, active?) for those bigger wires, performance optimized 5nm cores, then density optimized 5nm cache. Sounds like a tasty sandwich!
So I "attended" Hot Chips this year, but was disappointed that I didn't get any semiconductor company swag. Any advice on how to get T-shirts and other stuff from these kinds of companies?
If you watched the synopsys talk, there was a link to a free t-shirt. Intel had a small contest that was easy, and a t-shirt there. Otherwise this year was devoid of swag compared to last year
Wouldn't it just be a lot smarter to REQUIRE less interconnectivity? And for more core you could just deal with groups, say you have a future 128 core CPU, it could be 4x32core or 8x16 core design with some ineffectively instead. Kinda like NUMA.
Isn't Epyc Rome/Milan essentially that already? You have 4-core (Zen2) or 8-core (Zen3) CCXes that are internally interconnected, and each CCX (group) connects through the IO-die to the other CCXes. The question I guess is, would it make sense for future AMD designs to somewhat directly connect the CCXes to each other as well for lower inter-CCX latency? and in what topology? Thus making a trade of between latency and number of interconnects required.
Any thoughts on how this kind of 3d interposer could work for a big.little setup? I guess that the issue with stacking stuff on top of the cores, it would become an extra layer for the heat to travel through before reaching the heat sink
Makes you wonder if they can integrate some sort of cooling into the dies themselves to quickly reject heat threw the layers, micro heat pipes maybe or something more clever such as Graphene or Carbon Nanotubes, sooner or later those with probably be part or the semiconductors but they also have good heat transfer if I remember correctly, I guess the only trick is how to arrange them vertically threw the horizontal layers of chiplets
One Ring to rule them all, One Ring to find them, One Ring to bring them all, and in the darkness bind them. In the Land of Interposers, where the interconnects lie.
Shoutout to Adored's video tackling the subject in mid-2018: ua-cam.com/video/G3kGSbWFig4/v-deo.html
AMDs cross bar interconnect technology is difficult to scale beyond 8 CCX, I think with Genoa AMD will have more trouble addressing crossbar latency. Also AMD needs something like intel emib coz sending traces to the substrate using a serdes will add additional latency and power consumption. If it's a silicon bridge like emib then you don't need serdes coz your traces don't leave the silicon die. But then having 16 + 1 chiplets and connecting through emibs could be challenging. Amd needs to develop mesh interconnect to have meaningful scalability with 128 or more core processors.
But you didn't learn, huh.
They only wanted "a little bit of evil", just as you're only wanting "a little but of 3D". Sorry, it's all or nothing ;P
Not sure where in the video I'm referencing but it got a very large head-tilt and much chattering with myself XD
AMD might throw us a curveball here. They could be using the ring bus to only pass the L3 cache address of the information to the other core it communicating with and simultaneously putting the information in the L3 cache, and then the other core pull the information from that L3 cache address directly instead. That would explain why it in almost every single way performs as if they were all p2p connected. And since there's effectively moved less data on the bus the bandwidth is never saturated and uses less power.
Ohhh nooo, it's melting, my precious power ring, gone forever.... REEEEEEEEEE
I clearly remember exact same butter donuts from AdoredTV few years ago. Quite ironic not many believed him.
Also techpowerup recently published few AMD slides that were worth including in this video tbh.
Sadly all of the actual innovation AMD always try to bring may be again overshaded by intel's obscure garbage. Creating those "energy efficient cores" (who needs more than lets say 2 or 4 of them for an idle PC state to perform their allegedly main function, yet alder lake has 8 of them and raptor lake will have 16 of them?) is just a most efficient way for them to regain rendering and other scalable by default apps performance crown while keeping comfortable (nowadays, thanks to AMD forcing intel to make at least 8) amount of actual cores for ages again. And with this trick intel having those crowns basically will hold down the industry and handicap global computing power and its scalability because everything is being made for leader like software development and optimizations so we're stuck in that vicious cycle of single-core circlejerk.
I have honestly never considered the topology of CPU core interconnects, thanks for this fun mental exercise!
I like how you say "honestly," as if you're surprised that you've never done it, despite how commonplace it is among the general population to ponder the topology of CPU core interconnects.
It's one of the most important and still unsolved design issues in CPU engineering. And thats for two decades now a hot topic. But yeah, if you don't have an interest in CPU design or high performance computing it's a topic you don't run into.
My favorite long-term candidate for CPU-interconnect & NoC topology would be a 2D/3D flattened butterfly topology, & depending on how the cores & components are layed-out maybe a twisted variant thereof. Butterfly networks in general have several advantages over rings, fully connected but even over mesh designs for a lot of parameters but also overall. They are also well understood & have efficient routing algoirthms. Original 2007 Paper: "Flattened butterfly: a cost-efficient topology for high-radix networks" by John Kim, William J. Dally, Dennis Abts. & there are several studies analysing this toplogy. Unfortunately it's unlikely that any processor or chip will use such a topology until 2030 because it still has a patent pending - unless any of the manufacturers is willing to pay the inventors.
Fascinating Ian, thank you as always - tech hasn’t been this exciting in quite a long time!
Tech is always interesting when you read research papers. That co authored paper about networks is a few years old, I think 2017 or 2018. So this has been contemplated for a while
@@Jaker788 2015 I believe.
And LOL @ "interesting... reading research papers". Well, there's a clear division right there on subjective terms (for some anything to do with "reading" automatically excludes it from possibly being "interesting", let alone "research papers) :D
Please note, I'm also self-mocking as I don't like reading (I'm a slow reader so I don't read much so I fall out of habit ad-infinitum)
@@ChrispyNut Yeah I'll agree that research into concepts isn't exactly as exciting as something being presented as functional.
It's the difference between "yeah we could potentially do this really awesome thing" and "look at this really awesome thing we actually have working and you can buy it soon!"
@@Jaker788 No, I think there's a misunderstanding.
I'm WAY more into concepts and theories and ifs, butts and maybes than I am finished "stuff" (especially as the finished, commercialised stuff barely resembles the initial concepts (see Intel Light Peak -> Thunderbolt)).
Just ... such people are in the minority (and I'm in the minority of the minority in not being a reader). :)
I spend so much of time with my head in the future that when the latest, greatest, taking-the-world-by-storm thing is all the rage I'm seen as a misery guts cos it's so fuckin' lame next to the thing I'd been thinking about that was interrupted with the thing I'd been thinking about decade(s) earlier. :'-(
If you get my drift.
I remember a long ass 2018 AMD's research paper on shitton of different connectivity approaches from a simple mesh to some crzy shit like double-crossed-toroidal-butterfly-god-of-ancient-elves.
Link is already in the description!
@@TechTechPotato that's a true journalist - always a step ahead!
SGI workstations had a glorious design, at the centre of it there was a massive switch design. Every component could talk to every other one, full speed. Complicated, yet simple, very fast and responsive system.
There is a guy on YT that does teardowns on systems he collected, and he opened one up. I wish all pc's followed this glorious architecture.
it's expensive. you spend your budget on chiplets and interconnects - big interconnect -> less chiplet, so you have to make a tradeoff
doesn't infinity fabric is already doing this, entire I/0 goes through infinity fabric and as far AMD's paper's go, infinity fabric have inbuilt buffers of multiple KBs for all IO operations..
@@niks660097 yes, but SGI was in the 90s.
infinity fabric does take power, so you have to budget your components against the total TDP budget on the package
The beauty of SGI's crossbar was that it allowed systems to scale bandwidth with socket count. The initial Origin2000 series did impose a latency penalty but this was resolved with the Origin3000 series which increased the scaling from 128 to 2048 sockets with lower worst case latency, increasing bisection bw to more than 1TB/sec. Naturally these systems ran as a NUMA design with a single OS instance (or it could be partitioned), so a user doing, say, defense imaging or GIS could run a task and immediately have access to dozens or hundreds of CPUs and relevant connected I/O and gfx power (infact I've still not seen any modern product that quotes faster image loading rates than the Group Station, though it's likely a thing but just not public; NVIDIA probably makes custom tech that isn't COTS, ditto AMD, indeed SGI did this at times, eg. for Lockheed).
The 8-port crossbar was the most complex chip SGI ever designed, it required 6 months of Verilog testing. Each port had a 2MB cache buffer, so although installed CPUs might have 2MB L2 (such as in a max spec Octane2), the crossbar had a lot more memory of a similar type, so not cheap. The crossbar had four independent connections and these could change which ports were connected to which other ports on each clock tick, allowing for continuously variable I/O paths. At the same time, applications could not only lock in an I/O path to secure guaranteed bandwidth (with DMA), the REACT extensions to IRIX supported real-time response certainty aswell, hence the broad use of SGIs in defense and other industrial applications. This meant, for example, that a digital video stream could be routed through to main RAM without involving the CPUs, and with a hw guarantee that it would never drop a frame, while at the same time the same crossbar is routing other data aswell.
CPUs were not connected directly to the crossbar though; in Origin, CPUs and RAM were connected to a HUB chip. Each HUB had two ports: one goes to a crossbar, the other to the router fabric (similar tech, ie. NUMAlink). Thus, any CPU could connect to any other either directly via its local HUB, or via a crossbar link, or via a router link. See:
www.sgidepot.co.uk/mod_block_diag_server.gif
This did mean more hops though with Origin2000, but the arch changed with the 3000 series to solve this (along with a modular brick design instead of connected half-racks), resulting in much lower latency penalties for long routes (I think the worst case scenario in a 1024-CPU O3K is 50% latency penalty for the most distant nodes). The design also used an interesting caching mechanism to cope with the situation where data changed by one CPU could invalidate copies held by many others, but that's a whole other thing. There's a lot more nuance to all this of course (see below for refs, PDFs, etc.)
Note Octane used a simplified chip called HEART, to which the CPUs and RAM are connected, but HEART has just a single link to the crossbar because there's no router fabric.
For more, see my index pages:
www.sgidepot.co.uk/origin/
www.sgidepot.co.uk/octane/
Note SGI had been planning to scale single image support with Origin4000 to 37500 sockets (along with IR5 for gfx), but alas with all the management screwups, loss of staff, etc., that never happened, but the NUMAlink tech lives on, I think HP is still using it as NUMALink8 or something, giving 64GB/sec per port, though I doubt they'll carry on the arch any further.
A caveat to the awesomeness though: many XIO option cards (such as PCI, FC, etc.) used an XIO/PCI bridge chip and the early versions of these chips were kinda naff, limiting PCI bw to around 185MB/sec. The boards for O3K were better. Still, I was able to get 600MB/sec from an Octane, which for 1997 is kinda nuts. Not had a chance to try the same thing with my O3800 yet.
@@dercooney I can't speak for AMD or modern markets, but for SGI the cost aspect wrt their target markets was largely irrelevant. One oil company told me their $2M Onyx2/RealityCentre setup paid for itself in *six seconds* (brownie points if you can guess how). Note I was the head sysadmin of a RealityCentre for a few years; an early version, it was a 16-CPU 3-rack Onyx2 with five IR2E pipes.
Wish I had the time to do vids on my SGIs, but alas YT came along a tad too late for that really. Maybe some day.
Now this is a quality vid, you deserve way more credit for stuff like this.
very interesting ... wonder how this develops further with TSMC investing further into 3D capabilities and how much of that AMD can/will make use off.
SoIC™, their latested 3DIC solution.
AMD is only X86 clones, they will never produce that on 3D, new architecture is needed, new languages.
Quickly got flashbacks to BNC ring networking (when 10Mbs was AMAZING), funny how similar computing is up and down the chain, back and forth through time.
That is my next pick up line.
Instead of asking: What kind of guy are you into?
I will ask: What is your minimum specification?
Her:
Height: 7 foot
D size: 10 inches
@@suntzu1409 Imma go out on a limb and guess the B in your acronym stands for "boobie" and you think you're an "inspector".
@@Nathan-gn3ls federal boobie inspector???
@@Nathan-gn3ls federal 🅱️oob🅱️ie inspector
this is probably the exact opposite of technique that PUA's use to hack the programming of a fembot
step 1 is realizing they will never divulge accurate information regarding their user manual.
EDIT: I almost forgot its 2021. you might have better luck if its a guy.
Yes, graph theory! Love that some of the theoretical issues show up in 3D typology
Oh no! The potatoes are multiplying!! 1:40
I watched this whole video. I didn't understand everything you talked about but I came away largely grasping the big picture. I enjoyed this, thank you.
Incredibly informative Video. Thank you Ian
20:38 What's not clear to me exactly, is why you'd use an interposer for interconnect within a lone chiplet. You can do your butterfly/torus/etc on regular metal layers without needing to go out to an interposer, it's plenty doable to have signals weave cross different metal layers. Is there a shortage of metal layers in the readily available processes? I wouldn't imagine so. Even when you bring multiple chiplets into play, you can design things such that the intra-chiplet links of a big butterfly/torus are in metal layers, while only the inter-chiplet links go into the interposer.
The point of the video is that as you scale to 16 cores, a ring doesn't work, so you might want to do an on-die mesh. But even then, there are better meshes, so with a one-chiplet interposer it would be easier to work on independently, or optimize when it comes to Serdes links.
@@TechTechPotato Ah, I see.
@@TechTechPotato great video! I bet a big contributing factor of what's designed and used in the future will be highly dependent on how well software and core scheduling evolves and works. From a workload perspective, the amount of use cases for more than 8 cores to be allocated to a single process/job whereby inter-core latency is highly important seems very low even in the enterprise space. In the vast majority of use cases that I've seen whereby beyond 8 cores is required, it has been a highly parallel and core/thread independent workload. Therefore, if essentially good NUMA aware scheduling is being used, I highly doubt there'd be many use cases whereby the extra connections and overhead of a more complex and expensive architecture would be worth it short to medium term. 4 slower cores in Zen 1 was certainly not ideal for plenty of enterprise and hosting use cases while the latest 8 high performance low latency interconnected cores seems by far the sweet spot.
These ideas go back decades and will always hold true. I'm hoping AMD has split the core-to-interconnect so that the core can stay the same, but they can change out the interconnect topology at will.
Thanks Ian! It's really interesting to talk about all these techniques used in individual products years ago can be combined together to make something extremely innovating. Take the interposer and maybe some HBM from Vega, combine it with Zen 4 chiplets, the V-Cache from Zen3D, and the I/O die from Zen 2 for something incredible. I'm still waiting to see if Zen 4's iGPU will be a separate chiplet, or maybe part of the I/O die. You can tell AMD has a long term game plan and have been executing one piece at a time, and when the time is right they start combining those individual pieces.
Happy to have this video recommended in my feed! Learning a lot more from UA-cam than school :P
I do think you are correct in that their next step is using an interposer. The step after will be an active interposer with some logic built in, this is when it will becomes really exciting
I'm not terribly worried about AMD and their ability to innovate with design with respect to CPUs. The constraint will likely be the capabilities of their foundry partner.
Physical/engineering limitations, always the party-pooper of cool concepts..... until they become the enabler and round we go in the endless loop until a blackhole comes along to keep infinity out of the equation :)
@@ChrispyNut computers can get only so dense until you start dealing with exotic matter, weird particle effects, & then black holes. I want to see how wild CPUs will get in the future
@@Apocalymon I don't mean we create the black hole, rather that everything eventually ends up in a black hole (figuratively, not literally everything).
Maybe we see how wild CPUs get all the time, the organic brain.
When I studied Networking, I never thought I would see the same topology concepts applied to the hardware itself later on.
nice vid. I remember watching a vid from jim at adored some time ago about interposer technology and all the untapped potential. nice to know that they are finally looking into those options for higher bandwidth/ less latency
I just found your channel and it is really interesting, im currently making my bachelor in mathematics and maybe want to go in this industry, thanks for the interesting content
Wow, that is very educational! I finaly get a hang of what is going on with CPU designs now
Amazingly, I understood a lot through your explanation. As a marketing & retail professional, I really do not need much about this. I'm just really a curious gamer. :P
This was the best visualization to how topology relates to computing, need to share with students
please don't, he displays ring architecture wrong. Intel Ring is a Bus, transfering data from one core to another doesn't go through other cores and increses latency like he describes. A CPU core is not a router. In Intel Ring Bus, all cores are connected to the Bus, there is no such thing as hops. The huge dissadvantage of the Ring Bus is that only one core can transfer data at the same time. All other cores must wait. The Bus is shared between all cores.
It's like a network hub, the more devices on the hub, the slower the bandwidth because it's shared.
The advantage, latency stays the same no matter how many cores
I mean - with Zen3 it might just be a simple bisected direct-routed ring bus. That gives you basically static time between all cores, but lower total bandwidth when all cores try to talk to each other. And that is also what we see when comparing Zen2 to Zen3: Zen2 has higher bandwidth when just 2 cores are talking with each other and when the data-size is small. But with higher traffic the difference vanishes.
Sadly there is very little benchmark-software to measure that...... now i am kinda intrigued, might just measure the CPUs i have at hand.
02:05 "means you only need to hop once means you have lower latency"
Is this really what it means? I though ring bus is called that way because it's shared (i.e. only one member can send data at a time), not because each member acts as a blocking waypoint. The latency improvement comes from less owners but not having shared link among all members can hurt latency in practical tasks.
Core fabrics are point to point. They don't use shared bus because shared bus have very high impedance which will slow down signaling. In very long connection they would even need to break the link into several section and put repeaters in between to reduce impedance. The latency penalty of a repeater is less than the latency penalty of high impedance interlink.
@@kazedcat my original question still kind of applies - are multiple nodes allowed to write to ring bus simultaneously?
Пётр Б. There is no one ring, The ring is made up of multiple point to point link fragments. So each node can broadcast to their own link fragment at the same time but the nodes will have to recieve the signal in sequence. If a node is broadcasting at one end and receiving signal in the other end the recieved signal will have to be stored in a buffer before it can be rebroadcasted into the next link fragment. If the ring is bidirectional then there will be buffers for each direction.
@@kazedcat if communication occurs only between adjacent nodes, is throughput of the bus 4x of the each link throughput?
@@ПётрБ-с2ц Each node needs two reciever and two transmitter to form a bidirectional ring. So peak throughput of the node is 4x the per direction bandwidth
That video and the IBM video owned you my sub sir. Amazing content
Nicely done again Ian. I love your deep dives.
While the cross-bar topology provides the lowest node-to-node latency, it comes at the cost of scalability due to accumulation of hot spots in the cross-bar switch fabric. This was the reason Intel moved to ring topology beyond 6-cores with the 32nm Westmere, at the marginal cost of latency increase, more wiring and interface logic, but at the same time it gained some bandwith.
Absolutely loving your content Ian. These kind of deep dives rarely happen elsewhere, and shed light on what is a pretty complicated, and rather important industry! As for the 3D designs, will cooling become a significant issue? And how will they solve it?
A network-only based interposer will have some logic, but I doubt it's that significant
Or they have a central bidirectional ring with minimal physical distance inside the ring, but the longest distance is from the core to the ring. So latency between ring hops could be marginal but there would be single significant latency and that's from core to the ring.
Has anyone wonder why the Ryzen logo looks like a burning circle, an eye? He will ryze again!
"One Ring to rule them all,
One Ring to find them,
One Ring to bring them all
and in the darkness bind them."
And if you stab your finger in the middle everything disappears.
What if the "structural silicon" on top of the cores is replaced with another bisected ring connecting the cores on that side? Could allow for 16 core chiplets.
More power, more complexity, more cost.
Can always do more to get more faster, it's whether it's worth the cost :|
@@ChrispyNut i would imagine a bottom interposer, making the stack 3 high would also incur a rise in packaging cost
@@TheBackyardChemist You literally made me facepalm and given how sweaty I am, that was really unpleasant!
I think part of the reason they put the cache on top of the cache and not the cores was heat so I don't see them putting anything on top if they can help it.
@@duckrutt Well they are already putting silicon over it, its just blank. Interconnects use some power but not huge amounts of it. I think a bigger issue is having to drill TSVs nearby of the cores.
Ever since zen started, they have always mentioned the infinity fabric, which seems to be the secret sauce to the long term scalability of their designs
That talk about interconnection topologies and butterdonuts gave me AdoredTV flashbacks.
Which is the more appealing accent though?
Link to that video is already in the description
But thankfully, contrary to AdoredTV, this is a channel that does not make shit up all the time and is talking about interesting things (that he also understands).
@@ABaumstumpf Wow you could be a professional political pundit on Fox News with how wrong you are.
Jim doesn't "Make shit up", he interprets information to deduce expected outcomes. Sometimes gets it wrong sometimes right, but he's clear that what he does has a high margin of error.
He also talks about interesting things (in this case 3 years before this channel.
Ian also talks about things he doesn't really understand (in fact, I'm pretty sure I recall him stating such in this very video about the very topics under discussion).
Basically, you don't like Jim, you do like Ian. That's cool, but doesn't vomit garbage everywhere in the hopes it blinds us to the truth.
@@ChrispyNut "he interprets information to deduce expected outcomes"
Yeah no. He just regurgitates stories he got from anywhere without doing even the most basic of checks most of the time. And only very rarely does he do some simple interpolation (can't do much wrong there).
"Sometimes gets it wrong sometimes right, but he's clear that what he does has a high margin of error."
In a way - he has a very high error-margin as in - he is about as accurate as saying "at some time in the morning the sun will rise and in the evening it will set".
"(in fact, I'm pretty sure I recall him stating such in this very video about the very topics under discussion)"
Well, Ian understands what he understands and what he does not full understand (or where he is not an expert) - aka the opposite of AdoredTV.
"Basically, you don't like Jim, you do like Ian. "
Nope, i just can't stand that AdoredTV is spouting bullshit.
I think you are all missing one more option for the ring topology of AMD. (Not exclusive to other hybrid versions, or use of cross connection). Using a skip one ring topology. Various options here, especially with the use of bidirectional, dual rings. 1to 2, skip 3, 4 to 5, skip 6... Second ring, 1 to 3 (skip2), etc... Various skip options (do not stop stations) available, depending on your needs... With the use of a bisector in each ring, you would get some significantly lower latencies. Even if you lower bandwidth on the ring's for power saving, or turn one off for further power savings, you effectively lower the total latency at all times. ( In this scenario, you wouldn't turn off one ring until you absolutely need to do so)
BTW, this doesn't preclude the use of a interposer connectivity setup either
19:30 - Ian, could it be that AMD is using something which is bisecting the ring but doing so in a dynamic manner? ie. multiple bisections, but they can change their routing clock to clock like a nonblocking crossbar?
Fun Fact: Bi-Directional can be Full Duplex or Half Duplex, the ability to send data both ways at the same time or not.
SGI's design was full duplex 25 years ago, so it would be odd if AMD's was only half. :)
Outstanding video. Really makes you think about what is possible with the Zen road map. Seems the real limiting factor will be socket size. Even with the 3D concept you discussed the fact remains that more cores equals more space. I am sure 3 nm will help with the space problem some but will not be perfect nor will it be the only method used. I think on the EPYC side the socket size was planed well out from the beginning and probably will support 128 core products with node shrinks. The new AM5 socket size for consumer is going to tell a lot about where core counts are going to go in that market.
Something you did not mention but I could see benefitting from the interposer layer would be a GPU chiplet with its own 3D layered graphics memory/cache (64-128 MB). It would open AMD up to providing more powerful integrated GPU's. The ability to make an APU would just be a choice of including the chiplet on the cpu package. Till your interposer idea I understood why this was not done as the CPU/GPU traffic would use most of the bandwidth through the IO die. The interposer would remove a good portion of the IO die bandwidth being used allowing for a very powerful APU to be produced using chiplets.
My thinking here would be that the Ryzen chip would have space for up to three 8-core chiplets on an interposer. The high end chips greater than 16 cores would only be CPU's. The 16 and lower core count chips would have space for a GPU chiplet and allow for AMD to produce up to 16 core APU's. On the flip side they could produce 4/6/8 core APU's with two GPU chiplets for better graphics performance.
In all your take on the the third layer in the 3D stack being an interposer opens many doors for AMD. Their chiplet approach has proven to have significant advantages and I am sure 3D cache is just he beginning to the "3D" nature of future chips.
Again great video and insight.
Now... imagine that the interposer is based on silicon photonics... you want one ring... trivial... two rings, three rings, fifty rings... it's all a matter of adding another wavelength of light... oh, and it doesn't generate heat, or magnetic field, or more power. I never really understood the significance of this tech until you explained the interposer. I doubt it will make it down to the consumer, but for the rack designs, this will be huge.
How about vertically stacking not only cache but also cores? Thermal issues aside, a third dimension could open up new interesting solutions.
Really interesting video, thank you!
Unfortunately it's the thermal issues that are the reasons why that doesn't happen. But 3D topology is basically graph theory, and we have centuries of research there.
Been a while when last I heard about the Butter Donut.. think was a video from AdoredTV some years back.
Link to that video is already in the description
Every time I think of CPU interconnects now, I immediately see the words "butter donut" appear in my mind and get hungry. Thanks AMD and Adored!
Thanks Ian, you took something as topology of cpu made it easier than it is (of course was a glimpse of it), great work. By the way the potato always remember me a pringle or lays hahaha make me want to buy one.
I think the very reason why Intel was struggling to go to 10nm in the first place is the mesh may be too complex, whereas AMD (and by extension TSMC) can make simpler designs like bisected rings and achieve great yields on a smaller node, hence, they were able to get it released first (3rd Gen), and refine it and get better results on the same process node (Ryzen 5000 series)
Solid point. Solid possibility.
& not that it matters as it pertains to whether this scenario is any more or less likely, but compared w/ the popular narrative that intel's struggles have been a result of complacency it'd be interesting to find the reality was just the opposite; they floundered out of an unwillingness to compromise away from theoretical peak performance (the cutting edge).
The long term rewards of success w/ such a complex topology would be quite the siren song. The performance possible w/ said approach could likely result in 5+ yrs of virtually assured industry dominance.
That said, whether sticking w/ such an approach given its power efficiency disadvantages as market trends continued toward greater & greater emphasis on performance ÷ power was a wise decision is a whole other can of worms.
The reason Intel struggled with 10nm was from a few design reasons:
1. They went for contact over active gate to replace FinFET, way too sensitive to disturbance at fab time and yields were terrible as a result. Everyone is now going for Gate all around or Intel's SUPERFIN! COAG was dropped
2. They used quad patterning for the fine grained details before EUV was available. Way to complicated and wide error range, double pattering is the reasonable limit.
3. They used Cobalt electrical channels instead of copper, reason being that copper needed more insulation than wire at these small scales and cobalt would not. However Cobalt is hard and brittle in comparison to copper, temp swings might break or fracture these channels.
Each reason is ranked in contribution to the disaster.
@@Jaker788 Interesting. Thanks for sharing.
love your content, man. i feel like i'm taking an intro to electronics engineering course. Thanks for putting these videos together! :)
Dr. Cutress, love this type of content. Could you be bothered to discuss the various implementations of SMT to help us learn why it’s a practical (aside from ~33% greater bandwidth) and also “pointless” at the same time?
Ring of rings, nerd of nerds, power of powers, analysis of analytics. I have to sip a coffee not to fall into slumber when i'm watching this stuff, but afterwards i feel like a Power User. Super Hero level achieved :) Can you make a colorful T-shirt with a Folded Torus Topology?
18:35
With two bisected rings, doing forward and back communication, wouldn't it be possible to have the two rings 90 degrees apart from each other in the diagram?
On the image on the right, there's two blue horizontal bars signifying the connections of the ring. If you added two bars of a different color, say, red, going vertically, it would illustrate my idea.
Would that not give you a connected network that was acting "nearly like" a fully connected one? Potentially a best of both worlds scenario, aye?
At what level of cache size will it become easier and faster to move the program code and state to the data?
Maybe Ryzen's hops are not stops. Maybe transmission is more like broadcast on shared bus - whole bus becomes busy and only sending and receiving cores are interested in data. When you put shortcuts in ring bus you can temporarily cut one ring bus into two subbuses allowing for two simultaneus transmissions. Add bidirectionality and ring bus gets even more throughput. I dont know, just guessing.
I remember this paper about connectivity, AdoredTV make video about it, hope we would see it in reality, even if in servers only.
That video is in the description and pinned comment, paper is in the description
With multiplexers all you need is a one time broadcast of all cores to every other core to find its location on the ‘ ring ‘, then store that location as an address in that ‘ ring ‘ so you can write to that address.
that would essentially be a shared bus used by all cores. The problem with that I suspect is that you get contention when multiple cores need to talk to one another, you need to sync and incur stalls when the Bus is in use, and the overall bandwidth is smaller than with lots of small interconnects. Also you may get into trouble with your fabric clock speeds, since the signal needs to go all the way around in long traces, instead of just a number of short fast hops.
Active interposers are also lithography field limited. Stitching as for CIS is 2x for FF and 4x for MF sensors is difficult. Array stitching with yield tolerance for the active com interposer is the right choice. 32nm instead of 65nm is better for > 10GBaud serials because Ft peak there.
Wonderful video. I like all of the technical discussion on how the new chip technology can work.
a 4 core fully connected has 3 connections at each core. a ring only needs 2 connections at each core. I'd guess that there are cross-connects on the ring with those extra interconnects.
6:00
It seems like a link between 1 and 16 and 4 and 12 would significantly improve latency while only marginally increasing power usage.
Kinda surprised that isn't a factor in their mesh design.
is a bit interesting, especially if even a mash can be of different dimensions with the same number of nodes, akin to the ring with bi-fold links. Hmm. I like the idea of an interconnect die in a stacked die chip.
Great vid, and presentation. B)
I was thinking about your last minute talk on the 5950... What about putting the chiplets directly on the IOD via TSVs? A good amount of IOD power is for the SERDES of the chiplets. Maybe the mesh can be but on the IOD or an interposer can be put between the chiplet and the IOD (the V cache can be still present on top)...
How about a 3 layer stack (as they are doing with VNAND).
Bottom layer just for inter-core connectivity (potentially on a 12nm process node), middle layer cores etc (possibly with some interconnects, allowing them to "cross" the other interconnects), top layer cache?
Edit: I should wait until the end of the video before commenting, shouldn't I.
Consider it a ++ you guessed the end before the end! Realistically most interposers to date are 65nm ish. Super cheap, super easy to do.
@@TechTechPotato Does 65nm introduce latency for communication? How come 65nm is acceptable for a mesh interposer?
@@TechTechPotato So given that cutting edge processes are the most expensive and have the worst yields, a 3 layer stack with a cheap interconnect stack, tiny single core dies on the latest process, and one big unified cache die covering all of them on a not-quite-cutting edge process, giving the best yields?
I'm merely an observer here, but can a twisted toroidal shape still be described as a 'ring' (for the sake of politricks), but actually present full connectivity with 8 rings twisted into one toroidal shape?
Ring is 1 dimensional interconnect. A Mesh is 2 dimensional . A Torus would be 3 dimensional.
@@kazedcat nope that would be a donut. A torus is just the 2d surface!
I wonder why no one ever mentions the root-tree kind of topology. It's definitely 3D and the stem is a multi lane multiplexer where relatively easily a 32 connection can be made on each lane. Scale that with stop gap multiple stems and a cpu finds itself in the gpu territory but with a much more efficient topology. Just a thought.
Thanks for the explanation! Very well presented
AMD have made a number of statements regarding the power requirements of the IO Die in EPYC - knowing the breakdown of DDR4, PCI-e, all those other on-SoC buses and SerDes in terms of power usage...
I really don’t think that AMD is concerned with interconnect power. TSMC is already working on micro heat spreader designs for your 3D chiplet stack. Basically a die which is all metal with an inner chamber of thermal fluid to dissipate heat.
Its not just about power. Also latency. What Ian is proposing would essentially make the 6 chiplets on each side into a mega chiplet latency wise. 48 cores all 1 hop away instead of a hop to IO die, across the die, and then a hop to the correct chiplet.
I think it's less about heat dissipation and more about efficiency...
Sht I was thinking in such thing while watching this video
Thanks Dr. Ian for the step by step break down of this topic. The future could get way more exciting as they scale up the cores. It will probably take years to trickle down to consumer products? I'm pretty happy with my Ryzen 5800X on a single CCX.
btw, dunno if you spotted, but some document about zen 4 had weird (not making sense) stuff about IF/CCD/CCX, which indicated difference from zen 3.
Very informative. Thanks.
yes they the interposer can be an interconnectivity interface, altough it might be not, because they showed a 3900XT3D on a event this last august and therefore the adding connectivity wasn't used in the 3000 series baseline, my question is, how they engineered the unused connections points ( I remember hearing that those connection were already there) to be compatible with the Zen2 and the Zen 3 architecture? OR the L3 cache installed on the 3900XT3D and those on the next 5800X3D aren't the same. they shouldn't.
Take your double bisected ring and cross the bisects and you now have something that looks like an "infinity" symbol. "Infinity Cache" coincidence?
Interesting insight. You may be onto something.
I've been so curious about possible mesh architectures. The way you described AMDs potential 3D multi chiplet with the (Butterfly, Torus) mesh interposer was so interesting. I also think there may be another step AMD may take, by maybe arranging a stacked mesh. Having a stacked interposer mesh, should allow the latency to to be reduced exponentially. The human brain has a stacked mesh I think. In the way we can access simple memories or fine details and skills that we've learned, is like stacks on stacks on stacks etc. Like one core to another core to another core etc., the mesh stacks could give that similar connectivity. We may never see it in our Zen6 or Zen(?) PCs but that must be a path they are trying to achieve in the future don't you think? I don't imagine the V-Cache will be individually allocated for each core in the future though just massive cache for every core to access. They are probably heading to where the a similar mesh could be used for Cache as well. Sorry for rambling. I hope you have a great day and keep up the super content.
Stupid question but why do cores have to be connected to each other? As far as I know the L1 and L2 caches are private to each core so they dont need it for that and to access shared data they simply go to the L3 cahce (or main memory), cant they? Why exactly would core 1 speak to core 2 ?
Crossbar reminds me of cross-connect in physical telecommunications copper pairs.
I wonder how much faster subjectively computers will be in 10 years.
At this point, I wonder when will we reach the limits of physics.
@@kamilazman2943 The limit of physics is centuries away.
We may eventually hit the limit of digital computing on silicon, but there are advances in quantum and optical technologies, which have a lot of potential.
It almost took 10 years for twice the singlethread perfomance of a 2600K around 3.4GHz but modern CPUs have to clock 30% more, kinda embarrasing.
@@saricubra2867 If you look beyond desktop cpu's, that outlook becomes so INSANELY fucking stupid. Apple specifically, but also all high end arm architectures have insane per watt single thread performance. Desktops arent nearly everything...
How do you verify the latencies from one cpu node to another cpu node and from one cpu node to the DDR memory. Do you use some tools to generate traffic which can stress the fabric(The ring or the mesh) ? I am assuming you do all these testing at a post silicon level.
They have a program that test thread to thread latency. The program pingpongs data between threads and they can measure the performance on varying pingpong level.
How do these ring configurations translate into FPS in Crysis?
You just buy the cheapest processor you find on sale, and then slap in a good graphics card.
I you like your cpu array, put a ring on it.
I’ve read about this, but talked about as if this was the end of AMD.
This sounds really interesting and quite hopeful for AMD in the future.
How do Intel chips do perform close to AMD with much less cache?
This is a very good video with interesting analysis and good explanations
Your content is great, hope you keep on gaining subscribers/growing.
I have come back to this paper repeatedly as a reference for multiplayer level design.
IMHO as scale increases CPU architects will take more and more pages from the Network/Datacenter Engineering playbooks. I would imaging after some number of cores they'll switch to either CLOS topology (through multiple levels of switches) or maybe they'll go straight to a Dragonfly topology.
Love to watch your vids to get a better ubderstanding of the semiconductor sector. What stock would you personally go for at this moment?
Super interesting video, a lot of info made easy to digest :)
Adored did also touch this topic for those that are interested in these kind of cpu topology and interposers.
Thanks this helped me with my Cisco collision domain testing!
28nm interposer (passive, active?) for those bigger wires, performance optimized 5nm cores, then density optimized 5nm cache.
Sounds like a tasty sandwich!
So, will you make a video about every article on AT now?
This vid has made me subscribe. thanks mate, was really informative!
brilliant. Linus should watch this video.
I know this video is old now, but is packaging the future of IC performance development?
So I "attended" Hot Chips this year, but was disappointed that I didn't get any semiconductor company swag. Any advice on how to get T-shirts and other stuff from these kinds of companies?
If you watched the synopsys talk, there was a link to a free t-shirt. Intel had a small contest that was easy, and a t-shirt there. Otherwise this year was devoid of swag compared to last year
Can you explain how gpu linking works with so many "core'?
Wouldn't it just be a lot smarter to REQUIRE less interconnectivity? And for more core you could just deal with groups, say you have a future 128 core CPU, it could be 4x32core or 8x16 core design with some ineffectively instead. Kinda like NUMA.
Isn't Epyc Rome/Milan essentially that already? You have 4-core (Zen2) or 8-core (Zen3) CCXes that are internally interconnected, and each CCX (group) connects through the IO-die to the other CCXes.
The question I guess is, would it make sense for future AMD designs to somewhat directly connect the CCXes to each other as well for lower inter-CCX latency? and in what topology? Thus making a trade of between latency and number of interconnects required.
This went way above my head really fast.
Start with section 1. The idea is to go slowly into the topic of simply 'how do we connect things' and built from there.
"Buttered Donut" was covered by Adored TV yrs ago. He was on to something then...
Love the content well done. Well researched
Absolutely love this channel. Thank you much, sir
Any thoughts on how this kind of 3d interposer could work for a big.little setup? I guess that the issue with stacking stuff on top of the cores, it would become an extra layer for the heat to travel through before reaching the heat sink
Makes you wonder if they can integrate some sort of cooling into the dies themselves to quickly reject heat threw the layers, micro heat pipes maybe or something more clever such as Graphene or Carbon Nanotubes, sooner or later those with probably be part or the semiconductors but they also have good heat transfer if I remember correctly, I guess the only trick is how to arrange them vertically threw the horizontal layers of chiplets