Back in the early 80's I was in a design team working on a Hardware Modeling Library at Mentor. Our device allowed 'chips' of up to 256 active pins (400 pins total) to be included into software simulations. (pre-widespread use of VHDL, obviously). I designed the physical interface into the customers' 'chip' (among other things). It was very interesting to query packaging folks from Intel, Fairchild, Wakefield, Brit-Telecom, Mercedes, etc) on what their upper-limit on pincount was... often I got a very cautious glance and ... "Well, how many can you give us?" as an answer. Many of them were only willing to talk about a more traditional hybrid-on-ceramic packaging. Whenever I turned to 3D packaging, I got a variety of answers, from "Nope, not for at least 5 years" to ... "Well, that depends on the vertical height we can have" [our device had 8 card-slots spaced 1.25 inches on center and had to account for a 0.125 inch thick controlled impedance PCB, and ZIF socketing.... either 4x64 pin, 2x 128 pin or 1x400 pin (256 active pins)]. Just for fun, I'd ask if the full 12 inches was enough. The answer would be : "Of course, yes, but we still want to have up to 8 devices installed in the card cage... and 1.25 each inches is a bit tight. " which I always interpreted as they wanted over 1 inch EACH for their concept of a 3D multi-chip interconnect including all cooling heatsinks, fans, etc. Our answer was "just pull one interface board (running 7, not 8 devices) and then you have 2.5 inches to work worth ... otherwise, buy a second HML". Some customers did not smile at that suggestion. At over 100 grand, this was not a cheap device, bitd...but you could run up to 4 boxes under one multi-unit license. Only one young engineer at an unnamed aerospace company did not flinch at the 1 inch headroom... I imagine THEY were the ones who had the most compact version of the 3D packaging at the time. BTW, this was the same set of informational "interviews" that forced us to go to 256 active pins. When we started this, we thought we could get away with "just" 128 active pins. Virtually EVERYONE told us to double it. Our engineering manager CRIED when he heard that... us design-grunts were cheering it on!! MOAR POWER is always good, right? Now, designing the backplane and send/receive data lines and phasing clocks to get insane state-transition control times, THAT was fun to do. Controlling crosstalk and race conditions in the PCB layout just about cost me my sanity, but I made it work and it remains THE QUIETEST system (as measured on a FCC testing facility at Mariposa) that I was ever associated with... and the biggest. It was an amazing box of rawk. Tho no one asked for it, getting up to 512 active pins for one model would have required a change in the way we transmitted/stored the data for vector-in, vectors-out, tri-state-data and timing analysis data.
Great video! I taught a course on Advanced Packaging a number of years ago and it's nice to see the industry moving towards MCM/Chiplet designs and, now, stacking of chips. Perhaps a future video can be done on passive vs active interposers?
I remember a Professor saying that optimizing separetad parts doesn't necessary mean optimizing the full result when the parts are working together... so that in general integration and considering the entire system usually is better. It is kind of crazy that for me that we went from discrete chips, to almost full integrated, and going back to somehow discrete...But I guess that some systems became too complex for us to optimize in an integrated mode nowdays.
intel glued CPUs together in the past - the Pentium II and III in Slot Versions had On-Package-Cache for example - and later the first Core Generation had mutiple "glued" together CPUs too - like the Core 2 Extreme QX6700, the Core 2 Quad Q6600 or the Pentium D 945
Of course the problem with stacking is heat dissipation. A cube of silicon won't be able to get the heat out of the center fast enough. I guess they'll go to fluid cooling of these cubes if they ever stack that deep.
that joke is on intel. They aren't even able to use glue the right way when they glued the ihs to the pcb(for example i7 8700k) what caused terrible bad temperatures. i rather don't think that it would be an good idea if intel starts to glue anything together again.
And now Intel, AMD, ARM, and others are getting together to come up with an interconnect standard. We could see chiplets from different vendors on the same package. Imagine AMD cores alongside ARM and even FPGA fabric, tasked with handling different aspects of a system.
Will be interesting to see what new solutions it makes possible. e.g. AMD have dominated the console space because they are one of the only players that has all the technology and IP required to produce a monolothic SOC that can provide a good experience for gaming. My hope is that eventually it allows manufacturers to shop around to some degree, combining whichever parts that are available and best suited for their product. For instance I could imagine a high end tablet or gaming handheld using a CPU from Qualcomm and a GPU from Nvidia, and using UCIe to avoid those two companies having to collaborate too closely or share sensitive IP. Of course you can already have a separate CPU and discrete GPU from different companies, but this complicates the cooling, and I suspect packaging everything together could have cost and latency advantages too.
As I understand the reverse happened when a Japanese calculator company wanted intel to make a number of different chips for their calculator industry however a person in Intel decided to combine all the functions into the first CPU chip
Usually integration (aka "fewer components") is the way to save on money and complexity. Like it saves big time, in some cases multiple orders of magnitude. But he explained why currently the reverse is true, at least for highly complex CPUs.
@@graealex 100%. It's a balance between silicon yield and costs versus post-silicon process costs. In this day and age, silicon costs are astronomical so it makes sense to invest in designs that reduce them.
If I remember correctly Intel were making ASICs. The ASICs were fully integrated where as the CPUs were general purpose and needed additional components to function Same thing here, monolithic CPUs are great with everything on one chip but if you can break it down, generalize the hardware and make it profitable then it will be a success.
Quick tip for you, if you use obs for recording you can set your microphone audio to be on a different audio lane than the desktop audio making it possible for you to treat the audio on an external program and then import it back during the editing process, that way you can remove the breath sounds from your audio. It can be achieved by importing the audio to audacity (First you will need to use your preferred video editor to export only the audio as an mp3 or wav for example), choosing a part of the audio where the breathing is noticeable and going to effects -> Noise Reduction -> Get Noise Profile then doing CTRL + A and going to effect -> Noise Reduction -> mess around with the settings (in your case use low settings since the breathing is noticeable but not intense) and click "ok". After that export the audio from audacity and replace the audio from the video with the exported one on your preferred video editor :)
I worked in the aerospace industry sometime ago. They had a MultiChip module that contained an ASIC die which they had designed in house, and dies that were basically off the shelf, Ethernet PHY, Flash, MRAM, and oscillator. MCMs have been used there for quite some time it seems. They did it because they had basically no choice.
Computers started out with multi-chip designs. Central processing units weren't a single chip - they were several chips, all doing different functions. It was Intel's big innovation to pull all those functions together and put them onto a single piece of silicon. Not quite all (I/O and memory controllers remained external to the CPU for a long time), but close enough. The funny thing is that AMD is deliberately going backwards. They were the first to move the memory controller onto the CPU die, which provided a considerably performance boost, and now they've reversed that. You really can't stress too much how important a very high-speed interconnect is when doing such things.
VLSI integrated functions onto one die lowering cost and improving power efficiency, while benefitting from Dennard scaling making shrinks smaller cheaper faster WITHOUT increasing thermal density. Now not only does i/o not scale, but cache neither on current processes and the energy density in high performance logic has become a real limitation. Thus chiplets space out heat generation in many core CPU, while large local caches allow gangs of CPUs to work mostly with local data, while trips to main memory become so proportionately slower that performance sensitive software does all it can to avoid unpredicted memory access. Not only is a 64c/128t chiplet cheaper but is impossible to market as a profitable product when monolithic. Binning of chiplets with selective disabling of cores, allows for far greater core efficiency or clock speed in premium models because the chiplets can be chosen to suit a SKU. A monolithic die has to space out cores to reduce logic density wasting expensive wafers and cannot mix n match parts to meet SKU requirements.
What is missed is that AMD chiplets DONT just WORK. They overcome the downsides of familiar MCM and revolutionize computing from the stagnant Intel sinecure it was. They have been incredibly clever. They had one shot at avoiding oblivion - they needed a single design which cost competitively scaled to ~all markets. They didnt link existing cpuS, but designed a teamable processor core unit (4 core CCX) from the ground up. as others here say, the key downside is the power & lag used by inter core data transmissions over any distance, so this was the focus of the CCX design - each of the 4 cores had a direct hardware link to each of the other (adjacent) cores - this needed only 3 such ultra fast intra ccx links on each core. inter CCX links were lavishly and liberally cached to minimise lag. Having given Fabric the optimum hardware, they then put disproportionate focus on Fabric. They initially strategically yielded to Intel on the sexy IOPS at which monolithic ~wins (gaming), & gradually gained a flanking foot hold & then advantage based on raw power and cheap costs of harvesting - even at inferior process initially. So I have a problem with dismissing it as something Intel etc. can just whip up in the lab & marketing - BS. a/ it requires a ground up re-design of their entire processor range & discarding a lot of IP b/ its been over a decade of hard yards - validation, patents, .... Infinity Fabric is a huge moat for amd - its clear that Intel were caught utterly flat footed by it in 2017 & have mostly just tried to ignore it since- not take it for the serious threat it is.
That's not everything too, before zen 2 most separate silicon interconnects for chips were very slow and huge latencies(100s of nanoseconds), amd found a way to bring 512b wide IF(servers) and < 100 nanoseconds of latency by working with TSMC, which paved the way for others.
The irony is that the QPU would work if they dropped their UMA. You could push data from any core to any adjacent one. Supercomputers with thousands of cores work like that, a core can't randomly address any other core, not without calling the "network". But then you would have the Cell architecture with their PPUs. Fabric is easy to program I guess, because it's just like a network switch.
The great thing about chiplets is the ability to mix different type of process tech on the same chiplet device. For example, the CPU (which is built on a process tuned for low capacitance) and the DRAM (which is built on a process tuned for high capacitance ...)
MCMs faced two major issues during my time at Intel several years ago. The first was the most obvious: putting multiple chips in the same package meant the heat dissipation of the package now had to be designed for the two chips - ostensibly doubling the required heat dissipation. We handled that by throttling down the chip clocks, a process that has been largely automated by onboard temperature sensors and variable clock generators. IE., if you put two of today's advanced chips in a single package, they will self-throttle downwards. The second issue is that anytime you have a signal leave or enter a chip, it goes in or out via a pad driver, essentially an amplifier of that signal designed to drive the large capacitance of a PCB trace. This adds delay to the line. A lot of it. Internal trace drivers only have to worry about driving very short lengths of narrow and flat aluminium/copper. Not so with PCB traces. MCM pad drivers could be backed off a bit to compensate for minimal traces in the MCM, but typical MCMs unite chips that would normally be packaged separately, requiring a design change. PS., like the old movie "support your local gunslinger", keep in mind I am a stupid guy in the process world. The real process guys have to keep a fan on their heads to keep their brains from overheating :-)
It isn't really doubling the heat dissipation though. Taking an existing device at X wattage and splitting it up into different MCM components doesn't raise the wattage of the combined chips by any appreciable amount -- even if we include an active interposer. However, adding *additional* chips (read: cores, features, etc) to a design would *definitely* increase the thermals. Put another way, the advantage of a fully integrated chip vs MCM isn't about thermals, but moreso that fully integrated chips were a result of continuous process improvements over the course of history. It became *cheaper* to produce an integrated chip rather than go with MCM or COB. The overall performance improvements were a side benefit. Asianometry does touch upon the latency issues with MCM in the video too -- and it's something that AMD has seemingly solved architecturally.
@@scottfranco1962 No, X/2 + X/2 = X. However, X + Y is definitely X + Y. A theoretical chip that is designed in a monolithic fashion versus MCM will consume roughly the same amount of power.
@@scottfranco1962 Or put another way, designing a *device* that has a power dissipation of X watts in a particular package will be the same regardless of the number of chips inside. There is of course the arrangement of the individual chips underneath the heat spreader, but that's something else for another day.
Far as I know, chip manufacturers aren't incorporating heat pipe design concepts into the surface of their chips- there should be plenty of heat performance still available.
Remember Pentium II? Intel had problems placing all that primary cache memory on the Processor chip, so they used the multichip approach, with the cache being the added chiplets. The cache run at half of the processor clock speed anyway. They returned to a monolithic structure with the next generation, after one year...
IIRC, the (earlier) Pentium Pro had a separate, interconnected cache within the same package. Unfortunately, it was expensive and ran 16bit code poorly, but it really cranked with pure 32 bit code.
Returned? The Pentium II was the first Intel processor that included a L2 cache at all. If you wanted L2 prior to that, it had to be SRAM chips on the motherboard. Or, with the 386, if you wanted any cache at all, it was on the motherboard.
@@TrueThanny It was way more fun back then! Nothing like discovering the newsgroups and getting a JPEG viewer to look at your first digital nudie pics!
IBM developed the first large MCMs in the late 1970s with volume shipments starting in the 1980s. Their first MCMs have 100 and 118 chips on each module. They were originally used in mainframes.
Im not completely knowledgable, but everytime I come across some technology/architecture I hear "yeah IBM made a product with that back then and it dipped but then it gained traction" lol
You are missing where the package innovation came from. Multi-Die packages where driven by memory and image sensor (processor). For example through silicon via technology was developed and matured for these applications. I still think that are the leading edge for this topic. A modern ISP is amazing part of technology with wafer level direct bonding, thinned image sensor chip to just 10um thickness to be ilumanted from the backside. Memory is producing stacks with up to 16 chips.
Great video. I didn't really understand why chips on a bigger fab process are more expensive, but once you talked about using smaller chips to 'approximate' to the size of a bigger one, it made sense (my 'aha!' moment) - bigger fabs use more space, so there's less space to make more of them = if it has errors, it is a waste and now you've wasted all that space. If it had been made smaller, then at the least the cost for a bad chip is less. Thanks!
A funny thing about chiplets is that they're helping overall but they are so good that they leave no silicon for the low end. The low availabiliy of the 3300X and thelaco of a 5300X is becaus AMD is getting little to no chiplets with less than 6 working cores
This is the fault of the way AMD approached Zen 2 and 3 and nothing inherent with chiplets. Something as simple as a 4 or 6 core part doesn't require chiplets and if AMD does this for Zen 4 I think Intel will eat AMD for lunch at the low end. If I were designing AMD's Zen 4 Ryzen based CPUs and APUs, I would use 5 chips in all. The first would be a 6 core monolithic die, with no graphics, that allows a tile based interface to add a graphics chiplet. I would create 2 graphics chiplets for different power levels. I can now use this single chip for 4 - 6 core CPUs and APUs. That's a lot of versatility still. I'd then make another chiplet that has 8 cores, the IO functionality and the ability to add graphics just like the 6 core chip, and on another side be able to add another 8 core chiplet, with no IO functionality, so it's simply 8 more cores. AMD did the whole IO die thing with GloFo because they were still under contract to GloFo for die. I don't for a single second believe Intel is going to move to tiling for every single part because it just doesn't make sense to add complexity when the die isn't very big. So this is where AMD failed, at low cost items, and they shouldn't have to rely on having enough bad monolithic APU die to make lower core count parts. That's just bad. And, using an 8 core chiplet to make a 4 core part is BAD. It's a waste of 50% of the die.
@@johndoh5182 thats a lot of tooling though, it isn't cost effective to have that many different designs. It's also a waste of capacity since you have so many sku's hogging precious laser space. AMD has 3 designs: CPU,GPU,IO INTEL has maybe 1or 2 design: APU every intel sku is a binned down 12900k or is a 12900k
@@johndoh5182 You seems to have a backward idea on the whole chiplet idea. Remember when AMD was on the brink of colapse, high end desktop cpu got like 4 cores and maybe halo product with 8 cores. With chiplet AMD raised the bar, nowdays most common gaming machine got 6 cores. AMD is now becoming intel since no competition is insight (thankfully we got ADL, but thats not enough). Instead of asking AMD to downgrade their effort, i would instead go to intel to ask them to up their game. So we can raise the bar again so midrange can have more cores like 12 core and hopefully, making 4-6 core to be low end. Thats the beauty of competition although i dont see it coming anytime soon since price is slowly creeping up.
@@jakejakedowntwo6613 No? AMD makes 1 die for APU, CCD, IO each and a bunch of dies for discrete GPUs 6 core CCD is the only new one that AMD would need to make And intel makes more than 1 dies for alder lake
I'm surprised you didn't mention Intel tried this a few times. They did it with the Pentium 4 to get dual-cores (really dual-dies) to market. Then again with the QX6700 (2006) and Q6600's (2007) which contained two dual-core chiplets to deliver quad-core processors and they used the MCM naming for those too. There were also XEON equivalents at the time. In Intels case, AMD got to 4 cores with a single die and Intel was a bit behind so they had to use an MCM to catch up then they went back to a single die design. Now it has switched around and chiplets are being used to get higher core counts at a reduced cost like you said in the video.
Intel is so behind AMD and Apple on chiplet technology that it’s not even funny. AMD 5800X3D has a stacked 64MB of L3 vertical cache at $450! Intel’s Faveros EMIB technology announced in 2019 is still nonexistent!
*voltage, frequency & chiplet efficiency* Multi-Chip has some very interesting EFFICIENCY options available. Since POWER DRAW does not scale linearly, and is often the limit (especially in laptops) then adding MORE chiplets at lower performance per chiplet can result in better performance. If you plan to run at a max of, say, 1500MHz you can also optimize around that as a max frequency further saving power somewhat. So there's a lot more flexibility compared to a traditional monolithic design... and not being on the most cutting edge NODE might not matter so much if you have this flexibility. Maybe put those cost savings into adding more chiplets on less efficient nodes? There's definitely going to be a BALANCE of everything that results in the best performance per dollar to make and that's going to vary depending on the product, current node costs etc.
Can you do a video on yourself...how you generate so many high quality videos in such fast iterations? Truly amazing! I seriously would be interested how you do this!
@@scottfranco1962 one minute they are the world's leading chip producer, with military and bleeding edge computing resting on their back, and the next minute they are a highly armed guerilla fighting for their existence... its crazy how close both versions seems to be
Just a friendly heads up about your audio; your VO seems to have a bit of ringing to it. I noticed it around the 8:30 mark. This might be your mic arm resonating in which case it should be tightened. If the mic is desk mounted, put it on something soft so the desk can't resonate. Lastly, it might be feedback if you're listening to yourself with speakers but I'm doubtful of that last one.
Back in the days, mainframe CPUs were basically a bunch of chiplets in a single package and a heat spreader on top. "CPU Galaxy" is a good channel to see some of these old beasts.
Credit needs to go to software developers for finding ways to parallelize tasks. All the chiplets in the world won’t work if the software isn’t designed to run on multiple cores. You young folks probably won’t remember the days before multithreaded software, but trust me when I say it sucked. Programs would freeze as they completed a task, which in the days of slow CPUs and mechanical drives was quite often.
Nothing has changed. Most true multicore/threaded designs are one-off. Think supercomputer clusters. True multithread/multicore software engineers are hard to find. The average software engineer is scared to death of multithreaded designs, and it would not be a stretch to imagine that most multithread apps have latent bugs.
@@scottfranco1962 The more services the OS provides, the more those can be multi-threaded. Even the natural layers allows for it, e.g. GUI in a separate thread from the app.
That is Windows thinking. Devtools for multithreading has existed for 30-plus years on real OSes like Unix. Even today Win10/11 cant handle over 16 threads on a single process. Most windows apps start to taper off after 4 cores. Apple solved this almost 10 years ago with Grand central dispatch. Microsoft should do the same and if Microsoft wants to make an OS that actually works they should use a Linux kernel and put their brilliant desktop system over it. That would also solve the multithreaded problem in Windows if MSFT wants to do it.
The Ryzen chiplet design and the huge investment needed to make it work were a very risky move by AMD and almost an all in. With Bulldozer selling extremely poorly and no new cpu products on the market for about 5 years Im not sure if the AMD cpu division wouldve survived if ryzen wouldve been a failure. My guess is that it were the 2 major consoles using Bulldozer and GCN 1 keeping AMD alive. Of course we can thank the silicon gods that it workedout or wed be paying a pretty penny for Intel 4 core cpus in 14 nm to this day :D
**Edit**: explained by a reply to my comment. 10:03 - "[...] that handles analog functions like USB and SATA [...]". Neither of those are analog, and I'm not aware of any homophone that would make that statement make sense. Did you perhaps mean something like "auxiliary"?
In chip fabs chip design is split into 3 general categories: logic, SRAM, and analog. analog (aslo known as I/O) is everything that touches outside the chip.
@@Quxxy to the chip's circuit that's running at multiple gigahertz, any I/O is "analog", in a sense that the signal doesn't arrive clearly in a well defined form for the chip's circuit. There are special circuits in the chip that translate that "analog" voltages to a more manageable form inside the chip, and they are what's known as analog circuits
Around 20 years ago we saw the end of the MHz race between CPUs. Slower chips doing things smarter and more efficiently substituted for faster clock speeds. Chips have been mired in the same 'bloatware' that software finds itself.... we don't program smarter, we just throw more lines of code and billions more transistors at a problem. I believe that the 7 nM 'node' is over-extended, and just like with clock speeds, we will take a step back towards more realistic geometries. AMD has taken an interesting tack towards simplification. A database server does not need ANY graphics functionality, why include it? Design a couple of chiplets that implement database functionality in hardware, omit the graphics engine and glue them together with a CPU. Less transistors, easier geometries and better performance.
Scaling up the easiest way makes sense, until it stops making sense, and then scaling up the next-easiest way makes sense. Repeat until full theoretical capability is achieved.
Thats one of the problems with x86 (and ARM to some extent). About 90% of all instructions a program executes are jump,compare,add,subtract,load,store and x86 has ~15 000 instructions which makes decoding A LOT more difficult and requires more bits to encode an instruction, eating away silicon space for more useful things such as more pipelines or a better branch predictor. Thats one of the reasons why RISC-V is appealing since it wants to keep things as simple as possible, which also allows for higher code-density.
Sure, what is the clock speed of the human brain? Probably less than a 1khz. Yet all of the computers in the world cannot equal one. Parallel processing in 3d solids baby! Oh, yea, and all that for the cost of a beer or so...
Xilinx (now AMD) used a multidie approach for ultrascale too. Some designs that ran fine on V7 have problems meeting timing at the die crossing interface in ultrascale.
@TacticalMoonstone I like to know how similar or different the process technology is between sensor manufacturing vs processors. I have a feeling that image sensors use older equipment/process. Also, how far behind is full frame / large sensor semiconductors compared to latest cell phone cameras. If you would scale up for example, iPhone main sensor to larger size, how much would it cost? TowerJazz, Samsung, Sony, sigma…I’m also curious about those.
Remember the jump in performance that came with the Pentium II Xeon / Pro? It seemed like a big jump in performance. That used a primitive version of chiplet based L2 cache memory if I recall correctly; This is why it had a weird rectangular shape and case. I think big CPU cache is the way to go, generally.
SOC is the opposite of what amd is using. They are doing SIP system in a package, it’s been in use for decades. The next evolution of that is stacked die, where you use through silicon vias to directly connect die without bond wires and footprint increases. SOC is putting multiple die on the same wafer making large die with sub die only connected by the top metal layers or bond wires in the package. That was a large expense where board space was the primary concern. The other huge advantage to this approach is you don’t need as many different devices on a single die, reducing mask layers for each and not having to share thermal budgets at each layer. This reduces cost and complexity. The reduction in base and interconnect layers also greatly improves cost of scrap. Your net yield increases slightly but anything you scrap at end of line has man fewer masks so it costs less. These advantages are shared by Traditional SIP but stacked die is a way to take moores law into 3d and continue it beyond atomic size constraints.
The biggest roadblock issue for MCMs was cache memory latency bottlenecks. Consequently, integration of memory on die for CPUs GPUs and SoC would be a priority from the mid-90's onward, while on the severs side, massively parallel blade architecture solved the problems temporerally, but by ~2010, latency was again an issue. Chiplets solve this another way; splitting up the processors into manageable increments.
It's a bit of hair splitting but technically USB sends digital communication over analog differential pairs. The usb controller interface he mentions translates the external analog electrical signal into internal digital data. PAM 4 in PCIe 6 adds a more clear analog touch, with the communication signal voltage being able to represent 00, 01, 10 and 11 instead of just 0 or 1.
Why don't separate the systemsonachip into multiple processors like it used to be? Have separate video card and a neural AI calculator. What does "diffused" in USA mean on the AMD processor?
I wonder whether i saw this approach with first Intel quad cores. As far as i remember these were two 2-core chips "glued together" on a single package.
Intel already proven that chiplet works back in P4D, C2 era, except AMD shamed Intel for gluing and went as far as calling it fake dual core and later quad core.
Intel's first "dual-core" chip, the Pentium D, was two single-core chips placed on the same package. There were zero interconnects. When one chip needed to communicate with the other, it had to do so via the FSB, so it was actually SMP on a single package, not a dual-core chip. Their first "quad-core" was similar. It was a pair of dual-core chips on the same package, with the same lack of interconnects.
@@AlfaPro1337 AMD was right. The Pentium D was SMP on a single package, and the Core 2 Quad was dual-core SMP on the same package. AMD had the first actual dual-core processor, and also the first actual quad-core processor. Many workloads showed the consequences of this, too.
@@TrueThanny There were 0 interconnects, because the I/O die is basically on the North Bridge? Plus, I'm guessing Intel opted for a ring bus-like, thus, the main core die communicate with the NB and fills up, until it hits a certain workload that it needs to shift some work to the second core die. It's still a dual-core in a sense--though, not a single package, but hey, AMD was being a dick in the 1st place. Heck, even the early consumer 1st gen Core i series are basically 2 die, but, one is the core and the other is the I/O die. Technically, Intel has been gluing for years.
@@mawkzin Thank you! But TSV seem to be "through-silicon-vias", i.e. to connect different layers to each other, no? But searching for TSV led me to pages that mention "microbumps", which I believe is how the individual chiplets are bonded to the base material. It seems that microbumps are µm-level solder balls. Now I'd love to know even more about this part of IC packaging! How are those microbumps manufactured? How are they placed so precisely? Are they heat-soldered?
Maybe we can go back to having a north bridge on the motherboard in typical PCs? It would be a small step. This also reminds me of how 80286 and 80386 CPUs needed an optional “math co-processor” to be installed in the motherboard in order to perform some floating point calculations.
A North bridge would come with massive latency costs, tho. They didn't integrate the North bridge just because it was cheaper, they also did it because having an imc massively increases performance
We wouldn't actually want that. You're just adding more lengths of wire and logic between a device and the CPU, which becomes a bottleneck and adds latency. Everything either needs access to the CPU, RAM or both. The chipset that still exists on desktop systems still has the same problem. AMD's Epyc based systems don't even have a chipset because there is room in the IO die for all those auxiliary functions. It's probably key to understand that the Northbridge is mainly only really different from the modern chipset (and southbridge) in that it housed the memory controller. That specifically makes way more sense to be on the CPU package. The northbridge was removed in favor of the Integrated Memory Controller. It would be a step backwards.
@@blkspade23 Okay so not worth it, I guess Ryzen already does the logical thing and puts the IMC in the I/O chiplet. It maxes out at moderate memory speeds but it makes sense from a value perspective.
The Brilliance of AMD is that they have - essentially - only a single 8-core CPU. By connecting many of them with different fabric chips and/or disabling broken ones by lasering them out of the circuits, they can build a chip anywhere in size from 2 CPUs to 128 CPUs - all with only 1 GPU, roughly. So AMD really makes only ONE CPU and it has 8 cores, period.
Thanks. I was expecting more information about the packaging technology and process, that’s where the real innovation is as far as I understood. Is that covered by secrecy?
Large chips also have internal lag issues even at the speed of light signals that need to move 2 centimeters is always twice as long as if it needs to travel 1 cm
The older server CPU's were split up into separate dies on the same substrate based on functionality and the needs of the customer. There were different variations for them as needed for server processing. If it needed to do more calculations it would need more cpu cores on the same substrate. Back in the 90's Intel had separate cache dies In their Pentium 2 line because they found too many failures manufacturing CPU dies with cache at the time. It really came down to the cache itself when it came to manufacturing for the reason for most of the failures. Instead of wasting a whole CPU chip just because the cache was faulty. They decided to manufacture the cache separately. It does not cost more for having more instructions on the same die size. It costs more for the higher percentage of failed chips manufactured. I.E. Bad chips = no money. It may cost more for a finer lithography process on the same die size because of the expense of the equipment and development costs and also risking higher failure rates. It may cost even more if the design of the chip may require more layers of processing. This is how AMD "Chiplets" are the idea behind their new high-end processor with their 7nm process. Very few chip manufacturers use 7nm process at this time and and it seems to be getting better but risky in the means of higher failure rates. I'm sure the failure rates are getting lower but by splitting the dies or "Chiplets" the manufacturers may create less wasted failures and earn more profit.
4:00 this is not what Moore said, because chiplets and tiles are in same package, and he said about in separately packaged. Don't give credits, when don't deserved.
great coverage as always. I wonder if apple will ditch the monolithic dies for the pro lineup going forwards, they really are pushing that monolithic design SOC with their on silicon interconnect
I don't think it is possible for the A series chip as it need to be small for iDevice. Possible for future M series. The current M1 Ultra is 2 chips in the same die with RAM, Unify Memory, as chiplet on the side.
@@Theoryofcatsndogs not really, the M1 ultra is really one monolithic die, since the interconnect is on silicon. they just cut off one of the dies if it fails which is how they preserve yield
As I understand this, using ‘chiplets’ means breaking up a single chip into several components. This means that each chiplet would need extra circuitry to concentrate the signals into channels for exporting and importing data from the other chiplets. This extra channel circuitry would not be needed in a single integrated chip. That makes the integrated chip inherently more efficient that a bunch of interconnected chiplets.
More efficient until you reach a size limit. More efficient except more prone to manufacturing defects. More efficient except more expensive. In an ideal world it wouldn't be necessary, but it is necessary for various reasons.
What you described is how Moore's Law basically allowed Intel to keep squeezing out performance out of it's chips over the past decades. Problem is: Moore's Law is dead. We are pretty much at the limit at how much we can shrink transistor dimensions. You can't just shrink them them further and pack more of them in an integrated circuit. The failure rate is significantly higher. This is why Intel's chips and semiconductor roadmap got fucked because of the consistently poor yields they were getting.
Great video! I love all of them: thanks for the great work you do! When you refer to the USB and SATA connections to the AMD Ryzen I/O chip as "analog", what about them is analog? Is it the transceiver (not sure if this is the correct term in this context) that converts the external digital USB and SATA 5v signals to the 1.2-1.5v (I believe) signals that the I/O chip uses internally?
Nope, when you put the data rate of a digital signal high enough, it's not sharp edges and neat transitions on clock edges. You 100% have to treat them as analog signals when they come in from off-chip. On chip you can control levels and delays to make things near-enough digital ideals to treat them as crisp signals.
@@thewheelieguy Oh yeah, that totally makes sense. At the signal level everything is analog, but we (software and systems engineers) can usually ignore that. Thanks for the explanation!
@Asianometry If you could find time to make video on next generation of interconnects both data and power to deal with multi chip on a substate, the companies who are doing it and whether Intel can give competition to TSMC, Samsung with the next generation foundry, that would be awesome. When I look for information they are all fairy tales. Could you incorporate what kind of materials would be used for interconnect, interposers and why they are challenging.
He specifically showed the big perforated copper blocks with chips embedded and called them MCMs. I don't recall if he mentioned the letters I, B, and M but they're in there.
I work in embedded , no such issues there :) You would have to be mad to go chiplets there. I see the point for high end designs if you want to have some scalability and lower cost with some performance drawbacks.
Because they only had one chip. First-gen Zen was monolithic. Everything was on the one die, which had two memory controllers. The consumer platform got one chip, giving it up to eight cores and two memory channels. EPYC got four chips, which is up to 32 cores and eight memory channels. Later, Threadripper got two chips, for up to 16 cores and four memory channels. With Zen 2, all the I/O, including memory channels, moved into a separate die. That allowed an arbitrary number of cores for any platform, only really limited by space on the package.
Some people here in the comments confuse chiplets with MCM. They're not the same. In an MCM design a single chip can work in itself. In a chiplet design a single chip can't work in itself without an additional chip. In the case of AMD one chip (CCX) consist the cpu-cores and another chip consists the IO interfaces and connections (IO die). Neither works without the other. So a 386 motherboard or an IBM mainframe from 1965 is not based on chiplets. One thing is missing from this video: chiplets improve not just yields but the binning. Smaller chips can reach higher clockspeeds.
lol at 1:10 the person in the clean room on top the machine has a fall protection harness. However, the distance they have to fall is so short, it won't activate before they hit something.
I recommend checking caly-technologies die-yield-calculator. as I think you gave numbers for small chips and mature process here (8:18) with new and unrefined ones, big chips like epyc would have yeld of 1-5 chips per 300mm wafer which cost ~10000$ each. (yeld in 10% range) which would make cpu cost way above 500 000$ each to pay for every step of the process. same wafer and same confg with chiplet drops 840 working chiplets, and maybe 10 half's, that can be configured, into 850 working ryzens, 400 32 core threadrippers or 200/100 epycs. (assuming perfect interconnecting, you can drop 5% if you want to be realistic). IO dies on lower process are easy to get and WAY cheaper. as TSMC and others have always full load and fight for every wafer between companies, AMD can make 10x as much chiplet based cpu's as monolithic ones, with number of wafers they got from TSMC. amd is already above what would be possible without splicing into smaller pieces, and if intel would have chiplets with their 10 nm issues, they would have good yelds few years sooner. I know its extra complexity, extra time and extra problem layer, but we already past the point where monolith is physically possible and every manufacturer goes around their own way. NVIDIA sucks up any issues and disable 1 or 2 out of 70 cores completely, and just say their product have less, than it have working products. I like AMD way as divide and conquer is what IT found the best solution for wide variety of problems....
Man AMD should just stop gluing together all these parts. Give the competition time to copy their designs first... ohh wait, Intel just patented the AMD Ryzen architecture. Never mind.
@@anarsosoroo2891 He, humorously, said: "Intel used to dunk on AMD by calling chiplets 'glued together cores' but now Intel recently published a specification for a, supposedly, industry wide approach to a cross-vendor way to mix and match different accelerator chipslets into a coherent package. It 'incidentally' happens to be perfectly compatible with AMDs Infinity Fabric". But @J Bali is also right, as AMD (and a few specific AMD employees) is actually mentioned in the white paper. So AMD was in on it right from the inception. Hence Intel didn't just "patent Ryzen".
You didn't mention anything about how packaging technology suddenly enables the integration of multi chiplet modules without sacrifizing the performance. And what's the difference between an SiP and a chiplet structute.
You're likely going to see it moving forward more in mobile gaming. For example, the chip being used in the Steamdeck is a 4-core APU derived from AMD's Zen3 micro-architecture paired with 2 RDNA compute units. As for phones, battery technology probably hasn't caught up yet that phone OEMs would look beyond the current SOCs they're using.
I remember being obsessed with Hyper transport technology. What always bothered me was this idea that theoretically we can unite all the best solutions from competitors in order to make one "ultimate" product. At the same time losing competition (or at least a threat of it) was the major reason why one company developed its "killer feature" to stay afloat in the market or to outcompete its rivals. On the other hand there is no such thing as the ultimate product with all the cutting edge technology in the real world since every product is just a mean for fulfilling certain need or completing certain task. For each need or task the best solution might not require the best of the best or "ultimate" tech in every regard.
@@scottfranco1962 My dilemma stems from the assumtion that technological advancement comes either from competition or from some extreme external limitation. It seems logical to me that in the future we will compete directly in technology (of every processes) rather than the products. I might be simply wrong but isn't it the limitations from weak points of your product make you develop new technologies to overcompensate your weaknesses.
@@AlexanderSylchuk Oh, you are begging for my favorite story. I interviewed with a company that made precision flow valves. These were mechanical nightmares of high precision that accurately measured things like gas flow in chemical processes. This is like half the chemical industry (did you know a lot of chemical processes use natural gas as their feed stock?). Anyways, what has that got to do with this poor programmer? Well, like most industries they were computerizing. They had a new product that used a "bang bang" valve run by a microprocessor. A bang bang valve is a short piston that is driven by a solenoid that when not energized, is retracted by a spring and opens a intake port and lets a small amount of gas into a chamber. then the solenoid energizes, pushes the piston up and the gas out another port. Each time the solenoid activates, a small amount of gas is moved along. Hence the "bang bang" part. If you want to find one in your house, look at your refrigerator. Its how the Freon compressor in it works. Ok, well, that amount of gas is not very accurately measured no matter how carefully you machine the mechanism. But, it turns out to be "self accurate", that is, whatever the amount of gas IS that is moved, it is always the same. The company, which had got quite rich selling their precision valves, figured they could produce a much cheaper unit that used the bang bang valve. So they ginned it up, put a compensation table in it so the microprocessor could convert gas flows to bang bang counts, and voila! ici la produit! It worked. Time to present it to the CEO! The CEO asks the engineers "just how accurate is it?" Engineer says: well... actually it is more accurate than our precision valves. And for far cheaper. The story as told me didn't include just how many drinks the CEO needed that night. So the CEO, realizing that he had seen the future, immediately set into motion a plan to obsolete their old, expensive units and make the newer, more accurate and cheaper computerized gas flow valves. Ha ha, just kidding. He told the engineers to program the damm thing to be less accurate so that it wouldn't touch their existing business. Now they didn't hire me. Actually long story, they gave me a personality test that started with something like "did you love your mother", I told them exactly where, in what direction, and how much force they could use to put their test and walked out. I didn't follow up on what happened, mainly because I find gas flow mechanics to be slightly less interesting than processing tax returns. But I think if I went back there, I would have found a smoking hole where the company used to be. And that is the (very much overly long) answer to your well meaning response.
@@scottfranco1962 Great story! For some reason I imagined something like water valve from washing machines as a "bang bang" valve. To me fridge compressors work more like one piston combustion engine, but they don´t make distinct souds due to their continuous work. Maybe with a larger piston which stops at every cycle they will produce a "bang-bang" sound. It actually reminded me "from zero to one" by Peter Thiel and your story is actually my main concern about that book. When you outcompete your market you will find it a lot harder to disrupt your own business with new technology. It's just like it was in the soviet union, the only tech that was improving was military and only because of competition with the west.
A CPU you get for your computer is not SOC, not even chiplet based Ryzens. SOC includes not just the CPU and it's I/O, but also a GPU, storage and memory at least. Phones and consoles have SOCs, but not personal computers.
The Athlon Tri-Core was just enterprise quad-core chips that failed to yield due to manufacturing loss, so they just turned off the defective core in Microcode, sold it to consumers as the Tri-Core and we ate it up. Also because the failures were sometimes still semi-usable so we figured out ways to turn the defective CPUs back on.
In the 1980s AMD had the AM2900 bit-slice Microprocessor design. Do not know if anyone actually used these but they where and interesting idea, Multi-chip processor. The improvements in chip manufacturing made these redundant.
I can remember reading the data books on those when I was in my undergrad. The only industry applications I can think of were in very high speed signal processing, before the advent of integrated DSPs.
Xilinx is pronounced ZI-links, not ZEE-links. And panacea is typically pronounced with emphasis on the first and third syllables, not the second syllable. Otherwise, good video.
They worked because despite all of the performance disadvantages of chiplets, Intel spent half a decade relaunching Skylake on 14nm because of their initial 10nm failure. Sacrificing 10-20% of the power envelope and adding 10-20ns of memory latency to do hops over the package is fine if the competition is THAT stagnant. Had Ice Lake and Alder Lake launched in 2017 and 2018 as originally planned, this video would've been titled "why AMD's chiplets failed"
Back in the early 80's I was in a design team working on a Hardware Modeling Library at Mentor.
Our device allowed 'chips' of up to 256 active pins (400 pins total) to be included into software simulations. (pre-widespread use of VHDL, obviously). I designed the physical interface into the customers' 'chip' (among other things). It was very interesting to query packaging folks from Intel, Fairchild, Wakefield, Brit-Telecom, Mercedes, etc) on what their upper-limit on pincount was... often I got a very cautious glance and
... "Well, how many can you give us?" as an answer.
Many of them were only willing to talk about a more traditional hybrid-on-ceramic packaging. Whenever I turned to 3D packaging, I got a variety of answers, from "Nope, not for at least 5 years" to ... "Well, that depends on the vertical height we can have"
[our device had 8 card-slots spaced 1.25 inches on center and had to account for a 0.125 inch thick controlled impedance PCB, and ZIF socketing.... either 4x64 pin, 2x 128 pin or 1x400 pin (256 active pins)].
Just for fun, I'd ask if the full 12 inches was enough. The answer would be : "Of course, yes, but we still want to have up to 8 devices installed in the card cage... and 1.25 each inches is a bit tight. " which I always interpreted as they wanted over 1 inch EACH for their concept of a 3D multi-chip interconnect including all cooling heatsinks, fans, etc.
Our answer was "just pull one interface board (running 7, not 8 devices) and then you have 2.5 inches to work worth ... otherwise, buy a second HML".
Some customers did not smile at that suggestion. At over 100 grand, this was not a cheap device, bitd...but you could run up to 4 boxes under one multi-unit license.
Only one young engineer at an unnamed aerospace company did not flinch at the 1 inch headroom... I imagine THEY were the ones who had the most compact version of the 3D packaging at the time.
BTW, this was the same set of informational "interviews" that forced us to go to 256 active pins.
When we started this, we thought we could get away with "just" 128 active pins. Virtually EVERYONE told us to double it.
Our engineering manager CRIED when he heard that... us design-grunts were cheering it on!! MOAR POWER is always good, right?
Now, designing the backplane and send/receive data lines and phasing clocks to get insane state-transition control times, THAT was fun to do.
Controlling crosstalk and race conditions in the PCB layout just about cost me my sanity, but I made it work and it remains THE QUIETEST system (as measured on a FCC testing facility at Mariposa) that I was ever associated with... and the biggest.
It was an amazing box of rawk.
Tho no one asked for it, getting up to 512 active pins for one model would have required a change in the way we transmitted/stored the data for vector-in, vectors-out, tri-state-data and timing analysis data.
Man, thus is just so cool. Im glad you shared your story here.
Great video! I taught a course on Advanced Packaging a number of years ago and it's nice to see the industry moving towards MCM/Chiplet designs and, now, stacking of chips. Perhaps a future video can be done on passive vs active interposers?
You're lau? I don't think you're lau.
@@feynstein1004 ?
@@michaellau5329 It's a reference to this video:
ua-cam.com/video/h1sCiXTlR8Q/v-deo.html
Thank me later 😂😂
the arm of cyborg arnold left behind at cyberdyne really helps this subject.
I remember a Professor saying that optimizing separetad parts doesn't necessary mean optimizing the full result when the parts are working together... so that in general integration and considering the entire system usually is better. It is kind of crazy that for me that we went from discrete chips, to almost full integrated, and going back to somehow discrete...But I guess that some systems became too complex for us to optimize in an integrated mode nowdays.
Intel joked that AMD just "glue" their chips together.
Also Intel: Yep, we'll be using chiplet design soon.
intel glued CPUs together in the past - the Pentium II and III in Slot Versions had On-Package-Cache for example - and later the first Core Generation had mutiple "glued" together CPUs too - like the Core 2 Extreme QX6700, the Core 2 Quad Q6600 or the Pentium D 945
Of course the problem with stacking is heat dissipation. A cube of silicon won't be able to get the heat out of the center fast enough. I guess they'll go to fluid cooling of these cubes if they ever stack that deep.
Not really
Intels tiles are very different from amds chiplets
@@suntzu1409 emib baby, but it's worth a note theres multiple ways to skin a cat, and each have their own pros and cons.
that joke is on intel. They aren't even able to use glue the right way when they glued the ihs to the pcb(for example i7 8700k) what caused terrible bad temperatures. i rather don't think that it would be an good idea if intel starts to glue anything together again.
And now Intel, AMD, ARM, and others are getting together to come up with an interconnect standard. We could see chiplets from different vendors on the same package. Imagine AMD cores alongside ARM and even FPGA fabric, tasked with handling different aspects of a system.
this might also solves licensing hell intel made to prevent vendors integrating both arm and x86 ISA into one single die.
@@kekkocheng Why not connect 7 billion brains to Make a more powerful computer ?
It's already done. Check each Threadripper, EPYC and Ryzen Pro - each CPU has a small ARM CPU inside for the Platform Security Processor (PSP).
Will be interesting to see what new solutions it makes possible. e.g. AMD have dominated the console space because they are one of the only players that has all the technology and IP required to produce a monolothic SOC that can provide a good experience for gaming. My hope is that eventually it allows manufacturers to shop around to some degree, combining whichever parts that are available and best suited for their product. For instance I could imagine a high end tablet or gaming handheld using a CPU from Qualcomm and a GPU from Nvidia, and using UCIe to avoid those two companies having to collaborate too closely or share sensitive IP.
Of course you can already have a separate CPU and discrete GPU from different companies, but this complicates the cooling, and I suspect packaging everything together could have cost and latency advantages too.
@@liamness Nvidia is not part of UCIe. It’s not possible (yet)
As I understand the reverse happened when a Japanese calculator company wanted intel to make a number of different chips for their calculator industry however a person in Intel decided to combine all the functions into the first CPU chip
Usually integration (aka "fewer components") is the way to save on money and complexity. Like it saves big time, in some cases multiple orders of magnitude. But he explained why currently the reverse is true, at least for highly complex CPUs.
@@graealex 100%. It's a balance between silicon yield and costs versus post-silicon process costs. In this day and age, silicon costs are astronomical so it makes sense to invest in designs that reduce them.
If I remember correctly Intel were making ASICs.
The ASICs were fully integrated where as the CPUs were general purpose and needed additional components to function
Same thing here, monolithic CPUs are great with everything on one chip but if you can break it down, generalize the hardware and make it profitable then it will be a success.
Quick tip for you, if you use obs for recording you can set your microphone audio to be on a different audio lane than the desktop audio making it possible for you to treat the audio on an external program and then import it back during the editing process, that way you can remove the breath sounds from your audio. It can be achieved by importing the audio to audacity (First you will need to use your preferred video editor to export only the audio as an mp3 or wav for example), choosing a part of the audio where the breathing is noticeable and going to effects -> Noise Reduction -> Get Noise Profile then doing CTRL + A and going to effect -> Noise Reduction -> mess around with the settings (in your case use low settings since the breathing is noticeable but not intense) and click "ok". After that export the audio from audacity and replace the audio from the video with the exported one on your preferred video editor :)
yeah dude the breathing is a bit distracting (although natural)
same thing i did back in 2015 x) old times
@@ShienChannel Yeah :) It's an old trick I learned from the Minecraft gameplay times back in 2014 when we all had shit microphones
I didn't really notice his breathing before but now I can't un-hear it. Thanks.
Maybe the breathing is for effect, like "breathless prose".
I worked in the aerospace industry sometime ago.
They had a MultiChip module that contained an ASIC die which they had designed in house, and dies that were basically off the shelf, Ethernet PHY, Flash, MRAM, and oscillator. MCMs have been used there for quite some time it seems. They did it because they had basically no choice.
Computers started out with multi-chip designs. Central processing units weren't a single chip - they were several chips, all doing different functions. It was Intel's big innovation to pull all those functions together and put them onto a single piece of silicon. Not quite all (I/O and memory controllers remained external to the CPU for a long time), but close enough.
The funny thing is that AMD is deliberately going backwards. They were the first to move the memory controller onto the CPU die, which provided a considerably performance boost, and now they've reversed that. You really can't stress too much how important a very high-speed interconnect is when doing such things.
looking through computing history you will notice they do this alot
VLSI integrated functions onto one die lowering cost and improving power efficiency, while benefitting from Dennard scaling making shrinks smaller cheaper faster WITHOUT increasing thermal density.
Now not only does i/o not scale, but cache neither on current processes and the energy density in high performance logic has become a real limitation.
Thus chiplets space out heat generation in many core CPU, while large local caches allow gangs of CPUs to work mostly with local data, while trips to main memory become so proportionately slower that performance sensitive software does all it can to avoid unpredicted memory access.
Not only is a 64c/128t chiplet cheaper but is impossible to market as a profitable product when monolithic.
Binning of chiplets with selective disabling of cores, allows for far greater core efficiency or clock speed in premium models because the chiplets can be chosen to suit a SKU.
A monolithic die has to space out cores to reduce logic density wasting expensive wafers and cannot mix n match parts to meet SKU requirements.
What is missed is that AMD chiplets DONT just WORK. They overcome the downsides of familiar MCM and revolutionize computing from the stagnant Intel sinecure it was. They have been incredibly clever.
They had one shot at avoiding oblivion - they needed a single design which cost competitively scaled to ~all markets.
They didnt link existing cpuS, but designed a teamable processor core unit (4 core CCX) from the ground up.
as others here say, the key downside is the power & lag used by inter core data transmissions over any distance, so this was the focus of the CCX design - each of the 4 cores had a direct hardware link to each of the other (adjacent) cores - this needed only 3 such ultra fast intra ccx links on each core.
inter CCX links were lavishly and liberally cached to minimise lag.
Having given Fabric the optimum hardware, they then put disproportionate focus on Fabric.
They initially strategically yielded to Intel on the sexy IOPS at which monolithic ~wins (gaming), & gradually gained a flanking foot hold & then advantage based on raw power and cheap costs of harvesting - even at inferior process initially.
So I have a problem with dismissing it as something Intel etc. can just whip up in the lab & marketing - BS.
a/ it requires a ground up re-design of their entire processor range & discarding a lot of IP
b/ its been over a decade of hard yards - validation, patents, ....
Infinity Fabric is a huge moat for amd - its clear that Intel were caught utterly flat footed by it in 2017
& have mostly just tried to ignore it since- not take it for the serious threat it is.
That's not everything too, before zen 2 most separate silicon interconnects for chips were very slow and huge latencies(100s of nanoseconds), amd found a way to bring 512b wide IF(servers) and < 100 nanoseconds of latency by working with TSMC, which paved the way for others.
the infinity fabric is the secret sauce. having a suitable interconnect for the fabric is what makes this possible for sure.
The irony is that the QPU would work if they dropped their UMA. You could push data from any core to any adjacent one. Supercomputers with thousands of cores work like that, a core can't randomly address any other core, not without calling the "network".
But then you would have the Cell architecture with their PPUs.
Fabric is easy to program I guess, because it's just like a network switch.
The great thing about chiplets is the ability to mix different type of process tech on the same chiplet device. For example, the CPU (which is built on a process tuned for low capacitance) and the DRAM (which is built on a process tuned for high capacitance ...)
MCMs faced two major issues during my time at Intel several years ago. The first was the most obvious: putting multiple chips in the same package meant the heat dissipation of the package now had to be designed for the two chips - ostensibly doubling the required heat dissipation. We handled that by throttling down the chip clocks, a process that has been largely automated by onboard temperature sensors and variable clock generators. IE., if you put two of today's advanced chips in a single package, they will self-throttle downwards.
The second issue is that anytime you have a signal leave or enter a chip, it goes in or out via a pad driver, essentially an amplifier of that signal designed to drive the large capacitance of a PCB trace. This adds delay to the line. A lot of it. Internal trace drivers only have to worry about driving very short lengths of narrow and flat aluminium/copper. Not so with PCB traces. MCM pad drivers could be backed off a bit to compensate for minimal traces in the MCM, but typical MCMs unite chips that would normally be packaged separately, requiring a design change.
PS., like the old movie "support your local gunslinger", keep in mind I am a stupid guy in the process world. The real process guys have to keep a fan on their heads to keep their brains from overheating :-)
It isn't really doubling the heat dissipation though. Taking an existing device at X wattage and splitting it up into different MCM components doesn't raise the wattage of the combined chips by any appreciable amount -- even if we include an active interposer. However, adding *additional* chips (read: cores, features, etc) to a design would *definitely* increase the thermals.
Put another way, the advantage of a fully integrated chip vs MCM isn't about thermals, but moreso that fully integrated chips were a result of continuous process improvements over the course of history. It became *cheaper* to produce an integrated chip rather than go with MCM or COB. The overall performance improvements were a side benefit.
Asianometry does touch upon the latency issues with MCM in the video too -- and it's something that AMD has seemingly solved architecturally.
@@michaellau5329 Didn't follow that at all. Two chips in a package with X and Y heat generation somehow is not X+Y? New math?
@@scottfranco1962 No, X/2 + X/2 = X. However, X + Y is definitely X + Y.
A theoretical chip that is designed in a monolithic fashion versus MCM will consume roughly the same amount of power.
@@scottfranco1962 Or put another way, designing a *device* that has a power dissipation of X watts in a particular package will be the same regardless of the number of chips inside. There is of course the arrangement of the individual chips underneath the heat spreader, but that's something else for another day.
Far as I know, chip manufacturers aren't incorporating heat pipe design concepts into the surface of their chips- there should be plenty of heat performance still available.
Remember Pentium II? Intel had problems placing all that primary cache memory on the Processor chip, so they used the multichip approach, with the cache being the added chiplets. The cache run at half of the processor clock speed anyway.
They returned to a monolithic structure with the next generation, after one year...
IIRC, the (earlier) Pentium Pro had a separate, interconnected cache within the same package. Unfortunately, it was expensive and ran 16bit code poorly, but it really cranked with pure 32 bit code.
Returned? The Pentium II was the first Intel processor that included a L2 cache at all. If you wanted L2 prior to that, it had to be SRAM chips on the motherboard. Or, with the 386, if you wanted any cache at all, it was on the motherboard.
@@TrueThanny It was way more fun back then! Nothing like discovering the newsgroups and getting a JPEG viewer to look at your first digital nudie pics!
YES!!! I love all your videos, but I get especially excited when I see you post a tech one!
IBM developed the first large MCMs in the late 1970s with volume shipments starting in the 1980s. Their first MCMs have 100 and 118 chips on each module. They were originally used in mainframes.
Im not completely knowledgable, but everytime I come across some technology/architecture I hear "yeah IBM made a product with that back then and it dipped but then it gained traction" lol
@@AvgAtBes2 Lisa Su (AMD CEO) worked at IBM from 95-07. So technically she is an IBM alumni.
You are missing where the package innovation came from. Multi-Die packages where driven by memory and image sensor (processor). For example through silicon via technology was developed and matured for these applications. I still think that are the leading edge for this topic. A modern ISP is amazing part of technology with wafer level direct bonding, thinned image sensor chip to just 10um thickness to be ilumanted from the backside. Memory is producing stacks with up to 16 chips.
This was so successful at the time, now Apple and Intel do it too with similar techs. good topic!
Great video. I didn't really understand why chips on a bigger fab process are more expensive, but once you talked about using smaller chips to 'approximate' to the size of a bigger one, it made sense (my 'aha!' moment) - bigger fabs use more space, so there's less space to make more of them = if it has errors, it is a waste and now you've wasted all that space. If it had been made smaller, then at the least the cost for a bad chip is less. Thanks!
A funny thing about chiplets is that they're helping overall but they are so good that they leave no silicon for the low end. The low availabiliy of the 3300X and thelaco of a 5300X is becaus AMD is getting little to no chiplets with less than 6 working cores
This is the fault of the way AMD approached Zen 2 and 3 and nothing inherent with chiplets.
Something as simple as a 4 or 6 core part doesn't require chiplets and if AMD does this for Zen 4 I think Intel will eat AMD for lunch at the low end.
If I were designing AMD's Zen 4 Ryzen based CPUs and APUs, I would use 5 chips in all. The first would be a 6 core monolithic die, with no graphics, that allows a tile based interface to add a graphics chiplet. I would create 2 graphics chiplets for different power levels. I can now use this single chip for 4 - 6 core CPUs and APUs. That's a lot of versatility still. I'd then make another chiplet that has 8 cores, the IO functionality and the ability to add graphics just like the 6 core chip, and on another side be able to add another 8 core chiplet, with no IO functionality, so it's simply 8 more cores.
AMD did the whole IO die thing with GloFo because they were still under contract to GloFo for die.
I don't for a single second believe Intel is going to move to tiling for every single part because it just doesn't make sense to add complexity when the die isn't very big.
So this is where AMD failed, at low cost items, and they shouldn't have to rely on having enough bad monolithic APU die to make lower core count parts. That's just bad. And, using an 8 core chiplet to make a 4 core part is BAD. It's a waste of 50% of the die.
@@johndoh5182 thats a lot of tooling though, it isn't cost effective to have that many different designs. It's also a waste of capacity since you have so many sku's hogging precious laser space.
AMD has 3 designs: CPU,GPU,IO
INTEL has maybe 1or 2 design: APU
every intel sku is a binned down 12900k or is a 12900k
@@johndoh5182 You seems to have a backward idea on the whole chiplet idea. Remember when AMD was on the brink of colapse, high end desktop cpu got like 4 cores and maybe halo product with 8 cores. With chiplet AMD raised the bar, nowdays most common gaming machine got 6 cores. AMD is now becoming intel since no competition is insight (thankfully we got ADL, but thats not enough). Instead of asking AMD to downgrade their effort, i would instead go to intel to ask them to up their game. So we can raise the bar again so midrange can have more cores like 12 core and hopefully, making 4-6 core to be low end. Thats the beauty of competition although i dont see it coming anytime soon since price is slowly creeping up.
Suffering From Success™ 🗿🗿🗿🗿
@@jakejakedowntwo6613
No?
AMD makes 1 die for APU, CCD, IO each and a bunch of dies for discrete GPUs
6 core CCD is the only new one that AMD would need to make
And intel makes more than 1 dies for alder lake
Another well presented and very informative video. Thank you.
The original Intel P6 (1995) had 2 dies, a CPU plus a wire connected cache. That's the processor I started my long verification career on. Fun times!
was the 386 for me
Thanks for the overview. I have a high end threadripper CPU, nice to know how it got manufactured. First class video as usual.
I'm surprised you didn't mention Intel tried this a few times. They did it with the Pentium 4 to get dual-cores (really dual-dies) to market. Then again with the QX6700 (2006) and Q6600's (2007) which contained two dual-core chiplets to deliver quad-core processors and they used the MCM naming for those too. There were also XEON equivalents at the time.
In Intels case, AMD got to 4 cores with a single die and Intel was a bit behind so they had to use an MCM to catch up then they went back to a single die design. Now it has switched around and chiplets are being used to get higher core counts at a reduced cost like you said in the video.
Intel double cheeseburgers....
Those were the days.
Intel is so behind AMD and Apple on chiplet technology that it’s not even funny. AMD 5800X3D has a stacked 64MB of L3 vertical cache at $450! Intel’s Faveros EMIB technology announced in 2019 is still nonexistent!
Intel started doing it in 1997 with Pentium II
@@tringuyen7519 more like TSMC, not AMD.
@@dogman2387 Intel started doing it in 1981 with the iAPX 432
Hey, could you do a video on BYD, their blade batteries and their semiconductor division?
Quick tip: Xilinx is pronounced like, "zai links."
How do you know?
@@deusexaethera uh maybe chinese pronounce "Xi" as "Zai"
Ssst,.. don't tell that secret! Nobody else knows how to pronounce it other Ksee Links or Shi links. Its one of those PR failures.
no, it's pronounced "gif links"
Alternatively, it's pronounced "Ayy Em Dee" ;p
*voltage, frequency & chiplet efficiency*
Multi-Chip has some very interesting EFFICIENCY options available. Since POWER DRAW does not scale linearly, and is often the limit (especially in laptops) then adding MORE chiplets at lower performance per chiplet can result in better performance. If you plan to run at a max of, say, 1500MHz you can also optimize around that as a max frequency further saving power somewhat. So there's a lot more flexibility compared to a traditional monolithic design... and not being on the most cutting edge NODE might not matter so much if you have this flexibility. Maybe put those cost savings into adding more chiplets on less efficient nodes? There's definitely going to be a BALANCE of everything that results in the best performance per dollar to make and that's going to vary depending on the product, current node costs etc.
Great topic to cover, especially given the rise of importance of substrates and interposers and inter-die interconnects
How about a video about the Inmos Transputer? Loving what you do, nice 😉
I am wondering how chiplets are fundamentally different from transputers.
@@vulpo the Wikipedia article on the Transputer is quite good.
Can you do a video on yourself...how you generate so many high quality videos in such fast iterations? Truly amazing! I seriously would be interested how you do this!
Yeah, like "Why Asianometry videos work".
Cloning technology.
You don't want too much of a paper trail after the Chinese invade... :-)
He explains his process in this audio podcast episode: compoundingpodcast.com/ep24/ - worth your while!
@@scottfranco1962 one minute they are the world's leading chip producer, with military and bleeding edge computing resting on their back, and the next minute they are a highly armed guerilla fighting for their existence...
its crazy how close both versions seems to be
Just a friendly heads up about your audio; your VO seems to have a bit of ringing to it. I noticed it around the 8:30 mark. This might be your mic arm resonating in which case it should be tightened. If the mic is desk mounted, put it on something soft so the desk can't resonate. Lastly, it might be feedback if you're listening to yourself with speakers but I'm doubtful of that last one.
Thanks.
Back in the days, mainframe CPUs were basically a bunch of chiplets in a single package and a heat spreader on top. "CPU Galaxy" is a good channel to see some of these old beasts.
Credit needs to go to software developers for finding ways to parallelize tasks. All the chiplets in the world won’t work if the software isn’t designed to run on multiple cores. You young folks probably won’t remember the days before multithreaded software, but trust me when I say it sucked. Programs would freeze as they completed a task, which in the days of slow CPUs and mechanical drives was quite often.
Nothing has changed. Most true multicore/threaded designs are one-off. Think supercomputer clusters. True multithread/multicore software engineers are hard to find. The average software engineer is scared to death of multithreaded designs, and it would not be a stretch to imagine that most multithread apps have latent bugs.
@@scottfranco1962 The more services the OS provides, the more those can be multi-threaded. Even the natural layers allows for it, e.g. GUI in a separate thread from the app.
@@scottfranco1962 Nowadays almost all apps are multithreaded. Go/Julia languages offer multithreading out of the box.
@@scottfranco1962 its dumb easy to do with Go give it a spin! 👍
That is Windows thinking. Devtools for multithreading has existed for 30-plus years on real OSes like Unix. Even today Win10/11 cant handle over 16 threads on a single process. Most windows apps start to taper off after 4 cores. Apple solved this almost 10 years ago with Grand central dispatch. Microsoft should do the same and if Microsoft wants to make an OS that actually works they should use a Linux kernel and put their brilliant desktop system over it. That would also solve the multithreaded problem in Windows if MSFT wants to do it.
The Ryzen chiplet design and the huge investment needed to make it work were a very risky move by AMD and almost an all in. With Bulldozer selling extremely poorly and no new cpu products on the market for about 5 years Im not sure if the AMD cpu division wouldve survived if ryzen wouldve been a failure. My guess is that it were the 2 major consoles using Bulldozer and GCN 1 keeping AMD alive. Of course we can thank the silicon gods that it workedout or wed be paying a pretty penny for Intel 4 core cpus in 14 nm to this day :D
**Edit**: explained by a reply to my comment.
10:03 - "[...] that handles analog functions like USB and SATA [...]". Neither of those are analog, and I'm not aware of any homophone that would make that statement make sense. Did you perhaps mean something like "auxiliary"?
In chip fabs chip design is split into 3 general categories: logic, SRAM, and analog. analog (aslo known as I/O) is everything that touches outside the chip.
@@BusAlexey Aah, okay. Thank you. Do you know if that is a legacy from when a lot of the external I/O *would* have been analog?
@@Quxxy to the chip's circuit that's running at multiple gigahertz, any I/O is "analog", in a sense that the signal doesn't arrive clearly in a well defined form for the chip's circuit. There are special circuits in the chip that translate that "analog" voltages to a more manageable form inside the chip, and they are what's known as analog circuits
Around 20 years ago we saw the end of the MHz race between CPUs. Slower chips doing things smarter and more efficiently substituted for faster clock speeds.
Chips have been mired in the same 'bloatware' that software finds itself.... we don't program smarter, we just throw more lines of code and billions more transistors at a problem.
I believe that the 7 nM 'node' is over-extended, and just like with clock speeds, we will take a step back towards more realistic geometries.
AMD has taken an interesting tack towards simplification. A database server does not need ANY graphics functionality, why include it? Design a couple of chiplets that implement database functionality in hardware, omit the graphics engine and glue them together with a CPU. Less transistors, easier geometries and better performance.
Scaling up the easiest way makes sense, until it stops making sense, and then scaling up the next-easiest way makes sense. Repeat until full theoretical capability is achieved.
Thats one of the problems with x86 (and ARM to some extent). About 90% of all instructions a program executes are jump,compare,add,subtract,load,store and x86 has ~15 000 instructions which makes decoding A LOT more difficult and requires more bits to encode an instruction, eating away silicon space for more useful things such as more pipelines or a better branch predictor. Thats one of the reasons why RISC-V is appealing since it wants to keep things as simple as possible, which also allows for higher code-density.
Database acceleration has been done by Oracle with their SPARC cpus
Sure, what is the clock speed of the human brain? Probably less than a 1khz. Yet all of the computers in the world cannot equal one. Parallel processing in 3d solids baby!
Oh, yea, and all that for the cost of a beer or so...
Xilinx (now AMD) used a multidie approach for ultrascale too. Some designs that ran fine on V7 have problems meeting timing at the die crossing interface in ultrascale.
Can you do a video on camera image sensors?
@TacticalMoonstone I like to know how similar or different the process technology is between sensor manufacturing vs processors. I have a feeling that image sensors use older equipment/process. Also, how far behind is full frame / large sensor semiconductors compared to latest cell phone cameras. If you would scale up for example, iPhone main sensor to larger size, how much would it cost? TowerJazz, Samsung, Sony, sigma…I’m also curious about those.
Remember the jump in performance that came with the Pentium II Xeon / Pro? It seemed like a big jump in performance. That used a primitive version of chiplet based L2 cache memory if I recall correctly; This is why it had a weird rectangular shape and case. I think big CPU cache is the way to go, generally.
Pentium 2 and 3.. It plugged into a socket like the old agp slot
The Threadripper series is simply unbeatable in the prosumer market!
All it takes to beat them is an Arm... and a leg.... :0
SOC is the opposite of what amd is using. They are doing SIP system in a package, it’s been in use for decades. The next evolution of that is stacked die, where you use through silicon vias to directly connect die without bond wires and footprint increases. SOC is putting multiple die on the same wafer making large die with sub die only connected by the top metal layers or bond wires in the package. That was a large expense where board space was the primary concern.
The other huge advantage to this approach is you don’t need as many different devices on a single die, reducing mask layers for each and not having to share thermal budgets at each layer. This reduces cost and complexity. The reduction in base and interconnect layers also greatly improves cost of scrap. Your net yield increases slightly but anything you scrap at end of line has man fewer masks so it costs less. These advantages are shared by Traditional SIP but stacked die is a way to take moores law into 3d and continue it beyond atomic size constraints.
The biggest roadblock issue for MCMs was cache memory latency bottlenecks. Consequently, integration of memory on die for CPUs GPUs and SoC would be a priority from the mid-90's onward, while on the severs side, massively parallel blade architecture solved the problems temporerally, but by ~2010, latency was again an issue. Chiplets solve this another way; splitting up the processors into manageable increments.
10:00 "Analog functions like USB and SATA"
Wtf? Can someone explain how USB is analog?
It's a bit of hair splitting but technically USB sends digital communication over analog differential pairs. The usb controller interface he mentions translates the external analog electrical signal into internal digital data.
PAM 4 in PCIe 6 adds a more clear analog touch, with the communication signal voltage being able to represent 00, 01, 10 and 11 instead of just 0 or 1.
What about die stacking?
Why don't separate the systemsonachip into multiple processors like it used to be? Have separate video card and a neural AI calculator. What does "diffused" in USA mean on the AMD processor?
I wonder whether i saw this approach with first Intel quad cores. As far as i remember these were two 2-core chips "glued together" on a single package.
Intel already proven that chiplet works back in P4D, C2 era, except AMD shamed Intel for gluing and went as far as calling it fake dual core and later quad core.
@@AlfaPro1337 and then Intel did the same thing to AMD after the success of ryzen. capitalism is wonderful
Intel's first "dual-core" chip, the Pentium D, was two single-core chips placed on the same package. There were zero interconnects. When one chip needed to communicate with the other, it had to do so via the FSB, so it was actually SMP on a single package, not a dual-core chip.
Their first "quad-core" was similar. It was a pair of dual-core chips on the same package, with the same lack of interconnects.
@@AlfaPro1337 AMD was right. The Pentium D was SMP on a single package, and the Core 2 Quad was dual-core SMP on the same package. AMD had the first actual dual-core processor, and also the first actual quad-core processor. Many workloads showed the consequences of this, too.
@@TrueThanny There were 0 interconnects, because the I/O die is basically on the North Bridge?
Plus, I'm guessing Intel opted for a ring bus-like, thus, the main core die communicate with the NB and fills up, until it hits a certain workload that it needs to shift some work to the second core die.
It's still a dual-core in a sense--though, not a single package, but hey, AMD was being a dick in the 1st place.
Heck, even the early consumer 1st gen Core i series are basically 2 die, but, one is the core and the other is the I/O die.
Technically, Intel has been gluing for years.
But how are those chiplet interconnects actualy made physically? Bond wires? Solder balls?
TSMC calls it TSV if you want to know more.
@@mawkzin Thank you! But TSV seem to be "through-silicon-vias", i.e. to connect different layers to each other, no?
But searching for TSV led me to pages that mention "microbumps", which I believe is how the individual chiplets are bonded to the base material. It seems that microbumps are µm-level solder balls.
Now I'd love to know even more about this part of IC packaging! How are those microbumps manufactured? How are they placed so precisely? Are they heat-soldered?
Maybe we can go back to having a north bridge on the motherboard in typical PCs? It would be a small step.
This also reminds me of how 80286 and 80386 CPUs needed an optional “math co-processor” to be installed in the motherboard in order to perform some floating point calculations.
A North bridge would come with massive latency costs, tho. They didn't integrate the North bridge just because it was cheaper, they also did it because having an imc massively increases performance
We wouldn't actually want that. You're just adding more lengths of wire and logic between a device and the CPU, which becomes a bottleneck and adds latency. Everything either needs access to the CPU, RAM or both. The chipset that still exists on desktop systems still has the same problem. AMD's Epyc based systems don't even have a chipset because there is room in the IO die for all those auxiliary functions. It's probably key to understand that the Northbridge is mainly only really different from the modern chipset (and southbridge) in that it housed the memory controller. That specifically makes way more sense to be on the CPU package. The northbridge was removed in favor of the Integrated Memory Controller. It would be a step backwards.
@@blkspade23 Okay so not worth it, I guess Ryzen already does the logical thing and puts the IMC in the I/O chiplet. It maxes out at moderate memory speeds but it makes sense from a value perspective.
@@blkspade23 by the way, thank you for your lengthy explanation/reply. I appreciate that you shared your knowledge.
The Brilliance of AMD is that they have - essentially - only a single 8-core CPU. By connecting many of them with different fabric chips and/or disabling broken ones by lasering them out of the circuits, they can build a chip anywhere in size from 2 CPUs to 128 CPUs - all with only 1 GPU, roughly. So AMD really makes only ONE CPU and it has 8 cores, period.
Asianonetry should be a guest on MLID’s Broken Silicon podcast.
Thanks. I was expecting more information about the packaging technology and process, that’s where the real innovation is as far as I understood. Is that covered by secrecy?
Large chips also have internal lag issues even at the speed of light signals that need to move 2 centimeters is always twice as long as if it needs to travel 1 cm
That’s like an hour longer
The older server CPU's were split up into separate dies on the same substrate based on functionality and the needs of the customer. There were different variations for them as needed for server processing. If it needed to do more calculations it would need more cpu cores on the same substrate. Back in the 90's Intel had separate cache dies In their Pentium 2 line because they found too many failures manufacturing CPU dies with cache at the time. It really came down to the cache itself when it came to manufacturing for the reason for most of the failures. Instead of wasting a whole CPU chip just because the cache was faulty. They decided to manufacture the cache separately. It does not cost more for having more instructions on the same die size. It costs more for the higher percentage of failed chips manufactured. I.E. Bad chips = no money. It may cost more for a finer lithography process on the same die size because of the expense of the equipment and development costs and also risking higher failure rates. It may cost even more if the design of the chip may require more layers of processing. This is how AMD "Chiplets" are the idea behind their new high-end processor with their 7nm process. Very few chip manufacturers use 7nm process at this time and and it seems to be getting better but risky in the means of higher failure rates. I'm sure the failure rates are getting lower but by splitting the dies or "Chiplets" the manufacturers may create less wasted failures and earn more profit.
LISA SU WAS MIT SUPERSTAR IN COMPUTERS AS STUDENT. GUESS NOW DR. LISA SU.....
4:00 this is not what Moore said, because chiplets and tiles are in same package, and he said about in separately packaged. Don't give credits, when don't deserved.
great coverage as always. I wonder if apple will ditch the monolithic dies for the pro lineup going forwards, they really are pushing that monolithic design SOC with their on silicon interconnect
I don't think it is possible for the A series chip as it need to be small for iDevice. Possible for future M series. The current M1 Ultra is 2 chips in the same die with RAM, Unify Memory, as chiplet on the side.
@@Theoryofcatsndogs not really, the M1 ultra is really one monolithic die, since the interconnect is on silicon. they just cut off one of the dies if it fails which is how they preserve yield
Nice vid !!!!
As I understand this, using ‘chiplets’ means breaking up a single chip into several components. This means that each chiplet would need extra circuitry to concentrate the signals into channels for exporting and importing data from the other chiplets. This extra channel circuitry would not be needed in a single integrated chip. That makes the integrated chip inherently more efficient that a bunch of interconnected chiplets.
Yup, had the technology allowed them to keep shrinking the process node in Integrated chips, they would've done so
Except for that it's not because it doesn't work. So sure.
More efficient until you reach a size limit. More efficient except more prone to manufacturing defects. More efficient except more expensive. In an ideal world it wouldn't be necessary, but it is necessary for various reasons.
What you described is how Moore's Law basically allowed Intel to keep squeezing out performance out of it's chips over the past decades. Problem is: Moore's Law is dead. We are pretty much at the limit at how much we can shrink transistor dimensions. You can't just shrink them them further and pack more of them in an integrated circuit. The failure rate is significantly higher. This is why Intel's chips and semiconductor roadmap got fucked because of the consistently poor yields they were getting.
Great video! I love all of them: thanks for the great work you do!
When you refer to the USB and SATA connections to the AMD Ryzen I/O chip as "analog", what about them is analog? Is it the transceiver (not sure if this is the correct term in this context) that converts the external digital USB and SATA 5v signals to the 1.2-1.5v (I believe) signals that the I/O chip uses internally?
Nope, when you put the data rate of a digital signal high enough, it's not sharp edges and neat transitions on clock edges. You 100% have to treat them as analog signals when they come in from off-chip. On chip you can control levels and delays to make things near-enough digital ideals to treat them as crisp signals.
@@thewheelieguy Oh yeah, that totally makes sense. At the signal level everything is analog, but we (software and systems engineers) can usually ignore that. Thanks for the explanation!
Intel 2 years ago, "Well, if you wanna glue 2 chips together...."
Agnus, Denise, and Paula would like to discuss the Amiga home computer.
@Asianometry If you could find time to make video on next generation of interconnects both data and power to deal with multi chip on a substate, the companies who are doing it and whether Intel can give competition to TSMC, Samsung with the next generation foundry, that would be awesome.
When I look for information they are all fairy tales. Could you incorporate what kind of materials would be used for interconnect, interposers and why they are challenging.
It reminds me of Intel 2 or 3 "cartrige" processsors. The reason was similar - yield issues from trying to put too much on single chip.
Mhm why isn't L3 a separate unit or unit group integrated with the memory controller?
Also why would MCM be a curse word?
Good review.
12:03 - "Pana-see-ya" bud.
4:25 Dr Lisa Su is wearing a 1.5ct diamond.
no mention of IBM mainframe MCMs
He specifically showed the big perforated copper blocks with chips embedded and called them MCMs. I don't recall if he mentioned the letters I, B, and M but they're in there.
I work in embedded , no such issues there :) You would have to be mad to go chiplets there. I see the point for high end designs if you want to have some scalability and lower cost with some performance drawbacks.
Isn't going chiplet when you have more than 1 MCU in a PCB, lol.
Why do you say that first chiplet product is Epyc when AMD releases desktop Ryzen parts earlier than server?
Because they only had one chip. First-gen Zen was monolithic. Everything was on the one die, which had two memory controllers. The consumer platform got one chip, giving it up to eight cores and two memory channels. EPYC got four chips, which is up to 32 cores and eight memory channels. Later, Threadripper got two chips, for up to 16 cores and four memory channels.
With Zen 2, all the I/O, including memory channels, moved into a separate die. That allowed an arbitrary number of cores for any platform, only really limited by space on the package.
Some people here in the comments confuse chiplets with MCM. They're not the same.
In an MCM design a single chip can work in itself.
In a chiplet design a single chip can't work in itself without an additional chip. In the case of AMD one chip (CCX) consist the cpu-cores and another chip consists the IO interfaces and connections (IO die). Neither works without the other.
So a 386 motherboard or an IBM mainframe from 1965 is not based on chiplets.
One thing is missing from this video: chiplets improve not just yields but the binning. Smaller chips can reach higher clockspeeds.
lol at 1:10 the person in the clean room on top the machine has a fall protection harness. However, the distance they have to fall is so short, it won't activate before they hit something.
the protection is not used by woker to being protected by the machine, but is used by the machine to be protected by workers
I recommend checking caly-technologies die-yield-calculator.
as I think you gave numbers for small chips and mature process here (8:18) with new and unrefined ones, big chips like epyc would have yeld of 1-5 chips per 300mm wafer which cost ~10000$ each. (yeld in 10% range) which would make cpu cost way above 500 000$ each to pay for every step of the process.
same wafer and same confg with chiplet drops 840 working chiplets, and maybe 10 half's, that can be configured, into 850 working ryzens, 400 32 core threadrippers or 200/100 epycs. (assuming perfect interconnecting, you can drop 5% if you want to be realistic). IO dies on lower process are easy to get and WAY cheaper.
as TSMC and others have always full load and fight for every wafer between companies, AMD can make 10x as much chiplet based cpu's as monolithic ones, with number of wafers they got from TSMC.
amd is already above what would be possible without splicing into smaller pieces, and if intel would have chiplets with their 10 nm issues, they would have good yelds few years sooner.
I know its extra complexity, extra time and extra problem layer, but we already past the point where monolith is physically possible and every manufacturer goes around their own way.
NVIDIA sucks up any issues and disable 1 or 2 out of 70 cores completely, and just say their product have less, than it have working products.
I like AMD way as divide and conquer is what IT found the best solution for wide variety of problems....
And AMD's latest chiplet innovation is 3D V Cache which stacks additional cache on top of the base chiplet.
Man AMD should just stop gluing together all these parts. Give the competition time to copy their designs first... ohh wait, Intel just patented the AMD Ryzen architecture. Never mind.
@@abaj006 what? What do you mean?
@Anar he is spreading fake news that he didn't understand or read fully
@@drbali ohh, i was very confused at first xD
@@anarsosoroo2891 He, humorously, said: "Intel used to dunk on AMD by calling chiplets 'glued together cores' but now Intel recently published a specification for a, supposedly, industry wide approach to a cross-vendor way to mix and match different accelerator chipslets into a coherent package. It 'incidentally' happens to be perfectly compatible with AMDs Infinity Fabric".
But @J Bali is also right, as AMD (and a few specific AMD employees) is actually mentioned in the white paper. So AMD was in on it right from the inception. Hence Intel didn't just "patent Ryzen".
Thank you for telling me this.
You didn't mention anything about how packaging technology suddenly enables the integration of multi chiplet modules without sacrifizing the performance. And what's the difference between an SiP and a chiplet structute.
I don't see Chiclets anywhere these days. Are they banned? I miss them.
Why haven't we have seen any chiplet in smartphones instead of soc yet? Are they really that power hungry or is there any another issue?
You're likely going to see it moving forward more in mobile gaming. For example, the chip being used in the Steamdeck is a 4-core APU derived from AMD's Zen3 micro-architecture paired with 2 RDNA compute units.
As for phones, battery technology probably hasn't caught up yet that phone OEMs would look beyond the current SOCs they're using.
12:03 pana key not pana cia
Thanks for this
In 1983, I was at Intel and we were bonding together multiple 16k drams in a package for IBM.
Voice synth needs a bit of a touch up. 🙂
Also the first dual core Pentium4 and Q6600 were 2 dies glued together :)
Mobile phone sales per year are 1.2 - 1.3 BILLION, not "in the millions each year".
Your link to the Podcast is broken.
no it works
I remember being obsessed with Hyper transport technology. What always bothered me was this idea that theoretically we can unite all the best solutions from competitors in order to make one "ultimate" product. At the same time losing competition (or at least a threat of it) was the major reason why one company developed its "killer feature" to stay afloat in the market or to outcompete its rivals. On the other hand there is no such thing as the ultimate product with all the cutting edge technology in the real world since every product is just a mean for fulfilling certain need or completing certain task. For each need or task the best solution might not require the best of the best or "ultimate" tech in every regard.
It kinda happens already. AMD and Intel engineers can practically throw forks at each other during lunch...
@@scottfranco1962 My dilemma stems from the assumtion that technological advancement comes either from competition or from some extreme external limitation. It seems logical to me that in the future we will compete directly in technology (of every processes) rather than the products. I might be simply wrong but isn't it the limitations from weak points of your product make you develop new technologies to overcompensate your weaknesses.
@@AlexanderSylchuk Oh, you are begging for my favorite story. I interviewed with a company that made precision flow valves. These were mechanical nightmares of high precision that accurately measured things like gas flow in chemical processes. This is like half the chemical industry (did you know a lot of chemical processes use natural gas as their feed stock?). Anyways, what has that got to do with this poor programmer? Well, like most industries they were computerizing. They had a new product that used a "bang bang" valve run by a microprocessor. A bang bang valve is a short piston that is driven by a solenoid that when not energized, is retracted by a spring and opens a intake port and lets a small amount of gas into a chamber. then the solenoid energizes, pushes the piston up and the gas out another port. Each time the solenoid activates, a small amount of gas is moved along. Hence the "bang bang" part. If you want to find one in your house, look at your refrigerator. Its how the Freon compressor in it works.
Ok, well, that amount of gas is not very accurately measured no matter how carefully you machine the mechanism. But, it turns out to be "self accurate", that is, whatever the amount of gas IS that is moved, it is always the same. The company, which had got quite rich selling their precision valves, figured they could produce a much cheaper unit that used the bang bang valve. So they ginned it up, put a compensation table in it so the microprocessor could convert gas flows to bang bang counts, and voila! ici la produit! It worked. Time to present it to the CEO! The CEO asks the engineers "just how accurate is it?" Engineer says:
well... actually it is more accurate than our precision valves. And for far cheaper.
The story as told me didn't include just how many drinks the CEO needed that night.
So the CEO, realizing that he had seen the future, immediately set into motion a plan to obsolete their old, expensive units and make the newer, more accurate and cheaper computerized gas flow valves.
Ha ha, just kidding. He told the engineers to program the damm thing to be less accurate so that it wouldn't touch their existing business.
Now they didn't hire me. Actually long story, they gave me a personality test that started with something like "did you love your mother", I told them exactly where, in what direction, and how much force they could use to put their test and walked out.
I didn't follow up on what happened, mainly because I find gas flow mechanics to be slightly less interesting than processing tax returns. But I think if I went back there, I would have found a smoking hole where the company used to be.
And that is the (very much overly long) answer to your well meaning response.
@@scottfranco1962 Great story! For some reason I imagined something like water valve from washing machines as a "bang bang" valve. To me fridge compressors work more like one piston combustion engine, but they don´t make distinct souds due to their continuous work. Maybe with a larger piston which stops at every cycle they will produce a "bang-bang" sound. It actually reminded me "from zero to one" by Peter Thiel and your story is actually my main concern about that book. When you outcompete your market you will find it a lot harder to disrupt your own business with new technology. It's just like it was in the soviet union, the only tech that was improving was military and only because of competition with the west.
A CPU you get for your computer is not SOC, not even chiplet based Ryzens. SOC includes not just the CPU and it's I/O, but also a GPU, storage and memory at least. Phones and consoles have SOCs, but not personal computers.
beter explaining SerDes, nrz, or pam 4, used on its chiplet used for data transfer on chiplet technology
The Athlon Tri-Core was just enterprise quad-core chips that failed to yield due to manufacturing loss, so they just turned off the defective core in Microcode, sold it to consumers as the Tri-Core and we ate it up.
Also because the failures were sometimes still semi-usable so we figured out ways to turn the defective CPUs back on.
In the 1980s AMD had the AM2900 bit-slice Microprocessor design. Do not know if anyone actually used these but they where and interesting idea, Multi-chip processor. The improvements in chip manufacturing made these redundant.
I can remember reading the data books on those when I was in my undergrad. The only industry applications I can think of were in very high speed signal processing, before the advent of integrated DSPs.
Intel, 2017: "We'Re GoNnA gLuE ChIpS ToGethER"
Intel, 2022: "They're called _Tiles"_
What kinds of chips/applications are not suitable for chiplets?
Xilinx is pronounced ZI-links, not ZEE-links. And panacea is typically pronounced with emphasis on the first and third syllables, not the second syllable.
Otherwise, good video.
They worked because despite all of the performance disadvantages of chiplets, Intel spent half a decade relaunching Skylake on 14nm because of their initial 10nm failure. Sacrificing 10-20% of the power envelope and adding 10-20ns of memory latency to do hops over the package is fine if the competition is THAT stagnant. Had Ice Lake and Alder Lake launched in 2017 and 2018 as originally planned, this video would've been titled "why AMD's chiplets failed"
I dunno why it never occurred to me that they would use the same chiplets across both server and desktop processors.
Amd survived thanks to economic of chiplets.
Intel is behind amd in that. But soon will have better yields for server chips thanks to chiplets.
Remember Terminator 2: judgement day? Chiplets.
Habana intel differences? In passive data transfer I mean
Help us AMD, you're our only hope
Next up, someday: "Why Asianometry Videos Work"?
excellent - very intresting