respect the heck out of the multiple 'this is the boundary of my expertise' comments. in my experience when someone says that it just reaffirms that everything up to that point is trustworthy, or at least honest. it makes your speculation more interesting. not in the market for a laptop, but this tech space has been so cool to follow! i love where intel (and amd) is going.
Yup. But I am really glad we are seeing competition in this segment and companies are back testing out new and being bold with the design and not just being afraid of unknown and stagnating. Testing out new stuff and improving older stuff is never a bad thing.
Well with regards to IPC anyway AMD has been on the money with every irritation of gains they said they will get, Intel on the other hand... Well lets just say they have fallen short each and everytime.
Intel still has to prove that they can do efficient compute (but this step closer to Apple design might help) Amd still has to prove they can do efficient platform power, where Apple and Meteor Lake are miles ahead in true achievable battery life. Will see when notebooks are out.
@@Sam-jx5zy Yeah AMD has been making head way in power gating but still not near Intel's way in doing it, as for Apple well that's all based on ARM which is very efficient but ARM is by it's very nature, basic in how it processes requests and can't process long instructions and requests like x86. On the other hand they sip power so pros and cons.
High Yield explaining chip layouts is like music! I just nerd-out. This man is so good at the niche he is in. I wish this would be as financially lucrative as the value of the knowledge he espouses!
It's insane that this channel went from small youtuber with sub-1000 views to being invited to international trade shows by one of the biggest chip makers in less than three years! As someone who's here since the VTFET video I am amazed but also not surprised because the quality was there from the start. Herzlichen Glückwunsch und weiterhin alles Gute auf deinem Weg der dich hoffentlich weit bringen wird! 🍻
It seems a cpu architecture revolution is underway. The days of simply adding more cores are over. Intel and AMD are now innovating a lot more than they have been the last decade or two, and ARM CPUs are reaching insane performance levels, and neural processing is becoming much more prevalent. It seems that we have reached a turning point. It reminds me of the innovation that occurred during the movement from single core processors with huge pipelines, to multicore processors with reduced pipelines.
@@_EyeOfTheTiger reduced instruction set is better not worse for power per watt and compiled code efficiency. Risc set is enough for any task complicity, and much of modern workload are vectorised so no difference risc or cisc etc, more depends from vector extensions. big and complex decoders only waste for chip space and energy in cisc and later they also do all in mu-ops (so intel and amd chips really a risc chips with added on chip cisc to risc translation, since from i686 times 🤷♂️😂)
The problem is, all of this innovation is only coming because we haven't been able to make gains by upping core counts and building die on smaller process nodes, like you said. This is because we're reaching a point where we simply can't gain more benefits from those methods. So after all of this restructuring of the Chip and optimization is done, my layman opinion is that we're going to plateau. These sorts of organizational innovations can't happen forever, at a certain point you've reached peak efficiency for the tech you have available
I think a lot of us would be surprised to know how many of these 'new' technologies were actually patented decades ago. It's just a matter of culture and economics I think. No?
I suspect the L0 naming scheme does a few things. Allows for L2 to keep the same name since it has the same performance and size as previous gen and allows L3 to not be named L4, which may have some negative thoughts among the media (plus the potential for direct comparisons to AMD)
Also it must be using virtual addresses like L1, the larger L1 still has to start physical address translation for cache misses. So this small cache for recent data not in the register file may save energy by (usually) avoiding work.
@@mikeb3172 Cache coherency requires any cached data to be available to another core. But that requires physical addresses to obtain a read only or exclusive write to the memory cacheline. If L0 is read only with L1 able to invalidate entries then the simpler faster cache type fits. It sounds like a Jim Keller style idea that questions prevailing assumptions.
There used to be an intel generation during the stagnation years that had an L4 cache. DDR4 wasn't ready yet and the cores were memory-starved, so they put some cache on.
Adamantine is a separate cache tile that goes between the base tile and active tiles, so it can’t be on Lunar Lake with only 3 tiles. It’s possible for it to be on Arrow Lake as the tile implementation isn’t revealed, but I am doubtful of that.
First found High Yields channel 3 months back with the Zen 6 video and have become a fan ever since and have watched lots of his previous videos as well. Great content!
This is the best explanation of these things I’ve seen so far! I also don’t fully understand everything but I feel like you made it really easy and enjoyable to follow in one video. Thank you!
@@PKperformanceEUThere is no way intel will reach M4 max that quickly. Intel is good but the last few years haven’t been kind to intel or 10 years at that.
@@HighYield Aw, many more, High Yield. After all, I'm sure you've met many great people in the industry. More to come I say! Also, will we get some Strix or even Turin content? Skymont seems very impressive. I feel like AMD is sitting on Zen5c, which IPC is on par with Zen5, I'm saddened AMD didn't talk about it at all (perhaps in a future Hot Chips). They've left 8 Zen5c cores for consumer and the rest for Turin (dense). From what I've heard it's also a unified CCX, so no split cache, so much better latency (Zen 2 to Zen3), I don't know why they're sitting out on the design. That said, Turin dense, the CCDS look massive, and I don't think it'll fit on AM5. I'm really interested to know why the Zen5c CCDS look larger than Bergamos Zen4c CCDs. My thoughts lead me to it to having 12 CCDs instead of 8 in Bergamo. Could it be more GMI links, to fit more CCDs on package? Is that the reason why is bigger? Could a 12 Zen5c CCD fit onto AM5 package socket?
Look at you now! This is crazy, I remember watching your videos when you had 700 subscribers, and now you're getting invited to these events. Congrats!
That is what they said 50 years ago about 640kB. You have no idea how much memory we might need 5 or 10 years from now. Maybe even only 2 or 3 years from now.
@@TheRealEtaoinShrdlu win 7 times required 2gb of ram (8gb for best experience). now at least 8gb is required for daily tasks and non under engineered games. so 2-4 times ram in 14 years i say. but 32gb max... i say it may be not for ultra professional 3D / music producers. but who knows
Allow a user to add more ram and use the "onboard" RAM as cache or allow users to replace the SOC like they do for desktops. That or CXL3.1 can access more RAM via a PCIE enabled port and device.
There are already plenty of mobile use cases that don't need massive compute power but do need more than 32GB of RAM. It's an understandable compromise at times but it would be nice if there were more memory options.
Interesting, the L0 pre-L1 may avoid work. L1 caches use process specific virtual addresses with an address translation needed in parallel to validate the tag isn't a clash with data from another thread. (There's some great CPU engineering lecture videos in YT that explain how L1 operates) The tiny pre-L1 shouldn't imply the real L1 is an L2 accessed by physical addresses shareable between processes. So a pre-L1 cache ends up as L0 to avoid confusion. Now for some speculation and this could be why Lion Cove dropped HT. Without HT that cache using logical virtual addresses could make energy saving simplifications. If it is entirely flushed on thread changes and not shared between threads, no logical to physical address translation validation seems necessary. The small size may mean it can be looked up fast enough to pre-filter L1, with misses going to L1 after or the energy inexpensive fast cases are going to L1 simultaneously to complete faster with later validation on an L0 miss. So perhaps the underlying truth of the leaks was HT went to allow effectively a cache of register file and most recently used logical addresses accelerating L1.
The address translation is needed to figure out which cache line in the selected set has the desired virtual address (and all caches still store the full physical address each line is for, regardless of whether they're initially addressed virtually). The reason for traditional L1 being virtually-addressed is specifically to allow doing the translation (aka TLB lookup) in parallel. The reason such L1-s are so tiny is because they (ab)use the low 12 bits of physical and virtual addresses being the same (due to 4KB pages), and extend to 32KB or 48KB or whatever via just reading all 8 or 12 (aka associativity) possible matches, and selecting between them when the TLB result is gotten. A 192KB virtually-addressed cache would imply it reading an entire 48 possible cachelines (each being 64 bytes) on each access, which is utterly crazy. That said, assuming that L0 and L1 accesses aren't done in parallel, by the time the L0 concludes that it doesn't have the asked-for data, the TLB lookup will have finished anyway, and thus the L1 will be addressable physically with no additional delay, like it would with a traditional L2.
@@dzaimaThe point is in L1 the virtual address can be looked up, with the physical address translation in parallel for validation to ensure it's from the right process. 2 different physical cachelines can share the same logical page bits. You don't want the latency penalty of translating virtual addresses first because it's slow. The figuring out which cache line has the virtual address is back to front, process virtual addresses are mapped to physical memory via address mapping. The question what virtual address does this physical memory have is meaningless because it depends on what processes are sharing the memory page, you have a 1:n mapping. But the process thread running has a 1:1 translation. All the code I compiled tried to use relative addresses with relocatable code to minimise such problems.
@@RobBCactive Virtual to physical address mapping isn't a 1:1 translation even within one process - it can be n:1, as a process can map the same page to multiple locations in its own single virtual address space (and this is useful - see "Mesh: compacting memory manager"). Thus, addressing a cache by a full virtual address is impossible to do correctly without still having some physical mapping check somewhere.
@@dzaima just another reason to avoid the need for it, I think you are ignoring the possibility of a read only cache that writes through via L1 with its translation. Actual processing writes mostly to registers and then store operations. If you include all the L1 features what is the benefit of the L0 cache? The address translation isn't going to magically complete faster.
@@RobBCactive I'm saying that it'd be pretty reasonable for the Lion Cove L0 to function exactly like traditional L1-s, and its L1 can largely function exactly like a traditional L2. Haswell (2014) has a 32KB L1 with 4-cycle latency and 256KB L2 with 12-cycle latency, and it seems very possible to me that, with 10 years of process node improvements, similarly-structured caches (with the higher bandwidth of course) can map to Lion Cove's L0 and L1; and then the difference ends up being the modernly-sized extra level before the very-slow L3. I suppose it would be possible that Lion Cove's L0 leaves write ops to L1, but that'd obviously result in a rather larger write latency (though perhaps that doesn't matter too much given store forwarding).
Just came across this video of yours. With available information you did awesome work. I also look forward for real time performance with the release of Lunar lake. Similarly with the arrow lake. Time has ensured that ultra series have taken over H series chips from 6 years ago. Interesting times. Such videos not only educate but can also be useful for purchase decisions. Subscribed 👍
Intel used to make great ARM chips in their Xtensa series, up until their Atom SoC push in mobile. But they still hold on to their ARM architecture license. AMD also based their first competitive x86 products on their am29k RISC architecture. x86 (or more accurately AMD64) is just a layer of backwards compatibility and nothing more.
They just directly copy the ARM SoC to x86, but just like for Qualcomm, it's required 4 years just to copy the M1.. so those bullets comes out to slowly.. also Apple can scale their chip to desktop level, but check the Snapdragon X Elite, if you increase the power consumption with 250% (from 23W to 80W), the gained extra performance just 10%.. so food luck to make a desktop chip with that, so the chip itself doesn't mean a lot, if it's limited to only laptops, since the laptop market only a small part of the PC market..
@@PaulSpades If it means backwards compatability with out emulation... I am not buying mac in everything but name. And if it costs like curent Arm solution, it will be reasonable even as core for actual PC. But lack of Ram expandability is still a bit meh.
Great job on the video as usual bro! Thanks for the info and looking forward to seeing Lunar Lake AND Arrow lake hopefully later this year 🤞. Congrats on having Intel fly you out there too!
I'm seriously considering an LNL mini PC as an upgrade from my current 5600G mini PC and 7220U laptop, I feel like this thing can do it all with a much lower power and heat output (which will make it more portable than my current AM4 mini PC), 32 gigs is seriously enough for me since that's where I'm at right now, and the heaviest thing I run is probably just War Thunder and RPCS3
I'm really glad to hear intel seems to be going as wide as possible. It seems like that is why Apple chips are so fast and efficient, not ARM doing magic or something
Apple have caches v. close to the CPU, reducing latency and energy for data flows. Going wide doesn't help a lot of code, it inherently has serialising data dependencies.
Not sure if it's going wide that's helping here. From what I know, the efficiency of Apple chips came from 4 things: - better manufacturing node (M1 was N5, everybody else was on 7nm. M3 was on N3, everybody else was on N5. With Lunar Lake, we're finally on even grounds here) - on-chip RAM (while I hate non-upgradable RAM, I'm glad that Intel did this with Lunar Lake. There is a segment which clearly want battery life much more than upgradeable RAM) - non bloated OS (nothing to comment here, Windows (and Microsoft) sucks, Linux doesn't have enough support to be perfectly tuned yet) - laptop and motherboard design - this is much more subjective. Thing is that Apple actually prioritises battery life, while on PC side it's usually the benchmarks. Which is why many laptops are much louder and warmer. I also know that simply having some extra ports, that is, only having them exist, having something connected to them is not neccessary, that can also increase the minimum power required for the laptop to be on. Apple is famous for not having enough ports - I think this is also a reason of its efficiency Edit: forgot to add, M chips being on ARM also help on efficiency ... but not so much as most people claim (as if it's the only thing). My gut feeling is that it helps like 5-10%. As for the M chips being so fast ... other than the big memory bus width (up to 16!! channels on the Ultra chips) I'd say is also because of better manufacturing node. If you take the N5 and N4 and 5nm and 4nm generation of chips, Intel and AMD are better than M1 and M2. I mean, if you exclude the efficiency, Intel's Raptor Lake and Raptor Lake refresh which are on 10nm++++ are quite competitive even with M3 chips. Still, overall, the difference is not that big usually. The M cores/chips are clearly well designed.
it looks weird that media and display engine separated. they could switch display engine and 8MB side cache but the media engine does need some cache (not 8MB )
Good Job @Highyield , love your detailed reviews on these silicons. With respect to this video finally Intel catching up with various ARM platforms including Apple’s M series and Snapdragon X series.
@@betag24cn That was the first generation of E-cores. Did you watch the video? Skymont E-cores have similar IPC to Raptor Cove (Raptor Lake)... while being vastly more efficient
@@__aceofspades doesnt matter, tje concept is stupid, is fake you did not glued together two cpus because you were in panic, it is a dumb idea and points to the fact that your designs are terribñe on not generating heat thanks to absirds levels of power consumption, does not matter
It would be another lie by Intel. They said that Gracemont matched Skylake. Here we are years latter and the Haswell 4 core 8 thread i7-4700MQ laptop chip that i have has 25% higher IPC (CPU-Z) than the e-cores on my Alder Lake Core i7-12700K CPU with way, WAY faster DDR5. Lunar Lake is Intel's Bulldozer, there so many problems with the overall design of the chip. Meteor Lake makes more sense.
Will you do a video on Zen 5 and 5C? I'm interested in the capabilities of Zen 5C versus Lunar Lakes E-Cores and how the different paths they took paid off now.
I'm still waiting for a chip that integrates 32GB of HBM3e as an on-package L4 cache within the same SoC, while also supporting the addition of DDR5 memory modules with ECC capability, rather than being limited to just integrated memory.
The core layout with everything right next to the memory controller makes sense, and I'm glad to see intel moving in this direction. It'll be super interesting to see how x86 power consumption improves with this layout!
it is also interesting to see that the NPU has roughly a similar TOPs per area as the gpu, so expect it to be very power efficient, which also might mean that perhaps someone might find out a way to overclock it, since sometimes hardware optimized for efficiency has insane headroom for overclocking.
Lunar Lake looks like the biggest improvement for Intel in over a decade. In terms of performance per watt and GPU performance, it looks like Lunar Lake will beat Zen 5 and Qualcomm's X Elite. The only downside is that Lunar Lake is focused exclusively on thin and light laptops and handhelds, its not their highest performance product for mobile or desktops that is Arrow Lake which looks great for performance but will lose some efficiency and iGPU gains Lunar Lake brings.
I work for a major computer vendor and you're spot on. Your conclusion 110% speaks my mind and maims exactly what I've been saying when Intel presented us the LNL 3 weeks ago. I said that if LNL matches almost the battery perf of ARM by Qualcomm, this is going to be another Windows RT. ARM for Windows doesn't really offer a difference. We have already more performance than needed, NPU's are available en masse thanks to NVIDIA, it's just MS that firewalls for now the marketing bullshit storytelling about Copilot and that blocks other than embedded NPU's from being recognized by copilot, but this will change probably next year and they'll have to open the gates. What's left ? Battery performance. Ok, but if this gets matched, what's the point of having the whole industry shifting away from x86 ? Zero... ARM will be the thing that made Intel rethink it's architecture and from there the power efficiency and that's a good thing.
Hello, great video. I wanted to ask you, now that both mobile laptop cpus from amd and intel are announced, which cpu do you think is superior overall? Taking everything into account would you go with lunar lake or strix? Thanks
I think Lunar Lake has areal shot at the efficiency crown, but it does launch later in Q3, while AMD will launch sooner. Always wait for reviews, but for battery life I think LNL will be best. Strix Point should win in raw performance with up to 12 cores.
@@HighYield thanks for replying, patiently waiting on arrow lake desktop reveal as that is what im really interested in, im looking to upgrade to a new desktop with an rtx 5090, gonna go with whatever is faster amd 9000 or arrow lake. Ryzen 9000 vanilla series kinda disappointed me a bit tbh, pretty much same gaming performance as previous gen. Have to see what 9000 x3d chips have to offer.
I don't think they are comparable. AMD doesn't have something to compete with lunar lake given the low power target of lunar lake, and Intel has not announced what their answer to Strix Point is (though we all know it's going to be some variation of arrow lake). Intel will Winn the efficiency battle against Strix point and it's very likely that their GPU will be very competitive with hawk point at lower power, but it is unlikely it will be able to touch Strix point in GPU performance given that Strix point has 16 CUs. Overall, more and more excited for lunar lake. I think in a handheld form factor it's going to be very interesting.
@@sloanNYCthe shipping may be not late, but the real issue is supply. Here in my country you only able to find phoenix point/hawk point easily in gaming laptops while the thin & light category is dominated by intel.
Technically you can upgrade the memory after purchase. You just have to be really good at soldering 😁. I only tell this comment, because I knew someone who did this to his MacBook. Bought a. 8GB model abd with some patience and skill it became a 32 GB model 😅.
@@noticing33 I don't know but I know the device worked afterwards. I lost contact with him after his internship ended. But I think he wouldn't have done it, if it wouldn't have improvement performance.
I'm glad we have so vast mobile CPUs choice these days: Apple M1/M2/M3, Intel Meteor/Arrow/Lunar Lake, AMD Hawk Point/Strix Point and a new player - Snapdragon X Elite - is on its way. We never had a more difficult choice
They really should have spent that energy with Lunar Lake to replicate the chip concept on desktop PC as well. Sounds very interesting and hopefully comes to desktop
I know this is a video about Lunar Lake but this video gets me really really excited for Battlemage and desktop products like Arrow Lake If intel could figure out a V-Cache competitor and commit to multiple years of support for a motherboard platform they could make AMD straight up unattractive on desktop. I say that as someone with a 7950x3D and invested into AM5! I can't wait to see the next few years
@@aravindpallippara1577 Well currently, Lion cove is projected to have higher single threaded performance than Zen5 cores. That single thread lead will help with everything, including gaming. AMD has the biggest advantage in gaming rn with V-Cache, platform support and efficiency. with skymont, intel has a real chance of gaining a huge performance/watt uplift particularly in multi threaded loads which is where intel sucks down a comically large amount of power That's why I specified V-Cache and platform support would make AMD unattractive on desktop. Because Intel already has a decent chance of having class leading single threaded performance, adding V-Cache to an intel CPU would surely boost performance considerably (especially in games that love v-cache like Factorio, or Kerbal Space Program) And platform support like we have with AM5 would be really great. Having to upgrade every 2 gens is a huge downside compared to AMD's offerings and commitment to 2027+ support and why I personally went for the 7950x3D and AM5. V-Cache and platform support is just great
The V-cache is a solution because of the slow memory controller on Zen processors. When you glue the ram this close you dont have that AMD problem. No need for the same solution. The mystery cache is probably enough if Intel engineers did their job well.
@@impuls60 Imo that's an F Tier comment. L3 cache is going to beat out faster memory just by virtue of the insanely high bandwidth and lower latency. There's a reason intel loses in those games that favor 3D cache
@@impuls60 Agreed with the above commentor a cpu cache and ram have vastly different type of uses - cache is very raw and hence very fast as opposed to ram which needs to be encrypted and passed through os layer checks before being accessed - cache is still the king for performance of single thread operations.
A cynic would point out Pat was excited about Raptor and Meteor Lake and even Sapphire Rapids. These presentations have not been a reliable guide to what's delivered and when in recent years.
I'd love to see something like this for desktops where I can get an entire SOC with 32-64GB of ram all bundled together. I know there are upgradeability concerns but the performance benefits if you over spec could be really good, especially for ram heavy applications.
Depending on the details on moving data between the NPU and the GPU, using both at the same time could work really well for some AI workloads. Training a QLora where the main weights are only used for 4 bit inference that could run in the NPU and the backpropagation is done only for a low rank adaptor in fp32 or fp16 in the GPU could potentially work well. It won't be faster than a dedicated GPU, even a 3060 should outperform it. Memory bandwidth will likely also severely limit its performance. But often the issue with GPUs is not speed but available memory. Also this should be much more power efficient. It all will depend on software support, that is usually the issue with most non nvidia AI hardware.
AMD & Intel APU still conquered X86 ecosystem. Apple & Qualcomm still conquered ARM ecosystem with Nvidia own ARM CPU design will joining ARM ecosystem next year ❤❤❤
Cool to see a nice bit of Cache on the side to minimize DRAM access, L4 foresight on desktop? probably not but I love what I'm seeing from Intel this year, very exciting in more ways than expected. Maybe not quite leadership just yet but at least on par, the whole E-cores thing is evolving into something and I wont be surprise if it eventually gets to a point of Zen Dense. So far its still looking to be a split design mentality but a high IPC Philosophy so the ability to use E-cores for most task will get the best out of the efficiency. Last time I was this excited was Alder Lake?
lol, they announce things and never reoease tyem, meanwhile we get those wonderful i9 all damaged remember foverosm chip stackingm, that was in 2017, where is it? ah yes, in amd x3d cpus remember that intel innivation is not innnovation, is broken promises
@@betag24cnIntel did end up using foveros chip stacking, twice in fact. The more famous example is in Ponte Vecchio, where they have base dies (like described in this video mind you) except they have L4 cache. Vertically stacked. Also, underneath the compute so it doesn’t interfere with cooling. Wonder why anyone would ever try doing it the other way round? The other example was in a super obscure part that previewed Intel’s P core and E core design before alder lake. One p core and 4 atom cores. 5 cores total.
the NPU is pretty gigantic compared to for example what Apple does. Curious about the performance because Apples ones are ridiculously fast for their size
I think we'll have to wait for the release of the Lunar Lake laptop and the benchmark scores, but if you simply multiply the graphics scores of the Meteor Lake-H's 3DMark benchmark Time Spy and Fire Strike by 1.5, you get TS: 5250 FS: 13800. In terms of desktop GPUs, it's close to the performance of the GTX1660. In the country where I live, there are several articles that say it's 50% better in performance than the Meteor Lake-U, but if you multiply the GPU performance of the Meteor Lake-U by 1.5, it will be the same as the Meteor Lake-H's GPU performance. On a different note, is the presence or absence of hyperthreading related to the high single-thread performance of Apple silicon?
I can say the same about Apple ... being so efficient by simply using a better / newer manufacturing node. What a revolutionary concept ... And in regards to on-chip LPDDR, it wasn't done not because Intel or AMD didn't though of it or didn't knew how to do it. It wasn't done because of the tradeoffs, like a) no more upgradeability - something that many people actually like - and b) customizability - in case of soldered RAM, it's still to the OEM to add it, as it sees fit. And as the suppliers come, so to speak. You're much more constrained when you do it on-chip. I If it weren't for the efficiency improvements, I would be fully against it.
I have been wondering when we would finally see someone utilize on package memory to compete with Apple in power consumption. Fingers crossed hoping this is the beginning of the end for 8bg memory!
This doesn’t really save on power consumption, at least not the way Intel did it. It’s the same bog standard LPDDR memory as laptops have now, and it’s connected to the CPU with the same copper traces. The only thing that’s really changed is that it’s physically closer and OEM’s will have to source it from Intel (with the obligatory markup!) rather than sourcing it themselves from Micron, Samsung or SK Hynix. If I’m remembering the Apple chip architecture correctly, they’re using GDDR memory, and it’s actually integrated right into the same silicon as the rest of the chip. If true, that effectively means TSMC is making the memory for the Apple chip (they most certainly are NOT doing so for Intel) and the memory can’t even be physically separated from the rest of the chip.
@@benjaminlynch9958Apple is using lpddr5x; their implementation is basically the same as Intel’s. Shorter trace lengths do have benefits for power consumption and latency, although I’m not expecting large benefits from this implementation. The main benefit of this approach is a simplification of design for OEMs because they don’t have to worry about designing the memory system. It’s also beneficial for consumers by standardizing memory configurations, so companies can’t skimp on memory. 16GB is standard on Lunar Lake with higher end models going to 32.
Exciting stuff! Great video, as usual. I do have one question, though. Is it certain that a server implementation (or any) of Lion Cove would have SMT? Also, different implementations of the same architecture sounds more like a standard vs Dense Zen situation to me, and I think that it could get expensive to develop lots of just slightly different cores
Speaking of, we REALLY need the dynamic iGPU memory allocation that Apple has. On Windows' side I can see why it's not implemented and why nobody talks about it, as Microsoft couldn't give 2 flying Fs about Windows, especially in the performance side. If it's not ads or tracking the user, then it's priority 7384, to be done in 15 years from now. On Linux side I hope we'll see something, but usually GPU stuff comes from the manufacturer, so it would be Intel or AMD here for the iGPUs. And they're both busy on other areas, like the actual GPUs being competitive. And the drivers for Windows. Linux comes after that. Sigh.
@@sowa705 Oh, ok. I was under the impression that it's settled at boot time. I wonder then why did Apple presented (and people being wowed) as something new. I guess it was new for them.
Gluing the ram closer can yield far better ram function than CAMM. I think this cpu will be used in very small systems. I'm betting they will use even faster ram as soon as it comes available.
@@ChristopherBurtraw The on chip is mostly for the power savings, not neccesarly for bandwidth. Lunar Lake is optimized to be very efficient (and I hope it actually delivers). It should be perfect for ultrabooks which want really really long battery life and for gaming handhelds. For the rest of normal folk and normal (or powerful) laptops, we'll have Arrow Lake. And hopefully we'll see LPCAMM2 laptops with that. I dream of a Framework 16 with Arrow Lake and LPCAMM2 in which to add 128 GB of RAM and finally upgrade my almost 8 year old laptop, to one that will also last me 7-10 years.
@@Winnetou17 I'm hoping the next gen (after the one they just announced) 13 board will have it too. Framework won't want to implement this one even for the 13...
Do I understand correctly that 128 bit memory bus is the same bandwidth as what we'd get with dual-channel DDR modules? Since each module (channel) is usually 64-bit? Just trying to understand the overall memory bandwidth, I know we also get latency benefits and not downplaying it.
Can you explain the implications embedded RAM? My PC currently has 128GB of RAM. From several places I've looked, it sounds to me like with lunar lake you're limited to the 32GB of embedded RAM and it won't make use of normal RAM sticks? If so that's.. just not an option at all for developers and I'm really surprised that noone is getting out pitch forks. BUT I must be missing something so what am I missing?
Can you just combine GPU, NPU and CPU for the same inference task though? Or is Intel just adding up numbers to create a bigger number but in the real world, you will have to decide where to run any given model?
You are correct. Currently can't just add the numbers. Apparently work is being done to enable mismatched processors for ai batch-processing, but I don't expect it will release soon, if ever.
@@martin777xyz thanks for confirming. And yeah, from my understanding it sounds really tough to make these systems complement each other. Maybe some day we'll be running so many models locally that they can run in parallel but even that...
@@saricubra2867 The market segment is different for Lunar lake, it's meant for Thin & lights, thus it's competition is the X Elite. It is designed to be less powerful in multicore, more powerful in single core, and more efficient. Intel is trying to copy ARM. By the way, ram won't be and issue; the Lunar lake socs have a memory bandwidth of about 136 GB/s, slightly slower than the m4, however unlike Apple, 16gb is the minimum amount for Lunar Lake, 32 gigabyte is the maximum.
Can't wait to see if those promises in performance and power draw materializes. We could finally have smaller x86_64 handhelds and could reaffirm x86_64 as capable of reinventing itself. And in general its good seeing intel trying again. All the design decisions seem intelligent again and that it required a certain amount of redisgn
I love the idea of on package memory. It's fantastic to get the perspective of someone who sees this idea as an opportunity for improved efficiency and cost, rather than just a lack of upgradability.
I have two guesses on Intel 4 and later processes Meteor Lake and Lunar Lake all being mobile oriented and not desktop. One is that the processes are not suitable for high performance operation, but do get better power efficiency. 2) manage fab-process capacity.
They are lying. Raptor Cove has 36MB of L3 cache on a monolithic Ring Bus. A ring bus Haswell 4core8thread laptop Core i7-4700MQ that i have has 25% higher IPC than the e-cores on my current Alder Lake i7-12700K besides the HUGE difference on RAM speed.
It would be interestong if someone came up with a hybrid chip that has both x86 and ARM instruction cores. Which would allow running both X86 and ARM software natively. It could be an 8 core CPU with 2 X86 P cores + 6 ARM P cores.
@@reiniermoreno1653 Implementing the Software would easier compared to the hardware since we already have OSs that understand which ISA you are using. Binary executables also have info on which architecture they are designed for.
Even if its performance and efficiency is only on par with M1, I would say this is a win for Intel moving forward. I can imagine 16-20+ hrs batt life ultrabooks.
respect the heck out of the multiple 'this is the boundary of my expertise' comments. in my experience when someone says that it just reaffirms that everything up to that point is trustworthy, or at least honest. it makes your speculation more interesting.
not in the market for a laptop, but this tech space has been so cool to follow! i love where intel (and amd) is going.
The more things I learn, the more in realize how little I really know
@@HighYield Yep... but keep it up! :D Awesome to see great content like this on here.
@@HighYield This. You do need to know more than the average person to learn that, and it usually comes at a later age ;-)
Lunar lake looks compelling, but just like Zen 5, I'll believe it when I see it.
The only way you should do it.
Yup. But I am really glad we are seeing competition in this segment and companies are back testing out new and being bold with the design and not just being afraid of unknown and stagnating. Testing out new stuff and improving older stuff is never a bad thing.
Well with regards to IPC anyway AMD has been on the money with every irritation of gains they said they will get, Intel on the other hand... Well lets just say they have fallen short each and everytime.
Intel still has to prove that they can do efficient compute (but this step closer to Apple design might help)
Amd still has to prove they can do efficient platform power, where Apple and Meteor Lake are miles ahead in true achievable battery life.
Will see when notebooks are out.
@@Sam-jx5zy Yeah AMD has been making head way in power gating but still not near Intel's way in doing it, as for Apple well that's all based on ARM which is very efficient but ARM is by it's very nature, basic in how it processes requests and can't process long instructions and requests like x86. On the other hand they sip power so pros and cons.
High Yield explaining chip layouts is like music! I just nerd-out. This man is so good at the niche he is in. I wish this would be as financially lucrative as the value of the knowledge he espouses!
Yeah, this guy is the real deal. Really enjoy his content, even when it’s on x86
nerds
Much better detail than Asianometry, that's for sure
@@aerohk Jon is much smarter than me, he's just looking at the bigger picture most of the time. Like a certain technology, instead of a specific chip.
one characteristic of high IQ people is they pinpoint intelligent aspects in other people 🎩🧤🥂
I think they're doing some very interesting stuff with their SOCs. I'm happy they flew you out to the event, you're a great creator.
CPUs are going through somewhat of an architectural revolution. The days of simply adding more cores is over. The real innovation has begun.
It's insane that this channel went from small youtuber with sub-1000 views to being invited to international trade shows by one of the biggest chip makers in less than three years!
As someone who's here since the VTFET video I am amazed but also not surprised because the quality was there from the start.
Herzlichen Glückwunsch und weiterhin alles Gute auf deinem Weg der dich hoffentlich weit bringen wird! 🍻
Dankeschön. Ich bin immer noch überrascht
@@HighYield Verständlich aber hast es dir definitiv verdient.
Uhhhh huh….
It seems a cpu architecture revolution is underway. The days of simply adding more cores are over. Intel and AMD are now innovating a lot more than they have been the last decade or two, and ARM CPUs are reaching insane performance levels, and neural processing is becoming much more prevalent. It seems that we have reached a turning point. It reminds me of the innovation that occurred during the movement from single core processors with huge pipelines, to multicore processors with reduced pipelines.
ARM processors are still RISC based while x86 is CISC. ARM does well for certain task but it is a reduced instruction set
Frankly speaking after 4 years of inertia intel make something based on apple m1 ideas 🤔😉
@@_EyeOfTheTiger reduced instruction set is better not worse for power per watt and compiled code efficiency. Risc set is enough for any task complicity, and much of modern workload are vectorised so no difference risc or cisc etc, more depends from vector extensions. big and complex decoders only waste for chip space and energy in cisc and later they also do all in mu-ops (so intel and amd chips really a risc chips with added on chip cisc to risc translation, since from i686 times 🤷♂️😂)
The problem is, all of this innovation is only coming because we haven't been able to make gains by upping core counts and building die on smaller process nodes, like you said.
This is because we're reaching a point where we simply can't gain more benefits from those methods. So after all of this restructuring of the Chip and optimization is done, my layman opinion is that we're going to plateau. These sorts of organizational innovations can't happen forever, at a certain point you've reached peak efficiency for the tech you have available
I think a lot of us would be surprised to know how many of these 'new' technologies were actually patented decades ago. It's just a matter of culture and economics I think. No?
I can’t wait until real high-res die shots of these chips will be available
Get a scanning electron microscope
I suspect the L0 naming scheme does a few things. Allows for L2 to keep the same name since it has the same performance and size as previous gen and allows L3 to not be named L4, which may have some negative thoughts among the media (plus the potential for direct comparisons to AMD)
And naming their L1 L1.5 likely creates unnecessary problems on the SW side
Also it must be using virtual addresses like L1, the larger L1 still has to start physical address translation for cache misses.
So this small cache for recent data not in the register file may save energy by (usually) avoiding work.
My first thought was that the L0 is not shared per core, or micro code references to cache start at 0 and they're being more cohesive with naming.
@@mikeb3172 Cache coherency requires any cached data to be available to another core. But that requires physical addresses to obtain a read only or exclusive write to the memory cacheline.
If L0 is read only with L1 able to invalidate entries then the simpler faster cache type fits.
It sounds like a Jim Keller style idea that questions prevailing assumptions.
There used to be an intel generation during the stagnation years that had an L4 cache. DDR4 wasn't ready yet and the cores were memory-starved, so they put some cache on.
For a while I was expecting a big reveal of the Adamantine L4 cache.. Alas, it ended being the side cache
I was hoping for that too
Adamantine is a separate cache tile that goes between the base tile and active tiles, so it can’t be on Lunar Lake with only 3 tiles. It’s possible for it to be on Arrow Lake as the tile implementation isn’t revealed, but I am doubtful of that.
@@dex6316 Adamantine is an active Silicon base die which was rumoured to contained L4 cache. It is not a die that goes in between.
Congrats on getting a press invite dude! You deserve it.
but my rtx 3070 beats a pos NPU🤣🤣
First found High Yields channel 3 months back with the Zen 6 video and have become a fan ever since and have watched lots of his previous videos as well. Great content!
This is the best explanation of these things I’ve seen so far! I also don’t fully understand everything but I feel like you made it really easy and enjoyable to follow in one video. Thank you!
M4 and Lunar CPU fight is going to be interesting.
Hopefully intel becomes competitive with arrow lake against m4max.
@@PKperformanceEUThere is no way intel will reach M4 max that quickly. Intel is good but the last few years haven’t been kind to intel or 10 years at that.
@@GlobalWave1 Most likely. But it be nice to haven alternative if m4max will be more expensive than m3max i dont buy it
@@GlobalWave1 lunar lake is not a competitor to m4 max. its a competitor to m4
M4 is already reality and will be partnered by the M5, when Lunar Lake hits the market.
cpu competition is spicy again babyyy
High Yield with the early Computex coverage!!!!
My first, but hopefully not my last Computex.
@@HighYield Aw, many more, High Yield. After all, I'm sure you've met many great people in the industry. More to come I say! Also, will we get some Strix or even Turin content? Skymont seems very impressive. I feel like AMD is sitting on Zen5c, which IPC is on par with Zen5, I'm saddened AMD didn't talk about it at all (perhaps in a future Hot Chips). They've left 8 Zen5c cores for consumer and the rest for Turin (dense). From what I've heard it's also a unified CCX, so no split cache, so much better latency (Zen 2 to Zen3), I don't know why they're sitting out on the design.
That said, Turin dense, the CCDS look massive, and I don't think it'll fit on AM5. I'm really interested to know why the Zen5c CCDS look larger than Bergamos Zen4c CCDs. My thoughts lead me to it to having 12 CCDs instead of 8 in Bergamo. Could it be more GMI links, to fit more CCDs on package? Is that the reason why is bigger? Could a 12 Zen5c CCD fit onto AM5 package socket?
Look at you now! This is crazy, I remember watching your videos when you had 700 subscribers, and now you're getting invited to these events. Congrats!
Thanks! Tbh, it still feel like dream to me. I'm enjoying it as much as I can.
@@HighYield I'm glad :) and it's well deserved! Your hard work is definitely paying off
If the standard would be 32 GB for RAM on every soldered LPDDR5X RAM then no one will have issue over upgradability.
That is what they said 50 years ago about 640kB. You have no idea how much memory we might need 5 or 10 years from now. Maybe even only 2 or 3 years from now.
@@TheRealEtaoinShrdlu win 7 times required 2gb of ram (8gb for best experience). now at least 8gb is required for daily tasks and non under engineered games. so 2-4 times ram in 14 years i say.
but 32gb max... i say it may be not for ultra professional 3D / music producers. but who knows
Two option ram 16/32gb
Allow a user to add more ram and use the "onboard" RAM as cache or allow users to replace the SOC like they do for desktops.
That or CXL3.1 can access more RAM via a PCIE enabled port and device.
There are already plenty of mobile use cases that don't need massive compute power but do need more than 32GB of RAM.
It's an understandable compromise at times but it would be nice if there were more memory options.
Interesting, the L0 pre-L1 may avoid work.
L1 caches use process specific virtual addresses with an address translation needed in parallel to validate the tag isn't a clash with data from another thread. (There's some great CPU engineering lecture videos in YT that explain how L1 operates)
The tiny pre-L1 shouldn't imply the real L1 is an L2 accessed by physical addresses shareable between processes.
So a pre-L1 cache ends up as L0 to avoid confusion.
Now for some speculation and this could be why Lion Cove dropped HT.
Without HT that cache using logical virtual addresses could make energy saving simplifications. If it is entirely flushed on thread changes and not shared between threads, no logical to physical address translation validation seems necessary.
The small size may mean it can be looked up fast enough to pre-filter L1, with misses going to L1 after or the energy inexpensive fast cases are going to L1 simultaneously to complete faster with later validation on an L0 miss.
So perhaps the underlying truth of the leaks was HT went to allow effectively a cache of register file and most recently used logical addresses accelerating L1.
The address translation is needed to figure out which cache line in the selected set has the desired virtual address (and all caches still store the full physical address each line is for, regardless of whether they're initially addressed virtually). The reason for traditional L1 being virtually-addressed is specifically to allow doing the translation (aka TLB lookup) in parallel.
The reason such L1-s are so tiny is because they (ab)use the low 12 bits of physical and virtual addresses being the same (due to 4KB pages), and extend to 32KB or 48KB or whatever via just reading all 8 or 12 (aka associativity) possible matches, and selecting between them when the TLB result is gotten. A 192KB virtually-addressed cache would imply it reading an entire 48 possible cachelines (each being 64 bytes) on each access, which is utterly crazy.
That said, assuming that L0 and L1 accesses aren't done in parallel, by the time the L0 concludes that it doesn't have the asked-for data, the TLB lookup will have finished anyway, and thus the L1 will be addressable physically with no additional delay, like it would with a traditional L2.
@@dzaimaThe point is in L1 the virtual address can be looked up, with the physical address translation in parallel for validation to ensure it's from the right process. 2 different physical cachelines can share the same logical page bits.
You don't want the latency penalty of translating virtual addresses first because it's slow.
The figuring out which cache line has the virtual address is back to front, process virtual addresses are mapped to physical memory via address mapping.
The question what virtual address does this physical memory have is meaningless because it depends on what processes are sharing the memory page, you have a 1:n mapping. But the process thread running has a 1:1 translation.
All the code I compiled tried to use relative addresses with relocatable code to minimise such problems.
@@RobBCactive Virtual to physical address mapping isn't a 1:1 translation even within one process - it can be n:1, as a process can map the same page to multiple locations in its own single virtual address space (and this is useful - see "Mesh: compacting memory manager"). Thus, addressing a cache by a full virtual address is impossible to do correctly without still having some physical mapping check somewhere.
@@dzaima just another reason to avoid the need for it, I think you are ignoring the possibility of a read only cache that writes through via L1 with its translation.
Actual processing writes mostly to registers and then store operations.
If you include all the L1 features what is the benefit of the L0 cache? The address translation isn't going to magically complete faster.
@@RobBCactive I'm saying that it'd be pretty reasonable for the Lion Cove L0 to function exactly like traditional L1-s, and its L1 can largely function exactly like a traditional L2.
Haswell (2014) has a 32KB L1 with 4-cycle latency and 256KB L2 with 12-cycle latency, and it seems very possible to me that, with 10 years of process node improvements, similarly-structured caches (with the higher bandwidth of course) can map to Lion Cove's L0 and L1; and then the difference ends up being the modernly-sized extra level before the very-slow L3.
I suppose it would be possible that Lion Cove's L0 leaves write ops to L1, but that'd obviously result in a rather larger write latency (though perhaps that doesn't matter too much given store forwarding).
I remember watching your channel before you had less than 1000 subscribers. It's good to see you getting big enough to be invited to Intel events
I've seen your username around for a while now, thanks for sticking with me
I love how you are going into detail of the die space and die size and what each new process with L0 L1 L2 cache with lunar lake and what it means! 🎉
crazy you can soon build a tiny 4" by 4" work station, totally fine for 3d, illustration works and code. this is the way forwards.
Proud of you for getting invited to a press event! Well deserved. hoping lunar lake won't be as weird of a launch as strix
Just came across this video of yours. With available information you did awesome work. I also look forward for real time performance with the release of Lunar lake. Similarly with the arrow lake. Time has ensured that ultra series have taken over H series chips from 6 years ago. Interesting times. Such videos not only educate but can also be useful for purchase decisions. Subscribed 👍
Good effort from former IMG guys (among others). Congrats to the team
Thanks for the disclaimer early in the video. A perfect example of why you are exemplary and trustworthy.
The battle is far from over, X86 still has a lot of bullets to fight
haha. this is an advanced arm soc copy at every level of the design, except for the instructions decoder.
@@PaulSpades yeah just like how snapdragon ditch the low power cores for the laptops
Intel used to make great ARM chips in their Xtensa series, up until their Atom SoC push in mobile. But they still hold on to their ARM architecture license.
AMD also based their first competitive x86 products on their am29k RISC architecture.
x86 (or more accurately AMD64) is just a layer of backwards compatibility and nothing more.
They just directly copy the ARM SoC to x86, but just like for Qualcomm, it's required 4 years just to copy the M1.. so those bullets comes out to slowly.. also Apple can scale their chip to desktop level, but check the Snapdragon X Elite, if you increase the power consumption with 250% (from 23W to 80W), the gained extra performance just 10%.. so food luck to make a desktop chip with that, so the chip itself doesn't mean a lot, if it's limited to only laptops, since the laptop market only a small part of the PC market..
@@PaulSpades If it means backwards compatability with out emulation... I am not buying mac in everything but name. And if it costs like curent Arm solution, it will be reasonable even as core for actual PC. But lack of Ram expandability is still a bit meh.
Great job on the video as usual bro! Thanks for the info and looking forward to seeing Lunar Lake AND Arrow lake hopefully later this year 🤞. Congrats on having Intel fly you out there too!
Pretty cool. Looks interesting on paper. We will see how it performs in real life.
I love that I came across this channel. I can't believe how good the quality of content you have while still being a sub 50,000 sub channel.
Great video! Loved your lucid description of the LNL architecture.
I'm seriously considering an LNL mini PC as an upgrade from my current 5600G mini PC and 7220U laptop, I feel like this thing can do it all with a much lower power and heat output (which will make it more portable than my current AM4 mini PC), 32 gigs is seriously enough for me since that's where I'm at right now, and the heaviest thing I run is probably just War Thunder and RPCS3
I'm really glad to hear intel seems to be going as wide as possible. It seems like that is why Apple chips are so fast and efficient, not ARM doing magic or something
Apple have caches v. close to the CPU, reducing latency and energy for data flows.
Going wide doesn't help a lot of code, it inherently has serialising data dependencies.
Not sure if it's going wide that's helping here. From what I know, the efficiency of Apple chips came from 4 things:
- better manufacturing node (M1 was N5, everybody else was on 7nm. M3 was on N3, everybody else was on N5. With Lunar Lake, we're finally on even grounds here)
- on-chip RAM (while I hate non-upgradable RAM, I'm glad that Intel did this with Lunar Lake. There is a segment which clearly want battery life much more than upgradeable RAM)
- non bloated OS (nothing to comment here, Windows (and Microsoft) sucks, Linux doesn't have enough support to be perfectly tuned yet)
- laptop and motherboard design - this is much more subjective. Thing is that Apple actually prioritises battery life, while on PC side it's usually the benchmarks. Which is why many laptops are much louder and warmer. I also know that simply having some extra ports, that is, only having them exist, having something connected to them is not neccessary, that can also increase the minimum power required for the laptop to be on. Apple is famous for not having enough ports - I think this is also a reason of its efficiency
Edit: forgot to add, M chips being on ARM also help on efficiency ... but not so much as most people claim (as if it's the only thing). My gut feeling is that it helps like 5-10%.
As for the M chips being so fast ... other than the big memory bus width (up to 16!! channels on the Ultra chips) I'd say is also because of better manufacturing node. If you take the N5 and N4 and 5nm and 4nm generation of chips, Intel and AMD are better than M1 and M2. I mean, if you exclude the efficiency, Intel's Raptor Lake and Raptor Lake refresh which are on 10nm++++ are quite competitive even with M3 chips. Still, overall, the difference is not that big usually. The M cores/chips are clearly well designed.
As an ASIC Design Engineer , this is an amazing video. Was able to relate to a lot of concepts I learnt in school
Excited for this video!
The way tiles allow different parts of a CPU to use the most optimal process node is very cool.
it looks weird that media and display engine separated. they could switch display engine and 8MB side cache but the media engine does need some cache (not 8MB )
i really like ur transparency
Good Job @Highyield , love your detailed reviews on these silicons. With respect to this video finally Intel catching up with various ARM platforms including Apple’s M series and Snapdragon X series.
If those E cores are getting so good, I wouldn't mind having a budget option with just 6 E cores!
those things are basically a 8th gen i5 mixed with a atom, if you want that, go buy one, dont wait for the future
@@betag24cn That was the first generation of E-cores. Did you watch the video? Skymont E-cores have similar IPC to Raptor Cove (Raptor Lake)... while being vastly more efficient
@@__aceofspades doesnt matter, tje concept is stupid, is fake you did not glued together two cpus because you were in panic, it is a dumb idea and points to the fact that your designs are terribñe on not generating heat thanks to absirds levels of power consumption, does not matter
It would be another lie by Intel. They said that Gracemont matched Skylake.
Here we are years latter and the Haswell 4 core 8 thread i7-4700MQ laptop chip that i have has 25% higher IPC (CPU-Z) than the e-cores on my Alder Lake Core i7-12700K CPU with way, WAY faster DDR5.
Lunar Lake is Intel's Bulldozer, there so many problems with the overall design of the chip. Meteor Lake makes more sense.
intel is back on innovation track. I really want to bulid my next pc with intel. And Arc gpu😅
They are copying apple/arm and using tsmc, like AMD. Can't wait to see them do interesting things besides 14nm++++++++
"innovation track --> build my next pc" reads like "nVidia has -90 class GPUs that are great, let me build my next PC with GT 1030 (DDR4)"
yes, they are looking for new ways to get us on another decade of 4 cores is more than enough
Will you do a video on Zen 5 and 5C? I'm interested in the capabilities of Zen 5C versus Lunar Lakes E-Cores and how the different paths they took paid off now.
C cores are wayyyy better than e cores
For sure, but idk when I'll find the time. Soonish I think.
I love this channel so much
Bro, it's been a while but I still have that love for Intel
You produce incredibly great content. Subscribed!
I'm still waiting for a chip that integrates 32GB of HBM3e as an on-package L4 cache within the same SoC, while also supporting the addition of DDR5 memory modules with ECC capability, rather than being limited to just integrated memory.
bruh
the hell you talking about man lmfao
The core layout with everything right next to the memory controller makes sense, and I'm glad to see intel moving in this direction. It'll be super interesting to see how x86 power consumption improves with this layout!
it is also interesting to see that the NPU has roughly a similar TOPs per area as the gpu, so expect it to be very power efficient, which also might mean that perhaps someone might find out a way to overclock it, since sometimes hardware optimized for efficiency has insane headroom for overclocking.
I will neither understand, nor remember most of this, but it was interesting.
Thank you for this educational content! Underdog Intel is striking back with a mean kick! This is an amazing SOC! Its real competitor is the Apple M4!
Lunar Lake looks like the biggest improvement for Intel in over a decade. In terms of performance per watt and GPU performance, it looks like Lunar Lake will beat Zen 5 and Qualcomm's X Elite. The only downside is that Lunar Lake is focused exclusively on thin and light laptops and handhelds, its not their highest performance product for mobile or desktops that is Arrow Lake which looks great for performance but will lose some efficiency and iGPU gains Lunar Lake brings.
"looks like the biggest improvement for Intel in a decade"
No, it's since Alder Lake.
Great analysis! Really a real breakthrough.
I work for a major computer vendor and you're spot on. Your conclusion 110% speaks my mind and maims exactly what I've been saying when Intel presented us the LNL 3 weeks ago. I said that if LNL matches almost the battery perf of ARM by Qualcomm, this is going to be another Windows RT. ARM for Windows doesn't really offer a difference.
We have already more performance than needed, NPU's are available en masse thanks to NVIDIA, it's just MS that firewalls for now the marketing bullshit storytelling about Copilot and that blocks other than embedded NPU's from being recognized by copilot, but this will change probably next year and they'll have to open the gates. What's left ? Battery performance. Ok, but if this gets matched, what's the point of having the whole industry shifting away from x86 ? Zero...
ARM will be the thing that made Intel rethink it's architecture and from there the power efficiency and that's a good thing.
Not even 50k subs and you're already getting free Computex trips? Damn, balling
Gunna slap this into a new rig once it's released 🙏
Always wait for final silicon reviews. Something can look great on paper and be meh in reality
This guy got me high this morning. He got the sample
I’ve been using the Zenbook s14 for a couple of days and my god, the battery life is mind blowing, while also offering leading performance.
Hello, great video. I wanted to ask you, now that both mobile laptop cpus from amd and intel are announced, which cpu do you think is superior overall? Taking everything into account would you go with lunar lake or strix?
Thanks
I think Lunar Lake has areal shot at the efficiency crown, but it does launch later in Q3, while AMD will launch sooner. Always wait for reviews, but for battery life I think LNL will be best. Strix Point should win in raw performance with up to 12 cores.
@@HighYield AMD is usually late with actual shipping laptops though, no?
@@HighYield thanks for replying, patiently waiting on arrow lake desktop reveal as that is what im really interested in, im looking to upgrade to a new desktop with an rtx 5090, gonna go with whatever is faster amd 9000 or arrow lake. Ryzen 9000 vanilla series kinda disappointed me a bit tbh, pretty much same gaming performance as previous gen. Have to see what 9000 x3d chips have to offer.
I don't think they are comparable. AMD doesn't have something to compete with lunar lake given the low power target of lunar lake, and Intel has not announced what their answer to Strix Point is (though we all know it's going to be some variation of arrow lake).
Intel will Winn the efficiency battle against Strix point and it's very likely that their GPU will be very competitive with hawk point at lower power, but it is unlikely it will be able to touch Strix point in GPU performance given that Strix point has 16 CUs.
Overall, more and more excited for lunar lake. I think in a handheld form factor it's going to be very interesting.
@@sloanNYCthe shipping may be not late, but the real issue is supply. Here in my country you only able to find phoenix point/hawk point easily in gaming laptops while the thin & light category is dominated by intel.
Technically you can upgrade the memory after purchase. You just have to be really good at soldering 😁.
I only tell this comment, because I knew someone who did this to his MacBook. Bought a. 8GB model abd with some patience and skill it became a 32 GB model 😅.
But did it help perf
@@noticing33 I don't know but I know the device worked afterwards. I lost contact with him after his internship ended. But I think he wouldn't have done it, if it wouldn't have improvement performance.
Hahaha its gonna be fun for dyi project
That works when the memory is soldered to the motherboard, not when it's on the CPU die.
@@mattbosley3531 very true but this was still an Intel Mac with soldered memory.
While all of this is great stuff, here I am being hyped the most about hardware VVC decode 😅
I'm glad we have so vast mobile CPUs choice these days: Apple M1/M2/M3, Intel Meteor/Arrow/Lunar Lake, AMD Hawk Point/Strix Point and a new player - Snapdragon X Elite - is on its way. We never had a more difficult choice
They really should have spent that energy with Lunar Lake to replicate the chip concept on desktop PC as well.
Sounds very interesting and hopefully comes to desktop
I know this is a video about Lunar Lake but this video gets me really really excited for Battlemage and desktop products like Arrow Lake
If intel could figure out a V-Cache competitor and commit to multiple years of support for a motherboard platform they could make AMD straight up unattractive on desktop. I say that as someone with a 7950x3D and invested into AM5!
I can't wait to see the next few years
Shouldn't intel be exceeding not matching AMD's current offering to make AMD unattractive?
Or the standards are different for different companies?
@@aravindpallippara1577 Well currently, Lion cove is projected to have higher single threaded performance than Zen5 cores. That single thread lead will help with everything, including gaming. AMD has the biggest advantage in gaming rn with V-Cache, platform support and efficiency.
with skymont, intel has a real chance of gaining a huge performance/watt uplift particularly in multi threaded loads which is where intel sucks down a comically large amount of power
That's why I specified V-Cache and platform support would make AMD unattractive on desktop. Because Intel already has a decent chance of having class leading single threaded performance, adding V-Cache to an intel CPU would surely boost performance considerably (especially in games that love v-cache like Factorio, or Kerbal Space Program)
And platform support like we have with AM5 would be really great. Having to upgrade every 2 gens is a huge downside compared to AMD's offerings and commitment to 2027+ support and why I personally went for the 7950x3D and AM5. V-Cache and platform support is just great
The V-cache is a solution because of the slow memory controller on Zen processors. When you glue the ram this close you dont have that AMD problem. No need for the same solution. The mystery cache is probably enough if Intel engineers did their job well.
@@impuls60 Imo that's an F Tier comment. L3 cache is going to beat out faster memory just by virtue of the insanely high bandwidth and lower latency. There's a reason intel loses in those games that favor 3D cache
@@impuls60 Agreed with the above commentor a cpu cache and ram have vastly different type of uses - cache is very raw and hence very fast as opposed to ram which needs to be encrypted and passed through os layer checks before being accessed - cache is still the king for performance of single thread operations.
X86 is not dead yet. Pat also seemed very excited for Panther Lake...
A cynic would point out Pat was excited about Raptor and Meteor Lake and even Sapphire Rapids.
These presentations have not been a reliable guide to what's delivered and when in recent years.
@@RobBCactive
Raptor lake was great
@@technewseveryweek8332 if you read the tech news, you'd know better.
he was happy when he offered all their unused fabs to amd and nvidia because they are going extinct and nvidia seems to have listened
@@betag24cn to be fair, the Xeon Sierra Forest server is on an Intel 3nm node and with 144 E-cores has some advantages over Bergamo Zen4.
We love competition between Intel and AMD
Seems interesting stuff coming ;)
Something not related to ai and NPUs :/
I'd love to see something like this for desktops where I can get an entire SOC with 32-64GB of ram all bundled together. I know there are upgradeability concerns but the performance benefits if you over spec could be really good, especially for ram heavy applications.
Sad they just laid off 15,000 people.
Depending on the details on moving data between the NPU and the GPU, using both at the same time could work really well for some AI workloads. Training a QLora where the main weights are only used for 4 bit inference that could run in the NPU and the backpropagation is done only for a low rank adaptor in fp32 or fp16 in the GPU could potentially work well. It won't be faster than a dedicated GPU, even a 3060 should outperform it. Memory bandwidth will likely also severely limit its performance. But often the issue with GPUs is not speed but available memory. Also this should be much more power efficient.
It all will depend on software support, that is usually the issue with most non nvidia AI hardware.
I think they should have used the empty silicon left in the die to make the gpu more powerful
That packaging is impressive. Certainly more complex than what AMD is doing.
AMD & Intel APU still conquered X86 ecosystem. Apple & Qualcomm still conquered ARM ecosystem with Nvidia own ARM CPU design will joining ARM ecosystem next year ❤❤❤
Cool to see a nice bit of Cache on the side to minimize DRAM access, L4 foresight on desktop? probably not but I love what I'm seeing from Intel this year, very exciting in more ways than expected. Maybe not quite leadership just yet but at least on par, the whole E-cores thing is evolving into something and I wont be surprise if it eventually gets to a point of Zen Dense. So far its still looking to be a split design mentality but a high IPC Philosophy so the ability to use E-cores for most task will get the best out of the efficiency. Last time I was this excited was Alder Lake?
How Intel got its groove back! Great overview of their announcements.
lol, they announce things and never reoease tyem, meanwhile we get those wonderful i9 all damaged
remember foverosm chip stackingm, that was in 2017, where is it? ah yes, in amd x3d cpus
remember that intel innivation is not innnovation, is broken promises
@@betag24cnIntel did end up using foveros chip stacking, twice in fact. The more famous example is in Ponte Vecchio, where they have base dies (like described in this video mind you) except they have L4 cache. Vertically stacked. Also, underneath the compute so it doesn’t interfere with cooling. Wonder why anyone would ever try doing it the other way round?
The other example was in a super obscure part that previewed Intel’s P core and E core design before alder lake. One p core and 4 atom cores. 5 cores total.
Is Arrow Lake actually getting the 20A node, or just like this it will be using TSMC all around?
There will be some ARL parts on 20A.
Amazing contents!
the NPU is pretty gigantic compared to for example what Apple does. Curious about the performance because Apples ones are ridiculously fast for their size
I think we'll have to wait for the release of the Lunar Lake laptop and the benchmark scores, but if you simply multiply the graphics scores of the Meteor Lake-H's 3DMark benchmark Time Spy and Fire Strike by 1.5, you get TS: 5250 FS: 13800. In terms of desktop GPUs, it's close to the performance of the GTX1660. In the country where I live, there are several articles that say it's 50% better in performance than the Meteor Lake-U, but if you multiply the GPU performance of the Meteor Lake-U by 1.5, it will be the same as the Meteor Lake-H's GPU performance. On a different note, is the presence or absence of hyperthreading related to the high single-thread performance of Apple silicon?
I'm wondering why they would make it on TSMC N3B when N3E is already in production.
Intel had to take what was available when the order was made would be my bet.
Jupp
What a revolutionary set of concepts. On package LPDDR, a larger cache, and a split scheduler. Who could possibly have done such a thing…4 years ago
Imitation is the sincerest form of flattery.
I can say the same about Apple ... being so efficient by simply using a better / newer manufacturing node. What a revolutionary concept ...
And in regards to on-chip LPDDR, it wasn't done not because Intel or AMD didn't though of it or didn't knew how to do it.
It wasn't done because of the tradeoffs, like a) no more upgradeability - something that many people actually like - and b) customizability - in case of soldered RAM, it's still to the OEM to add it, as it sees fit. And as the suppliers come, so to speak. You're much more constrained when you do it on-chip. I
If it weren't for the efficiency improvements, I would be fully against it.
I have been wondering when we would finally see someone utilize on package memory to compete with Apple in power consumption. Fingers crossed hoping this is the beginning of the end for 8bg memory!
This doesn’t really save on power consumption, at least not the way Intel did it. It’s the same bog standard LPDDR memory as laptops have now, and it’s connected to the CPU with the same copper traces. The only thing that’s really changed is that it’s physically closer and OEM’s will have to source it from Intel (with the obligatory markup!) rather than sourcing it themselves from Micron, Samsung or SK Hynix.
If I’m remembering the Apple chip architecture correctly, they’re using GDDR memory, and it’s actually integrated right into the same silicon as the rest of the chip. If true, that effectively means TSMC is making the memory for the Apple chip (they most certainly are NOT doing so for Intel) and the memory can’t even be physically separated from the rest of the chip.
@@benjaminlynch9958Apple is using lpddr5x; their implementation is basically the same as Intel’s. Shorter trace lengths do have benefits for power consumption and latency, although I’m not expecting large benefits from this implementation. The main benefit of this approach is a simplification of design for OEMs because they don’t have to worry about designing the memory system. It’s also beneficial for consumers by standardizing memory configurations, so companies can’t skimp on memory. 16GB is standard on Lunar Lake with higher end models going to 32.
@@benjaminlynch9958 No, the RAM on Apple CPUs is just LPDDR** sitting on the substrate, the same as Lunar Lake.
All we need now is to add a cache chip over everything and we will have amazing performance at ultra low power consumption.
is this where the power and signal have been seperated (on opposite side)?
No, it’s on a TSMC node which doesn’t have backside power.
Exciting stuff! Great video, as usual. I do have one question, though. Is it certain that a server implementation (or any) of Lion Cove would have SMT? Also, different implementations of the same architecture sounds more like a standard vs Dense Zen situation to me, and I think that it could get expensive to develop lots of just slightly different cores
Yes, Lion Cove in Xeon will have SMT. And yes, LNC is also more flexible. Not really a “LNCc”, but there will be size differences.
@@HighYield Nice! Excited to see what they will come up with
Does this mean the gpu can access all the memory like the m series aka unified memory?
I can access the 8MB GPU L2 cache and the 8MB memory side cache.
every intel iGPU can access all of system RAM
Speaking of, we REALLY need the dynamic iGPU memory allocation that Apple has. On Windows' side I can see why it's not implemented and why nobody talks about it, as Microsoft couldn't give 2 flying Fs about Windows, especially in the performance side. If it's not ads or tracking the user, then it's priority 7384, to be done in 15 years from now.
On Linux side I hope we'll see something, but usually GPU stuff comes from the manufacturer, so it would be Intel or AMD here for the iGPUs. And they're both busy on other areas, like the actual GPUs being competitive. And the drivers for Windows. Linux comes after that. Sigh.
@@Winnetou17 you do have dynamic igpu allocation on windows, on intel all of memory is accessible to the igpu (unlike ryzen lol)
@@sowa705 Oh, ok. I was under the impression that it's settled at boot time. I wonder then why did Apple presented (and people being wowed) as something new. I guess it was new for them.
So there's no Intel process node in lunar lake? All from tsmc nodes.
The base tile ins manufactured by Intel and they also do testing + packaging. But yes, all the active silicon is TSMC.
@@HighYield if i not wrong isnt they gonna used on home manufacturing for 2025? basically all the things gonna used intel intel nodes
I'd like to see a version of this that is compatible with CAMM for memory instead of on-package.
Gluing the ram closer can yield far better ram function than CAMM. I think this cpu will be used in very small systems. I'm betting they will use even faster ram as soon as it comes available.
@@impuls60 CAMM supports the same 8533 speed that this package does. I don't buy this explanation.
@@ChristopherBurtraw The on chip is mostly for the power savings, not neccesarly for bandwidth.
Lunar Lake is optimized to be very efficient (and I hope it actually delivers). It should be perfect for ultrabooks which want really really long battery life and for gaming handhelds.
For the rest of normal folk and normal (or powerful) laptops, we'll have Arrow Lake. And hopefully we'll see LPCAMM2 laptops with that. I dream of a Framework 16 with Arrow Lake and LPCAMM2 in which to add 128 GB of RAM and finally upgrade my almost 8 year old laptop, to one that will also last me 7-10 years.
@@Winnetou17 I'm hoping the next gen (after the one they just announced) 13 board will have it too. Framework won't want to implement this one even for the 13...
Would be very much interested in the thermal perf as they are using the TSMC manufacturing
Do I understand correctly that 128 bit memory bus is the same bandwidth as what we'd get with dual-channel DDR modules? Since each module (channel) is usually 64-bit? Just trying to understand the overall memory bandwidth, I know we also get latency benefits and not downplaying it.
Yes, it’s 128-bit as most other consumer chips.
It will be interesting to see actual laptops.
Lunar Lake really seems like Apple Silicon. 8 wide superscalar P cores, LPDDR5 on package, system level cache on package, built on TSMC N3.
Can you explain the implications embedded RAM? My PC currently has 128GB of RAM. From several places I've looked, it sounds to me like with lunar lake you're limited to the 32GB of embedded RAM and it won't make use of normal RAM sticks? If so that's.. just not an option at all for developers and I'm really surprised that noone is getting out pitch forks.
BUT I must be missing something so what am I missing?
Can you just combine GPU, NPU and CPU for the same inference task though? Or is Intel just adding up numbers to create a bigger number but in the real world, you will have to decide where to run any given model?
You are correct. Currently can't just add the numbers. Apparently work is being done to enable mismatched processors for ai batch-processing, but I don't expect it will release soon, if ever.
@@martin777xyz thanks for confirming. And yeah, from my understanding it sounds really tough to make these systems complement each other. Maybe some day we'll be running so many models locally that they can run in parallel but even that...
Lunar Lake better change everything, because Intel is currently in meltdown.
Lunar Lake will flop hard. No ring bus, cache starved, less P-cores than Alder Lake or Raptor Lake P (i think), no memory upgrade path.
@@saricubra2867 The market segment is different for Lunar lake, it's meant for Thin & lights, thus it's competition is the X Elite. It is designed to be less powerful in multicore, more powerful in single core, and more efficient. Intel is trying to copy ARM. By the way, ram won't be and issue; the Lunar lake socs have a memory bandwidth of about 136 GB/s, slightly slower than the m4, however unlike Apple, 16gb is the minimum amount for Lunar Lake, 32 gigabyte is the maximum.
Chad skymont!!!
Hell yeah, this is why I love competition. Keep em coming Intel, and you too AMD.
Can't wait to see if those promises in performance and power draw materializes. We could finally have smaller x86_64 handhelds and could reaffirm x86_64 as capable of reinventing itself. And in general its good seeing intel trying again. All the design decisions seem intelligent again and that it required a certain amount of redisgn
I love the idea of on package memory. It's fantastic to get the perspective of someone who sees this idea as an opportunity for improved efficiency and cost, rather than just a lack of upgradability.
I have two guesses on Intel 4 and later processes Meteor Lake and Lunar Lake all being mobile oriented and not desktop. One is that the processes are not suitable for high performance operation, but do get better power efficiency. 2) manage fab-process capacity.
14:40 * " would have never expected that the E-cores in Arrow [Lunar] Lake have a higher IPC than the P-cores in Raptor Lake."
They are lying. Raptor Cove has 36MB of L3 cache on a monolithic Ring Bus.
A ring bus Haswell 4core8thread laptop Core i7-4700MQ that i have has 25% higher IPC than the e-cores on my current Alder Lake i7-12700K besides the HUGE difference on RAM speed.
It would be interestong if someone came up with a hybrid chip that has both x86 and ARM instruction cores. Which would allow running both X86 and ARM software natively. It could be an 8 core CPU with 2 X86 P cores + 6 ARM P cores.
i believe amd with the chiplets can do it, if they have not done it already
You will need also a hybrid SO that could understand and manage 2 different ISAs and architectures
@@reiniermoreno1653 you mean, windows 12?
@@reiniermoreno1653 Implementing the Software would easier compared to the hardware since we already have OSs that understand which ISA you are using. Binary executables also have info on which architecture they are designed for.
Even if its performance and efficiency is only on par with M1, I would say this is a win for Intel moving forward. I can imagine 16-20+ hrs batt life ultrabooks.